Gopi Trinadh Maddikunta
Gopi Trinadh Maddikunta
Nova Bench: Benchmarking 11 LLMs
on NVIDIA DGX Spark
The first comprehensive LLM benchmark suite for NVIDIA's $4,699 desktop supercomputer.
How fast can Blackwell really run 8B to 123B models?
When NVIDIA released the DGX Spark — a compact desktop powered by the GB10 Grace Blackwell Superchip — one question stood out: how fast can it actually run large language models?
Nobody had published comprehensive benchmark data. Reviews showed one or two models. Marketing materials quoted theoretical TOPS. But no one had systematically loaded a dozen models and measured real-world performance across different tasks.
So I did. Over the course of a week, I loaded 14 models onto my DGX Spark, built a benchmarking framework, and ran 4 systematic benchmark suites covering general inference, code generation, context scaling, and vision.
The Hardware
The DGX Spark measures just 150mm × 150mm × 50.5mm — smaller than most routers — yet packs serious specifications under the hood.
The key differentiator is the unified memory architecture. CPU and GPU share the same 128GB pool connected via NVLink-C2C at 5x PCIe Gen5 bandwidth. Models that wouldn't fit on a discrete GPU's VRAM can load entirely into this shared memory space. That's why a 73GB model like Mistral Large 123B runs on hardware that costs less than a used car.
The Results
I tested 11 models with the same prompt using Ollama 0.18.3. Here's the full ranking by generation speed:
| Rank | Model | Size | Prompt tok/s | Gen tok/s | Load Time | Tier |
|---|---|---|---|---|---|---|
| 1 | Llama 3.1 8B | 4.9 GB | 574.79 | 42.86 | 4.73s | ⚡ Fast |
| 2 | Gemma3 27B | 17 GB | 164.64 | 11.71 | 9.83s | ⚡ Fast |
| 3 | Qwen2.5-Coder 32B | 19 GB | 288.96 | 10.36 | 15.13s | ⚡ Fast |
| 4 | Qwen3 32B | 20 GB | 141.91 | 9.88 | 4.08s | ✓ Usable |
| 5 | CodeLlama 70B | 38 GB | 133.35 | 5.73 | 28.81s | ✓ Usable |
| 6 | Nemotron 70B | 42 GB | 87.97 | 4.77 | 27.25s | ✓ Usable |
| 7 | Llama 3.1 70B | 42 GB | 67.92 | 4.76 | 28.35s | ✓ Usable |
| 8 | DeepSeek-R1 70B | 42 GB | 24.18 | 4.68 | 47.02s | ✓ Usable |
| 9 | Llama 3.3 70B | 42 GB | 67.97 | 4.66 | 27.41s | ✓ Usable |
| 10 | Qwen 2.5 72B | 47 GB | 122.19 | 4.40 | 44.75s | ✓ Usable |
| 11 | Mistral Large 123B | 73 GB | 10.43 | 2.28 | 86.14s | 🐢 Slow |
The sweet spot? 27–32B models. Fast enough to feel interactive at 10–12 tok/s, large enough to produce quality output. If I could only keep three models, it would be Gemma3 27B, Qwen2.5-Coder 32B, and Llama 3.1 70B.
🔬 Explore the Interactive Dashboard
Sort models by speed, compare code generation benchmarks, explore context scaling data, and examine vision model results — all with real data from my DGX Spark:
The Code Benchmark
I asked each model to write a Python function for finding the longest palindromic substring — a common interview problem that tests reasoning, code quality, and documentation.
The most striking result: DeepSeek-R1 generated 10,710 tokens of step-by-step reasoning for this single question. It spent 39 minutes thinking through the problem before producing its answer. Qwen3 produced 8,180 tokens. These reasoning models treat every question as a deep thinking exercise.
For practical daily coding, Qwen2.5-Coder 32B at 10.33 tok/s is the winner — purpose-built for code, fast enough to be interactive, and produces clean solutions with proper type hints and docstrings.
Context Scaling — The Surprise
The most interesting finding came from comparing short versus long prompts. With a 130-token complex business analysis prompt versus a 19-token simple prompt:
| Model | Short Prompt | Long Prompt | Scale | Gen tok/s |
|---|---|---|---|---|
| Llama 3.1 8B | 574 tok/s | 2,090 tok/s | 3.6x | 41.82 |
| Gemma3 27B | 164 tok/s | 616 tok/s | 3.8x | 11.70 |
| Qwen3 32B | 141 tok/s | 491 tok/s | 3.5x | 10.10 |
| Llama 3.1 70B | 67 tok/s | 212 tok/s | 3.2x | 4.69 |
| Qwen 2.5 72B | 122 tok/s | 225 tok/s | 1.8x | 4.33 |
| Nemotron 70B | 87 tok/s | 164 tok/s | 1.9x | 4.62 |
This means the DGX Spark gets relatively faster with more complex inputs — exactly what you want for research and production workloads with long contexts.
Vision — 90B on a Desktop
Llama3.2-Vision 90B — a 54GB multimodal model — loaded in 16.8 seconds and processed images at 3.47 tok/s for generation and 6.02 tok/s for prompt evaluation.
A 90 billion parameter vision model running on a desktop computer. Not fast enough for production serving, but absolutely viable for development, prototyping, and research. Try doing that on a consumer GPU.
Key Findings
Should You Buy a DGX Spark?
Yes, if you need to run 70B+ models locally for development, prototyping, or research. No consumer GPU can fit these models in VRAM. The Spark handles them natively.
Maybe, if you primarily work with 8B–32B models. An RTX 4090 or Mac Studio with 192GB RAM can match or exceed the Spark's speed for smaller models at a lower price.
No, if you need production serving throughput. At 4–5 tok/s for 70B models through Ollama, this is a development platform, not a serving platform.
⚡ Explore the Full Dashboard
Sort models, compare benchmarks, toggle dark/light theme — all real data from my DGX Spark.
View on GitHub → Dataset on HuggingFace →Reproduce These Results
Everything is open source. Clone the repo and run on your own hardware:
Full dataset: huggingface.co/datasets/G3nadh/dgx-spark-benchmarks
