Nova_bench -

Gopi Trinadh Maddikunta

When NVIDIA released the DGX Spark — a compact desktop powered by the GB10 Grace Blackwell Superchip — one question stood out: how fast can it actually run large language models?

Nobody had published comprehensive benchmark data. Reviews showed one or two models. Marketing materials quoted theoretical TOPS. But no one had systematically loaded a dozen models and measured real-world performance across different tasks.

So I did. Over the course of a week, I loaded 14 models onto my DGX Spark, built a benchmarking framework, and ran 4 systematic benchmark suites covering general inference, code generation, context scaling, and vision.

💡 All data is open source. Raw benchmark results on GitHub and structured dataset on HuggingFace. Every result is reproducible.

The Hardware

The DGX Spark measures just 150mm × 150mm × 50.5mm — smaller than most routers — yet packs serious specifications under the hood.

128GB

Unified Memory

ARM Cores

1PFLOP

FP4 Performance

4TB

NVMe Storage

$4,699

Price

The key differentiator is the unified memory architecture. CPU and GPU share the same 128GB pool connected via NVLink-C2C at 5x PCIe Gen5 bandwidth. Models that wouldn't fit on a discrete GPU's VRAM can load entirely into this shared memory space. That's why a 73GB model like Mistral Large 123B runs on hardware that costs less than a used car.

The Results

I tested 11 models with the same prompt using Ollama 0.18.3. Here's the full ranking by generation speed:

Rank	Model	Size	Prompt tok/s	Gen tok/s	Load Time	Tier
1	Llama 3.1 8B	4.9 GB	574.79	42.86	4.73s	⚡ Fast
2	Gemma3 27B	17 GB	164.64	11.71	9.83s	⚡ Fast
3	Qwen2.5-Coder 32B	19 GB	288.96	10.36	15.13s	⚡ Fast
4	Qwen3 32B	20 GB	141.91	9.88	4.08s	✓ Usable
5	CodeLlama 70B	38 GB	133.35	5.73	28.81s	✓ Usable
6	Nemotron 70B	42 GB	87.97	4.77	27.25s	✓ Usable
7	Llama 3.1 70B	42 GB	67.92	4.76	28.35s	✓ Usable
8	DeepSeek-R1 70B	42 GB	24.18	4.68	47.02s	✓ Usable
9	Llama 3.3 70B	42 GB	67.97	4.66	27.41s	✓ Usable
10	Qwen 2.5 72B	47 GB	122.19	4.40	44.75s	✓ Usable
11	Mistral Large 123B	73 GB	10.43	2.28	86.14s	🐢 Slow

The sweet spot? 27–32B models. Fast enough to feel interactive at 10–12 tok/s, large enough to produce quality output. If I could only keep three models, it would be Gemma3 27B, Qwen2.5-Coder 32B, and Llama 3.1 70B.

🔬 Explore the Interactive Dashboard

Sort models by speed, compare code generation benchmarks, explore context scaling data, and examine vision model results — all with real data from my DGX Spark:

gopitrinadh.site/Nova-bench

The Code Benchmark

I asked each model to write a Python function for finding the longest palindromic substring — a common interview problem that tests reasoning, code quality, and documentation.

The most striking result: DeepSeek-R1 generated 10,710 tokens of step-by-step reasoning for this single question. It spent 39 minutes thinking through the problem before producing its answer. Qwen3 produced 8,180 tokens. These reasoning models treat every question as a deep thinking exercise.

For practical daily coding, Qwen2.5-Coder 32B at 10.33 tok/s is the winner — purpose-built for code, fast enough to be interactive, and produces clean solutions with proper type hints and docstrings.

Context Scaling — The Surprise

The most interesting finding came from comparing short versus long prompts. With a 130-token complex business analysis prompt versus a 19-token simple prompt:

📈 Llama 3.1 8B prompt eval: 574 → 2,090 tok/s — a 3.6x improvement with longer prompts. The unified memory architecture batches prompt tokens more efficiently at scale, while generation speed remains perfectly constant.

Model	Short Prompt	Long Prompt	Scale	Gen tok/s
Llama 3.1 8B	574 tok/s	2,090 tok/s	3.6x	41.82
Gemma3 27B	164 tok/s	616 tok/s	3.8x	11.70
Qwen3 32B	141 tok/s	491 tok/s	3.5x	10.10
Llama 3.1 70B	67 tok/s	212 tok/s	3.2x	4.69
Qwen 2.5 72B	122 tok/s	225 tok/s	1.8x	4.33
Nemotron 70B	87 tok/s	164 tok/s	1.9x	4.62

This means the DGX Spark gets relatively faster with more complex inputs — exactly what you want for research and production workloads with long contexts.

Vision — 90B on a Desktop

Llama3.2-Vision 90B — a 54GB multimodal model — loaded in 16.8 seconds and processed images at 3.47 tok/s for generation and 6.02 tok/s for prompt evaluation.

A 90 billion parameter vision model running on a desktop computer. Not fast enough for production serving, but absolutely viable for development, prototyping, and research. Try doing that on a consumer GPU.

Key Findings

27–32B is the sweet spot for interactive use

Gemma3 27B at 11.7 tok/s and Qwen3 32B at 9.9 tok/s. Fast enough for real-time coding and research, large enough for quality responses.

Prompt eval scales 3–4x with longer inputs

The unified memory architecture handles batched prompt tokens efficiently. Llama 8B jumped from 574 to 2,090 tok/s. A behavior unique to this architecture.

Reasoning models generate 10K+ tokens per question

DeepSeek-R1 spent 39 minutes reasoning through a coding problem, generating 10,710 tokens of chain-of-thought. Viable for async workloads on Spark.

90B vision model runs on desktop hardware

3.47 tok/s image analysis. Not production-speed but viable for development, prototyping, and research demos.

Generation speed is constant regardless of prompt length

70B models hold ~4.5 tok/s whether the prompt is 19 tokens or 130 tokens. Stable and predictable for workload planning.

123B is the practical ceiling for single Spark

Mistral Large at 2.28 tok/s. Fits in memory but barely interactive. Two Sparks connected via QSFP could handle larger models.

Should You Buy a DGX Spark?

Yes, if you need to run 70B+ models locally for development, prototyping, or research. No consumer GPU can fit these models in VRAM. The Spark handles them natively.

Maybe, if you primarily work with 8B–32B models. An RTX 4090 or Mac Studio with 192GB RAM can match or exceed the Spark's speed for smaller models at a lower price.

No, if you need production serving throughput. At 4–5 tok/s for 70B models through Ollama, this is a development platform, not a serving platform.

⚡ Explore the Full Dashboard

Sort models, compare benchmarks, toggle dark/light theme — all real data from my DGX Spark.

View on GitHub → Dataset on HuggingFace →

Reproduce These Results

Everything is open source. Clone the repo and run on your own hardware:

git clone https://github.com/GOPITRINADH3561/nova-bench.git

Full dataset: huggingface.co/datasets/G3nadh/dgx-spark-benchmarks

NVIDIA DGX Spark Blackwell LLM Benchmark Ollama Open Source GB10 128GB Unified ARM

Gopi Trinadh Maddikunta

MS Engineering Data Science, University of Houston. Research Assistant under Dr. Peizhu Qian. GSoC 2025 Contributor (Scala Center). Building AI infrastructure on NVIDIA DGX Spark.

🌐 Website 🐙 GitHub 🤗 HuggingFace

Gopi Trinadh Maddikunta

Gopi Trinadh Maddikunta

Nova Bench: Benchmarking 11 LLMson NVIDIA DGX Spark

The Hardware

The Results

🔬 Explore the Interactive Dashboard

The Code Benchmark

Context Scaling — The Surprise

Vision — 90B on a Desktop

Key Findings

Should You Buy a DGX Spark?

⚡ Explore the Full Dashboard

Reproduce These Results

Nova Bench: Benchmarking 11 LLMs
on NVIDIA DGX Spark