Gopi Trinadh Maddikunta

Gopi Trinadh Maddikunta

Copyright @ 2026 GT Groups.
All rights are reserved.

📊 Research · Benchmarks · Open Source

Nova Bench: Benchmarking 11 LLMs
on NVIDIA DGX Spark

The first comprehensive LLM benchmark suite for NVIDIA's $4,699 desktop supercomputer.
How fast can Blackwell really run 8B to 123B models?

✍️ Gopi Trinadh Maddikunta 📅 March 30, 2026 ⏱️ 8 min read 🏷️ LLM · NVIDIA · Benchmarks

When NVIDIA released the DGX Spark — a compact desktop powered by the GB10 Grace Blackwell Superchip — one question stood out: how fast can it actually run large language models?

Nobody had published comprehensive benchmark data. Reviews showed one or two models. Marketing materials quoted theoretical TOPS. But no one had systematically loaded a dozen models and measured real-world performance across different tasks.

So I did. Over the course of a week, I loaded 14 models onto my DGX Spark, built a benchmarking framework, and ran 4 systematic benchmark suites covering general inference, code generation, context scaling, and vision.

💡 All data is open source. Raw benchmark results on GitHub and structured dataset on HuggingFace. Every result is reproducible.

The Hardware

The DGX Spark measures just 150mm × 150mm × 50.5mm — smaller than most routers — yet packs serious specifications under the hood.

128GB
Unified Memory
20
ARM Cores
1PFLOP
FP4 Performance
4TB
NVMe Storage
$4,699
Price

The key differentiator is the unified memory architecture. CPU and GPU share the same 128GB pool connected via NVLink-C2C at 5x PCIe Gen5 bandwidth. Models that wouldn't fit on a discrete GPU's VRAM can load entirely into this shared memory space. That's why a 73GB model like Mistral Large 123B runs on hardware that costs less than a used car.

The Results

I tested 11 models with the same prompt using Ollama 0.18.3. Here's the full ranking by generation speed:

RankModelSizePrompt tok/sGen tok/sLoad TimeTier
1Llama 3.1 8B4.9 GB574.7942.864.73s⚡ Fast
2Gemma3 27B17 GB164.6411.719.83s⚡ Fast
3Qwen2.5-Coder 32B19 GB288.9610.3615.13s⚡ Fast
4Qwen3 32B20 GB141.919.884.08s✓ Usable
5CodeLlama 70B38 GB133.355.7328.81s✓ Usable
6Nemotron 70B42 GB87.974.7727.25s✓ Usable
7Llama 3.1 70B42 GB67.924.7628.35s✓ Usable
8DeepSeek-R1 70B42 GB24.184.6847.02s✓ Usable
9Llama 3.3 70B42 GB67.974.6627.41s✓ Usable
10Qwen 2.5 72B47 GB122.194.4044.75s✓ Usable
11Mistral Large 123B73 GB10.432.2886.14s🐢 Slow

The sweet spot? 27–32B models. Fast enough to feel interactive at 10–12 tok/s, large enough to produce quality output. If I could only keep three models, it would be Gemma3 27B, Qwen2.5-Coder 32B, and Llama 3.1 70B.

🔬 Explore the Interactive Dashboard

Sort models by speed, compare code generation benchmarks, explore context scaling data, and examine vision model results — all with real data from my DGX Spark:

gopitrinadh.site/Nova-bench

The Code Benchmark

I asked each model to write a Python function for finding the longest palindromic substring — a common interview problem that tests reasoning, code quality, and documentation.

The most striking result: DeepSeek-R1 generated 10,710 tokens of step-by-step reasoning for this single question. It spent 39 minutes thinking through the problem before producing its answer. Qwen3 produced 8,180 tokens. These reasoning models treat every question as a deep thinking exercise.

For practical daily coding, Qwen2.5-Coder 32B at 10.33 tok/s is the winner — purpose-built for code, fast enough to be interactive, and produces clean solutions with proper type hints and docstrings.

Context Scaling — The Surprise

The most interesting finding came from comparing short versus long prompts. With a 130-token complex business analysis prompt versus a 19-token simple prompt:

📈 Llama 3.1 8B prompt eval: 574 → 2,090 tok/s — a 3.6x improvement with longer prompts. The unified memory architecture batches prompt tokens more efficiently at scale, while generation speed remains perfectly constant.
ModelShort PromptLong PromptScaleGen tok/s
Llama 3.1 8B574 tok/s2,090 tok/s3.6x41.82
Gemma3 27B164 tok/s616 tok/s3.8x11.70
Qwen3 32B141 tok/s491 tok/s3.5x10.10
Llama 3.1 70B67 tok/s212 tok/s3.2x4.69
Qwen 2.5 72B122 tok/s225 tok/s1.8x4.33
Nemotron 70B87 tok/s164 tok/s1.9x4.62

This means the DGX Spark gets relatively faster with more complex inputs — exactly what you want for research and production workloads with long contexts.

Vision — 90B on a Desktop

Llama3.2-Vision 90B — a 54GB multimodal model — loaded in 16.8 seconds and processed images at 3.47 tok/s for generation and 6.02 tok/s for prompt evaluation.

A 90 billion parameter vision model running on a desktop computer. Not fast enough for production serving, but absolutely viable for development, prototyping, and research. Try doing that on a consumer GPU.

Key Findings

1
27–32B is the sweet spot for interactive use
Gemma3 27B at 11.7 tok/s and Qwen3 32B at 9.9 tok/s. Fast enough for real-time coding and research, large enough for quality responses.
2
Prompt eval scales 3–4x with longer inputs
The unified memory architecture handles batched prompt tokens efficiently. Llama 8B jumped from 574 to 2,090 tok/s. A behavior unique to this architecture.
3
Reasoning models generate 10K+ tokens per question
DeepSeek-R1 spent 39 minutes reasoning through a coding problem, generating 10,710 tokens of chain-of-thought. Viable for async workloads on Spark.
4
90B vision model runs on desktop hardware
3.47 tok/s image analysis. Not production-speed but viable for development, prototyping, and research demos.
5
Generation speed is constant regardless of prompt length
70B models hold ~4.5 tok/s whether the prompt is 19 tokens or 130 tokens. Stable and predictable for workload planning.
6
123B is the practical ceiling for single Spark
Mistral Large at 2.28 tok/s. Fits in memory but barely interactive. Two Sparks connected via QSFP could handle larger models.

Should You Buy a DGX Spark?

Yes, if you need to run 70B+ models locally for development, prototyping, or research. No consumer GPU can fit these models in VRAM. The Spark handles them natively.

Maybe, if you primarily work with 8B–32B models. An RTX 4090 or Mac Studio with 192GB RAM can match or exceed the Spark's speed for smaller models at a lower price.

No, if you need production serving throughput. At 4–5 tok/s for 70B models through Ollama, this is a development platform, not a serving platform.

⚡ Explore the Full Dashboard

Sort models, compare benchmarks, toggle dark/light theme — all real data from my DGX Spark.

View on GitHub →    Dataset on HuggingFace →

Reproduce These Results

Everything is open source. Clone the repo and run on your own hardware:

git clone https://github.com/GOPITRINADH3561/nova-bench.git

Full dataset: huggingface.co/datasets/G3nadh/dgx-spark-benchmarks

NVIDIA DGX Spark Blackwell LLM Benchmark Ollama Open Source GB10 128GB Unified ARM
GT
Gopi Trinadh Maddikunta
MS Engineering Data Science, University of Houston. Research Assistant under Dr. Peizhu Qian. GSoC 2025 Contributor (Scala Center). Building AI infrastructure on NVIDIA DGX Spark.