llama.cpp on NVIDIA DGX Spark — Benchmarks
Build: b6767 (5acd45546) · Published: · Platform: NVIDIA GB10 (SM 12.1), VMM: yes
llama-bench
llama-batched-bench
pp = prefill
tg = token generation
d = context depth
B = batch size
Overview
This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark.
Benchmarks include:
- Prefill (
pp) and generation (tg) at various context depths (d)
- Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments
Models:
- gpt-oss-20b
- gpt-oss-120b
- Qwen3 Coder 30B A3B
- Qwen2.5 Coder 7B
- Gemma 3 4B QAT
- GLM 4.5 Air
Feel free to request additional benchmarks for other models and use cases.
Benchmarks
Build
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j
Commands
# sequential requests
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
# parallel requests
llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32
History
| Date | Build | Commit | Notes |
| 2025 Oct 14 | b6761 | 7ea15bb64 | Initial numbers |
| 2025 Oct 15 | b6767 | 5acd45546 | Improved decode via ggml-org/llama.cpp#16585 |
gpt-oss-20b
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 | 3610.56 ± 15.16 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 | 79.74 ± 0.43 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 @ d4096 | 3361.11 ± 12.95 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 @ d4096 | 74.63 ± 0.15 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 @ d8192 | 3147.73 ± 15.77 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 @ d8192 | 69.49 ± 1.12 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 @ d16384 | 2685.54 ± 5.76 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 @ d16384 | 64.02 ± 0.72 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 @ d32768 | 2055.34 ± 20.43 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 @ d32768 | 55.96 ± 0.07 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 1.124 | 3644.93 | 0.431 | 74.24 | 1.555 | 2654.99 |
| 4096 | 32 | 2 | 8256 | 2.252 | 3637.13 | 0.766 | 83.59 | 3.018 | 2735.57 |
| 4096 | 32 | 4 | 16512 | 4.485 | 3652.80 | 0.960 | 133.37 | 5.445 | 3032.46 |
| 4096 | 32 | 8 | 33024 | 8.958 | 3657.89 | 1.228 | 208.45 | 10.186 | 3242.01 |
| 4096 | 32 | 16 | 66048 | 17.883 | 3664.62 | 1.695 | 302.07 | 19.578 | 3373.52 |
| 4096 | 32 | 32 | 132096 | 35.738 | 3667.60 | 2.403 | 426.22 | 38.140 | 3463.42 |
| 8192 | 32 | 1 | 8224 | 2.285 | 3584.99 | 0.458 | 69.81 | 2.743 | 2997.64 |
| 8192 | 32 | 2 | 16448 | 4.547 | 3603.29 | 0.797 | 80.29 | 5.344 | 3077.82 |
| 8192 | 32 | 4 | 32896 | 9.117 | 3594.24 | 1.004 | 127.47 | 10.121 | 3250.29 |
| 8192 | 32 | 8 | 65792 | 18.248 | 3591.40 | 1.356 | 188.77 | 19.604 | 3356.01 |
| 8192 | 32 | 16 | 131584 | 36.389 | 3601.99 | 1.951 | 262.37 | 38.340 | 3432.02 |
| 8192 | 32 | 32 | 263168 | 72.880 | 3596.94 | 2.937 | 348.69 | 75.816 | 3471.12 |
gpt-oss-120b
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 | 1689.47 ± 107.67 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 | 52.87 ± 1.70 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 @ d4096 | 1733.41 ± 5.19 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 @ d4096 | 51.02 ± 0.65 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 @ d8192 | 1705.93 ± 7.89 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 @ d8192 | 48.46 ± 0.53 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 @ d16384 | 1514.78 ± 5.66 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 @ d16384 | 44.78 ± 0.07 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 @ d32768 | 1221.23 ± 7.85 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 @ d32768 | 38.76 ± 0.06 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 2.345 | 1746.43 | 0.751 | 42.59 | 3.097 | 1333.05 |
| 4096 | 32 | 2 | 8256 | 4.442 | 1844.38 | 1.252 | 51.13 | 5.693 | 1450.15 |
| 4096 | 32 | 4 | 16512 | 8.816 | 1858.46 | 1.583 | 80.86 | 10.399 | 1587.86 |
| 4096 | 32 | 8 | 33024 | 17.570 | 1865.00 | 2.124 | 120.55 | 19.694 | 1676.88 |
| 4096 | 32 | 16 | 66048 | 35.112 | 1866.50 | 3.083 | 166.05 | 38.195 | 1729.22 |
| 4096 | 32 | 32 | 132096 | 70.158 | 1868.23 | 4.581 | 223.53 | 74.739 | 1767.42 |
| 8192 | 32 | 1 | 8224 | 4.462 | 1835.87 | 0.773 | 41.42 | 5.235 | 1571.02 |
| 8192 | 32 | 2 | 16448 | 8.960 | 1828.57 | 1.304 | 49.10 | 10.264 | 1602.57 |
| 8192 | 32 | 4 | 32896 | 17.800 | 1840.87 | 1.669 | 76.70 | 19.469 | 1689.65 |
| 8192 | 32 | 8 | 65792 | 35.706 | 1835.41 | 2.339 | 109.44 | 38.046 | 1729.29 |
| 8192 | 32 | 16 | 131584 | 71.322 | 1837.75 | 3.507 | 145.98 | 74.829 | 1758.46 |
| 8192 | 32 | 32 | 263168 | 142.658 | 1837.57 | 5.593 | 183.08 | 148.251 | 1775.15 |
Qwen3 Coder 30B A3B
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 | 2933.39 ± 9.43 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 | 59.95 ± 0.26 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 @ d4096 | 2537.98 ± 7.17 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 @ d4096 | 52.70 ± 0.75 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 @ d8192 | 2246.86 ± 6.45 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 @ d8192 | 44.48 ± 0.34 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 @ d16384 | 1772.41 ± 10.58 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 @ d16384 | 37.10 ± 0.05 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 @ d32768 | 1252.10 ± 2.16 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 @ d32768 | 27.82 ± 0.01 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 1.446 | 2831.93 | 0.609 | 52.58 | 2.055 | 2008.85 |
| 4096 | 32 | 2 | 8256 | 2.890 | 2834.16 | 1.117 | 57.27 | 4.008 | 2059.94 |
| 4096 | 32 | 4 | 16512 | 5.778 | 2835.64 | 1.546 | 82.78 | 7.324 | 2254.48 |
| 4096 | 32 | 8 | 33024 | 11.505 | 2848.21 | 2.195 | 116.63 | 13.700 | 2410.57 |
| 4096 | 32 | 16 | 66048 | 23.016 | 2847.43 | 3.218 | 159.11 | 26.234 | 2517.67 |
| 4096 | 32 | 32 | 132096 | 46.022 | 2848.01 | 4.926 | 207.86 | 50.949 | 2592.73 |
| 8192 | 32 | 1 | 8224 | 3.075 | 2664.32 | 0.724 | 44.18 | 3.799 | 2164.77 |
| 8192 | 32 | 2 | 16448 | 6.114 | 2679.65 | 1.267 | 50.50 | 7.382 | 2228.27 |
| 8192 | 32 | 4 | 32896 | 12.251 | 2674.64 | 1.807 | 70.85 | 14.058 | 2340.03 |
| 8192 | 32 | 8 | 65792 | 24.435 | 2682.01 | 2.729 | 93.82 | 27.164 | 2422.02 |
| 8192 | 32 | 16 | 131584 | 48.952 | 2677.56 | 4.322 | 118.47 | 53.274 | 2469.97 |
| 8192 | 32 | 32 | 263168 | 97.879 | 2678.26 | 7.057 | 145.10 | 104.936 | 2507.89 |
Qwen2.5 Coder 7B
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | pp2048 | 2267.08 ± 6.38 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | tg32 | 29.40 ± 0.02 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | pp2048 @ d4096 | 2094.87 ± 11.61 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | tg32 @ d4096 | 28.31 ± 0.10 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | pp2048 @ d8192 | 1906.26 ± 4.45 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | tg32 @ d8192 | 27.53 ± 0.04 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | pp2048 @ d16384 | 1634.82 ± 6.67 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | tg32 @ d16384 | 26.03 ± 0.03 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | pp2048 @ d32768 | 1302.32 ± 4.58 |
| qwen2 7B Q8_0 | 7.54 GiB | 7.62 B | tg32 @ d32768 | 22.08 ± 0.03 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 1.818 | 2252.90 | 1.131 | 28.30 | 2.949 | 1399.87 |
| 4096 | 32 | 2 | 8256 | 3.629 | 2257.46 | 1.273 | 50.29 | 4.901 | 1684.40 |
| 4096 | 32 | 4 | 16512 | 7.267 | 2254.46 | 1.393 | 91.86 | 8.661 | 1906.54 |
| 4096 | 32 | 8 | 33024 | 14.516 | 2257.44 | 1.598 | 160.22 | 16.113 | 2049.48 |
| 4096 | 32 | 16 | 66048 | 29.025 | 2257.90 | 2.092 | 244.69 | 31.118 | 2122.53 |
| 4096 | 32 | 32 | 132096 | 58.059 | 2257.55 | 2.764 | 370.44 | 60.824 | 2171.79 |
| 8192 | 32 | 1 | 8224 | 3.748 | 2185.91 | 1.171 | 27.33 | 4.918 | 1672.09 |
| 8192 | 32 | 2 | 16448 | 7.502 | 2183.95 | 1.354 | 47.28 | 8.856 | 1857.35 |
| 8192 | 32 | 4 | 32896 | 15.018 | 2181.92 | 1.556 | 82.27 | 16.574 | 1984.82 |
| 8192 | 32 | 8 | 65792 | 30.024 | 2182.77 | 1.908 | 134.16 | 31.933 | 2060.34 |
| 8192 | 32 | 16 | 131584 | 60.044 | 2182.93 | 2.673 | 191.55 | 62.717 | 2098.06 |
| 8192 | 32 | 32 | 263168 | 120.112 | 2182.49 | 3.903 | 262.39 | 124.015 | 2122.07 |
Gemma 3 4B QAT
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | pp2048 | 5694.21 ± 13.18 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | tg32 | 79.83 ± 0.18 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | pp2048 @ d4096 | 5228.77 ± 20.56 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | tg32 @ d4096 | 67.49 ± 1.17 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | pp2048 @ d8192 | 4882.66 ± 37.61 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | tg32 @ d8192 | 66.87 ± 0.80 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | pp2048 @ d16384 | 4491.42 ± 44.60 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | tg32 @ d16384 | 63.36 ± 0.66 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | pp2048 @ d32768 | 3840.09 ± 14.52 |
| gemma3 4B Q4_0 | 2.35 GiB | 3.88 B | tg32 @ d32768 | 57.67 ± 0.09 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 0.704 | 5819.48 | 0.467 | 68.59 | 1.170 | 3527.14 |
| 4096 | 32 | 2 | 8256 | 1.403 | 5837.53 | 0.585 | 109.49 | 1.988 | 4153.21 |
| 4096 | 32 | 4 | 16512 | 2.790 | 5871.91 | 0.675 | 189.62 | 3.465 | 4764.98 |
| 4096 | 32 | 8 | 33024 | 5.554 | 5899.92 | 0.907 | 282.33 | 6.461 | 5111.51 |
| 4096 | 32 | 16 | 66048 | 11.106 | 5900.73 | 1.349 | 379.54 | 12.455 | 5302.76 |
| 4096 | 32 | 32 | 132096 | 22.194 | 5905.76 | 2.113 | 484.68 | 24.307 | 5434.56 |
| 8192 | 32 | 1 | 8224 | 1.425 | 5750.56 | 0.477 | 67.12 | 1.901 | 4325.44 |
| 8192 | 32 | 2 | 16448 | 2.818 | 5814.33 | 0.653 | 98.07 | 3.470 | 4739.43 |
| 8192 | 32 | 4 | 32896 | 5.628 | 5822.18 | 0.809 | 158.21 | 6.437 | 5110.32 |
| 8192 | 32 | 8 | 65792 | 11.256 | 5822.26 | 1.183 | 216.33 | 12.439 | 5288.96 |
| 8192 | 32 | 16 | 131584 | 22.470 | 5833.08 | 1.913 | 267.69 | 24.383 | 5396.51 |
| 8192 | 32 | 32 | 263168 | 44.856 | 5844.16 | 3.251 | 314.98 | 48.107 | 5470.50 |
GLM 4.5 Air
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)
llama-bench
| model | size | params | test | t/s |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 | 841.44 ± 12.67 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 | 22.59 ± 0.11 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 @ d4096 | 749.08 ± 2.10 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 @ d4096 | 20.10 ± 0.01 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 @ d8192 | 680.95 ± 1.38 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 @ d8192 | 18.78 ± 0.07 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 @ d16384 | 565.44 ± 1.47 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 @ d16384 | 16.47 ± 0.01 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 @ d32768 | 418.84 ± 0.53 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 @ d32768 | 13.19 ± 0.01 |
llama-batched-bench
| PP | TG | B | N_KV |
T_PP s | S_PP t/s |
T_TG s | S_TG t/s |
T s | S t/s |
| 4096 | 32 | 1 | 4128 | 5.554 | 737.54 | 1.660 | 19.28 | 7.214 | 572.25 |
| 4096 | 32 | 2 | 8256 | 9.781 | 837.53 | 2.696 | 23.74 | 12.477 | 661.68 |
| 4096 | 32 | 4 | 16512 | 19.548 | 838.12 | 3.812 | 33.58 | 23.361 | 706.83 |
| 4096 | 32 | 8 | 33024 | 38.980 | 840.64 | 6.407 | 39.95 | 45.387 | 727.61 |
| 4096 | 32 | 16 | 66048 | 77.938 | 840.88 | 12.197 | 41.98 | 90.134 | 732.77 |
| 8192 | 32 | 1 | 8224 | 10.343 | 792.02 | 1.803 | 17.75 | 12.146 | 677.11 |
| 8192 | 32 | 2 | 16448 | 20.678 | 792.34 | 3.161 | 20.25 | 23.839 | 689.96 |
| 8192 | 32 | 4 | 32896 | 41.320 | 793.04 | 4.496 | 28.47 | 45.816 | 718.01 |
| 8192 | 32 | 8 | 65792 | 82.728 | 792.19 | 8.426 | 30.38 | 91.153 | 721.77 |