llama.cpp on NVIDIA DGX Spark — Benchmarks

Build: b6767 (5acd45546) · Published: · Platform: NVIDIA GB10 (SM 12.1), VMM: yes
llama-bench llama-batched-bench pp = prefill tg = token generation d = context depth B = batch size

Overview

This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark.

Benchmarks include:
  • Prefill (pp) and generation (tg) at various context depths (d)
  • Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments
Models:
  • gpt-oss-20b
  • gpt-oss-120b
  • Qwen3 Coder 30B A3B
  • Qwen2.5 Coder 7B
  • Gemma 3 4B QAT
  • GLM 4.5 Air

Feel free to request additional benchmarks for other models and use cases.

Benchmarks

Build

cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j

Commands

# sequential requests
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

# parallel requests
llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32

History

DateBuildCommitNotes
2025 Oct 14b67617ea15bb64Initial numbers
2025 Oct 15b67675acd45546Improved decode via ggml-org/llama.cpp#16585

gpt-oss-20b

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp20483610.56 ± 15.16
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg3279.74 ± 0.43
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d40963361.11 ± 12.95
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d409674.63 ± 0.15
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d81923147.73 ± 15.77
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d819269.49 ± 1.12
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d163842685.54 ± 5.76
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d1638464.02 ± 0.72
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d327682055.34 ± 20.43
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d3276855.96 ± 0.07
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141281.1243644.930.43174.241.5552654.99
409632282562.2523637.130.76683.593.0182735.57
4096324165124.4853652.800.960133.375.4453032.46
4096328330248.9583657.891.228208.4510.1863242.01
409632166604817.8833664.621.695302.0719.5783373.52
4096323213209635.7383667.602.403426.2238.1403463.42
819232182242.2853584.990.45869.812.7432997.64
8192322164484.5473603.290.79780.295.3443077.82
8192324328969.1173594.241.004127.4710.1213250.29
81923286579218.2483591.401.356188.7719.6043356.01
8192321613158436.3893601.991.951262.3738.3403432.02
8192323226316872.8803596.942.937348.6975.8163471.12

gpt-oss-120b

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp20481689.47 ± 107.67
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg3252.87 ± 1.70
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d40961733.41 ± 5.19
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d409651.02 ± 0.65
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d81921705.93 ± 7.89
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d819248.46 ± 0.53
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d163841514.78 ± 5.66
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d1638444.78 ± 0.07
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d327681221.23 ± 7.85
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d3276838.76 ± 0.06
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141282.3451746.430.75142.593.0971333.05
409632282564.4421844.381.25251.135.6931450.15
4096324165128.8161858.461.58380.8610.3991587.86
40963283302417.5701865.002.124120.5519.6941676.88
409632166604835.1121866.503.083166.0538.1951729.22
4096323213209670.1581868.234.581223.5374.7391767.42
819232182244.4621835.870.77341.425.2351571.02
8192322164488.9601828.571.30449.1010.2641602.57
81923243289617.8001840.871.66976.7019.4691689.65
81923286579235.7061835.412.339109.4438.0461729.29
8192321613158471.3221837.753.507145.9874.8291758.46
81923232263168142.6581837.575.593183.08148.2511775.15

Qwen3 Coder 30B A3B

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp20482933.39 ± 9.43
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg3259.95 ± 0.26
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d40962537.98 ± 7.17
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d409652.70 ± 0.75
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d81922246.86 ± 6.45
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d819244.48 ± 0.34
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d163841772.41 ± 10.58
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d1638437.10 ± 0.05
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d327681252.10 ± 2.16
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d3276827.82 ± 0.01
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141281.4462831.930.60952.582.0552008.85
409632282562.8902834.161.11757.274.0082059.94
4096324165125.7782835.641.54682.787.3242254.48
40963283302411.5052848.212.195116.6313.7002410.57
409632166604823.0162847.433.218159.1126.2342517.67
4096323213209646.0222848.014.926207.8650.9492592.73
819232182243.0752664.320.72444.183.7992164.77
8192322164486.1142679.651.26750.507.3822228.27
81923243289612.2512674.641.80770.8514.0582340.03
81923286579224.4352682.012.72993.8227.1642422.02
8192321613158448.9522677.564.322118.4753.2742469.97
8192323226316897.8792678.267.057145.10104.9362507.89

Qwen2.5 Coder 7B

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
qwen2 7B Q8_07.54 GiB7.62 Bpp20482267.08 ± 6.38
qwen2 7B Q8_07.54 GiB7.62 Btg3229.40 ± 0.02
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d40962094.87 ± 11.61
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d409628.31 ± 0.10
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d81921906.26 ± 4.45
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d819227.53 ± 0.04
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d163841634.82 ± 6.67
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d1638426.03 ± 0.03
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d327681302.32 ± 4.58
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d3276822.08 ± 0.03
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141281.8182252.901.13128.302.9491399.87
409632282563.6292257.461.27350.294.9011684.40
4096324165127.2672254.461.39391.868.6611906.54
40963283302414.5162257.441.598160.2216.1132049.48
409632166604829.0252257.902.092244.6931.1182122.53
4096323213209658.0592257.552.764370.4460.8242171.79
819232182243.7482185.911.17127.334.9181672.09
8192322164487.5022183.951.35447.288.8561857.35
81923243289615.0182181.921.55682.2716.5741984.82
81923286579230.0242182.771.908134.1631.9332060.34
8192321613158460.0442182.932.673191.5562.7172098.06
81923232263168120.1122182.493.903262.39124.0152122.07

Gemma 3 4B QAT

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
gemma3 4B Q4_02.35 GiB3.88 Bpp20485694.21 ± 13.18
gemma3 4B Q4_02.35 GiB3.88 Btg3279.83 ± 0.18
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d40965228.77 ± 20.56
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d409667.49 ± 1.17
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d81924882.66 ± 37.61
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d819266.87 ± 0.80
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d163844491.42 ± 44.60
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d1638463.36 ± 0.66
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d327683840.09 ± 14.52
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d3276857.67 ± 0.09
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141280.7045819.480.46768.591.1703527.14
409632282561.4035837.530.585109.491.9884153.21
4096324165122.7905871.910.675189.623.4654764.98
4096328330245.5545899.920.907282.336.4615111.51
409632166604811.1065900.731.349379.5412.4555302.76
4096323213209622.1945905.762.113484.6824.3075434.56
819232182241.4255750.560.47767.121.9014325.44
8192322164482.8185814.330.65398.073.4704739.43
8192324328965.6285822.180.809158.216.4375110.32
81923286579211.2565822.261.183216.3312.4395288.96
8192321613158422.4705833.081.913267.6924.3835396.51
8192323226316844.8565844.163.251314.9848.1075470.50

GLM 4.5 Air

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench
modelsizeparamstestt/s
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Bpp2048841.44 ± 12.67
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Btg3222.59 ± 0.11
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Bpp2048 @ d4096749.08 ± 2.10
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Btg32 @ d409620.10 ± 0.01
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Bpp2048 @ d8192680.95 ± 1.38
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Btg32 @ d819218.78 ± 0.07
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Bpp2048 @ d16384565.44 ± 1.47
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Btg32 @ d1638416.47 ± 0.01
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Bpp2048 @ d32768418.84 ± 0.53
glm4moe 106B.A12B Q4_K67.85 GiB110.47 Btg32 @ d3276813.19 ± 0.01
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141285.554737.541.66019.287.214572.25
409632282569.781837.532.69623.7412.477661.68
40963241651219.548838.123.81233.5823.361706.83
40963283302438.980840.646.40739.9545.387727.61
409632166604877.938840.8812.19741.9890.134732.77
8192321822410.343792.021.80317.7512.146677.11
81923221644820.678792.343.16120.2523.839689.96
81923243289641.320793.044.49628.4745.816718.01
81923286579282.728792.198.42630.3891.153721.77

More info