llama.cpp on NVIDIA DGX Spark — Benchmarks

Build: b6767 (5acd45546) · Published: Oct 31, 2025 · Platform: NVIDIA GB10 (SM 12.1), VMM: yes

llama-bench llama-batched-bench pp = prefill tg = token generation d = context depth B = batch size

Overview

This document summarizes the performance of llama.cpp for various models on the new NVIDIA DGX Spark.

Benchmarks include:

Prefill (pp) and generation (tg) at various context depths (d)
Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments

Models:

gpt-oss-20b
gpt-oss-120b
Qwen3 Coder 30B A3B
Qwen2.5 Coder 7B
Gemma 3 4B QAT
GLM 4.5 Air

Feel free to request additional benchmarks for other models and use cases.

Benchmarks

Build

cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j

Commands

# sequential requests
llama-bench -m [model.gguf] -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048

# parallel requests
llama-batched-bench -m [model.gguf] -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32

History

Date	Build	Commit	Notes
2025 Oct 14	b6761	`7ea15bb64`	Initial numbers
2025 Oct 15	b6767	`5acd45546`	Improved decode via ggml-org/llama.cpp#16585

gpt-oss-20b

Model card: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048	3610.56 ± 15.16
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32	79.74 ± 0.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d4096	3361.11 ± 12.95
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d4096	74.63 ± 0.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d8192	3147.73 ± 15.77
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d8192	69.49 ± 1.12
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d16384	2685.54 ± 5.76
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d16384	64.02 ± 0.72
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d32768	2055.34 ± 20.43
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d32768	55.96 ± 0.07

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.124	3644.93	0.431	74.24	1.555	2654.99
4096	32	2	8256	2.252	3637.13	0.766	83.59	3.018	2735.57
4096	32	4	16512	4.485	3652.80	0.960	133.37	5.445	3032.46
4096	32	8	33024	8.958	3657.89	1.228	208.45	10.186	3242.01
4096	32	16	66048	17.883	3664.62	1.695	302.07	19.578	3373.52
4096	32	32	132096	35.738	3667.60	2.403	426.22	38.140	3463.42
8192	32	1	8224	2.285	3584.99	0.458	69.81	2.743	2997.64
8192	32	2	16448	4.547	3603.29	0.797	80.29	5.344	3077.82
8192	32	4	32896	9.117	3594.24	1.004	127.47	10.121	3250.29
8192	32	8	65792	18.248	3591.40	1.356	188.77	19.604	3356.01
8192	32	16	131584	36.389	3601.99	1.951	262.37	38.340	3432.02
8192	32	32	263168	72.880	3596.94	2.937	348.69	75.816	3471.12

gpt-oss-120b

Model card: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	1689.47 ± 107.67
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	52.87 ± 1.70
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d4096	1733.41 ± 5.19
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d4096	51.02 ± 0.65
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d8192	1705.93 ± 7.89
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d8192	48.46 ± 0.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d16384	1514.78 ± 5.66
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d16384	44.78 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d32768	1221.23 ± 7.85
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d32768	38.76 ± 0.06

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	2.345	1746.43	0.751	42.59	3.097	1333.05
4096	32	2	8256	4.442	1844.38	1.252	51.13	5.693	1450.15
4096	32	4	16512	8.816	1858.46	1.583	80.86	10.399	1587.86
4096	32	8	33024	17.570	1865.00	2.124	120.55	19.694	1676.88
4096	32	16	66048	35.112	1866.50	3.083	166.05	38.195	1729.22
4096	32	32	132096	70.158	1868.23	4.581	223.53	74.739	1767.42
8192	32	1	8224	4.462	1835.87	0.773	41.42	5.235	1571.02
8192	32	2	16448	8.960	1828.57	1.304	49.10	10.264	1602.57
8192	32	4	32896	17.800	1840.87	1.669	76.70	19.469	1689.65
8192	32	8	65792	35.706	1835.41	2.339	109.44	38.046	1729.29
8192	32	16	131584	71.322	1837.75	3.507	145.98	74.829	1758.46
8192	32	32	263168	142.658	1837.57	5.593	183.08	148.251	1775.15

Qwen3 Coder 30B A3B

Model card: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048	2933.39 ± 9.43
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32	59.95 ± 0.26
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d4096	2537.98 ± 7.17
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d4096	52.70 ± 0.75
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d8192	2246.86 ± 6.45
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d8192	44.48 ± 0.34
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d16384	1772.41 ± 10.58
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d16384	37.10 ± 0.05
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d32768	1252.10 ± 2.16
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d32768	27.82 ± 0.01

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.446	2831.93	0.609	52.58	2.055	2008.85
4096	32	2	8256	2.890	2834.16	1.117	57.27	4.008	2059.94
4096	32	4	16512	5.778	2835.64	1.546	82.78	7.324	2254.48
4096	32	8	33024	11.505	2848.21	2.195	116.63	13.700	2410.57
4096	32	16	66048	23.016	2847.43	3.218	159.11	26.234	2517.67
4096	32	32	132096	46.022	2848.01	4.926	207.86	50.949	2592.73
8192	32	1	8224	3.075	2664.32	0.724	44.18	3.799	2164.77
8192	32	2	16448	6.114	2679.65	1.267	50.50	7.382	2228.27
8192	32	4	32896	12.251	2674.64	1.807	70.85	14.058	2340.03
8192	32	8	65792	24.435	2682.01	2.729	93.82	27.164	2422.02
8192	32	16	131584	48.952	2677.56	4.322	118.47	53.274	2469.97
8192	32	32	263168	97.879	2678.26	7.057	145.10	104.936	2507.89

Qwen2.5 Coder 7B

Model card: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048	2267.08 ± 6.38
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32	29.40 ± 0.02
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d4096	2094.87 ± 11.61
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d4096	28.31 ± 0.10
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d8192	1906.26 ± 4.45
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d8192	27.53 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d16384	1634.82 ± 6.67
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d16384	26.03 ± 0.03
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d32768	1302.32 ± 4.58
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d32768	22.08 ± 0.03

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.818	2252.90	1.131	28.30	2.949	1399.87
4096	32	2	8256	3.629	2257.46	1.273	50.29	4.901	1684.40
4096	32	4	16512	7.267	2254.46	1.393	91.86	8.661	1906.54
4096	32	8	33024	14.516	2257.44	1.598	160.22	16.113	2049.48
4096	32	16	66048	29.025	2257.90	2.092	244.69	31.118	2122.53
4096	32	32	132096	58.059	2257.55	2.764	370.44	60.824	2171.79
8192	32	1	8224	3.748	2185.91	1.171	27.33	4.918	1672.09
8192	32	2	16448	7.502	2183.95	1.354	47.28	8.856	1857.35
8192	32	4	32896	15.018	2181.92	1.556	82.27	16.574	1984.82
8192	32	8	65792	30.024	2182.77	1.908	134.16	31.933	2060.34
8192	32	16	131584	60.044	2182.93	2.673	191.55	62.717	2098.06
8192	32	32	263168	120.112	2182.49	3.903	262.39	124.015	2122.07

Gemma 3 4B QAT

Model card: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048	5694.21 ± 13.18
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32	79.83 ± 0.18
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d4096	5228.77 ± 20.56
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d4096	67.49 ± 1.17
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d8192	4882.66 ± 37.61
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d8192	66.87 ± 0.80
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d16384	4491.42 ± 44.60
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d16384	63.36 ± 0.66
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d32768	3840.09 ± 14.52
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d32768	57.67 ± 0.09

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	0.704	5819.48	0.467	68.59	1.170	3527.14
4096	32	2	8256	1.403	5837.53	0.585	109.49	1.988	4153.21
4096	32	4	16512	2.790	5871.91	0.675	189.62	3.465	4764.98
4096	32	8	33024	5.554	5899.92	0.907	282.33	6.461	5111.51
4096	32	16	66048	11.106	5900.73	1.349	379.54	12.455	5302.76
4096	32	32	132096	22.194	5905.76	2.113	484.68	24.307	5434.56
8192	32	1	8224	1.425	5750.56	0.477	67.12	1.901	4325.44
8192	32	2	16448	2.818	5814.33	0.653	98.07	3.470	4739.43
8192	32	4	32896	5.628	5822.18	0.809	158.21	6.437	5110.32
8192	32	8	65792	11.256	5822.26	1.183	216.33	12.439	5288.96
8192	32	16	131584	22.470	5833.08	1.913	267.69	24.383	5396.51
8192	32	32	263168	44.856	5844.16	3.251	314.98	48.107	5470.50

GLM 4.5 Air

Model card: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes · build 5acd45546 (6767)

llama-bench

model	size	params	test	t/s
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048	841.44 ± 12.67
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32	22.59 ± 0.11
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d4096	749.08 ± 2.10
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d4096	20.10 ± 0.01
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d8192	680.95 ± 1.38
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d8192	18.78 ± 0.07
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d16384	565.44 ± 1.47
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d16384	16.47 ± 0.01
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	pp2048 @ d32768	418.84 ± 0.53
glm4moe 106B.A12B Q4_K	67.85 GiB	110.47 B	tg32 @ d32768	13.19 ± 0.01

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	5.554	737.54	1.660	19.28	7.214	572.25
4096	32	2	8256	9.781	837.53	2.696	23.74	12.477	661.68
4096	32	4	16512	19.548	838.12	3.812	33.58	23.361	706.83
4096	32	8	33024	38.980	840.64	6.407	39.95	45.387	727.61
4096	32	16	66048	77.938	840.88	12.197	41.98	90.134	732.77
8192	32	1	8224	10.343	792.02	1.803	17.75	12.146	677.11
8192	32	2	16448	20.678	792.34	3.161	20.25	23.839	689.96
8192	32	4	32896	41.320	793.04	4.496	28.47	45.816	718.01
8192	32	8	65792	82.728	792.19	8.426	30.38	91.153	721.77

Overview

Benchmarks

Build

Commands

History

gpt-oss-20b

gpt-oss-120b

Qwen3 Coder 30B A3B

Qwen2.5 Coder 7B

Gemma 3 4B QAT

GLM 4.5 Air

More info