Jetson AGX Thor — llama.cpp Benchmarks

Build: b6767 (5acd4554) · Published: Oct 30, 2025 · Platform: NVIDIA Thor (SM 11.0), VMM: yes

llama-bench llama-batched-bench pp = prefill tg = token generation d = context depth B = batch size

Overview

This document summarizes the performance of llama.cpp for various models on the NVIDIA Jetson AGX Thor.

Benchmarks include:

Prefill (pp) and generation (tg) at various context depths (d)
Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments

Notes:

Device 0: Jetson AGX Thor, compute capability 11.0, VMM: yes
All runs built from b6767 (5acd4554) with -DGGML_CUDA=ON

gpt-oss-20b

Model card: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048	1861.26 ± 6.09
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32	57.18 ± 0.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d4096	1728.68 ± 6.75
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d4096	52.46 ± 0.24
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d8192	1614.11 ± 6.20
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d8192	51.26 ± 0.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d16384	1377.71 ± 43.86
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d16384	51.60 ± 0.14
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	pp2048 @ d32768	1123.22 ± 2.55
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	tg32 @ d32768	46.53 ± 0.07

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	2.191	1869.50	0.576	55.52	2.767	1491.70
4096	32	2	8256	4.153	1972.64	1.083	59.08	5.236	1576.77
4096	32	4	16512	8.285	1977.65	1.316	97.24	9.601	1719.85
4096	32	8	33024	16.531	1982.21	1.748	146.42	18.279	1806.62
4096	32	16	66048	33.076	1981.38	2.278	224.76	35.354	1868.19
4096	32	32	132096	66.070	1983.83	3.369	303.98	69.439	1902.34
8192	32	1	8224	4.279	1914.62	0.586	54.57	4.865	1690.44
8192	32	2	16448	8.507	1925.91	1.110	57.67	9.617	1710.31
8192	32	4	32896	16.957	1932.46	1.382	92.65	18.338	1793.85
8192	32	8	65792	33.888	1933.88	1.879	136.24	35.767	1839.45
8192	32	16	131584	67.772	1934.02	2.542	201.43	70.314	1871.38
8192	32	32	263168	135.397	1936.11	3.872	264.49	139.269	1889.64

Run summary

load time =    6610.08 ms
prompt eval =  417768.33 ms / 778128 tok  (0.54 ms/tok, 1862.58 tok/s)
decode eval =    1162.30 ms / 64 runs     (18.16 ms/tok, 55.06 tok/s)
total time =  425524.98 ms / 778192 tok
graphs reused = 372

gpt-oss-120b

Model card: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048	937.81 ± 4.23
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32	41.82 ± 0.13
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d4096	897.09 ± 2.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d4096	39.35 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d8192	856.72 ± 3.76
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d8192	38.26 ± 0.25
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d16384	774.93 ± 1.60
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d16384	36.31 ± 0.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	pp2048 @ d32768	635.81 ± 1.13
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	tg32 @ d32768	32.62 ± 0.05

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	4.758	860.84	0.821	38.97	5.579	739.87
4096	32	2	8256	8.684	943.38	1.952	32.79	10.635	776.28
4096	32	4	16512	17.392	942.05	2.449	52.26	19.841	832.20
4096	32	8	33024	34.777	942.24	3.237	79.09	38.013	868.75
4096	32	16	66048	69.644	941.02	4.548	112.57	74.192	890.23
4096	32	32	132096	139.233	941.38	6.893	148.55	146.127	903.98
8192	32	1	8224	8.857	924.94	0.832	38.47	9.689	848.84
8192	32	2	16448	17.752	922.94	2.071	30.91	19.823	829.76
8192	32	4	32896	35.434	924.75	2.529	50.60	37.964	866.51
8192	32	8	65792	70.713	926.78	3.548	72.15	74.262	885.95
8192	32	16	131584	141.543	926.02	4.860	105.35	146.403	898.78
8192	32	32	263168	282.951	926.46	7.722	132.61	290.673	905.37

Run summary

load time =   67534.35 ms
prompt eval =  871709.25 ms / 778128 tok  (1.12 ms/tok, 892.65 tok/s)
decode eval =    1652.57 ms / 64 runs     (25.82 ms/tok, 38.73 tok/s)
total time =  940801.38 ms / 778192 tok
graphs reused = 372

Qwen3 Coder 30B A3B

Model card: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048	1533.71 ± 3.66
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32	42.70 ± 0.08
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d4096	1314.86 ± 2.31
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d4096	37.74 ± 0.07
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d8192	1143.21 ± 1.68
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d8192	35.88 ± 0.58
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d16384	928.23 ± 2.83
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d16384	32.12 ± 0.03
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	pp2048 @ d32768	610.49 ± 1.52
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	tg32 @ d32768	25.69 ± 0.04

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	2.747	1491.24	0.829	38.61	3.575	1154.54
4096	32	2	8256	5.349	1531.41	1.388	46.10	6.738	1225.37
4096	32	4	16512	10.688	1532.90	1.775	72.11	12.463	1324.84
4096	32	8	33024	21.332	1536.09	2.447	104.64	23.779	1388.81
4096	32	16	66048	42.617	1537.79	3.513	145.75	46.130	1431.79
4096	32	32	132096	85.262	1537.29	5.421	188.90	90.683	1456.68
8192	32	1	8224	5.744	1426.21	0.885	36.17	6.629	1240.67
8192	32	2	16448	11.468	1428.69	1.487	43.04	12.955	1269.64
8192	32	4	32896	22.937	1428.62	2.015	63.52	24.952	1318.37
8192	32	8	65792	45.809	1430.63	2.990	85.63	48.799	1348.23
8192	32	16	131584	91.578	1431.27	4.508	113.58	96.085	1369.45
8192	32	32	263168	183.218	1430.77	7.403	138.32	190.621	1380.58

Run summary

load time =   16355.53 ms
prompt eval =  561791.71 ms / 778128 tok  (0.72 ms/tok, 1385.08 tok/s)
decode eval =    1713.13 ms / 64 runs     (26.77 ms/tok, 37.36 tok/s)
total time =  579823.96 ms / 778192 tok
graphs reused = 372

Qwen2.5 Coder 7B

Model card: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048	1606.76 ± 2.75
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32	25.26 ± 0.09
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d4096	1441.19 ± 2.14
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d4096	24.18 ± 0.04
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d8192	1306.22 ± 0.76
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d8192	23.91 ± 0.23
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d16384	1140.95 ± 1.19
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d16384	22.83 ± 0.03
qwen2 7B Q8_0	7.54 GiB	7.62 B	pp2048 @ d32768	808.52 ± 1.25
qwen2 7B Q8_0	7.54 GiB	7.62 B	tg32 @ d32768	20.42 ± 0.01

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	2.656	1541.92	1.309	24.45	3.965	1041.10
4096	32	2	8256	5.103	1605.30	1.286	49.75	6.390	1292.10
4096	32	4	16512	9.840	1665.05	1.369	93.52	11.209	1473.15
4096	32	8	33024	19.661	1666.64	1.867	137.09	21.528	1533.97
4096	32	16	66048	39.260	1669.28	2.143	238.89	41.403	1595.23
4096	32	32	132096	78.482	1670.09	3.000	341.33	81.482	1621.17
8192	32	1	8224	5.164	1586.31	1.323	24.19	6.487	1267.71
8192	32	2	16448	10.295	1591.43	1.346	47.54	11.641	1412.90
8192	32	4	32896	20.589	1591.54	1.523	84.07	22.111	1487.74
8192	32	8	65792	41.102	1594.46	2.175	117.72	43.277	1520.26
8192	32	16	131584	82.176	1595.02	2.759	185.60	84.934	1549.24
8192	32	32	263168	164.147	1597.01	4.196	244.06	168.343	1563.29

Gemma 3 4B QAT

Model card: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048	2642.30 ± 3.33
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32	66.78 ± 0.08
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d4096	2442.52 ± 8.84
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d4096	59.65 ± 0.46
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d8192	2325.68 ± 12.33
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d8192	58.93 ± 1.24
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d16384	2156.82 ± 18.37
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d16384	57.26 ± 0.21
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048 @ d32768	1819.49 ± 30.83
gemma3 4B Q4_0	2.35 GiB	3.88 B	tg32 @ d32768	52.09 ± 0.20

llama-batched-bench

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
4096	32	1	4128	1.520	2695.26	0.528	60.57	2.048	2015.59
4096	32	2	8256	2.985	2744.63	0.632	101.20	3.617	2282.48
4096	32	4	16512	5.943	2757.00	0.783	163.43	6.726	2455.00
4096	32	8	33024	11.862	2762.35	1.255	204.05	13.117	2517.66
4096	32	16	66048	23.690	2766.42	1.474	347.30	25.164	2624.70
4096	32	32	132096	47.366	2767.23	2.290	447.23	49.655	2660.25
8192	32	1	8224	3.020	2712.78	0.534	59.92	3.554	2314.12
8192	32	2	16448	6.015	2723.81	0.698	91.74	6.713	2450.26
8192	32	4	32896	12.001	2730.49	0.915	139.83	12.916	2546.88
8192	32	8	65792	23.980	2732.90	1.518	168.65	25.498	2580.24
8192	32	16	131584	47.940	2734.08	2.021	253.35	49.961	2633.73
8192	32	32	263168	95.821	2735.76	3.355	305.17	99.177	2653.52

Run summary

load time =    2203.52 ms
prompt eval =  297107.59 ms / 778128 tok  (0.38 ms/tok, 2619.01 tok/s)
decode eval =    1062.01 ms / 64 runs     (16.59 ms/tok, 60.26 tok/s)
total time =  300420.11 ms / 778192 tok
graphs reused = 372

GLM 4.5 Air

Model card: https://huggingface.co/unsloth/GLM-4.5-Air-GGUF/tree/main

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench

model	size	params	test	t/s
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	pp2048	461.94 ± 0.22
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	tg32	20.58 ± 0.30
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	pp2048 @ d4096	416.40 ± 2.31
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	tg32 @ d4096	18.19 ± 0.01
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	pp2048 @ d8192	368.01 ± 0.32
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	tg32 @ d8192	16.43 ± 0.03
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	pp2048 @ d16384	293.14 ± 0.39
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	tg32 @ d16384	10.83 ± 0.01
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	pp2048 @ d32768	199.93 ± 0.13
glm4moe 106B.A12B Q4_K - Medium	68.01 GiB	110.47 B	tg32 @ d32768	7.37 ± 0.01

Overview

gpt-oss-20b

gpt-oss-120b

Qwen3 Coder 30B A3B

Qwen2.5 Coder 7B

Gemma 3 4B QAT

GLM 4.5 Air

More info