Jetson AGX Thor — llama.cpp Benchmarks
Build: b6767 (5acd4554) · Published: Oct 30, 2025 · Platform: NVIDIA Thor (SM 11.0) , VMM: yes
llama-bench
llama-batched-bench
pp = prefill
tg = token generation
d = context depth
B = batch size
Contents
Overview
This document summarizes the performance of llama.cpp for various models on the NVIDIA Jetson AGX Thor.
Benchmarks include:
Prefill (pp) and generation (tg) at various context depths (d)
Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments
Notes:
Device 0: Jetson AGX Thor, compute capability 11.0, VMM: yes
All runs built from b6767 (5acd4554) with -DGGML_CUDA=ON
gpt-oss-20b
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B pp2048 1861.26 ± 6.09
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B tg32 57.18 ± 0.15
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B pp2048 @ d4096 1728.68 ± 6.75
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B tg32 @ d4096 52.46 ± 0.24
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B pp2048 @ d8192 1614.11 ± 6.20
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B tg32 @ d8192 51.26 ± 0.21
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B pp2048 @ d16384 1377.71 ± 43.86
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B tg32 @ d16384 51.60 ± 0.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B pp2048 @ d32768 1123.22 ± 2.55
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B tg32 @ d32768 46.53 ± 0.07
llama-batched-bench
PP TG B N_KV
T_PP s S_PP t/s
T_TG s S_TG t/s
T s S t/s
4096 32 1 4128 2.191 1869.50 0.576 55.52 2.767 1491.70
4096 32 2 8256 4.153 1972.64 1.083 59.08 5.236 1576.77
4096 32 4 16512 8.285 1977.65 1.316 97.24 9.601 1719.85
4096 32 8 33024 16.531 1982.21 1.748 146.42 18.279 1806.62
4096 32 16 66048 33.076 1981.38 2.278 224.76 35.354 1868.19
4096 32 32 132096 66.070 1983.83 3.369 303.98 69.439 1902.34
8192 32 1 8224 4.279 1914.62 0.586 54.57 4.865 1690.44
8192 32 2 16448 8.507 1925.91 1.110 57.67 9.617 1710.31
8192 32 4 32896 16.957 1932.46 1.382 92.65 18.338 1793.85
8192 32 8 65792 33.888 1933.88 1.879 136.24 35.767 1839.45
8192 32 16 131584 67.772 1934.02 2.542 201.43 70.314 1871.38
8192 32 32 263168 135.397 1936.11 3.872 264.49 139.269 1889.64
Run summary
load time = 6610.08 ms
prompt eval = 417768.33 ms / 778128 tok (0.54 ms/tok, 1862.58 tok/s)
decode eval = 1162.30 ms / 64 runs (18.16 ms/tok, 55.06 tok/s)
total time = 425524.98 ms / 778192 tok
graphs reused = 372
gpt-oss-120b
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 937.81 ± 4.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 41.82 ± 0.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d4096 897.09 ± 2.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d4096 39.35 ± 0.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d8192 856.72 ± 3.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d8192 38.26 ± 0.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d16384 774.93 ± 1.60
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d16384 36.31 ± 0.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 @ d32768 635.81 ± 1.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg32 @ d32768 32.62 ± 0.05
llama-batched-bench
PP TG B N_KV
T_PP s S_PP t/s
T_TG s S_TG t/s
T s S t/s
4096 32 1 4128 4.758 860.84 0.821 38.97 5.579 739.87
4096 32 2 8256 8.684 943.38 1.952 32.79 10.635 776.28
4096 32 4 16512 17.392 942.05 2.449 52.26 19.841 832.20
4096 32 8 33024 34.777 942.24 3.237 79.09 38.013 868.75
4096 32 16 66048 69.644 941.02 4.548 112.57 74.192 890.23
4096 32 32 132096 139.233 941.38 6.893 148.55 146.127 903.98
8192 32 1 8224 8.857 924.94 0.832 38.47 9.689 848.84
8192 32 2 16448 17.752 922.94 2.071 30.91 19.823 829.76
8192 32 4 32896 35.434 924.75 2.529 50.60 37.964 866.51
8192 32 8 65792 70.713 926.78 3.548 72.15 74.262 885.95
8192 32 16 131584 141.543 926.02 4.860 105.35 146.403 898.78
8192 32 32 263168 282.951 926.46 7.722 132.61 290.673 905.37
Run summary
load time = 67534.35 ms
prompt eval = 871709.25 ms / 778128 tok (1.12 ms/tok, 892.65 tok/s)
decode eval = 1652.57 ms / 64 runs (25.82 ms/tok, 38.73 tok/s)
total time = 940801.38 ms / 778192 tok
graphs reused = 372
Qwen3 Coder 30B A3B
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B pp2048 1533.71 ± 3.66
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B tg32 42.70 ± 0.08
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B pp2048 @ d4096 1314.86 ± 2.31
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B tg32 @ d4096 37.74 ± 0.07
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B pp2048 @ d8192 1143.21 ± 1.68
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B tg32 @ d8192 35.88 ± 0.58
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B pp2048 @ d16384 928.23 ± 2.83
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B tg32 @ d16384 32.12 ± 0.03
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B pp2048 @ d32768 610.49 ± 1.52
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B tg32 @ d32768 25.69 ± 0.04
llama-batched-bench
PP TG B N_KV
T_PP s S_PP t/s
T_TG s S_TG t/s
T s S t/s
4096 32 1 4128 2.747 1491.24 0.829 38.61 3.575 1154.54
4096 32 2 8256 5.349 1531.41 1.388 46.10 6.738 1225.37
4096 32 4 16512 10.688 1532.90 1.775 72.11 12.463 1324.84
4096 32 8 33024 21.332 1536.09 2.447 104.64 23.779 1388.81
4096 32 16 66048 42.617 1537.79 3.513 145.75 46.130 1431.79
4096 32 32 132096 85.262 1537.29 5.421 188.90 90.683 1456.68
8192 32 1 8224 5.744 1426.21 0.885 36.17 6.629 1240.67
8192 32 2 16448 11.468 1428.69 1.487 43.04 12.955 1269.64
8192 32 4 32896 22.937 1428.62 2.015 63.52 24.952 1318.37
8192 32 8 65792 45.809 1430.63 2.990 85.63 48.799 1348.23
8192 32 16 131584 91.578 1431.27 4.508 113.58 96.085 1369.45
8192 32 32 263168 183.218 1430.77 7.403 138.32 190.621 1380.58
Run summary
load time = 16355.53 ms
prompt eval = 561791.71 ms / 778128 tok (0.72 ms/tok, 1385.08 tok/s)
decode eval = 1713.13 ms / 64 runs (26.77 ms/tok, 37.36 tok/s)
total time = 579823.96 ms / 778192 tok
graphs reused = 372
Qwen2.5 Coder 7B
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
qwen2 7B Q8_0 7.54 GiB 7.62 B pp2048 1606.76 ± 2.75
qwen2 7B Q8_0 7.54 GiB 7.62 B tg32 25.26 ± 0.09
qwen2 7B Q8_0 7.54 GiB 7.62 B pp2048 @ d4096 1441.19 ± 2.14
qwen2 7B Q8_0 7.54 GiB 7.62 B tg32 @ d4096 24.18 ± 0.04
qwen2 7B Q8_0 7.54 GiB 7.62 B pp2048 @ d8192 1306.22 ± 0.76
qwen2 7B Q8_0 7.54 GiB 7.62 B tg32 @ d8192 23.91 ± 0.23
qwen2 7B Q8_0 7.54 GiB 7.62 B pp2048 @ d16384 1140.95 ± 1.19
qwen2 7B Q8_0 7.54 GiB 7.62 B tg32 @ d16384 22.83 ± 0.03
qwen2 7B Q8_0 7.54 GiB 7.62 B pp2048 @ d32768 808.52 ± 1.25
qwen2 7B Q8_0 7.54 GiB 7.62 B tg32 @ d32768 20.42 ± 0.01
llama-batched-bench
PP TG B N_KV
T_PP s S_PP t/s
T_TG s S_TG t/s
T s S t/s
4096 32 1 4128 2.656 1541.92 1.309 24.45 3.965 1041.10
4096 32 2 8256 5.103 1605.30 1.286 49.75 6.390 1292.10
4096 32 4 16512 9.840 1665.05 1.369 93.52 11.209 1473.15
4096 32 8 33024 19.661 1666.64 1.867 137.09 21.528 1533.97
4096 32 16 66048 39.260 1669.28 2.143 238.89 41.403 1595.23
4096 32 32 132096 78.482 1670.09 3.000 341.33 81.482 1621.17
8192 32 1 8224 5.164 1586.31 1.323 24.19 6.487 1267.71
8192 32 2 16448 10.295 1591.43 1.346 47.54 11.641 1412.90
8192 32 4 32896 20.589 1591.54 1.523 84.07 22.111 1487.74
8192 32 8 65792 41.102 1594.46 2.175 117.72 43.277 1520.26
8192 32 16 131584 82.176 1595.02 2.759 185.60 84.934 1549.24
8192 32 32 263168 164.147 1597.01 4.196 244.06 168.343 1563.29
Gemma 3 4B QAT
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 2642.30 ± 3.33
gemma3 4B Q4_0 2.35 GiB 3.88 B tg32 66.78 ± 0.08
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 @ d4096 2442.52 ± 8.84
gemma3 4B Q4_0 2.35 GiB 3.88 B tg32 @ d4096 59.65 ± 0.46
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 @ d8192 2325.68 ± 12.33
gemma3 4B Q4_0 2.35 GiB 3.88 B tg32 @ d8192 58.93 ± 1.24
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 @ d16384 2156.82 ± 18.37
gemma3 4B Q4_0 2.35 GiB 3.88 B tg32 @ d16384 57.26 ± 0.21
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 @ d32768 1819.49 ± 30.83
gemma3 4B Q4_0 2.35 GiB 3.88 B tg32 @ d32768 52.09 ± 0.20
llama-batched-bench
PP TG B N_KV
T_PP s S_PP t/s
T_TG s S_TG t/s
T s S t/s
4096 32 1 4128 1.520 2695.26 0.528 60.57 2.048 2015.59
4096 32 2 8256 2.985 2744.63 0.632 101.20 3.617 2282.48
4096 32 4 16512 5.943 2757.00 0.783 163.43 6.726 2455.00
4096 32 8 33024 11.862 2762.35 1.255 204.05 13.117 2517.66
4096 32 16 66048 23.690 2766.42 1.474 347.30 25.164 2624.70
4096 32 32 132096 47.366 2767.23 2.290 447.23 49.655 2660.25
8192 32 1 8224 3.020 2712.78 0.534 59.92 3.554 2314.12
8192 32 2 16448 6.015 2723.81 0.698 91.74 6.713 2450.26
8192 32 4 32896 12.001 2730.49 0.915 139.83 12.916 2546.88
8192 32 8 65792 23.980 2732.90 1.518 168.65 25.498 2580.24
8192 32 16 131584 47.940 2734.08 2.021 253.35 49.961 2633.73
8192 32 32 263168 95.821 2735.76 3.355 305.17 99.177 2653.52
Run summary
load time = 2203.52 ms
prompt eval = 297107.59 ms / 778128 tok (0.38 ms/tok, 2619.01 tok/s)
decode eval = 1062.01 ms / 64 runs (16.59 ms/tok, 60.26 tok/s)
total time = 300420.11 ms / 778192 tok
graphs reused = 372
GLM 4.5 Air
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)
llama-bench
model size params test t/s
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B pp2048 461.94 ± 0.22
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B tg32 20.58 ± 0.30
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B pp2048 @ d4096 416.40 ± 2.31
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B tg32 @ d4096 18.19 ± 0.01
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B pp2048 @ d8192 368.01 ± 0.32
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B tg32 @ d8192 16.43 ± 0.03
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B pp2048 @ d16384 293.14 ± 0.39
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B tg32 @ d16384 10.83 ± 0.01
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B pp2048 @ d32768 199.93 ± 0.13
glm4moe 106B.A12B Q4_K - Medium 68.01 GiB 110.47 B tg32 @ d32768 7.37 ± 0.01
© 2025 — Benchmarks generated with llama.cpp tools on Jetson AGX Thor. Links open in new tabs.