Jetson AGX Thor — llama.cpp Benchmarks

Build: b6767 (5acd4554) · Published: · Platform: NVIDIA Thor (SM 11.0), VMM: yes
llama-bench llama-batched-bench pp = prefill tg = token generation d = context depth B = batch size

Overview

This document summarizes the performance of llama.cpp for various models on the NVIDIA Jetson AGX Thor.

Benchmarks include:
  • Prefill (pp) and generation (tg) at various context depths (d)
  • Batch sizes of 1, 2, 4, 8, 16, 32 — typical for local environments
Notes:
  • Device 0: Jetson AGX Thor, compute capability 11.0, VMM: yes
  • All runs built from b6767 (5acd4554) with -DGGML_CUDA=ON

gpt-oss-20b

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp20481861.26 ± 6.09
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg3257.18 ± 0.15
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d40961728.68 ± 6.75
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d409652.46 ± 0.24
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d81921614.11 ± 6.20
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d819251.26 ± 0.21
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d163841377.71 ± 43.86
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d1638451.60 ± 0.14
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Bpp2048 @ d327681123.22 ± 2.55
gpt-oss 20B MXFP4 MoE11.27 GiB20.91 Btg32 @ d3276846.53 ± 0.07
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141282.1911869.500.57655.522.7671491.70
409632282564.1531972.641.08359.085.2361576.77
4096324165128.2851977.651.31697.249.6011719.85
40963283302416.5311982.211.748146.4218.2791806.62
409632166604833.0761981.382.278224.7635.3541868.19
4096323213209666.0701983.833.369303.9869.4391902.34
819232182244.2791914.620.58654.574.8651690.44
8192322164488.5071925.911.11057.679.6171710.31
81923243289616.9571932.461.38292.6518.3381793.85
81923286579233.8881933.881.879136.2435.7671839.45
8192321613158467.7721934.022.542201.4370.3141871.38
81923232263168135.3971936.113.872264.49139.2691889.64
Run summary
load time =    6610.08 ms
prompt eval =  417768.33 ms / 778128 tok  (0.54 ms/tok, 1862.58 tok/s)
decode eval =    1162.30 ms / 64 runs     (18.16 ms/tok, 55.06 tok/s)
total time =  425524.98 ms / 778192 tok
graphs reused = 372

gpt-oss-120b

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048937.81 ± 4.23
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg3241.82 ± 0.13
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d4096897.09 ± 2.57
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d409639.35 ± 0.07
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d8192856.72 ± 3.76
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d819238.26 ± 0.25
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d16384774.93 ± 1.60
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d1638436.31 ± 0.07
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Bpp2048 @ d32768635.81 ± 1.13
gpt-oss 120B MXFP4 MoE59.02 GiB116.83 Btg32 @ d3276832.62 ± 0.05
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141284.758860.840.82138.975.579739.87
409632282568.684943.381.95232.7910.635776.28
40963241651217.392942.052.44952.2619.841832.20
40963283302434.777942.243.23779.0938.013868.75
409632166604869.644941.024.548112.5774.192890.23
40963232132096139.233941.386.893148.55146.127903.98
819232182248.857924.940.83238.479.689848.84
81923221644817.752922.942.07130.9119.823829.76
81923243289635.434924.752.52950.6037.964866.51
81923286579270.713926.783.54872.1574.262885.95
81923216131584141.543926.024.860105.35146.403898.78
81923232263168282.951926.467.722132.61290.673905.37
Run summary
load time =   67534.35 ms
prompt eval =  871709.25 ms / 778128 tok  (1.12 ms/tok, 892.65 tok/s)
decode eval =    1652.57 ms / 64 runs     (25.82 ms/tok, 38.73 tok/s)
total time =  940801.38 ms / 778192 tok
graphs reused = 372

Qwen3 Coder 30B A3B

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp20481533.71 ± 3.66
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg3242.70 ± 0.08
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d40961314.86 ± 2.31
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d409637.74 ± 0.07
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d81921143.21 ± 1.68
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d819235.88 ± 0.58
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d16384928.23 ± 2.83
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d1638432.12 ± 0.03
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Bpp2048 @ d32768610.49 ± 1.52
qwen3moe 30B.A3B Q8_030.25 GiB30.53 Btg32 @ d3276825.69 ± 0.04
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141282.7471491.240.82938.613.5751154.54
409632282565.3491531.411.38846.106.7381225.37
40963241651210.6881532.901.77572.1112.4631324.84
40963283302421.3321536.092.447104.6423.7791388.81
409632166604842.6171537.793.513145.7546.1301431.79
4096323213209685.2621537.295.421188.9090.6831456.68
819232182245.7441426.210.88536.176.6291240.67
81923221644811.4681428.691.48743.0412.9551269.64
81923243289622.9371428.622.01563.5224.9521318.37
81923286579245.8091430.632.99085.6348.7991348.23
8192321613158491.5781431.274.508113.5896.0851369.45
81923232263168183.2181430.777.403138.32190.6211380.58
Run summary
load time =   16355.53 ms
prompt eval =  561791.71 ms / 778128 tok  (0.72 ms/tok, 1385.08 tok/s)
decode eval =    1713.13 ms / 64 runs     (26.77 ms/tok, 37.36 tok/s)
total time =  579823.96 ms / 778192 tok
graphs reused = 372

Qwen2.5 Coder 7B

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
qwen2 7B Q8_07.54 GiB7.62 Bpp20481606.76 ± 2.75
qwen2 7B Q8_07.54 GiB7.62 Btg3225.26 ± 0.09
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d40961441.19 ± 2.14
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d409624.18 ± 0.04
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d81921306.22 ± 0.76
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d819223.91 ± 0.23
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d163841140.95 ± 1.19
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d1638422.83 ± 0.03
qwen2 7B Q8_07.54 GiB7.62 Bpp2048 @ d32768808.52 ± 1.25
qwen2 7B Q8_07.54 GiB7.62 Btg32 @ d3276820.42 ± 0.01
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141282.6561541.921.30924.453.9651041.10
409632282565.1031605.301.28649.756.3901292.10
4096324165129.8401665.051.36993.5211.2091473.15
40963283302419.6611666.641.867137.0921.5281533.97
409632166604839.2601669.282.143238.8941.4031595.23
4096323213209678.4821670.093.000341.3381.4821621.17
819232182245.1641586.311.32324.196.4871267.71
81923221644810.2951591.431.34647.5411.6411412.90
81923243289620.5891591.541.52384.0722.1111487.74
81923286579241.1021594.462.175117.7243.2771520.26
8192321613158482.1761595.022.759185.6084.9341549.24
81923232263168164.1471597.014.196244.06168.3431563.29

Gemma 3 4B QAT

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
gemma3 4B Q4_02.35 GiB3.88 Bpp20482642.30 ± 3.33
gemma3 4B Q4_02.35 GiB3.88 Btg3266.78 ± 0.08
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d40962442.52 ± 8.84
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d409659.65 ± 0.46
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d81922325.68 ± 12.33
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d819258.93 ± 1.24
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d163842156.82 ± 18.37
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d1638457.26 ± 0.21
gemma3 4B Q4_02.35 GiB3.88 Bpp2048 @ d327681819.49 ± 30.83
gemma3 4B Q4_02.35 GiB3.88 Btg32 @ d3276852.09 ± 0.20
llama-batched-bench
PPTGBN_KV T_PP sS_PP t/s T_TG sS_TG t/s T sS t/s
409632141281.5202695.260.52860.572.0482015.59
409632282562.9852744.630.632101.203.6172282.48
4096324165125.9432757.000.783163.436.7262455.00
40963283302411.8622762.351.255204.0513.1172517.66
409632166604823.6902766.421.474347.3025.1642624.70
4096323213209647.3662767.232.290447.2349.6552660.25
819232182243.0202712.780.53459.923.5542314.12
8192322164486.0152723.810.69891.746.7132450.26
81923243289612.0012730.490.915139.8312.9162546.88
81923286579223.9802732.901.518168.6525.4982580.24
8192321613158447.9402734.082.021253.3549.9612633.73
8192323226316895.8212735.763.355305.1799.1772653.52
Run summary
load time =    2203.52 ms
prompt eval =  297107.59 ms / 778128 tok  (0.38 ms/tok, 2619.01 tok/s)
decode eval =    1062.01 ms / 64 runs     (16.59 ms/tok, 60.26 tok/s)
total time =  300420.11 ms / 778192 tok
graphs reused = 372

GLM 4.5 Air

Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes · build 5acd4554 (6767)

llama-bench
modelsizeparamstestt/s
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Bpp2048461.94 ± 0.22
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Btg3220.58 ± 0.30
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Bpp2048 @ d4096416.40 ± 2.31
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Btg32 @ d409618.19 ± 0.01
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Bpp2048 @ d8192368.01 ± 0.32
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Btg32 @ d819216.43 ± 0.03
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Bpp2048 @ d16384293.14 ± 0.39
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Btg32 @ d1638410.83 ± 0.01
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Bpp2048 @ d32768199.93 ± 0.13
glm4moe 106B.A12B Q4_K - Medium68.01 GiB110.47 Btg32 @ d327687.37 ± 0.01

More info