llama.cpp Benchmarks — DGX Spark vs Jetson AGX Thor

This page presents comparative benchmark results between NVIDIA DGX Spark and Jetson AGX Thor using llama.cpp build (b6767) commit 5acd45546. Each section corresponds to a different large language model tested under identical conditions. Values are reported in tokens per second, and the uplift column indicates relative performance (Spark ÷ Thor). Benchmarks were recorded and published on October 31, 2025.

The benchmark data presented here originates from the public discussion on the llama.cpp GitHub repository . The original DGX Spark benchmark results are archived at spark-llamacpp-bench.html , and the Jetson AGX Thor results at thor-llamacpp-bench.html . These datasets form the basis for the comparative tables and performance summaries shown below.

gpt-oss‑20B MXFP4 MoE 11.27 GiB • 20.91 B params

View Model Card on Hugging Face

Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
pp2048	3,610.56	1,861.26	1.94×
pp2048 @ d4096	3,361.11	1,728.68	1.94×
pp2048 @ d8192	3,147.73	1,614.11	1.95×
pp2048 @ d16384	2,685.54	1,377.71	1.95×
pp2048 @ d32768	2,055.34	1,123.22	1.83×

Interpretation: DGX Spark delivers about 1.9× higher prefill throughput.

Token Generation

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
tg32	79.74	57.18	1.39×
tg32 @ d4096	74.63	52.46	1.42×
tg32 @ d8192	69.49	51.26	1.36×
tg32 @ d16384	64.02	51.6	1.24×
tg32 @ d32768	55.96	46.53	1.20×

Interpretation: DGX Spark averages 1.36× token-generation throughput (range ~1.20–1.42×) versus Jetson Thor for these tests. This suggests lower per-token overhead and steadier GPU utilization on Spark at these context depths.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	3,644.93	1,869.5	1.95×
4096	2	8256	3,637.13	1,972.64	1.84×
4096	4	16512	3,652.8	1,977.65	1.85×
4096	8	33024	3,657.89	1,982.21	1.85×
4096	16	66048	3,664.62	1,981.38	1.85×
4096	32	132096	3,667.6	1,983.83	1.85×
8192	1	8224	3,584.99	1,914.62	1.87×
8192	2	16448	3,603.29	1,925.91	1.87×
8192	4	32896	3,594.24	1,932.46	1.86×
8192	8	65792	3,591.4	1,933.88	1.86×
8192	16	131584	3,601.99	1,934.02	1.86×
8192	32	263168	3,596.94	1,936.11	1.86×

Interpretation: DGX Spark delivers about 1.9× higher prefill throughput under concurrency.

Token Generation Throughput (S_TG)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	74.24	55.52	1.34×
4096	2	8256	83.59	59.08	1.41×
4096	4	16512	133.37	97.24	1.37×
4096	8	33024	208.45	146.42	1.42×
4096	16	66048	302.07	224.76	1.34×
4096	32	132096	426.22	303.98	1.40×
8192	1	8224	69.81	54.57	1.28×
8192	2	16448	80.29	57.67	1.39×
8192	4	32896	127.47	92.65	1.38×
8192	8	65792	188.77	136.24	1.39×
8192	16	131584	262.37	201.43	1.30×
8192	32	263168	348.69	264.49	1.32×

Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.37× higher token-generation throughput (range ~1.28–1.42×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.

gpt-oss‑120B MXFP4 MoE 59.02 GiB • 116.83 B params

View Model Card on Hugging Face

Single‑Batch (llama‑bench)

Prefill

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
pp2048	1,689.47	937.81	1.80×
pp2048 @ d4096	1,733.41	897.09	1.93×
pp2048 @ d8192	1,705.93	856.72	1.99×
pp2048 @ d16384	1,514.78	774.93	1.95×
pp2048 @ d32768	1,221.23	635.81	1.92×

Interpretation: DGX Spark delivers about 1.9× higher prefill throughput.

Token Generation

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
tg32	52.87	41.82	1.26×
tg32 @ d4096	51.02	39.35	1.30×
tg32 @ d8192	48.46	38.26	1.27×
tg32 @ d16384	44.78	36.31	1.23×
tg32 @ d32768	38.76	32.62	1.19×

Interpretation: DGX Spark averages 1.26× token-generation throughput (range ~1.19–1.30×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	1,746.43	860.84	2.03×
4096	2	8256	1,844.38	943.38	1.96×
4096	4	16512	1,858.46	942.05	1.97×
4096	8	33024	1,865	942.24	1.98×
4096	16	66048	1,866.5	941.02	1.98×
4096	32	132096	1,868.23	941.38	1.98×
8192	1	8224	1,835.87	924.94	1.98×
8192	2	16448	1,828.57	922.94	1.98×
8192	4	32896	1,840.87	924.75	1.99×
8192	8	65792	1,835.41	926.78	1.98×
8192	16	131584	1,837.75	926.02	1.98×
8192	32	263168	1,837.57	926.46	1.98×

Interpretation: DGX Spark delivers about 2.0× higher prefill throughput under concurrency.

Token Generation Throughput (S_TG)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	42.59	38.97	1.09×
4096	2	8256	51.13	32.79	1.56×
4096	4	16512	80.86	52.26	1.55×
4096	8	33024	120.55	79.09	1.52×
4096	16	66048	166.05	112.57	1.48×
4096	32	132096	223.53	148.55	1.50×
8192	1	8224	41.42	38.47	1.08×
8192	2	16448	49.1	30.91	1.59×
8192	4	32896	76.7	50.6	1.52×
8192	8	65792	109.44	72.15	1.52×
8192	16	131584	145.98	105.35	1.39×
8192	32	263168	183.08	132.61	1.38×

Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.51× higher token-generation throughput (range ~1.08–1.59×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.

Qwen3 Coder 30B A3B 30.25 GiB • 30.53 B params

View Model Card on Hugging Face

Single‑Batch (llama‑bench)

Prefill

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
pp2048	2,933.39	1,533.71	1.91×
pp2048 @ d4096	2,537.98	1,314.86	1.93×
pp2048 @ d8192	2,246.86	1,143.21	1.97×
pp2048 @ d16384	1,772.41	928.23	1.91×
pp2048 @ d32768	1,252.1	610.49	2.05×

Interpretation: DGX Spark delivers about 2.0× higher prefill throughput.

Token Generation

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
tg32	59.95	42.7	1.40×
tg32 @ d4096	52.7	37.74	1.40×
tg32 @ d8192	44.48	35.88	1.24×
tg32 @ d16384	37.1	32.12	1.16×
tg32 @ d32768	27.82	25.69	1.08×

Interpretation: DGX Spark averages 1.24× token-generation throughput (range ~1.08–1.40×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	2,831.93	1,491.24	1.90×
4096	2	8256	2,834.16	1,531.41	1.85×
4096	4	16512	2,835.64	1,532.9	1.85×
4096	8	33024	2,848.21	1,536.09	1.85×
4096	16	66048	2,847.43	1,537.79	1.85×
4096	32	132096	2,848.01	1,537.29	1.85×
8192	1	8224	2,664.32	1,426.21	1.87×
8192	2	16448	2,679.65	1,428.69	1.88×
8192	4	32896	2,674.64	1,428.62	1.87×
8192	8	65792	2,682.01	1,430.63	1.87×
8192	16	131584	2,677.56	1,431.27	1.87×
8192	32	263168	2,678.26	1,430.77	1.87×

Interpretation: DGX Spark delivers about 1.9× higher prefill throughput under concurrency.

Token Generation Throughput (S_TG)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	52.58	38.61	1.36×
4096	2	8256	57.27	46.1	1.24×
4096	4	16512	82.78	72.11	1.15×
4096	8	33024	116.63	104.64	1.11×
4096	16	66048	159.11	145.75	1.09×
4096	32	132096	207.86	188.9	1.10×
8192	1	8224	44.18	36.17	1.22×
8192	2	16448	50.5	43.04	1.17×
8192	4	32896	70.85	63.52	1.12×
8192	8	65792	93.82	85.63	1.10×
8192	16	131584	118.47	113.58	1.04×
8192	32	263168	145.1	138.32	1.05×

Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.11× higher token-generation throughput (range ~1.04–1.36×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.

Qwen2.5 Coder 7B (Q8_0) 7.54 GiB • 7.62 B params

View Model Card on Hugging Face

Single‑Batch (llama‑bench)

Prefill

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
pp2048	2,267.08	1,606.76	1.41×
pp2048 @ d4096	2,094.87	1,441.19	1.45×
pp2048 @ d8192	1,906.26	1,306.22	1.46×
pp2048 @ d16384	1,634.82	1,140.95	1.43×
pp2048 @ d32768	1,302.32	808.52	1.61×

Interpretation: DGX Spark delivers about 1.5× higher prefill throughput.

Token Generation

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
tg32	29.4	25.26	1.16×
tg32 @ d4096	28.31	24.18	1.17×
tg32 @ d8192	27.53	23.91	1.15×
tg32 @ d16384	26.03	22.83	1.14×
tg32 @ d32768	22.08	20.42	1.08×

Interpretation: DGX Spark averages 1.15× token-generation throughput (range ~1.08–1.17×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	2,252.9	1,541.92	1.46×
4096	2	8256	2,257.46	1,605.3	1.41×
4096	4	16512	2,254.46	1,665.05	1.35×
4096	8	33024	2,257.44	1,666.64	1.35×
4096	16	66048	2,257.9	1,669.28	1.35×
4096	32	132096	2,257.55	1,670.09	1.35×
8192	1	8224	2,185.91	1,586.31	1.38×
8192	2	16448	2,183.95	1,591.43	1.37×
8192	4	32896	2,181.92	1,591.54	1.37×
8192	8	65792	2,182.77	1,594.46	1.37×
8192	16	131584	2,182.93	1,595.02	1.37×
8192	32	263168	2,182.49	1,597.01	1.37×

Interpretation: DGX Spark delivers about 1.4× higher prefill throughput under concurrency.

Token Generation Throughput (S_TG)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	28.3	24.45	1.16×
4096	2	8256	50.29	49.75	1.01×
4096	4	16512	91.86	93.52	0.98×
4096	8	33024	160.22	137.09	1.17×
4096	16	66048	244.69	238.89	1.02×
4096	32	132096	370.44	341.33	1.09×
8192	1	8224	27.33	24.19	1.13×
8192	2	16448	47.28	47.54	0.99×
8192	4	32896	82.27	84.07	0.98×
8192	8	65792	134.16	117.72	1.14×
8192	16	131584	191.55	185.6	1.03×
8192	32	263168	262.39	244.06	1.08×

Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.05× higher token-generation throughput (range ~0.98–1.17×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.

Gemma 3 4B QAT (Q4_0) 2.35 GiB • 3.88 B params

View Model Card on Hugging Face

Single‑Batch (llama‑bench)

Prefill

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
pp2048	5,694.21	2,642.3	2.16×
pp2048 @ d4096	5,228.77	2,442.52	2.14×
pp2048 @ d8192	4,882.66	2,325.68	2.10×
pp2048 @ d16384	4,491.42	2,156.82	2.08×
pp2048 @ d32768	3,840.09	1,819.49	2.11×

Interpretation: DGX Spark delivers about 2.1× higher prefill throughput.

Token Generation

Test	DGX Spark	Jetson AGX Thor	Uplift (×)
tg32	79.83	66.78	1.20×
tg32 @ d4096	67.49	59.65	1.13×
tg32 @ d8192	66.87	58.93	1.13×
tg32 @ d16384	63.36	57.26	1.11×
tg32 @ d32768	57.67	52.09	1.11×

Interpretation: DGX Spark averages 1.13× token-generation throughput (range ~1.11–1.20×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	5,819.48	2,695.26	2.16×
4096	2	8256	5,837.53	2,744.63	2.13×
4096	4	16512	5,871.91	2,757	2.13×
4096	8	33024	5,899.92	2,762.35	2.14×
4096	16	66048	5,900.73	2,766.42	2.13×
4096	32	132096	5,905.76	2,767.23	2.13×
8192	1	8224	5,750.56	2,712.78	2.12×
8192	2	16448	5,814.33	2,723.81	2.13×
8192	4	32896	5,822.18	2,730.49	2.13×
8192	8	65792	5,822.26	2,732.9	2.13×
8192	16	131584	5,833.08	2,734.08	2.13×
8192	32	263168	5,844.16	2,735.76	2.14×

Interpretation: DGX Spark delivers about 2.1× higher prefill throughput under concurrency.

Token Generation Throughput (S_TG)

Context Length (PP)	Batch Size	Cache Entries	DGX Spark	Jetson AGX Thor	Uplift (×)
4096	1	4128	68.59	60.57	1.13×
4096	2	8256	109.49	101.2	1.08×
4096	4	16512	189.62	163.43	1.16×
4096	8	33024	282.33	204.05	1.38×
4096	16	66048	379.54	347.3	1.09×
4096	32	132096	484.68	447.23	1.08×
8192	1	8224	67.12	59.92	1.12×
8192	2	16448	98.07	91.74	1.07×
8192	4	32896	158.21	139.83	1.13×
8192	8	65792	216.33	168.65	1.28×
8192	16	131584	267.69	253.35	1.06×
8192	32	263168	314.98	305.17	1.03×

Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.11× higher token-generation throughput (range ~1.03–1.38×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.