llama.cpp Benchmarks — DGX Spark vs Jetson AGX Thor

This page presents comparative benchmark results between NVIDIA DGX Spark and Jetson AGX Thor using llama.cpp build (b6767) commit 5acd45546. Each section corresponds to a different large language model tested under identical conditions. Values are reported in tokens per second, and the uplift column indicates relative performance (Spark ÷ Thor). Benchmarks were recorded and published on October 31, 2025.

The benchmark data presented here originates from the public discussion on the llama.cpp GitHub repository . The original DGX Spark benchmark results are archived at spark-llamacpp-bench.html , and the Jetson AGX Thor results at thor-llamacpp-bench.html . These datasets form the basis for the comparative tables and performance summaries shown below.

gpt-oss‑20B MXFP4 MoE 11.27 GiB • 20.91 B params
Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill
TestDGX SparkJetson AGX ThorUplift (×)
pp20483,610.561,861.261.94×
pp2048 @ d40963,361.111,728.681.94×
pp2048 @ d81923,147.731,614.111.95×
pp2048 @ d163842,685.541,377.711.95×
pp2048 @ d327682,055.341,123.221.83×
Interpretation: DGX Spark delivers about 1.9× higher prefill throughput.
Token Generation
TestDGX SparkJetson AGX ThorUplift (×)
tg3279.7457.181.39×
tg32 @ d409674.6352.461.42×
tg32 @ d819269.4951.261.36×
tg32 @ d1638464.0251.61.24×
tg32 @ d3276855.9646.531.20×
Interpretation: DGX Spark averages 1.36× token-generation throughput (range ~1.20–1.42×) versus Jetson Thor for these tests. This suggests lower per-token overhead and steadier GPU utilization on Spark at these context depths.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
4096141283,644.931,869.51.95×
4096282563,637.131,972.641.84×
40964165123,652.81,977.651.85×
40968330243,657.891,982.211.85×
409616660483,664.621,981.381.85×
4096321320963,667.61,983.831.85×
8192182243,584.991,914.621.87×
81922164483,603.291,925.911.87×
81924328963,594.241,932.461.86×
81928657923,591.41,933.881.86×
8192161315843,601.991,934.021.86×
8192322631683,596.941,936.111.86×
Interpretation: DGX Spark delivers about 1.9× higher prefill throughput under concurrency.
Token Generation Throughput (S_TG)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
40961412874.2455.521.34×
40962825683.5959.081.41×
4096416512133.3797.241.37×
4096833024208.45146.421.42×
40961666048302.07224.761.34×
409632132096426.22303.981.40×
81921822469.8154.571.28×
819221644880.2957.671.39×
8192432896127.4792.651.38×
8192865792188.77136.241.39×
819216131584262.37201.431.30×
819232263168348.69264.491.32×
Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.37× higher token-generation throughput (range ~1.28–1.42×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.
gpt-oss‑120B MXFP4 MoE 59.02 GiB • 116.83 B params
Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill
TestDGX SparkJetson AGX ThorUplift (×)
pp20481,689.47937.811.80×
pp2048 @ d40961,733.41897.091.93×
pp2048 @ d81921,705.93856.721.99×
pp2048 @ d163841,514.78774.931.95×
pp2048 @ d327681,221.23635.811.92×
Interpretation: DGX Spark delivers about 1.9× higher prefill throughput.
Token Generation
TestDGX SparkJetson AGX ThorUplift (×)
tg3252.8741.821.26×
tg32 @ d409651.0239.351.30×
tg32 @ d819248.4638.261.27×
tg32 @ d1638444.7836.311.23×
tg32 @ d3276838.7632.621.19×
Interpretation: DGX Spark averages 1.26× token-generation throughput (range ~1.19–1.30×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
4096141281,746.43860.842.03×
4096282561,844.38943.381.96×
40964165121,858.46942.051.97×
40968330241,865942.241.98×
409616660481,866.5941.021.98×
4096321320961,868.23941.381.98×
8192182241,835.87924.941.98×
81922164481,828.57922.941.98×
81924328961,840.87924.751.99×
81928657921,835.41926.781.98×
8192161315841,837.75926.021.98×
8192322631681,837.57926.461.98×
Interpretation: DGX Spark delivers about 2.0× higher prefill throughput under concurrency.
Token Generation Throughput (S_TG)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
40961412842.5938.971.09×
40962825651.1332.791.56×
409641651280.8652.261.55×
4096833024120.5579.091.52×
40961666048166.05112.571.48×
409632132096223.53148.551.50×
81921822441.4238.471.08×
819221644849.130.911.59×
819243289676.750.61.52×
8192865792109.4472.151.52×
819216131584145.98105.351.39×
819232263168183.08132.611.38×
Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.51× higher token-generation throughput (range ~1.08–1.59×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.
Qwen3 Coder 30B A3B 30.25 GiB • 30.53 B params
Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill
TestDGX SparkJetson AGX ThorUplift (×)
pp20482,933.391,533.711.91×
pp2048 @ d40962,537.981,314.861.93×
pp2048 @ d81922,246.861,143.211.97×
pp2048 @ d163841,772.41928.231.91×
pp2048 @ d327681,252.1610.492.05×
Interpretation: DGX Spark delivers about 2.0× higher prefill throughput.
Token Generation
TestDGX SparkJetson AGX ThorUplift (×)
tg3259.9542.71.40×
tg32 @ d409652.737.741.40×
tg32 @ d819244.4835.881.24×
tg32 @ d1638437.132.121.16×
tg32 @ d3276827.8225.691.08×
Interpretation: DGX Spark averages 1.24× token-generation throughput (range ~1.08–1.40×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
4096141282,831.931,491.241.90×
4096282562,834.161,531.411.85×
40964165122,835.641,532.91.85×
40968330242,848.211,536.091.85×
409616660482,847.431,537.791.85×
4096321320962,848.011,537.291.85×
8192182242,664.321,426.211.87×
81922164482,679.651,428.691.88×
81924328962,674.641,428.621.87×
81928657922,682.011,430.631.87×
8192161315842,677.561,431.271.87×
8192322631682,678.261,430.771.87×
Interpretation: DGX Spark delivers about 1.9× higher prefill throughput under concurrency.
Token Generation Throughput (S_TG)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
40961412852.5838.611.36×
40962825657.2746.11.24×
409641651282.7872.111.15×
4096833024116.63104.641.11×
40961666048159.11145.751.09×
409632132096207.86188.91.10×
81921822444.1836.171.22×
819221644850.543.041.17×
819243289670.8563.521.12×
819286579293.8285.631.10×
819216131584118.47113.581.04×
819232263168145.1138.321.05×
Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.11× higher token-generation throughput (range ~1.04–1.36×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.
Qwen2.5 Coder 7B (Q8_0) 7.54 GiB • 7.62 B params
Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill
TestDGX SparkJetson AGX ThorUplift (×)
pp20482,267.081,606.761.41×
pp2048 @ d40962,094.871,441.191.45×
pp2048 @ d81921,906.261,306.221.46×
pp2048 @ d163841,634.821,140.951.43×
pp2048 @ d327681,302.32808.521.61×
Interpretation: DGX Spark delivers about 1.5× higher prefill throughput.
Token Generation
TestDGX SparkJetson AGX ThorUplift (×)
tg3229.425.261.16×
tg32 @ d409628.3124.181.17×
tg32 @ d819227.5323.911.15×
tg32 @ d1638426.0322.831.14×
tg32 @ d3276822.0820.421.08×
Interpretation: DGX Spark averages 1.15× token-generation throughput (range ~1.08–1.17×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
4096141282,252.91,541.921.46×
4096282562,257.461,605.31.41×
40964165122,254.461,665.051.35×
40968330242,257.441,666.641.35×
409616660482,257.91,669.281.35×
4096321320962,257.551,670.091.35×
8192182242,185.911,586.311.38×
81922164482,183.951,591.431.37×
81924328962,181.921,591.541.37×
81928657922,182.771,594.461.37×
8192161315842,182.931,595.021.37×
8192322631682,182.491,597.011.37×
Interpretation: DGX Spark delivers about 1.4× higher prefill throughput under concurrency.
Token Generation Throughput (S_TG)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
40961412828.324.451.16×
40962825650.2949.751.01×
409641651291.8693.520.98×
4096833024160.22137.091.17×
40961666048244.69238.891.02×
409632132096370.44341.331.09×
81921822427.3324.191.13×
819221644847.2847.540.99×
819243289682.2784.070.98×
8192865792134.16117.721.14×
819216131584191.55185.61.03×
819232263168262.39244.061.08×
Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.05× higher token-generation throughput (range ~0.98–1.17×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.
Gemma 3 4B QAT (Q4_0) 2.35 GiB • 3.88 B params
Single‑batch tests use a 2048‑token prompt. The suffix (e.g., @ d4096) indicates a pre‑filled 4096‑token context is already resident. Prefill encodes the prompt; Token Generation measures decoding speed for 32 tokens.

Single‑Batch (llama‑bench)

Prefill
TestDGX SparkJetson AGX ThorUplift (×)
pp20485,694.212,642.32.16×
pp2048 @ d40965,228.772,442.522.14×
pp2048 @ d81924,882.662,325.682.10×
pp2048 @ d163844,491.422,156.822.08×
pp2048 @ d327683,840.091,819.492.11×
Interpretation: DGX Spark delivers about 2.1× higher prefill throughput.
Token Generation
TestDGX SparkJetson AGX ThorUplift (×)
tg3279.8366.781.20×
tg32 @ d409667.4959.651.13×
tg32 @ d819266.8758.931.13×
tg32 @ d1638463.3657.261.11×
tg32 @ d3276857.6752.091.11×
Interpretation: DGX Spark averages 1.13× token-generation throughput (range ~1.11–1.20×) versus Jetson Thor for these single-batch tests. At these context depths, decode appears dominated by per-token overhead (launch/latency), which Spark handles more efficiently.

Multi‑batch runs use a 2048‑token prompt and generate 32 tokens per request. Batch Size is concurrent requests. Cache Entries approximates total KV pairs during the run.

Multi‑Batch (llama‑batched‑bench)

Prefill Throughput (S_PP)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
4096141285,819.482,695.262.16×
4096282565,837.532,744.632.13×
40964165125,871.912,7572.13×
40968330245,899.922,762.352.14×
409616660485,900.732,766.422.13×
4096321320965,905.762,767.232.13×
8192182245,750.562,712.782.12×
81922164485,814.332,723.812.13×
81924328965,822.182,730.492.13×
81928657925,822.262,732.92.13×
8192161315845,833.082,734.082.13×
8192322631685,844.162,735.762.14×
Interpretation: DGX Spark delivers about 2.1× higher prefill throughput under concurrency.
Token Generation Throughput (S_TG)
Context Length (PP)Batch SizeCache EntriesDGX SparkJetson AGX ThorUplift (×)
40961412868.5960.571.13×
409628256109.49101.21.08×
4096416512189.62163.431.16×
4096833024282.33204.051.38×
40961666048379.54347.31.09×
409632132096484.68447.231.08×
81921822467.1259.921.12×
819221644898.0791.741.07×
8192432896158.21139.831.13×
8192865792216.33168.651.28×
819216131584267.69253.351.06×
819232263168314.98305.171.03×
Interpretation: Across batch sizes and contexts, Spark sustains roughly 1.11× higher token-generation throughput (range ~1.03–1.38×), indicating better scaling of decode workloads and reduced launch/latency overhead under concurrency.