When a new generation of Jetsons are introduced, we know that they will perform much better after sofware is tuned to their capabilities. At launch in September 2025, the initial benchmarks running LLMs using the vLLM inference engine were promising. But only five weeks later, performance was 3.5 times faster! Looky here :
Introduction
One of my favorite quotes is from Alvy Ray Smith, of Pixar fame, “Pixels are bits that you can see”. We associate GPUs with graphics and Large Language Models (LLMs). But it’s not immediately obvious why GPUs are a natural fit for LLMs.
There’s one clue that can help you put the story together. If you think about a bitmapped computer display, it is a matrix of pixels. We typically dimension the matrix in width and height, for example a 1080p monitor is 1920 pixels wide by 1080 pixels high. The values of what those pixels can be varies in different formats, but one you’ll hear often is RGBA, which is 8 bits for red, 8 bits for green, 8 bits for blue and 8 bits for an alpha channel which controls opacity. So a simple display is 1920x1080x32, or a matrix of 1920×1080 with each entry being 32 bits.
The first bitmapped displays showed up at Xerox PARC in the 1970s. On early machines like the Xerox Alto, the screen displayed a block of memory the CPU wrote into directly — you could literally watch the image “paint” itself during refresh. Over time, that special-purpose display hardware evolved into programmable graphics cards. Roughly 25 years later, we had what we now recognize as GPUs. You might have heard of the company which created the GPU: NVIDIA.
So what does this have to do with LLMs? Here’s the thing: A huge amount of graphics work boils down to linear algebra. And the interesting thing about dealing with graphics, matrices and linear algebra is that operations can be done in parallel. Fast.
At their core, LLMs are matrices of numbers. Words are broken into tokens, and those tokens are turned into vectors in an embedding space, a big learned coordinate system where linear algebra captures relationships in vocabulary and context in a concept called computational statistics. So we leverage GPUs for their math capabilities, though we don’t show the results on the monitor.
In general, we break working with LLMs into two tasks. The first is training the LLM. During training, the matrices of numbers that make up the model are created. These matrices are the model’s learned parameters, usually called weights. While this training can be done on a single GPU for smaller models, modern large models are trained on clusters of GPUs in data centers. For the biggest models, that process can take days, weeks, or even months.
The second task is inference. Inference is the process of taking a prompt, turning it into numbers, and repeatedly predicting the next token to build up a response. This process is driven by the model’s weights, which form a large statistical model that estimates which token is most likely to come next. Inference is the part you can run on a local GPU, and it is what we are covering here.
LLM Inferencing Explained
From a developer’s perspective, the high-level view of the inference algorithm is simple. We’re staying high-level here; but know that the devil is in the details. An LLM is a large data structure that holds an organized set of parameters, mostly vectors and matrices. But before the model can work with numbers, we need a way to turn text into numbers.
Before any of that work happens, the text you type needs to be turned into numbers. This step is called tokenization. A tokenizer splits text into small pieces called tokens. These may be whole words, partial words, or even bits of punctuation. Each token is then looked up in a vocabulary table and assigned a unique integer ID. That list of token IDs becomes the input to the model.
The first data structure we encounter in the LLM is the embedding table, which we use to convert tokens into what are called activations. Activations are vectors that represent the location of a token in the model’s embedding space.
Most of the rest of the LLM consists of a group of transformer layers. Each transformer layer has two main sets of weight matrices. The first set belongs to the attention block, which determines how each token relates to the other tokens around it. The second set belongs to the Multi-Layer Perceptron (MLP), sometimes called the feed-forward block. If attention decides what each token should pay attention to, the MLP decides what to do with that information. It’s where the model synthesizes new ideas from the information attention pulled together.
Conceptually, the inference algorithm is a large loop that takes the current context of activations and runs them through these transformer layers sequentially.
At this point, we know a little about the LLM data. Before we start describing the inferencing algorithm, I suggest you take a look at the Transformer Explainer: https://poloclub.github.io/transformer-explainer/ . It’s a great visual description of how a simple LLM works. Now that we have the pieces, we can talk about how they are used at runtime.
A Note About Context: LLMs never work with a single token at a time. They work with a sequence of tokens called the context. The context includes everything the model has seen so far: the system prompt, your prompt, any previous conversation, and any tokens the model has generated in response.
During the prompt “prefill” phase, the model processes all of the tokens in the context at once. Each token produces an activation vector. These activations are passed through every transformer layer, and within each layer the tokens are processed in parallel. This is where the model builds its sense of history and meaning.
Once prefill is finished, we switch to the decode loop to generate tokens. We take the last token in the prompt and run it through the inference algorithm. Each new token the model generates is appended to the context, and the process repeats. The context grows with every generated token. This growth is an important performance consideration, and it’s why efficient caching strategies are so important.
A Minimal Pseudocode View of Inference
activation = embedding_table[token_id] # shape [D]
for layer in model.layers:
activation = attention(layer, activation, kv_cache)
activation = mlp(layer, activation)
Now that we have the idea of context and how the activations move through the layers, we can talk about where the heavy computation happens.
Attention: Where the Work Happens
The attention block is where things get expensive. For each layer we create three new vectors by multiplying the activation by three different D × D weight matrices:
Q = activation * W_Q
K = activation * W_K
V = activation * W_V
If D is 8,192, as in Llama-70B, each of these matrices has about 67 million elements. That is a lot of math per token and per layer, which is why GPUs are such a good match for this workload.
Prefill
During prefill we run every token through every transformer layer. At each layer we compute K and V for each token and store them in a KV-cache for that layer. The cache is not a single blob. It is a set of K and V buffers — one pair per layer — that grows as we process more tokens.
Decode Loop
After prefill, we generate one new token at a time. For each new token:
- We compute fresh Q, K, and V at every layer.
- We append the new K and V to that layer’s KV-cache.
- We compare the new Q against every previous K.
- We blend the corresponding Vs to create the next activation.
We do not recompute Ks and Vs for previous tokens. Only the new token’s K and V are added each step. This is why the KV-cache grows with every token in the sequence.
Turning Activations Into Tokens
Once a token’s activation has passed through all of the layers, the model converts that activation into vocabulary scores. It does this with one more matrix multiply using an output weight matrix. The result is a vector of logits, one score per token in the vocabulary. We apply a softmax to produce probabilities, select the next token according to the sampling strategy, and feed that token ID back into the model.
Why GPUs Matter
All of this work is highly parallel. Each token’s activation can be multiplied by the Q, K, and V weight matrices in parallel across the hidden dimension, and across many tokens at once. That is why running inference on a GPU is so important. Tensor Cores on devices like Jetson Thor are built for this type of matrix math, which is where a large part of the speed comes from.
All of these weight matrices can be stored in full precision, but they do not have to be. Quantization lets us store them in formats like 8-bit or 4-bit values, which reduces memory requirements and bandwidth while keeping the underlying algorithm the same.
Improving Performance
When the Jetson AGX Thor was introduced in September 2025, one of the demonstrations was the benchmark for running vLLM. Just five short weeks later, the performance of that benchmark improved by 3.5×. The question is:
What changed?
vLLM’s first release on Jetson AGX Thor used a generic code path that runs everywhere. It worked, but it didn’t take advantage of Thor’s architecture. The 3.5× speedup came from adding kernels and code specifically tuned for Thor, which enabled performance features the generic path could not.
Automatic Prefix Caching
Once the Thor-specific code paths were added, vLLM could finally use features that depended on fast access to cached values. Automatic Prefix Caching is the most important of these. APC recognizes when a new prompt starts with the same sequence of tokens as a previous one and skips regenerating K and V for that shared prefix. Instead, it pulls those values directly from the KV-cache and continues from there. This means the model doesn’t have to rebuild the prefix portion of the context during prefill.
When running the generic code path on Thor, APC provided little benefit. Without hardware-aware kernels, the latency of retrieving cached blocks roughly equaled the time saved by avoiding computation. With the Thor-optimized code in place, vLLM can now load cached K and V blocks efficiently, allowing prefix reuse to produce a real reduction in prefill time.
PagedAttention
PagedAttention is a standard feature of vLLM, but it requires hardware-specific kernels to work. On the new Jetson Thor, PagedAttention only became effective once kernels tuned for Thor’s specific architecture were added. The idea behind it is straightforward: instead of storing the KV-cache as one large, growing buffer, vLLM breaks it into fixed-size pages. This keeps memory organized, avoids fragmentation, and allows multiple sequences to coexist cleanly.
Generic attention kernels don’t understand the paged layout. They treat the KV-cache as a single flat buffer, so the paging structure doesn’t provide any real benefit. With the Thor-optimized kernels in place, vLLM can access the KV-cache page by page, using memory patterns that align with how the GPU prefers to read and write data. This reduces fragmentation and keeps memory access more regular, which in turn improves overall throughput.
PagedAttention didn’t reach its full potential with generic code, but with kernels written specifically for Thor’s architecture, the GPU could finally take advantage of the more organized memory layout. Fragmentation dropped, access patterns stabilized, and throughput rose accordingly.
Optimized Attention Kernels
FlashInfer and xFormers provided the next major improvement. These libraries include attention kernels that fuse operations together, reduce intermediate memory traffic, and make much better use of the GPU’s Tensor Cores. Early versions of vLLM on Jetson ran attention through more generic CUDA and PyTorch operators, which worked correctly but didn’t match the way Thor’s hardware schedules matrix operations.
Once the optimized kernels were integrated—and once they were connected to the paged KV-cache—they allowed the GPU to execute the attention step with far fewer stalls. The fused kernels kept the computation paths shorter and reduced the number of trips to memory, which let Thor sustain a higher level of activity during each attention pass. This didn’t exhaust the hardware’s potential, but it unlocked performance the generic operators simply couldn’t reach.
CUDA Graphs
The decode loop in an LLM generates one token at a time. Each token triggers a sequence of CUDA kernels—attention, MLP, layer norms, and so on. These kernels run quickly, but each one still needs to be launched from the CPU. When the model is generating many small batches, that launch overhead starts to dominate the total latency.
CUDA Graphs require the underlying kernels to be stable and ‘capturable,’ which wasn’t fully supported by the generic fallback kernels on this hardware. Once the Thor-specific kernels were in place, vLLM could enable CUDA Graphs. This captures the entire decode step as a single executable graph, so instead of launching each kernel individually, the runtime replays the whole sequence at once.
With fewer pauses between operations and less back-and-forth with the CPU, Thor can maintain a steadier level of activity during decoding, which translates directly into faster token generation.
Putting It All Together
All of these features—Automatic Prefix Caching, PagedAttention, the optimized attention kernels, and CUDA Graphs—were already part of vLLM. What changed on Jetson AGX Thor was the foundation underneath them. Once vLLM gained kernels and execution paths tuned specifically for Thor’s architecture, these higher-level features could finally operate the way they were designed.
The KV-cache began to behave like a structured memory system instead of a flat buffer. Attention kernels stopped stalling on mismatched layouts. Memory access patterns were optimized for the GPU. Kernel launches became cheaper. And the GPU stayed busier between tokens.
The 3.5× speedup didn’t come from new hardware or new model weights. It came from software that finally let Thor run closer to its intended performance envelope. Several features of the new Thor software stack helped make integration easier, which cut development time. Now, we have a new baseline of performance! What comes next?
