Archive of posts with category 'llm'

Accelerate LLM Inference with Speculative Decoding

Many inference speedup techniques mirror the classic systems regime—such as caching, paging, tiling, pipelining, and speculative execution (e.g. branch prediction and cache prefetch). Speculative decoding, generalizing speculative execution to stochastic...

Parallelizing Multi-Head Attention

In the multi-head attention mechanism, why after reshaping the projection matrices for Q/K/V from 3 dimensions to 4, we need to transpose the tokens dimension with the heads dimension?