Charles Xu
Essays, books, wiki on technologies, career, markets, and more.
Archive of posts with category 'llm'
Many inference speedup techniques mirror the classic systems regime—such as caching, paging, tiling, pipelining, and speculative execution (e.g. branch prediction and cache prefetch). Speculative decoding, generalizing speculative execution to stochastic...
In the multi-head attention mechanism, why after reshaping the projection matrices for Q/K/V from 3 dimensions to 4, we need to transpose the tokens dimension with the heads dimension?