Charles Xu

Essays, books, wiki on technologies, career, markets, and more.

Archive of posts with category 'llm'

3.3x Faster HuggingFace Tokenizers for Single Sequence

3.3x Faster HuggingFace Tokenizers for Single Sequence

I made HuggingFace tokenizers 3.3x faster by parallelizing single-input tokenization with overlapping chunks, zero-copy offset operations, SIMD-accelerated boundary detection, and cache-hierarchy-aware chunking. The result is bit-identical to serial encoding. Fast...

25 Nov 2025

Accelerate LLM Inference with Speculative Decoding

Accelerate LLM Inference with Speculative Decoding

Many inference speedup techniques mirror the classic systems regime—such as caching, paging, tiling, pipelining, and speculative execution (e.g. branch prediction and cache prefetch). Speculative decoding, generalizing speculative execution to stochastic...

11 Mar 2025

Parallelizing Multi-Head Attention

Parallelizing Multi-Head Attention

In the multi-head attention mechanism, why after reshaping the projection matrices for Q/K/V from 3 dimensions to 4, we need to transpose the tokens dimension with the heads dimension?

28 Feb 2025

Never miss a story from me, subscribe to my newsletter

Explore →

git (3) web (9) microservices (8) distributed systems (6) signal processing (1) networking (11) istio (4) security (1) docker (2) kubernetes (9) operation (4) career (5) go (1) cloud (4) investment (2) startup (6) oss (1) artificial intelligence (3) llm (3)