TinyLama

Minimal LLM Inference Engine from Scratch

TinyLama is a ~3,000-line C++20 LLM inference engine with zero external dependencies. Built for learning and understanding how large language models work at the lowest level.

Features

GGUF parser with mmap and quantization support (Q4_0, Q6_K, F16, F32)
BPE tokenizer extracted directly from model vocabulary
Vulkan compute backend — 16 Slang shaders for GPU inference
CUDA compute backend — 16 hand-written PTX kernels
Dual-GPU inference — split model across Vulkan and CUDA via --split flag
Full LLaMA architecture — attention, RoPE, GQA, SwiGLU FFN

Why Build This?

Instead of using existing frameworks, TinyLama implements every component from scratch — tokenization, model loading, GPU compute kernels, and memory management. This provides deep understanding of:

How quantized weights are stored and dequantized on GPU
How attention mechanisms map to GPU compute shaders
How to schedule work across multiple heterogeneous GPUs
The full pipeline from text input to token generation

TinyLama

Minimal LLM Inference Engine from Scratch

Features

Why Build This?

Links