Projects

TinyLama

Minimal LLM Inference Engine from Scratch

TinyLama is a ~3,000-line C++20 LLM inference engine with zero external dependencies. Built for learning and understanding how large language models work at the lowest level.

Features

  • GGUF parser with mmap and quantization support (Q4_0, Q6_K, F16, F32)
  • BPE tokenizer extracted directly from model vocabulary
  • Vulkan compute backend — 16 Slang shaders for GPU inference
  • CUDA compute backend — 16 hand-written PTX kernels
  • Dual-GPU inference — split model across Vulkan and CUDA via --split flag
  • Full LLaMA architecture — attention, RoPE, GQA, SwiGLU FFN

Why Build This?

Instead of using existing frameworks, TinyLama implements every component from scratch — tokenization, model loading, GPU compute kernels, and memory management. This provides deep understanding of:

  • How quantized weights are stored and dequantized on GPU
  • How attention mechanisms map to GPU compute shaders
  • How to schedule work across multiple heterogeneous GPUs
  • The full pipeline from text input to token generation