Complete GPT-2 transformer inference engine from scratch in CUDA with custom tiled GEMM, fused attention, LayerNorm, KV cache, and autoregressive generation — ~190 tok/s on RTX 3050, profiled with Nsight Compute.
EAGLE-style speculative decoding engine in Triton for Qwen3-4B→Qwen3-32B on AMD MI300X (ROCm 7.2). Custom prefill, GQA decode, and EAGLE draft attention kernels — 2.5× wall-clock speedup.
An adaptive Vision Transformer inference system that avoids unnecessary high-resolution computation, achieving ~3× faster inference than static high-res ViT by selectively escalating only when needed.
Built a no-code web based machine learning trainer that lets users upload CSVs, select algorithms, visualization and train models end-to-end.
A lightweight vector database, retrieval engine, and custom indexer, all built completely from scratch.
custom RAG system that uses local document embeddings and generative AI to provide accurate, context-aware answers from private knowledge.
Implemented a fully connected autograd engine and neural network from scratch in pure Python.
andrej karpathy's micrograd python implementation in c++