Built a no-code web based machine learning trainer that lets users upload CSVs, select algorithms, visualization and train models end-to-end.
An adaptive Vision Transformer inference system that avoids unnecessary high-resolution computation, achieving ~3× faster inference than static high-res ViT by selectively escalating only when needed.
Fused ReLU + LayerNorm into a single CUDA RawKernel which is 5.8x faster than running them separately, A CUDA Python experiment demonstrating kernel fusion by combining ReLU and LayerNorm into a single GPU pass and comparing it against the unfused multi-kernel pipeline.
Flash Attention from scratch, tiled CUDA forward kernel, online softmax with running max and correction factor, recomputation trick in backward, O(N) memory, full forward and backward verified against PyTorch autograd to 1e-6.
A lightweight vector database, retrieval engine, and custom indexer, all built completely from scratch.
custom RAG system that uses local document embeddings and generative AI to provide accurate, context-aware answers from private knowledge.
Implemented a fully connected autograd engine and neural network from scratch in pure Python.
andrej karpathy's micrograd python implementation in c++