Projects
A collection of things I've built, researched, and tinkered with over the years.
Some project links are not available because they were done as part of graded coursework or unpublished research. Academic integrity policies of certain courses prohibit publicly sharing the code or reports. Please contact me privately for any information regarding them.
GPU Kernel Programming with Triton
- Implemented Triton kernels covering vector addition, matrix multiplication, convolution, attention, normalization, and fused optimizer paths as a comprehensive suite of ML primitives.
- Implemented kernel optimizations like autotuning, pipelining, and alignment hinting, to surpass the baseline reference implementations.
Accelerating CNN Inference via AVX Intrinsics
- Implemented and accelerated core CNN layers (Conv2D, Fully Connected, ReLU, MaxPool) in C++ using AVX intrinsics.
- Coded multiple advanced convolution strategies, including Input/Weight/Output-Stationary dataflows and 2D Tiling, to optimize performance on an AlexNet model.
High-Performance Algorithms on Shared & Distributed Memory
- Engineered high-throughput parallel solutions for data compression (Huffman Coding) and vector processing (Blelloch and Hillis–Steele Prefix Sum), utilizing OpenMP and Pthreads for shared-memory threading and MPI for distributed message passing.
- Achieved lock-free concurrent writes via an offset-precomputation strategy and header-based partitioning, while benchmarking scalability and communication overheads to evaluate trade-offs between work-efficiency and step-complexity.
MPI Collectives over 2-D Mesh Topology
- Implemented broadcast, reduce, and allreduce collectives on an m * m mesh using only MPI_Send/MPI_Recv, enforcing strict neighbor-only communication and minimizing per-link load via planar row/column decomposition.
Lightweight Hypervisor with KVM API
- Built a C-based hypervisor using the Linux KVM API to initialize and manage VCPU state and memory maps, successfully booting guest VMs into protected mode.
- Orchestrated a multi-VM producer-consumer model by implementing hypercalls (via KVM_EXIT_IO) to act as a broker, trapping VM exits to copy a shared data buffer between guests.
Concurrent Min/Max Heap as Loadable Kernel Module
- Implemented a loadable kernel module (LKM) in C to provide a per-process, concurrency-safe min/max heap, managing kernel memory and state for multiple processes.
- Exposed kernel-space heap operations (init, insert, extract-top) to userspace via the /proc filesystem using both standard read/write and ioctl system calls.
Extended CLI Shell with New Fan-Out Operators
- Built a C-based CLI shell from scratch using syscalls with robust I/O redirection (<, >, >>), built-ins (cd, exit, logout, type, history), and background job control (&) with process status reporting and child cleanup.
- Extended POSIX pipelining with new fan-out operators (||, |||) to broadcast output to multiple downstream processes simultaneously
Kernel-Level Thread Scheduler and Alarm Clock in PintOS
- Reimplemented timer_sleep() in the PintOS kernel using interrupt-driven wake-ups instead of busy-waiting.
- Extended the thread scheduler to maintain an ordered sleep queue and unblock threads on timer interrupts for deterministic wake-ups.
Autoscaling Cloud Management System with libvirt API
- Built an autoscaling program using the libvirt to manage the lifecycle of a multi-VM, CPU-intensive client-server application.
- Implemented a monitoring loop to read VM CPU utilization via virDomainPtr handles, triggering horizontal scaling (N to N+1 replicas) when load exceeded a threshold.
- Programmatically spawned new server VMs from XML templates and notified the multi-threaded client to distribute load to the new replica, mitigating the overload.
Container Runtime with Linux Namespaces & Cgroups
- Built a container runtime in C, invoking clone & setns system calls to create isolated PID, NET, MNT, and UTS namespaces.
- Mounted a new rootfs and /proc for process isolation, configured veth (virtual ethernet) pairs for host networking, and enforced memory limits using the cgroups API.
Memory-Efficient Compiler for a Scoped, Statically-Typed Language
- Implemented a full compiler pipeline (recursive-descent parser, AST, hash-based symbol table with scoped entries, semantic checks, and NASM code generation) for a statically-typed language designed by faculty, called ERPLAG.
- Designed lightweight activation records and register allocation, lowering memory usage by 35% compared to naive stack allocation, while supporting recursion and dynamic arrays.
32-bit MIPS Processor Design & Custom Pipeline Architecture
- Implemented a 32-bit Single-Cycle MIPS processor in Verilog, modularly designing the Control Unit, Instruction and Data Memory, ALU, and Register File, and integrating them into a complete R-format execution pipeline.
- Built a 3-stage instruction pipeline (Fetch/Encode, Execute, Generate Parity) supporting 8 custom ALU operations.
x86 Real-Mode Split-Screen Editor & System Utilities
- Developed a dual-viewport text editor in x86 Assembly, utilizing BIOS video interrupts (INT 10h) and implemented a low-level file management module using DOS handles (INT 21h) for sequential and random access (LSEEK) file operations.
Low-Level Profiling and Benchmarking of the Lua Garbage Collector
- Profiled the C implementation of Lua's garbage collector using Valgrind (Callgrind) to identify performance hotspots under various configurations (stop-the-world, generational, incremental)
- Analyzed source code and call graphs to benchmark the performance trade-offs of different memory management strategies under varying workloads.
High-Performance Key-Value Database with Kernel Bypass
- Built a high-performance key-value database engine in C, with RB-Tree memtables, WAL durability, Bloom-filter SSTables, and leveled compaction. Implemented storage paths, on-disk formats, and crash-recovery logic.
- Integrated the engine with a networking stack using Linux network namespaces and DPDK for kernel bypass, and benchmarked latency gains against the standard kernel stack.
File System on an In-Memory Disk Emulator
- Built an in-memory disk emulator in C to provide a persistent block-level device interface with a fixed 4KB block size.
- Designed and implemented a file system on the emulated disk, managing all core metadata including the super block, inodes, data bitmaps, and indirect blocks
Multi-Client TCP Server with Custom ARQ Protocol
- Built a multi-client TCP server in C with a custom packet format (seq num, type flags) and stop-and-wait ARQ to handle 10% simulated packet loss with a 2s retransmission timeout, guaranteeing reliable, in-order delivery.
Network Message Bus: Distributed Message Queue with UDP
- Built a distributed message queue in C providing System V-style APIs over UDP multicast for inter-host communication.
- Implemented per-host servers and error-capture processes to demultiplex multicasts and propagate ICM failures (e.g., HOST UNREACHABLE) across nodes for end-to-end reliability.
Customizable Load Balancer with Dynamic Replica Management
- Designed a load balancer to route asynchronous client requests, using consistent hashing to efficiently distribute load and minimize remapping on server failure.
- Implemented dynamic replica management by interfacing with the Docker daemon to programmatically spawn and terminate server containers in response to requests.
Network Packet Sniffer
Developed a packet-capture utility using raw sockets to inspect Ethernet/IP/TCP headers for educational analysis. Reconstructed simple TCP flows and exported pcap-compatible output for downstream tooling.
Anytime Clustering for Streaming Data
Implemented an anytime hierarchical k-medoids pipeline for streaming data experiments. Employed micro-cluster sketches and asynchronous insertion logic to maintain low-latency updates under arrival bursts.
Affine Short-Rate Models for Swap Valuation
Studied Hull–White and CIR affine term-structure models for pricing interest-rate derivatives. Implemented semi-analytical calibration routines and finite-difference checks for PDE-based pricing validation.
Monte Carlo Methods for Option Pricing
Built path-dependent Monte Carlo estimators for derivative pricing under stochastic volatility models. Used antithetic variates and control variates to reduce estimator variance and implemented parallel sampling.
Clustering Stability in Noisy Streams
Investigated medoid-based clustering stability under nonstationary stream arrivals and noise. Performed controlled experiments measuring purity and silhouette with randomized concept-drift injections.
Randomized Algorithms for Data Summarization
Explored sketching methods for approximate frequency and quantile estimation on large streams. Implemented Count-Min and t-digest sketches and compared memory/accuracy trade-offs.