Rebuilding micrograd in Rust - Introduction
Introduction
Section titled “Introduction”Karpathy’s micrograd teaches backpropagation in 150 lines of python code. To build a foundation in interpretability of LLMs I found this lecture a strong starting point.
After working through the material I had the following questions:
- How is the Neural net represented in memory? How are dependencies tracked during the forward pass? and during backpropagation, how are the dependencies inverted?
- micrograd for simplicity targets
batch_size=1. What do production grade deep learning libraries do to maximize hardware utilization? - How do these deeply nested data structures maintain ownership of their dependents?
- How do closures store and retain gradients across iterations and epochs?
- What’s the scope of parallel weight updates? What sort of challenges exist in practice?
- How does CUDA impact training? CUDA is a blackbox for matrix multiplication using GPUs. If that’s all, then can’t we write a small kernel for weight updates?
I had to write and extend micrograd. I selected Rust because it forces me to be intentional about ownership at every step. Python -> CPython -> C/C++ would offer better initial velocity which my mind would misunderstand as proficiency.
Smart Pointers
Section titled “Smart Pointers”Primer | Given the number of options I needed a mental model to map smart pointers to use-cases.
Shared Ownership
Section titled “Shared Ownership”Sole Ownership
Section titled “Sole Ownership”Micrograd nodes need to update weights and gradients. This narrows our choice down to Rc<RefCell<T>>. I will consider
Arc<Mutex<T>> when parallel computation becomes a priority.
Traits
Section titled “Traits”Primer | Rust defines shared behaviour through Traits. We need to implement add, multiply and activation functions that can be applied to a struct.
Back Function
Section titled “Back Function”Each operation needs a closure to capture the gradients during forward pass and apply updates during backprop. These closures are:
- Dynamic functions assigned at runtime - stored on the heap.
- Mutate the values they own.
We need to use a Box smart pointer here because:
- In a single threaded autograd, a back function can only be owned by a single node.
- The type is not known at compile time, we only know the function should have a trait
FnMuti.e. a function that can mutate its members.
Kahn’s Topological Sort
Section titled “Kahn’s Topological Sort”Micrograd implements a topological sort using recursive DFS. Recursive algorithms risk stack overflow on deep graphs. Kahn’s topological sort is a standard method to flatten a DAG in the order of dependencies iteratively.
- Calculate the indegrees of each node in the graph:
HashMap<Node, Indegree>where . - Seed a queue with nodes where indegree==0.
- Create a
result: Vec<T>to store the topological sort. - While the queue has nodes
- popleft a node and push into
resultwhere . - For each child of the node.
- Update indegree to where .
- If indegree==0 then push to queue
- popleft a node and push into
- Return
result.
With these components we can simulate a neural network with a few layers.
Next: Building neural network primitives for ugrad.