I recently decided to make a serious effort in learning more about AI and machine learning, and one topic that came up recently was backpropagation.
I'd watched 3Blue1Brown's videos on how basic neural networks worked a long time ago, and had a reasonable understanding of the overall concept and the mathematics involved, but I find that it's difficult for me to internalize something until I've applied it in a project, made mistakes, and learned how to fix those mistakes along the way.
As part of internalizing how backpropagation and gradient descent works, I implemented Andrej Karpathy's micrograd in Go.
But why?
One reason why I did this is because the original video to building micrograd in Python was incredibly clear and approachable, which made understanding the codebase and the concepts involved quite simple.
The more important reason as to why I did this is because I do not know how to program in Go, so rewriting the Python codebase that I'd already built whilst going through Karpathy's video would not only allow me to get cozy in a slick new language but also force me to truly understand what I was implementing rather than fool myself that I understood it.
Bugs
There were some fun bugs along the way that I kept track of in a debug.txt file, which I thought I'd share.
1. The predictions are all the same
After I'd finished writing all the core methods and implemented the training loop in main
, I found that the predictions produced for each forward pass with each item in the training set was the same (ypred[0] == ypred[1] == ...
).
It wasn't obvious to me why this was the case, and this proved tricky to debug, so I learned how to use the Delve debugger for Go to step through my code and see where I went wrong.
Eventually, I figured out that the activation values for every neuron (the output of the neuron struct's Call
) were the same no matter what training data I fed into the MLP. But why?
Then I realized - duh, I'd forgotten to set the act
variable to act.Add(n.Weights[i].Mul(x[i]))
, so the value of each activation was just the neuron's bias set randomly when the MLP is first created!
I didn't expect the program to compile when I wasn't doing anything with the output of Add
, so this was a funny bug to find.
2. Incorrect graphs
Another bug I found was that loss.Backward()
produced a much smaller graph than expected, as compared to the Python version.
To figure out where I went wrong, I used both a Python and Go debugger to step through both codebases side by side on ypred[0].Backward()
(so I can dissect a much smaller graph than loss.Backward()
) to see where they diverged and to see what the data structures looked like.
I noticed that my Go implementation only had 1/9th the nodes (the graph only has the Value operations for 1 neuron/the output layer) that my Python implementation had, and only one tanh
node. I must not be implementing my layers properly!
Sure enough, I'd subtly introduced a bug in the layer's Call
method, where I wasn't passing in the values of the previous layer into the next layer. After I made sure to do this, the graphs were a lot closer in length (with small differences due to the ways I implemented the Value operations slightly differently).
3. Gradient descent doesn't work
The last, and probably most time consuming bug to solve, was that the backpropagation worked correctly in the Backward()
method but was not reflected back in the training loop in n.Parameters()
, so all the gradients were still set to zero and gradient descent didn't do anything.
I figured I had messed up handling the pointers somewhere, so as a quick hacky fix to get things working first I'd just match the gradients calculated in Backward()
with the parameters in the MLP and set them individually in a horribly slow loop involving a silly UUID matching.
This didn't work either, and the loss went up to 8.0 each time, which indicated to me that my gradients were being calculated incorrectly somewhere.
It was getting past midnight as I was working on this so I decided to tackle it the next day, and I spent my time winding down by reading up more on Go pointers to see where I messed up.
Eventually I came across this post by Dave Cheney, which made me realize that I'd declared my Value operation methods on Value instead of its pointer. And since I was mutating that Value, doing &Value inside the method was just getting the address of the copy of Value rather than the original Value. Fixing this fixed gradient descent.
In hindsight, this was a silly bug. But it was very satisfying to find!
What I learned
Overall I'd say implementing micrograd in Go went quite well, noob bugs aside, and I'd say I have both a significantly better understanding of how small multilayer perceptrons work and a feel for how to do things in Go.
I think there are still a few minor bugs I have to fix (I believe a.Add(a)
results in two nodes instead of one), and a few extra features to build in the final micrograd implementation on Karpathy's GitHub, but making a working version of micrograd certainly has allowed me to distill what can often be highly abstract concepts in machine learning to something I can play around with, reason about, and build. I find this very valuable.
To quote Richard Feynman, what I cannot create, I do not understand.