So, let’s start off by imagining this scenario. You have to go to some government department to renew your licence or change your son’s middle name to Hodl, they will make you complete copious amounts of complicated, burdensome paperwork to do so. Since we’re in the glorious age of technology, you, like me, have probably forgotten what it’s like to write using pen and paper. Incase you have forgotten I’ll remind you. It’s cumbersome, there’s no spell check, and on top of all that it also destroys forests. After all of this, written information has a high probability of being captured incorrectly on the system we need it on anyway. This archaic process basically means that you’ll need to fill out even more forms to change what was right in the first place.

Bar the fact that we should be doing all this stuff easily on the internet anyway, of course it would be a lot less labour intensive and more accurate if machines recognised all this handwritten information and seamlessly transferred it into said government’s database. So why don’t we do this? Well, actually many places already do.

Handwritten digit (and character) recognition is a very well researched problem and many successful implementations exist around the world. I decided to build some of my own machine learning models tackling the problem and I’ll give you a little insight to my thoughts. First take a look at the image below, its an example of some of the digits I used to train my model.

Okay, so if you don’t know much about machine learning — here is the super high level primer that will make sure you understand the essence of what I’m trying to do.

Firstly, this type of problem falls into a category known as supervised learning. Generally speaking, this means our data takes the form of features and labels. In the handwritten digit case, the features would be all the pixels making up the digit, and the label tells us what this digit actually is. It’s then our task to build a model where the ‘machine learns’ based on all these provided examples of the features and labels, so that when it is eventually given a set of features (the pixels of a handwritten digit) it can now predict the number based on what it has learnt from all these past examples.

Imagine an alien lands on earth with no concept of digits. If we provide this alien with a massive sheet like the above picture, with thousands of examples of handwritten digits, we hope Mr. Alien could eventually learn by example to recognise different digits. That is essentially the supervised learning problem, using lots and lots of examples for the ‘machine to learn’ so that it can eventually start making its own predictions.

Another interesting supervised learning example is predicting who survived the Titanic based on age, gender, and a number of other features. If you want to get started with machine learning yourself, try this exact challenge here.

Okay back to digit prediction…

So, there are lots of different ways for our ‘machine to learn’. There are various flavors of regression, support vector machines, decision trees etc… An Introduction to Statistical Learning (Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani) is a great place to start if you’d like to learn more about these.

Then there is deep learning…

Deep learning? I’m not even going to attempt to explain this. Instead I’m going to point you to the most insanely clear explanation of a difficult concept I’ve ever had the pleasure of stumbling upon. Check it out here.

Okay, I lied, here goes my abstract explanation of deep learning anyway (deep learning neural network): Think of it as this system as pictured below where there are these hidden layers of neurons. Pretend each neuron is like a knob on a DJ’s mixer and adjusting it will somehow change the final predicted output. Each time we feed it an example digit, these knobs get adjusted and this ‘neural network’ learns a bit more (and hopefully will predict a little better).


Now, you may ask “what affects how well this deep learning model actually predicts the digits?” Well, lots and lots of things. That’s why people with machine learning PhDs from Stanford get the big bucks, because they know these things best, allowing them to make the best models in the world. In the above example, ask yourself this question – how many hidden layers should we have? How many neurons we should we have in each layer? At what rate should we adjust these neurons given each training example? These are some of the hyper-parameters of the model (many more exist than what was just mentioned), and selecting the optimal hyper-parameters for model performance is no trivial task.

Since handwritten digit prediction is a well-known problem, a couple of google searches helped me find some fairly optimal hyper-parameters so I’m not going to focus on that in this post. Instead I’m going to talk about how I further improved my model using Principal Components Analysis (PCA).

Before I get stuck in, it’s worth noting a bit more on the actual data set I worked with. The data comes in the form of 784 pixels (28×28 pixel image) with colour intensity of each pixel being on scale from 0–255 (these are our features). For each data point we also have a label telling us what this digit actually is. I used a training set of only 2 500 digits. Now consider this logic -the more examples to learn from the better, right? And it makes sense that the more complex our data is (generally the more features, the more complex), the more examples we would need to train our model, right? Hopefully this will make intuitive sense, if not you can Google ‘the curse of dimensionality’ to help you along.

So, in my case I had 784 pixels (my features), and only 2 500 examples to train my deep learning neural network. This isn’t fantastic – I have relatively few training examples compared to a large number of features. At this current point, my deep learning models test accuracy prediction was sitting at about 96%.

So how did PCA help me do better? PCA is essentially a dimensionality reduction technique that allows you to shrink your number of features, whilst still capturing as much information as possible. To help explain this a bit better, let’s look at some cat pictures… (taken from a presentation by Greg Distiller, University of Cape Town.)

Here you can see that with just 150 principal components as opposed to all 420, we can easily still make out the cat in this photograph. I am essentially planning on doing the exact same thing with my digit images, instead of using all 784 pixels, I plan on actually using only about 40 principal components. Take a look at the plot I did below.

What this essentially means, is that I managed to shrink my number of features from 784 (the number of original pixels), to only 40 features (these are now termed principal components), with these 40 features still capturing essentially 99% of the original variation.

Training a deep learning model with 2 500 examples but only with 40 features sounds a lot better, right? It indeed boosted my test accuracy to about 98.4%. The table shows how well my various models performed (even though I’ve only discussed my deep learning one, thought I’d show you all of them for interest’s sake).



The big takeaway for me with this process is that there is so much to learn, and so many little cunning techniques you can use to boost performance. Ultimately, all that’s required at the end of the day is lots of curiosity and experimentation.

If you are interested in any of my actual code or more technical details please get in touch with me. For brevity’s sake, I had to leave a lot of the juicy details out!