Scratch-built Neural Network

Introduction

In this project, we uncover the rich mathematics that gives rise to modern machine learning and artificial intelligence. Nowadays, artificial intelligence (AI) is being used across a wide range of applications. From large language models (LLMs) such as OpenAI's ChatGPT and Anthropic's Claude, to vision systems in self-driving cars, AI is taking centre stage and affects the lives of us all. In this project, we turn the the complexity down to analyse how seemingly 'dumb' models can make accuracte predicitons when trained on large amounts of data. In this, case we will focus on handwritten digit classification. Given 10s of thousands of training images, we will train a neural network model to learn the underlying structure of digit classification, and validate it to see if it can make accurate predications on unseen data. Better yet, we will forgo modern python frameworks such as PyTorch and TensorFlow to understand the mathematical algorithms and operations involved from scratch.

Handwritten Digits and the MNIST Set

Humans are imperfect. One notable example of this is the way we hand-write digits. Some choose to strikethrough their '7's, some choose to draw '4's in one stroke, while others draw '9's that look like a 'g'. Generally however, the human brain is able to process the visual appearance of a variety of differently drawn digits, and classify them, almost instantanously, as belonging to one of the classes 0-9. This is remarkable as no two handwritten digits are the same. One '9' may have a 0.2mm thicker stem than another, while another '6' might have a circle with a radius 0.15mm smaller. With the neural network developed in section 3, we will attempt to emulate what the human brain does with powerful mathematics and computation.

Any machine learning model requires a large amount of data to learn the complex relationships between inputs and their labelled classes. For this project, we will use the MNIST handwritten digit dataset introduced by LeCun et al. at Bell Labs in 1998. It has the following characteristics

The MNIST dataset contains 70,000 labeled images of handwritten digits (0–9).
The set is partitioned into 60,000 training and 10,000 test images.
Each image is \(28 \times 28\) pixels in grayscale, representing a single digit.
Pixel values are integers from 0 to 255 representing darkness
Labels are digits 0-9 as ground truth.

This dataset is a benchmark widely used in machine learning and computer vision. The current world record for the highest percentage accuracy on the test set is 99.87%, set by the 'Branching/Merging CNN + Homogeneous Vector Capsules' model from the following paper. Figure 2.1 shows examples of images in the test set, alongside their ground-truth labels. Click the button to see more examples.

→

?

Figure 2.1 - Handwritten digit examples from the MNIST test set.

There is clearly much variation between handwritten digits. This mostly comes down to the individual's handwriting legibility, but there are also some cultural differences. For example, the '7' with the strikethrough or the '1' with a hat. Figure 2.2 shows 20 examples of each digit.

200 MNIST examples — Figure 2.2 - 200 examples from the MNIST set showing variation between each digit.

Inevitably, given enough humans and written digits there will be handwritten digits so illegible that even a human could not decipher what the intention was. If anything, this reinforces the important of clear writing! Figure 1.3 shows ambiguous examples from the MNIST set, along their questionable labels.

Bad MNIST digits — Figure 2.3 - Ambiguous examples from the MNIST set.

Keep in mind, statistical machine learning is not perfect and neither is human inference. Nevertheless, most of the time us humans can read handwritten digits with minimal ambiguity. This reliable structure is ultimately what we try to unearth in this project through careful formulation and training of a neural network.

Neural Network Architecture

In this section we introduce the machine learning model used to classify the digits. This, of course, is the neural network. A neural network is a layered mathematical model made up of nodes (neurons) connected by edges (weights). It takes an input (like a grid of pixel values), processes it through one or more hidden layers, and produces an output such as a probability distribution over classes. One can think of a neural network as a highly non-linear function which typically maps a large number of inputs to one or more outputs. The non-linearity is what allows the model to mould itself in complex ways and learn the underlying function which maps inputs to their correct labels. We will define what we mean by 'non-linearity' later in this chapter. In a typical diagram, such as that in Figure 3.1, you'll see input neurons on the left, hidden layers in the middle, and output neurons on the right, with every connection representing a learned weight. The neural network models how biological neurons and synapses (equivalent to weights/edges) work in the brain. Thus, if trained correctly this model should be able to understand complex patterns within spatial image data, just like humans do, to make the correct classification.

Neural Network Diagram — Figure 3.1 – A simple neural network.

Each layer of the neural network in Figure 3.1 is explained below.

Input Layer: On the far left, we see the input neurons. There is no computation inside an input neuron as they corrrespond directly to the features of our input. In the case of the MNIST handwritten digits, which are \(28 \times 28\) pixel images, there are a total of \(28 \times 28 = 784\) input feautures. This means our neural network will require \(784\) input neurons on the input layer. These input features are denoted \(x_i\) and correspond to the dimensions of the input space. This high number of dimensions (i.e. pixels) dictate the granularity and quality of the input. For example, if there were only \(3 \times 3 = 9\) pixels in the image, it is clear that there would not be enough 'room' to distinctively draw the digits '0' to '9'. On the other hand, if we were to take an extremely high resolution like \(1,000 \times 1,000 = 1,000,000\) pixels for our images, there would be too much detail in each digit. This additional quality doesn't make the digits any more recognisable. This also, as the reader will later learn, would result in weight matrices with billions of parameters. This would be effectively impossible to train without a large GPU cluster. The \(28 \times 28\) input size provides a balance of not polluting the model with extraneous information, whilst giving enough quality to represent digits in an easily recognisable way.
Hidden Layer: In the centre, we find the hidden layer. A neural network can have many hidden layers, with any number of neurons. Adding more than around three hidden layers results in so-called deep learning. The number of hidden layers and the number of hidden neurons determine the flexibility of the model. The outputs of the hidden neurons are denoted by \(z_{i}^{(n)}\), which denotes the \(i\)-th neuron in the \(n\)-th hidden layer. If we have too few hidden neurons, the model will not be flexible enough. Conversely, if we have too many hidden neurons, our model will learn to rely on the training data too much and effectively start memorising it. Thus, the model misses the essence of the problem which is to judge the shapes and contours in each image to determine the most appropriate digit. Instead rather, the model will just learn "this is training example 15239 and I remember it's a '9'". Memorisation is not learning. This is so-called overfitting. To combat overfitting, several regularizsation techniques have been establish, such as dropout and max-pooling, which aim to break the fine neural 'memory' structures that result in overfitting.
Output Layer: On the right, we find the output layer. This layer is responsible for taking the outputs of the final hidden layer and, for the classification problem, producing series of pseudoprobabilities corresponding to the model's confidence that the given input belongs to each class. Hence, the number of output neurons is equal to the number of classes. For our problem there will be 10 output neurons, one for each of the digits '0' to '9'. The predication function that we will later define will be responsible for taking the final hidden layer outputs and producing a pseudoprobability strictly in the range of 0 (near impossible) to 1 (near certain). We denote the outputs \(\hat{y_i}\) with a little hat to indicate that they are predications, not the actual ground truth associated with the training example (which is called \(y_i\)).

Now that we understand the structure of the neural network and each of the layers, it is natural to wonder what all the connections (edges) mean and how the network actually processes data. First of all, in a graph nodes may be connected by edges in any way. In the case of a neural network the layers are typically fully connected, meaning every node in one layer connects to every other node in the following layer. The network should also be read from left-to-right. Input features, \(x_i\), come into the model on the left and they forward propogate via the edges to neurons in the following layer, some mathematics is performed, then those outputs follow the same process until we are left with out outputs, \(\hat{y_i}\). The details of that mathematics is given in Figure 3.2.

Neural Network Neuron Sum + Activation — Figure 3.2 – Flow diagram showing weighted sum and activation used to produce hidden neuron output \(z_{1}^{(1)}\)

Figure 3.2 is actually a truncated version of Figure 3.1, showing only the input neurons and the first neuron in the 1st (and only) hidden layer. The output of this neuron is \(z_{1}^{(1)}\). Following the diagram, the first step is to take the weighted sum of each input \(x_i\) and the weight corresponding to edge between the origin and destination node. The destination node in this case is the first node in the hidden layer, so the first index in each weight is '1'. After the sum we also add the bias term \(b_{1}^{(i)}\). The bias allows the model to be even more flexible by introducing an offset which is independent of the input values. Think about it as the '\(b\)' term in the linear function \(y = mx + b\). Following that, we feed the weighted sum plus bias into a non-linear activation function. The output of this is the familiar \(z_{1}^{(1)}\) from Figure 3.1. Whilst Figure 3.2 describes what happens for \(z_{1}^{(1)}\) only, the other \(z_{i}^{(n)}\) follow the exact same process, with input data always coming from the previous layer. The exception to this is the output layer, which operates the same, except with a different activation function.

Lastly, we motivate the selection of the activation and prediction functions.

Acitvation Function

The non-linear activation step is paramount in allowing the network to create the high levels of flexibility required to learn complex relationships. In fact, when we start discussing the linear algebra behind these operations in section 4, we will see that if no activation function is present, or it is linear, the effect of subsequent layers is nil and the model's output will simply be a linear combination of the inputs. Besides being non-linear

Prediction Function

As alluded earlier, the predication function is a special activation function which is only used on the ouput layer of the network to pseudoprobabilities for the outputs \(y_{i}\). These outputs correspond to each of the possible classes and are called logits.

Forward Propagation and Measuring Loss

We sim

PCB Design with KiCAD 8

Onno

6.1 Schematic Capture

Deity

6.2 PCB Layout & Routing

Structural

6.3 Assembly and Testing

A week later, the training finished on the swarm of 99,000 NVIDIA 4090 GPUs

Finished Design

Fin des

Conclusion

NNs are dope