Machine Learning Building Blocks: Logistic Regression
Logistic Regression from scratch with NumPy.
Preamble
In my previous article, I wrote about linear regression, starting with linear equations and analytical solutions to fitting your data, to gradient descent-optimized models and using PyTorch primitives to create a single layer neural network to solve a continuous linear regression problem. This time, I will be giving an explanation of how to make discrete classifications using logistic regression on a binary breast cancer data set.
Introduction
In the ScikitLearn Breast Cancer data set, we have labels for benign or malignant tumours and a matrix of thirty features which describe each of 569 breast tumours to make the predictions on. I find it useful to visualise data before starting any project, so I’ll first do that below. It can be difficult to conceptualise high dimensionality data so we will reduce the dimensionality down to something we can plot before we try to visualise it. Decreasing the dimensionality can be achieved using methods, like PCA, T-SNE, UMAP, VAE’s, etc. Here I have chosen the sklearn implementaion of T-SNE, there are pros and cons to each method but I won’t go into them here, as with this size of data there isn’t really a wrong choice. The illustration on the left shows how this data looks when reduced to just the X/Y planes. It is pretty clear that creating a decision boundary on how the data looks in this space would not be completely accurate, but even in two dimensions there is visual evidence that samples which are benign and samples which are malignant have generally separating characteristics that will help to make predictions on the test data.
Logistic Function
The logistic curve, or sigmoid curve, is useful for mapping the probability of a value between 0 and 1. You can see below the sigmoid curve crosses zero on the X-axis at 0.5 on the Y. This means that for positive values of X the logistic value will trend closer to 1 and for negative values will trend closer to 0, with the 50/50 chance being the X-axis value zero. In practical terms for predictions, this means that once we map our input values into this logistic space, we gain classifications for each sample based on if the y-value is greater than or less than 0.5 and then applying the appropriate label. This function is very easy to implement as, mathematically, it is 1/(1+e^-r). Below are the NumPy implementations for the logistic function as well as the values of r (between -10 and 10) for this function.
Making Prediction
So now that we have a function for calculating the logistic for a given value, we need to create a function to predict the output probability for a sample. This function will take in our features and a given set of weights and output the logistic value of the matrix dot product. The output will be the logistic probability, between 0 and 1, of the feature being of class 1. The calculation done by the dot product function will be very similiar to how the predictions were made for the linear regression model. We are looking for the matrix of weights which, when multiplied by our features, will give us a model that will robustly map samples to the correct classifier. A code snipper for this is below using the NumPy dot product operator.
Logistic Loss
The next step in this problem is deciding how to measure the loss or cost for a set of predictions we are making given our current weights. Because the logistic curve is not a convex function, we cannot use the same loss function as in linear regression, instead we have to use the log loss function. There is a great write up on this function here: Link and Link. The result is a differentiable and convex loss function that we can optimise with our weights in order to improve this model’s performance with gradient descent, as we saw in linear regression models.
An immediate improvement we can make to this is to add regularization, as suggested by Sebastian Raschka and V. Mirjalili in Python Machine Learning. Regularization has the effect of constraining the model by restricting the parameters from becoming too large. In this case, we will use L2 regularization, also known as Ridge Regression when applied to linear and logistic regressions. This will penalise overly large weights in our model, which will hopefully prevent overfitting. This can be implemented by adding the sum of the square of the weights to the end of the cost function, as shown below. J(w) is the cost function we defined above. Lambda is a value we can set, and intuitively Lambda of 0 would remove regularlization, while increasingly large values of Lambda would increase the regularization term of the equation.
We can now put this whole thing together in code for our set of features, labels and weights.
Intermezzo
This model is very naive at this point as our first set of weights will be completely random, but let’s take a look at the performance of the random weights to establish a baseline of how well this is working so far. First we will create our test/train splits and make our first predictions, then we can use the very handy Classification Report function from Scikit Learn to report back our performance metrics.
Predictably, this did very poorly. But it is good to have something with which to compare our progress!
Calculating the Gradient
To figure out how to change our weights to improve our model, we need to know the gradient of the cost function for weights of a set of predictions. Because we have defined the cost function as both differentiable and convex, this method is both 1) possible and 2) will result in optimal parameters for the model if given enough training time. We can calculate the gradient by hand by taking the first derivative of the logistic loss function, where m is the number of observation, X is the feature vector, sigma of X are the predictions and y are the actual binary class labels [0, 1].
Now all we need is to pick a learning rate, which will be a constant to multiply our gradients by, then increment our weights closer to the minima with each iteration, make new predictions, and repeat.
Putting it all together
We have now created functions for predicting the probability of a sample being of class 1 or inversely class 0, a function to calculate the loss/cost of each of our weight matrices, and a function to update these weights and improve our predictions. We can now chain these together in a training loop for n-iterations and get back a set of optimized weights for this classification problem. We will also create a list of the costs so we can examine the convex nature of this optimisation problem ourselves! I chose 10,000 iterations, but this was an arbitrary number and as is present in the graph the process could have stopped a lot earlier with no considerable increase in cost.
So now that this model is trained, how has it performed? Before training we had an average f1 accuracy of 0.15 on our test data and we have increased that to 0.99. That is probably about as good as we are going to get for this dataset with logistic regression for today.
Extending Logistic Regression
Logistic regression can also be extended beyond y = 0,1 to y = 0,1, … n where we simply run n one-against-all binary classification problems. In this way we can use a regression model for multi-classification problems without having to do some type of output binning or otherwise combining models. This is useful if we want to approach a problem from the simplest possible algorithmic implementation, before adding any intellectual or technical overhead from more complicated techniques.
I’ve written the training and model portions of this tutorial entirely in NumPy to dig out the guts of these methods from low-code frameworks to help any one who wants to gain a deeper understanding of this algorithm. NumPy is a real staple of the Python data science/numerical computing world and is an invaluable tool to learn to be able to implement not only these well known algorithms, but to implement custom algorithms for our own specific use cases. I hope this example helped to shed light on how easy it can be to create these models from scratch and that both technically and conceptually they are very accessible.