## Logistic Regression: Machine Learning for Binary Classification

**Introduction
**

**Logistic Regression** belongs to a family of machine learning models called Linear Classification, and it is used to predict which of two *learned* classes — called labels — a sample specimen belongs to. It is a binary prediction/**classification** model. Logistic Regression should not be confused with Linear Regression, which is a machine learning model concerned with learning the coefficients of a * polynomial* (i.e. a function) that best fits a collection of specimens represented as points in *n*-feature space.

Classification is a very natural way to organize and understand things, and takes on a few forms. Classifying computer files as being or not being a virus, or email messages as being or not being SPAM, are examples of *singular classification*, where specimens can only belong to one class. Decomposing and analyzing the sounds of a classical score to identify it’s period, is an example of singular classification too, but with more than two classes. *Multinomial classification* on the other hand, handles the case where specimens can belong to more than one class. A machine learning algorithm that analyzes documents for Term Frequency/Inverse Document Frequency (TF/IDF) might, for example, classify this article under the subjects of *machine learning*, *big data* and *python*. Similarly, songs in your MP3 player can be multiply classified (a.k.a tagged). And finally, we note that in classification, it is sometimes more appropriate to calculate the **probability** of a specimen belonging to a class, rather than asserting that it belongs to that class. For instance, a physician might tell a patient that he has a high probability of experiencing a heart attack, given certain clinical risk factors; but she cannot assert that he will have one. For these cases, we are interested in a probability value being emitted by the trained predictor, rather than having it predict a class label.

For this concise *Insight* on Logistic Regression — which again applies to binary predictions — we’ll *learn* the MNIST handwritten digits database, and then use the trained model to predict the class of new handwritten digit specimens. According to it’s Wiki entry:

* The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by “re-mixing” the samples from NIST’s original datasets. The creators felt that since NIST’s training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST’s complete dataset was too hard. Furthermore, the black and white images from NIST were normalized to fit into a 20×20 pixel bounding box and anti-aliased, which introduced grayscale levels. The database contains 60,000 training images and 10,000 testing images.*

There are many sources of data we could’ve used for this *Insight*, but this one has a 2D aspect that will appeal to our visual intuition, as we will see later. Similarly, there are multiple machine learning platforms and packages that we could apply to this learning problem. To name a few:

**Apache Spark**: via raw SparkContext code, or using the MLlib API (Java, Python, Scala)

**Apache Mahou**t: via it’s OnlineLogisticRegression class (Java)

**SciKit-Learn**: via it’s LogisticRegression class (Python)

Apache** Spark** is the natural choice for large and/or high-velocity streaming datasets. Apache **Mahout** is a natural choice for batch-oriented analytics where Hadoop is already deployed (although Mahout is currently undergoing a transition to provide a library for Spark as well). And finally, there is the Python-based **SciKit-Learn** ML library. SciKit-Learn isn’t a distributed compute option like Spark and Mahout are, but in practice ML scale doesn’t come up as often as one may think. And when it occasionally does, it has been shown that employing *sampling* and *ensemble-learning* techniques and/or using a single, properly configured server can outperform distributed system solutions.

We will employ Python — specifically Python3 — and the SciKit-Learn ML library for this *Insight*. Not only do their concise and powerful syntax offer a far cleaner implementation than anything written in Java can offer; but we’re also able to leverage Numpy’s powerful Array Oriented (vectorized) Processing as well as leverage Matplotlib’s powerful plotting capability. This option also has the advantage of not requiring a lot of up-front configuration steps, as Spark and Mahout do. All that is necessary is a Python3 interpreter and a few Python modules.

**Theory
**

*To keep this Insight both practical and manageable in length, we present the theory of Logistic Regression somewhat informally, and hope that the reader will be curious enough to learn the formalities and rigors detailed elsewhere.*

As mentioned, we want a Logistic Regression predictor, h_{β}(x), to emit either a **0** (**False**) or **1** (**True**) class/label as it’s prediction; or, alternatively, to emit a value indicating the probability that the predicted class/label is **1** (**True**). Both cases are mathematically described here:

- Predictor output (a class/label): h
_{β}(x) ∈ {0,1} - Predictor output (a probability): 0 ≤ h
_{β}(x) ≤ 1

Both of these properties are simultaneously captured in the Sigmoid (S-shaped) function illustrated here. Below, we’ll cover aspects of this curve that make it interesting for 2-way labelling / classification (i.e. for Logistic Regression):

- First, we notice that no matter what value the independent input variable
**x**may take on (an unbound**domain**), the dependent output variable**y**is restricted to**continuous****real numbers**between**0**and**1**(a bounded**range**). This is useful for*probability-based*predictors (case**2**above).

- Next, we notice that the function is also
*symmetric*about the inflection point**(x,y) = (0, 0.5)**, which is the mid-point of the function’s range,**y**. (*More formally, by symmetry we mean that the derivative of the sigmoid function,*. This means that, in addition to**f ‘(x),**has the property that= |**f ‘(x)****f ‘(-x)|**for all values of**x**)**y**being bound to continuous real numbers between**0**and**1**, values for**y**are also evenly and symmetrically distributed on both sides of the inflection point. This allows us to define**y = 0.5**as the mid-point predictor-value where the class prediction inflects from**0**to**1**(or vice versa); and where probability of the class being**1**is exactly**50%**. This mid-point/inflection-point is called the**decision boundary**and is used in class-based predictors (case**1**above). We’ll have more to say about*decision boundaries*later on.

- Lastly, we observe that the rate-of-change (
) is largest near**f ‘(x)****x = 0**and quickly decreases outside of that. This is significant because**x**, the independent variable, actually represents the value of a single-featured specimen whose class we’re trying to predict, and we can leverage this property — that is, the steep rate-of-change that occurs within a narrow range of**x**— to separate the class**0**specimens from the class**1**specimens.

Because it’s central to Logistic Regression ML, let’s explore that last point with an example. Let’s pretend to apply the Sigmoid function to predict the **probability**, **y**, that a student *has mastered a complex course*, given a certain amount of **study** **time**,** x**, in months. Let’s also say that statisticians *observed/learned* that all 5,000 previous course participants required an average of **six months** to attain a **working knowledge** of the subject, and so we’ll define six-months to be the **mid-point** milestone on their way to mastery.

*(article continues in column two …)*

In order to mathematically capture that statistical observation, we need to shift the curve horizontally six units to the right by replacing **x** with **x – 6** in the Sigmoid equation. Doing so makes the point **(x,y)** = **(6, 0.5)** the **new** mid-point/inflection-point, which corresponds to the six month mid-point milestone. The “*statistically* *corrected*” Sigmoid predictor function looks like this:

Intuitively, we can see what this predictor tells us: (a) during the initial stages of study, acquisition of understanding is a struggle; (b) but as the student approaches the six-month milestone, with terminology and concepts now familiar, things begin to click and the rate of understanding increases substantially; and (c) finally, as the student approaches expert level, there is little left to learn, which predictor models by emitting predictions that increase very little with additional study time.

The Sigmoid shape, indeed, models the human learning curve quite nicely. But take a moment to appreciate that, if we hadn’t first corrected (i.e. shifted) the function, the predictor would predict that, with zero study time, every student starts out at the **working knowledge** expertise level (i.e. at the *pre-corrected* mid-point), which is clearly wrong. But because statisticians had previously *learned/observed* that every student implicitly comes with this six-month lead time (which would be an input *feature* to a machine learner), we were able to *manually* correct for it by replacing **x** with a very specific polynomial of **x**, (in this case,** x – 6**). Now what about for **Martian students**? Well, the same statisticians *learned/observed* that Martians only have a three-month lead time, and are also able to acquire understanding five times faster than humans near the mid-point. Because of this, the replacement polynomial for a Martian-based predictor would be, **5x – 15** (where, for Martians,** (x,y) = (3, 0.5)** becomes the mid-point, and where the slope at that mid-point is 5 times greater than it is for humans.

This discovering and statistically correcting for *observations *and* attributes* contained in datasets, samples, populations, and specimens, is exactly what the machine learning algorithm programmatically does for us: given a training dataset of *n*-featured specimens — each labeled as either class **0** or **1** — the algorithm calculates the coefficients of an ** n+1 **term

**1st**. degree polynomial that essentially shifts and/or distorts (i.e. corrects) the shape of a

*normal*Sigmoid function so as to optimally separate the two classes based on what it

*learns/observes*from that training dataset. When

**x**is replaced by this

**term, multivarite polynomial, the predictor function take on the following generalized form:**

*n+1*The parenthesis-enclosed polynomial can be more succinctly expressed using vector notation like this: **β**^{T}**X**. (read: Beta-transpose *times* X), where **x _{0}** = 1 (always).

The vector **β** represents the coefficients that the machine learning algorithm has learned for a given training set. And the vector **X** represents the values of an *n-*featured specimen whose class/label we are predicting. Thus for the **i**–*th*. specimen, inserting **X _{i}** into the trained Sigmoid function above, reveals the predicted class/label for that specimen.

Now we won’t cover how the coefficients vector **β** is calculated, but generally speaking it involves defining one or more equations that essentially *tally* the cumulative prediction error across an entire training set; that is, the cumulative difference between the *predicted* values of the training samples (emitted by the trained predictor function) and the known *true* value for those same samples. This equation(s) is called the **Cost Function** **J**_{β}(x), and is used to *algorithmically* *search* for the set of coefficients **β** that minimizes the cumulative cost/error/tally for a given training set. If you’re familiar with Calculus, the technique involves taking the partial derivative of the Cost Function, **J**_{β}(x), with respect to each coefficient β**_{i}**, and then applying an iterative numerical method called

**Gradient Descent**—

*the learning algorithm*— to find the set of coefficients

**β**that minimizes

**J**

_{β}(x) for a given training dataset. It’s important to highlight that when the training set is changed, the coefficients and therefore the predictor function changes, too. This is why using

*cross-validation*training techniques (such as K-fold or LOO-XV) to asses the average score of multiple predictor accuracies, is vitally important.

Finally, before moving onto the implementation section, we make two closing remarks**. **First**, **the set of vectors **X** for which **β**^{T}**X** = **0**, are the roots of the polynomial **β**^{T}**X**. Those vectors are points in ** n-feature** space, and collectively form a

**hyperplane**known as the

**Decision Boundary**— that is, the boundary where the predicted classification/label inflects it’s value from

**0**to

**1**(or vice versa). Above, we showed the decision boundary for the single featured, univariate case. The cost function,

**J**

_{β}(x), for Logistic Regression — which is already defined for us — attempts to maximize the “distance” between points (specimens) on either side of this hyperplane and the hyperplane itself; with equidistant points on each side being the ideal (though rare) case. This maximization improves the accuracy of predictions, which becomes especially important for samples that appear close to the decision boundary. Which brings us to the second remark, which is that trained predictors can sometimes not perform well generally, yet still perform extremely well in predicting one of the two labels — for instance, when it predicts a

**0**it is always right. In this case, that predictor is still very useful, but in an asymmetric prediction way.

Having laid the theoretical foundation for Logistic Regression, we now turn our attention to machine learning the MNIST dataset discussed earlier. Each sample in that dataset comes in the form of a vector of length 784 (so, 784 features per sample). Each vector represents a square *image* that is 28-pixels wide by 28-pixels high, and is *logically* grouped into consecutive vector ranges, like so:

- [0 – 27]: 28 grayscale pixel values for row-1
- [28 – 55]: 28 grayscale pixel values for row-2
- [56 – 83]: 28 grayscale pixel values for row-3
- …
- [756 – 783]: 28 grayscale pixel values for image-row-28

Each vector location stores an integer value between [0 – 255] inclusive, which is the grayscale value of the pixel it represents. Finally, as a whole, each vector is classified to a numeric label in the range [0 – 9] inclusive, corresponding to the image representation of the handwritten digit stored in that vector.

As this *Insight* covers Binary (and not Multinomial) Logistic Regression, the implementation code that follows will focus on learning/distinguishing between just two of the 10 available classes: labels 3 and 7 (chosen arbitrarily). Thus, the image vectors and the associated labels for these two cases will be extracted from the larger dataset near the beginning of the implementation code. Although we are learning to distinguish between two of the 10 possible cases here, it is easy to extend this to learn all 10 cases simultaneously by using a simple technique called “O*ne versus Rest*“.

Click on the image-link below to visit the page containing SciKit-Learn/Python3-based code that implements the solution to the binary classification problem we’ve been discussing. The solution contains helpful by step-by-step comments and annotations to the code, along with illustrative 2D images of the results returned by the trained predictor. The **raw** source code, which you can run on your laptop, is also available on our GitHub page, here:

Click on the image below to see the complete learning example, in color. The equivalent source code is available in the above GitHub link.