Machine learning: What are the basics?

According to Google Trends, interest in the term 'machine learning' (ML) has increased over 300% since 2013. The world has watched ML go from the realm of a relatively small number of data scientists to the mainstream of analysis and business. And while this has resulted in a plethora of innovations and improvements among our customers and for organisations worldwide, it's also provoked reactions ranging from curiosity to anxiety among people everywhere.

What is machine learning?

Machine learning is, more or less, a way for computers to learn things without being specifically programmed. But how does that actually happen? The answer in one word is: algorithms.

Algorithms are sets of rules that a computer is able to follow. Think about how you learned to do long division – maybe you learned to take the denominator and divide it into the first digits of the numerator, then subtracting the subtotal and continuing with the next digits until you were left with a remainder.

What does machine learning look like?

In machine learning, our goal is either prediction or clustering. Prediction is a process where, from a set of input variables, we estimate the value of an output variable. For example, using a set of characteristics of a house, we can predict its sale price. Prediction problems are divided into two main categories:

Regression problems, where the variable to predict is numerical (e.g., the price of a house).

Classification problems, where the variable to predict is part of one of some number of pre-defined categories, can be as simple as "yes" or "no". (for example, predict whether a certain piece of equipment will experience a mechanical failure).

The most prominent and common algorithms used in machine learning historically and today come in three groups: linear models, tree-based models, and neural networks.

Linear Model Approach

A linear model uses a simple formula to find the "best fit" line through a set of data points. This methodology dates back over 200 years, and it has been used widely throughout statistics and machine learning. It is useful for statistics because of its simplicity – the variable you want to predict (the dependent variable) is represented as an equation of variables you know (independent variables), and so prediction is just a matter of inputting the independent variables and having the equation spit out the answer.

Tree-Based Model Approach

A decision tree is a graph that uses a branching method to show each possible outcome of a decision. Thing of it as ordering a salad, you first decide the type of lettuce, then the toppings, then the dressing. We can represent all possible outcomes in a decision tree. In machine learning, the branches used are binary yes/no answers.

Neural Networks

Neural networks refer to a biological phenomenon comprised of interconnected neurons that exchange messages with each other. This idea has now been adapted to the world of machine learning and is called ANN (Artificial Neural Networks).

Deep learning, which you've heard a lot about, can be done with several layers of neural networks put one after the other.

ANNs are a family of models that are taught to adopt cognitive skills.

Evaluating Machine Learning Models

There are several metrics for evaluating machine learning models, depending on whether you are working with a regression model or a classification model.

For regression models, you want to look at mean squared error and R2. Mean squared error is calculated by computing the square of all errors and averaging them over all observations. The lower this number is, the more accurate your predictions were. R2 (pronounced R-Squared) is the percentage of the observed variance from the mean that is explained (that is, predicted) by your model. R2 always falls between 0 and 1, and a higher number is better.

For classification models, the most simple metric for evaluating a model is accuracy. Accuracy is a common word, but in this case we have a very specific way of calculating it. Accuracy is the percentage of observations that are correctly predicted by the model. Accuracy is simple to understand, but should be interpreted with caution, in particular when the various classes to predict are unbalanced.

Another metric you might come across is the ROC AUC, which is a measure of accuracy and stability. AUC stands for "area under the curve". A higher ROC AUC generally means you have a better model. Logarithmic loss, or log loss, is a metric often used in competitions like those run by Kaggle, and it is applied when your classification model outputs not strict classifications (e.g., true and false), but class membership probabilities (e.g., a 10% chance of being true, a 75% chance of being true, etc.). Log loss applies heavier penalties to incorrect predictions that your model made with high confidence.

What does it all mean?

Once limited to the realm of hardcore data science, machine learning has now become a reality for businesses seeking to deliver insights out of their ever-growing data stores. Simply put, ML can be integrated into existing code to enrich data analytics projects or deliver insights for multiple data sets, making ML one of the best friends that Business Intelligence (BI) professionals could have. What's more, ML accuracy increases with use, meaning the more an adaptive algorithm powered by an intelligent machine is used, the better it gets at doing its job.

Jennifer Roubaud is the VP of the UK and Ireland for Dataiku, the maker of the all-in-one data science software platform Dataiku Data Science Studio (DSS), a unique advanced analytics software solution that enables companies to build and deliver their own data products more efficiently. To learn more about Machine Learning download the free guidebook from Dataiku: Machine Learning Basics

Newsweek's AI and Data Science in Capital Marketsconference on December 6-7 in New York is the most important gathering of experts in Artificial Intelligence and Machine Learning in trading. Join us for two days of talks, workshops and networking sessions with key industry players.