Machine Learning (ML) isn’t just a big black box that understands data and is magically able to make accurate predictions about the future. There are many different ML predictive algorithms and they all have different strengths and weaknesses. With experience and some rules of thumb, data scientists look at the nature of the data and the question being asked to determine the right algorithm to use.
For a recent project, we chose to use Decision Tree based algorithms because most of the data was categorical (as opposed to continuous) and the outcome we were looking for was a binary Yes/No prediction. The idea of using a decision to predict something is easy for most people to understand: Each node in the tree represents a Yes/No question. You start at the top of the tree and answer the question. A “yes” takes you down one branch of the tree, and a “no” takes you down the other. When you run out of questions to ask, the last node in the tree has a predictive statement such as “given all of your answers, you are 85% likely to prefer the color blue.” Unfortunately, simple Decision Trees don’t usually make very good predictions with complex real-world problems.
Roaming the Random Forest
Fortunately, there are several ways of enhancing Decision Trees. There are literally billions of unique Decision Trees that could be created for a training data set that has more than 100 different attributes, like the data set for our recent work. So, which is the best one? That answer depends on the specific data that the model is tested with and, unfortunately, it’s never possible to have all of that data up front (Remember, you’re going to use this model to try to predict what will happen based on data you’ve never seen before). One common way to leverage the fact that there are many possible decision trees for any data set is to create many different trees, use them all, and average the results of the predictions they provide. This is what data scientists call a Random Forest. Random Forests almost always perform better than individual Decision Trees.
CatBoost or Bust
Another way to enhance Decision Trees is to use a technique called Gradient Boosting. The leading Gradient Boosting technique for decision trees is an implementation called CatBoost and it is the algorithm that ended up working best for our recent project. With a decision tree, gradient boosting works like this: First, build a simple decision tree. For every observation in your training data, record the error in the prediction it makes. Then build another decision tree to predict that error. Combine the original prediction and the prediction of the error to generate an updated and improved prediction. Record the error in that updated prediction and build another decision tree to predict that error. Repeat until you get bored.
If you’re interested in drilling down on this topic, StatQuest has a great video that explains in more detail. Amitech’s own, Mehdi Khodayari, is also a great resource for more information about these and other machine learning techniques.