Algorithms – Data Spelunk

Whether just exploring a machine learning tool or trying to tackle a problem, one challenge when starting out is which algorithm(s) to try once you have done some basic data discovery and clean up. Of course, the answer is “It depends”.

Decision Criteria

When choosing, several factors come into play.

What kind of data do you have and does it contain the “correct answer” for each input set (a requirement for supervised learning)?
What kind of output are you looking for? Predicted values, Yes/No or one of many classifications, groupings or anomalies have different algorithms (or variations) available.
Is your problem space related to image or audio processing, time series data or some other specialty that has specialized algorithms you may consider?
Do you have to be able to explain what it means of how it works? If all people want is an answer and a sense of how it scored, that’s easy. But if you have to explain why it creates a specific result for a specific input, you may want to consider an algorithm like linear regression or a simple decision tree that is fairly easy to explain. Explaining why the neural network took an input of A, B and C resulted in an output of W may not be practical even if you are a guru.
How linear are the relationships (and can you make them linear by engineering the data using a log function for example)? Some algorithms handle non-linear relationships better than others.
Other factors to consider are the size of the dataset, the number of dimensions, and how much time and money should be spent to get an answer (some algorithms train quickly while others can be more expensive).

Nice, but what’s a person to do?

So in come the cheat sheets! Here are a couple that I like:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice provides a good overview with relative strength and weaknesses. Obviously this is geared towards AzureML but the information on each family or general type is applicable across platforms. It includes a link to the related cheat sheet https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-cheat-sheet

http://www.dummies.com/programming/big-data/data-science/choosing-right-algorithm-machine-learning/ is a more recent find for me that had some additional insights I liked.

A more extreme example of “if this then that” guidance comes from the folks at Scikit-learn. http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html This is obviously geared towards their platform of choice, but again it offers insights into potential strengths or weaknesses of different algorithms.

Looking at several of these can help you get a better sense of what kinds of algorithms may work for your current problem.

Whatever you choose, try more than one. There rarely is a clear cut single answer – at least not until you’ve iterated a couple times, gone back to engineer some more factors, and tried it all again! In many cases selecting which factors to include, which to exclude and what can be engineered may have a greater impact on your success than a specific algorithm selection. Making sure data is normalized if the algorithm needs it, dealing with missing values, outliers, covariant inputs and just plain data wrangling are key. Some algorithms are more susceptible to these issues than others but in almost all cases doing the work before hand is at least as if not more important depending on how the data comes to you. So don’t let choosing an algorithm take too much of your attention. Explore a couple likely candidates and see where they take you!

More information can be found at several other excellent blog posts. Here are a couple you may find enlightening:

Choosing Machine Learning Algorithms: Lessons from Microsoft Azure

A Tour of Machine Learning Algorithms

Choosing an Azure Machine Learning Algorithm

Category: Algorithms

How to choose an algorithm