Kevin Queen – Data Spelunk

Summer break is over…. :)

Summer has certainly had its distractions – hope you have enjoyed yours!

I’m excited to be presenting again, this time at Devup 2017 Conference in October. This year it will be at a new location – the Saint Charles Convention Center. It’s a great venue (with decent WiFi – yeah!). It will be held at the same location used for STL SilverLinings Conference 2017 in May of this year, and am looking forward to returning there.

I’ll be presenting on Azure ML Studio – which continues to grow in its offerings and capabilities (now with neural nets)! The biggest challenge will be deciding how much ground can be covered in a little less than an hour <grin>. Click the title to this posting and tell me in the comments if you prefer presentations on a new area to briefly cover everything, or to only mention most features but go in more depth for the core features. I’m kind of torn as to which approach to take – so please provide me with your suggestions!

Azure ML Studio is very exciting. It allows organizations to quickly leverage the power of machine learning to derive value from their data and either enhance operations or improve decisions. The ability to easily generate web services that can be used with Excel or readily incorporated in custom applications or websites puts this capability readily in reach for reasonable costs. An analyst reviewing data for insights can use their Excel spreadsheet to get predictions, classifications, etc. using the models developed and run in Azure against large data sets. Functional programs can run real time on the fly analysis for everything from anomaly detection (like fraud) to classifications (risk, prospect, customer type, etc.) and predictions (future demand, value, or other numeric). This is a very flexible tool.

When presenting on such a topic – particularly in an introductory level presentation -there is a plethora of topics to discuss. Some include:

Supported algorithms and models
Data sources
Data manipulation and cleansing
Web Service generation and publishing
R, SQL, and Python integration
Data exports
Full life cycle development of the solution

Not to mention just the general “how do you get started and use the thing”.

Share your ideas on what deserves the most attention given a relatively short period for presentation (an hour). I’d really like to hear your thoughts. And if you want to poke around on your own – check out http://studio.azureml.net! Nah, I don’t get any kickbacks for promoting this – I just think its a great tool.

How to choose an algorithm

Whether just exploring a machine learning tool or trying to tackle a problem, one challenge when starting out is which algorithm(s) to try once you have done some basic data discovery and clean up. Of course, the answer is “It depends”.

Decision Criteria

When choosing, several factors come into play.

What kind of data do you have and does it contain the “correct answer” for each input set (a requirement for supervised learning)?
What kind of output are you looking for? Predicted values, Yes/No or one of many classifications, groupings or anomalies have different algorithms (or variations) available.
Is your problem space related to image or audio processing, time series data or some other specialty that has specialized algorithms you may consider?
Do you have to be able to explain what it means of how it works? If all people want is an answer and a sense of how it scored, that’s easy. But if you have to explain why it creates a specific result for a specific input, you may want to consider an algorithm like linear regression or a simple decision tree that is fairly easy to explain. Explaining why the neural network took an input of A, B and C resulted in an output of W may not be practical even if you are a guru.
How linear are the relationships (and can you make them linear by engineering the data using a log function for example)? Some algorithms handle non-linear relationships better than others.
Other factors to consider are the size of the dataset, the number of dimensions, and how much time and money should be spent to get an answer (some algorithms train quickly while others can be more expensive).

Nice, but what’s a person to do?

So in come the cheat sheets! Here are a couple that I like:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice provides a good overview with relative strength and weaknesses. Obviously this is geared towards AzureML but the information on each family or general type is applicable across platforms. It includes a link to the related cheat sheet https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-cheat-sheet

http://www.dummies.com/programming/big-data/data-science/choosing-right-algorithm-machine-learning/ is a more recent find for me that had some additional insights I liked.

A more extreme example of “if this then that” guidance comes from the folks at Scikit-learn. http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html This is obviously geared towards their platform of choice, but again it offers insights into potential strengths or weaknesses of different algorithms.

Looking at several of these can help you get a better sense of what kinds of algorithms may work for your current problem.

Whatever you choose, try more than one. There rarely is a clear cut single answer – at least not until you’ve iterated a couple times, gone back to engineer some more factors, and tried it all again! In many cases selecting which factors to include, which to exclude and what can be engineered may have a greater impact on your success than a specific algorithm selection. Making sure data is normalized if the algorithm needs it, dealing with missing values, outliers, covariant inputs and just plain data wrangling are key. Some algorithms are more susceptible to these issues than others but in almost all cases doing the work before hand is at least as if not more important depending on how the data comes to you. So don’t let choosing an algorithm take too much of your attention. Explore a couple likely candidates and see where they take you!

More information can be found at several other excellent blog posts. Here are a couple you may find enlightening:

Choosing Machine Learning Algorithms: Lessons from Microsoft Azure

A Tour of Machine Learning Algorithms

Choosing an Azure Machine Learning Algorithm

Kicking things off

Welcome to my new blog. I look forward to exploring a variety of topics – primarily related to machine learning in one form or another. My focus is more on leveraging machine learning in applications rather than research (which obviously has its own value). Researchers have been using this tool set for years, but the technology has advanced to a point where many companies and organizations can begin to reap the benefits in their own application / environment. I don’t pretend to know it all so feel free to share your thoughts and experiences!