Comparing Quality of Different Models

DevUp 2017 was another great conference! The team at devupconf.org did their usual great job of getting everything put together. This year they switched venue and the new location was perfect. Lots of room, better quality facilities, food, and WiFi. This last is pretty important for an Azure demo session! Saw several really good sessions and got to meet more great people. If you are looking for a copy of my presentation, it can be downloaded here. There are slides to show some of the demo screen content but of course the demo was the heart of the presentation.

During my session on Azure ML Studio, I showed some R code I created for the purpose of comparing the quality of different models. Now ML Studio comes with a nice Evaluation Module that already does this. As shown here, it has all the important statistics as well as some nice graphs.

One problem here is that you have to keep clicking between the two options on the right to see the numbers for the two models. The graph shows both models but when models are close in performance, the numbers are easier to interpret. My second problem is you can only compare two models. To compare more, you have to create more evaluation modules and bounce between modules (while bouncing between models within a module) to make comparisons. It quickly becomes unwieldy. Most of the time this may be all right, but in early exploration of a number of models it is cumbersome.

Of even more interest is whether or not your model has been overfit to the training data. The Evaluation module can’t help you there. So…. a little R to the rescue. This example is for a two class classification problem. Slightly different R code will be needed to plot results for other problem types using different measures.

The Cross Validation module will do a ten fold cross validation and reports the various quality measures for the model against each of the ten folds. When tuning hyperparameters (one of the great tools of Azure ML), this can be very important to get the best results. It also gives you a chance to see how the model performs against a variety of subsets – giving an indication of how sensitive it might be to different data subsets. You need your model to generalize well and not be overfit to the training data. The “How to perform cross-validation with a parameter sweep” section of the Tune Model Hyperparameters does a nice job of explaining how to do this. The Cross Validation output shows the various scores for all ten folds.

While it shows a mean and standard deviation at the bottom, this is only for one model and mentally gauging these two numbers is not that easy.

The first block of R code processes these results for graphing:

# Map 1-based optional input ports to variables
# Prepare cross validation result sets for assessment
# of the different performance measures for each run
df <- maml.mapInputPort(1) # class: data.frame
# remove the two summary data rows as these are
# not result sets
datarows <- df$Fold.Number != “Mean” & df$Fold.Number != “Standard Deviation”
df.data <- df[datarows,]
# enter the desired model identifier as it will
# appear in the graphs
modelname <- “Boosted DT”
df.data$Model<-modelname
maml.mapOutputPort(“df.data”);

You can do this for each model’s cross validation results and then rbind the sets together (or use the Add Rows module to do the same thing without having to write code). Note the need to identify each model so they can be easily distinguished from the other models in the resulting graph. You may also need to change “F-Score” to “F.Score” using the edit metadata module to change the column name. The R script won’t properly interpret F-Score as it tries to treat it as a formula instead of a column name.

Finally generate a graph that compares all models across each measure.

# plot the results of each performance measure
# by model to compare model performance and
# sensitivity to different data sets

# requires the following previously be done in the input data frame:
# – cross validation result sets without mean and standard deviation summary rows
# – a model column to distinguish different models results
# – rename the F-Score column to F.Score so it can be properly interpreted

# Map 1-based optional input ports to variables
df.data <- maml.mapInputPort(1) # class: data.frame
#df2 <- maml.mapInputPort(2) # class: data.frame
library(ggplot2)
#df.data <- rbind(df1, df2)
dataplot <- ggplot(df.data, aes(Model, Accuracy)) + geom_boxplot()
dataplot1 <- ggplot(df.data, aes(Model, Precision)) + geom_boxplot()
dataplot2 <- ggplot(df.data, aes(Model, Recall)) + geom_boxplot()
dataplot3 <- ggplot(df.data, aes(Model, F.Score)) + geom_boxplot()
dataplot4 <- ggplot(df.data, aes(Model, AUC)) + geom_boxplot()
plot(dataplot)
plot(dataplot1)
plot(dataplot2)
plot(dataplot3)
plot(dataplot4)
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort(“df.data”);

No matter how many model results have been added, you will see them side by side for each performance measure. You can easily compare them both in terms of how well they did, but also how much the results varied by fold. Ideally you want a nice tight box plot with a high average score. Below are the results for four models. Not that with only ten folds, an outlier is almost half of that quartile. I therefore would consider the line to extend all the way to the outlier. With that said, the Decision Jungle (second model from the left) has a nice high precision (what we were tuning for in these models) and is fairly consistent in its results. All other models have a lower average score and are spread out over a greater range of results suggesting they are not as  consistent with different data sets. Even the best model has a pretty wide spread – the limited size of the data set used here is likely partially to blame.

I found this to make it much easier to compare the relative quality of all four models together as well as to gauge if a model is particularly sensitive to the data subset it is working with. You can write similar code for other problem types. The built in evaluation module does a nicer job for graphing ROC, Precision/Recall and Lift, as well as seeing if adjusting your threshold will improve things. But when comparing multiple models across the core quality measures I find I prefer this set of graphs.

Summer break is over…. :)

Summer has certainly had its distractions – hope you have enjoyed yours!

I’m excited to be presenting again, this time at Devup 2017 Conference in October. This year it will be at a new location – the Saint Charles Convention Center. It’s a great venue (with decent WiFi – yeah!). It will be held at the same location used for STL SilverLinings Conference 2017 in May of this year, and am looking forward to returning there.

I’ll be presenting on Azure ML Studio – which continues to grow in its offerings and capabilities (now with neural nets)! The biggest challenge will be deciding how much ground can be covered in a little less than an hour <grin>. Click the title to this posting and tell me in the comments if you prefer presentations on a new area to briefly cover everything, or to only mention most features but go in more depth for the core features. I’m kind of torn as to which approach to take – so please provide me with your suggestions!

Azure ML Studio is very exciting. It allows organizations to quickly leverage the power of machine learning to derive value from their data and either enhance operations or improve decisions. The ability to easily generate web services that can be used with Excel or readily incorporated in custom applications or websites puts this capability readily in reach for reasonable costs. An analyst reviewing data for insights can use their Excel spreadsheet to get predictions, classifications, etc. using the models developed and run in Azure against large data sets. Functional programs can run real time on the fly analysis for everything from anomaly detection (like fraud) to classifications (risk, prospect, customer type, etc.) and predictions (future demand, value, or other numeric). This is a very flexible tool.

When presenting on such a topic – particularly in an introductory level presentation -there is a plethora of topics to discuss. Some include:

  • Supported algorithms and models
  • Data sources
  • Data manipulation and cleansing
  • Web Service generation and publishing
  • R, SQL, and Python integration
  • Data exports
  • Full life cycle development of the solution

Not to mention just the general “how do you get started and use the thing”.

Share your ideas on what deserves the most attention given a relatively short period for presentation (an hour). I’d really like to hear your thoughts. And if you want to poke around on your own – check out http://studio.azureml.net! Nah, I don’t get any kickbacks for promoting this – I just think its a great tool.

How to choose an algorithm

Whether just exploring a machine learning tool or trying to tackle a problem, one challenge when starting out is which algorithm(s) to try once you have done some basic data discovery and clean up. Of course, the answer is “It depends”.

Decision Criteria

When choosing, several factors come into play.

  • What kind of data do you have and does it contain the “correct answer” for each input set (a requirement for supervised learning)?
  • What kind of output are you looking for? Predicted values, Yes/No or one of many classifications, groupings or anomalies have different algorithms (or variations) available.
  • Is your problem space related to image or audio processing, time series data or some other specialty that has specialized algorithms you may consider?
  • Do you have to be able to explain what it means of how it works? If all people want is an answer and a sense of how it scored, that’s easy. But if you have to explain why it creates a specific result for a specific input, you may want to consider an algorithm like linear regression or a simple decision tree that is fairly easy to explain. Explaining why the neural network took an input of A, B and C resulted in an output of W may not be practical even if you are a guru.
  • How linear are the relationships (and can you make them linear by engineering the data using a log function for example)? Some algorithms handle non-linear relationships better than others.
  • Other factors to consider are the size of the dataset, the number of dimensions, and how much time and money should be spent to get an answer (some algorithms train quickly while others can be more expensive).

Nice, but what’s a person to do?

So in come the cheat sheets! Here are a couple that I like:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice provides a good overview with relative strength and weaknesses. Obviously this is geared towards AzureML but the information on each family or general type is applicable across platforms. It includes a link to the related cheat sheet https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-cheat-sheet

http://www.dummies.com/programming/big-data/data-science/choosing-right-algorithm-machine-learning/ is a more recent find for me that had some additional insights I liked.

A more extreme example of “if this then that” guidance comes from the folks at Scikit-learn. http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html This is obviously geared towards their platform of choice, but again it offers insights into potential strengths or weaknesses of different algorithms.

Looking at several of these can help you get a better sense of what kinds of algorithms may work for your current problem.

Whatever you choose, try more than one. There rarely is a clear cut single answer – at least not until you’ve iterated a couple times, gone back to engineer some more factors, and tried it all again! In many cases selecting which factors to include, which to exclude and what can be engineered may have a greater impact on your success than a specific algorithm selection. Making sure data is normalized if the algorithm needs it, dealing with missing values, outliers, covariant inputs and just plain data wrangling are key. Some algorithms are more susceptible to these issues than others but in almost all cases doing the work before hand is at least as if not more important depending on how the data comes to you. So don’t let choosing an algorithm take too much of your attention. Explore a couple likely candidates and see where they take you!

More information can be found at several other excellent blog posts. Here are a couple you may find enlightening:

Choosing Machine Learning Algorithms: Lessons from Microsoft Azure

A Tour of Machine Learning Algorithms

Choosing an Azure Machine Learning Algorithm

Kicking things off

Welcome to my new blog. I look forward to exploring a variety of topics – primarily related to machine learning in one form or another. My focus is more on leveraging machine learning in applications rather than research (which obviously has its own value). Researchers have been using this tool set for years, but the technology has advanced to a point where many  companies and organizations can begin to reap the benefits in their own application / environment. I don’t pretend to know it all so feel free to share your thoughts and experiences!