Comparing Quality of Different Models

DevUp 2017 was another great conference! The team at devupconf.org did their usual great job of getting everything put together. This year they switched venue and the new location was perfect. Lots of room, better quality facilities, food, and WiFi. This last is pretty important for an Azure demo session! Saw several really good sessions and got to meet more great people. If you are looking for a copy of my presentation, it can be downloaded here. There are slides to show some of the demo screen content but of course the demo was the heart of the presentation.

During my session on Azure ML Studio, I showed some R code I created for the purpose of comparing the quality of different models. Now ML Studio comes with a nice Evaluation Module that already does this. As shown here, it has all the important statistics as well as some nice graphs.

One problem here is that you have to keep clicking between the two options on the right to see the numbers for the two models. The graph shows both models but when models are close in performance, the numbers are easier to interpret. My second problem is you can only compare two models. To compare more, you have to create more evaluation modules and bounce between modules (while bouncing between models within a module) to make comparisons. It quickly becomes unwieldy. Most of the time this may be all right, but in early exploration of a number of models it is cumbersome.

Of even more interest is whether or not your model has been overfit to the training data. The Evaluation module can’t help you there. So…. a little R to the rescue. This example is for a two class classification problem. Slightly different R code will be needed to plot results for other problem types using different measures.

The Cross Validation module will do a ten fold cross validation and reports the various quality measures for the model against each of the ten folds. When tuning hyperparameters (one of the great tools of Azure ML), this can be very important to get the best results. It also gives you a chance to see how the model performs against a variety of subsets – giving an indication of how sensitive it might be to different data subsets. You need your model to generalize well and not be overfit to the training data. The “How to perform cross-validation with a parameter sweep” section of the Tune Model Hyperparameters does a nice job of explaining how to do this. The Cross Validation output shows the various scores for all ten folds.

While it shows a mean and standard deviation at the bottom, this is only for one model and mentally gauging these two numbers is not that easy.

The first block of R code processes these results for graphing:

# Map 1-based optional input ports to variables
# Prepare cross validation result sets for assessment
# of the different performance measures for each run
df <- maml.mapInputPort(1) # class: data.frame
# remove the two summary data rows as these are
# not result sets
datarows <- df$Fold.Number != “Mean” & df$Fold.Number != “Standard Deviation”
df.data <- df[datarows,]
# enter the desired model identifier as it will
# appear in the graphs
modelname <- “Boosted DT”
df.data$Model<-modelname
maml.mapOutputPort(“df.data”);

You can do this for each model’s cross validation results and then rbind the sets together (or use the Add Rows module to do the same thing without having to write code). Note the need to identify each model so they can be easily distinguished from the other models in the resulting graph. You may also need to change “F-Score” to “F.Score” using the edit metadata module to change the column name. The R script won’t properly interpret F-Score as it tries to treat it as a formula instead of a column name.

Finally generate a graph that compares all models across each measure.

# plot the results of each performance measure
# by model to compare model performance and
# sensitivity to different data sets

# requires the following previously be done in the input data frame:
# – cross validation result sets without mean and standard deviation summary rows
# – a model column to distinguish different models results
# – rename the F-Score column to F.Score so it can be properly interpreted

# Map 1-based optional input ports to variables
df.data <- maml.mapInputPort(1) # class: data.frame
#df2 <- maml.mapInputPort(2) # class: data.frame
library(ggplot2)
#df.data <- rbind(df1, df2)
dataplot <- ggplot(df.data, aes(Model, Accuracy)) + geom_boxplot()
dataplot1 <- ggplot(df.data, aes(Model, Precision)) + geom_boxplot()
dataplot2 <- ggplot(df.data, aes(Model, Recall)) + geom_boxplot()
dataplot3 <- ggplot(df.data, aes(Model, F.Score)) + geom_boxplot()
dataplot4 <- ggplot(df.data, aes(Model, AUC)) + geom_boxplot()
plot(dataplot)
plot(dataplot1)
plot(dataplot2)
plot(dataplot3)
plot(dataplot4)
# Select data.frame to be sent to the output Dataset port
maml.mapOutputPort(“df.data”);

No matter how many model results have been added, you will see them side by side for each performance measure. You can easily compare them both in terms of how well they did, but also how much the results varied by fold. Ideally you want a nice tight box plot with a high average score. Below are the results for four models. Not that with only ten folds, an outlier is almost half of that quartile. I therefore would consider the line to extend all the way to the outlier. With that said, the Decision Jungle (second model from the left) has a nice high precision (what we were tuning for in these models) and is fairly consistent in its results. All other models have a lower average score and are spread out over a greater range of results suggesting they are not as  consistent with different data sets. Even the best model has a pretty wide spread – the limited size of the data set used here is likely partially to blame.

I found this to make it much easier to compare the relative quality of all four models together as well as to gauge if a model is particularly sensitive to the data subset it is working with. You can write similar code for other problem types. The built in evaluation module does a nicer job for graphing ROC, Precision/Recall and Lift, as well as seeing if adjusting your threshold will improve things. But when comparing multiple models across the core quality measures I find I prefer this set of graphs.

Leave a Reply

Your email address will not be published. Required fields are marked *