Max Kuhn Gives Talk at NYC Open Data Meetup
NYC Open Data Meetup was pleased to host Max Kuhn on February 17, who gave a wonderful talk about predictive analytics. He focused on several key points: prediction vs inferential statistics, the process of model building, the choice of methodology, and next steps in data science. He also spoke about the ethics that can come into play when building predictive models.
"Predictive modeling is the process of creating a model whose primary goal is to achieve high levels of accuracy". The objective here is to make the best possible prediction in an individual data instance. While this might seem like an obvious and trivial definition, Max Kuhn was making a point about the difference between predictive analytics and traditional inferential statistics. They are measuring different things. He quotes Friedman (2001), who describes an example related to boosted trees and MLE.
"...degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality."
Traditional inferential statistics focus on the appropriateness of models as related to things like distributional assumptions, parsimony, and degrees of freedom. This is not the case in predictive modeling. In fact, in some cases, the entire concept of degrees of freedom may be meaningless, as for example when you have more predictors than you have observations. So one of the key take aways from Max Kuhn's talk was that is it okay to throw the inferential book away if it makes your predictions better.
The issues in predictive modeling are not so much related to confidence intervals or probabilities. The issues to focus on are overfitting vs underfitting and bias vs variance. The choice of model may have some constraints. The nature of the data is one, for example, if there is a lot of multicollinearity or a lot of unlabeled data. Cross validated cost estimates will contribute to model selection. And at times, ease of use may be a consideration, if the model is going to be deployed.
Primarily, though, when doing predictive modeling he suggests that one not be deterred by the complexity of the model itself. Complex or non-linear pre-processing can make predictions better; so too can ensembles of models. Over paramatized (by traditional statistical criteria) models that are highly regularized and non-linear can often make excellent predictions.
Many different models can work well to solve a problem. In fact, Max Kuhn believes that the current available models are good for solving most problems. Except for a few exceptions, model improvement, he believes, will have more to do with feature selection than with a new algorithm. The exceptions, areas where he believes we could use better models are: using unlabeled data, severe class imbalances, feature engineering, applicability domain techniques and confidence assessments. In most cases, though, it will be feature selection that improves models.
Max Kuhn told a story about analyzing molecules in an assay that had a lot of predictors but also a lot of imperfections, so the data was noisy. The assay that was more pure could only provide a limited number of predictors. Yet it was the latter that gave better results. So big data is not always better data, and except for anomaly detection, where one might need a lot of data to get a large enough sample of the target of interest, smaller data sets do just as well.
Finally, one of the biggest takeaways of the evening was the importance of advocating for a model that is a superior predictor, even if it is difficult to interpret. A somewhat trivial example Max Kuhn used was a spam filter. Nobody will care if the model is difficult to interpret if it is accurate in filtering spam and keeping important email from going into the junk folder. In Max Kuhn's personal history he needed to defend a model that does diagnostic work when the FDA was asking for a simpler model. But his predictive model as excellent, and he did not want to compromise accuracy. When his daughter had an illness that required this very diagnostic tool, and the instrument used in his daughter's hospital was made by a different company, he wondered if the other group of data scientists had also stood their ground.
The work data scientists do is important, and in some cases can be a matter of life and death.