Max Kuhn Gives Talk at NYC Open Data Meetup

Posted on Feb 19, 2015

NYC Open Data Meetup was pleased to host Max Kuhn on February 17, who gave a wonderful talk about predictive analytics. He focused on several key points: prediction vs inferential statistics, the process of model building, the choice of methodology, and next steps in data science. He also spoke about the ethics that can come into play when building predictive models.

"Predictive modeling is the process of creating a model whose primary goal is to achieve high levels of accuracy". The objective here is to make the best possible prediction in an individual data instance. While this might seem like an obvious and trivial definition, Max Kuhn was making a point about the difference between predictive analytics and traditional inferential statistics. They are measuring different things. He quotes Friedman (2001), who describes an example related to boosted trees and MLE.
"...degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality."

Traditional inferential statistics focus on the appropriateness of models as related to things like distributional assumptions, parsimony, and degrees of freedom. This is not the case in predictive modeling. In fact, in some cases, the entire concept of degrees of freedom may be meaningless, as for example when you have more predictors than you have observations. So one of the key take aways from Max Kuhn's talk was that is it okay to throw the inferential book away if it makes your predictions better.

The issues in predictive modeling are not so much related to confidence intervals or probabilities. The issues to focus on are overfitting vs underfitting and bias vs variance. The choice of model may have some constraints. The nature of the data is one, for example, if there is a lot of multicollinearity or a lot of unlabeled data. Cross validated cost estimates will contribute to model selection. And at times, ease of use may be a consideration, if the model is going to be deployed.

Primarily, though, when doing predictive modeling he suggests that one not be deterred by the complexity of the model itself. Complex or non-linear pre-processing can make predictions better; so too can ensembles of models. Over paramatized (by traditional statistical criteria) models that are highly regularized and non-linear can often make excellent predictions.

Many different models can work well to solve a problem. In fact, Max Kuhn believes that the current available models are good for solving most problems. Except for a few exceptions, model improvement, he believes, will have more to do with feature selection than with a new algorithm. The exceptions, areas where he believes we could use better models are: using unlabeled data, severe class imbalances, feature engineering, applicability domain techniques and confidence assessments. In most cases, though, it will be feature selection that improves models.

Max Kuhn told a story about analyzing molecules in an assay that had a lot of predictors but also a lot of imperfections, so the data was noisy. The assay that was more pure could only provide a limited number of predictors. Yet it was the latter that gave better results. So big data is not always better data, and except for anomaly detection, where one might need a lot of data to get a large enough sample of the target of interest, smaller data sets do just as well.

Finally, one of the biggest takeaways of the evening was the importance of advocating for a model that is a superior predictor, even if it is difficult to interpret. A somewhat trivial example Max Kuhn used was a spam filter. Nobody will care if the model is difficult to interpret if it is accurate in filtering spam and keeping important email from going into the junk folder. In Max Kuhn's personal history he needed to defend a model that does diagnostic work when the FDA was asking for a simpler model. But his predictive model as excellent, and he did not want to compromise accuracy. When his daughter had an illness that required this very diagnostic tool, and the instrument used in his daughter's hospital was made by a different company, he wondered if the other group of data scientists had also stood their ground.

The work data scientists do is important, and in some cases can be a matter of life and death.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp