Max Kuhn Gives Talk at NYC Open Data Meetup

Posted on Feb 19, 2015

NYC Open Data Meetup was pleased to host Max Kuhn on February 17, who gave a wonderful talk about predictive analytics. He focused on several key points: prediction vs inferential statistics, the process of model building, the choice of methodology, and next steps in data science. He also spoke about the ethics that can come into play when building predictive models.

"Predictive modeling is the process of creating a model whose primary goal is to achieve high levels of accuracy". The objective here is to make the best possible prediction in an individual data instance. While this might seem like an obvious and trivial definition, Max Kuhn was making a point about the difference between predictive analytics and traditional inferential statistics. They are measuring different things. He quotes Friedman (2001), who describes an example related to boosted trees and MLE.
"...degrading the likelihood by overfitting actually improves misclassification error rates. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality."

Traditional inferential statistics focus on the appropriateness of models as related to things like distributional assumptions, parsimony, and degrees of freedom. This is not the case in predictive modeling. In fact, in some cases, the entire concept of degrees of freedom may be meaningless, as for example when you have more predictors than you have observations. So one of the key take aways from Max Kuhn's talk was that is it okay to throw the inferential book away if it makes your predictions better.

The issues in predictive modeling are not so much related to confidence intervals or probabilities. The issues to focus on are overfitting vs underfitting and bias vs variance. The choice of model may have some constraints. The nature of the data is one, for example, if there is a lot of multicollinearity or a lot of unlabeled data. Cross validated cost estimates will contribute to model selection. And at times, ease of use may be a consideration, if the model is going to be deployed.

Primarily, though, when doing predictive modeling he suggests that one not be deterred by the complexity of the model itself. Complex or non-linear pre-processing can make predictions better; so too can ensembles of models. Over paramatized (by traditional statistical criteria) models that are highly regularized and non-linear can often make excellent predictions.

Many different models can work well to solve a problem. In fact, Max Kuhn believes that the current available models are good for solving most problems. Except for a few exceptions, model improvement, he believes, will have more to do with feature selection than with a new algorithm. The exceptions, areas where he believes we could use better models are: using unlabeled data, severe class imbalances, feature engineering, applicability domain techniques and confidence assessments. In most cases, though, it will be feature selection that improves models.

Max Kuhn told a story about analyzing molecules in an assay that had a lot of predictors but also a lot of imperfections, so the data was noisy. The assay that was more pure could only provide a limited number of predictors. Yet it was the latter that gave better results. So big data is not always better data, and except for anomaly detection, where one might need a lot of data to get a large enough sample of the target of interest, smaller data sets do just as well.

Finally, one of the biggest takeaways of the evening was the importance of advocating for a model that is a superior predictor, even if it is difficult to interpret. A somewhat trivial example Max Kuhn used was a spam filter. Nobody will care if the model is difficult to interpret if it is accurate in filtering spam and keeping important email from going into the junk folder. In Max Kuhn's personal history he needed to defend a model that does diagnostic work when the FDA was asking for a simpler model. But his predictive model as excellent, and he did not want to compromise accuracy. When his daughter had an illness that required this very diagnostic tool, and the instrument used in his daughter's hospital was made by a different company, he wondered if the other group of data scientists had also stood their ground.

The work data scientists do is important, and in some cases can be a matter of life and death.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp