Predictive Analytics by Eric Siegel

Predictive analytics is about what the mass media calls data science or big data.  It seems that practitioners prefer the more precise term, predictive analytics, which is what is sounds like, using analytics to predict behavior.  The book is a quick, non-technical introduction to data analytics, explaining basic concepts and definitions, and then sharing real-world examples of predictive analytics.

My Notes:

  • Machine Learning: software algorithms that can learn from and then make predictions about data
  • Predictive analytics focuses on the micro (what is one person likely to do).  Forecasting focuses on the macro (what is the economy likely to do)
  • A predictive model generates a predictive score (credit score) based on the the traits of an entity (borrower) that can be used to predict behavior (loan default)
  • Predictive models do not imply causation but for predictive analytics, causation is not required.
  • The analysis is only as good as the data (garbage in, garbage out).
  • Only a controlled experiment (control group vs experiment group) can show causality.
  • Visualizing the data can help the analyst identify patterns to guide the model building.
  • Machine learning is multivariate.
  • Decision tree is by far the most popular methodology for building a predictive model.
  • Other methodologies are artificial neural networks, loglinear regression, support vector machines, TreeNet
  • The data preparation phase of building the predictive model is tedious but necessary.
  • Predictive multiplier is a simple metric used to compare predictive models.
  • Overlearning is when you mistake noise for information and make the model too complex and less useful.
  • Typically, 80% of the data is used to train the model.  20% is set aside to test the model afterwards.
  • Kaggle is a startup that uses crowdsourcing and competitions to build predictive data models, much like Netflix’s $1m competition.
  • An ensemble model is a combination of multiple predictive models.  Ensemble models are consistently superior than single model.
  • The most popular open source software for analytics is R.
  •  An uplift model is a predictive model that predicts the influence on an individual’s behavior from one treatment vs another.  It is the analog to a controlled experiment when one is not possible.  Uplift models were heavily used by the Obama campaign.
  • Further Reading: KDNuggets, Kaggle, Competing on Analytics, Data.gov, The Signal and the Noise, The Wisdom of Crowds

Leave a Reply