Nov, 2018

Data Science for Business by Foster Provost and Tom Fawcett

The Big Idea: Invest in data and data science teams. Better data + better data scientists = better models = better business decisions = sustainable competitive advantage. 

Chapter 1: Introduction, Data Analytic Thinking

  • Data mining is the extraction of knowledge from data.
  • Data science is a set of principles to guide data mining.
  • Big data means datasets that are too large for traditional data processing systems and require new technologies such as Hadoop, HBase, MongoDB.
  • We are in Big Data 1.0 still. Big Data 2.0 will be the golden era of data science.
  • Building a top-notch data science team is nontrivial but can be a tremendous strategic advantage.
  • Ex: fraud detection, Amazon, Harrah’s casinos.
  • It’s important for managers and executives to understand basic data science principles to get the most from data science projects and teams.
  • Just like chemistry is not about test tubes, data science is not about data engineering or data mining.

Chapter 2: Business Problems and Data Science Solutions

  • There are a few fundamental types of data mining tasks: classification, regression, similarity matching, clustering, association grouping, profiling, link prediction, data reduction, causal modeling.
  • This book will focus on: classification, regression, similar matching, and clustering
  • Ex: churn prediction is a classification problem.
  • Supervised vs unsupervised. Supervised data mining has a specific target. Unsupervised data mining is used to learn and observe patterns in the data but doesn’t have a specific target.
  • It’s important to appropriately evaluate prediction models.
  • Your model is not what the data scientists design, it’s what the engineers build.
  • Data science engineers are software engineers who have expertise in production systems and in data science.
  • Data mining is closer to R&D than to software engineering.
  • Invest in pilot studies and throwaway prototypes.
  • Analytics skills (ability to formulate problems well, to prototype solutions, to make reasonable assumptions) are more important than software engineering skills in a data science team.
  • Useful skills for a business analyst: statistics, SQL, data warehousing, regression analysis, machine learning.
  • There is much overlap, but there is a different because understanding the reason for churn, vs predicting which customers to target to reduce future churn.
  • Ex. Who are the most profitable customers? SQL
  • Ex. Is there really a difference between the profitable customers and the average customers? Statistics and hypothesis testing.
  • Ex. But who really are these profitable customers? Can I characterize them? SQL, statistics, automated pattern finding. Classification.
  • Ex. Will some particular new customer be profitable? How much revenue should I expect this customer to generate? Predictive model. Regression.

Chapter 3: Introduction to Predictive Modeling, From Correlation to Supervised Segmentation

  • Predictive modeling is supervised segmentation. We have some target quantity we would like to predict.
  • A classification tree, or decision tree, is a method to classify data instances.
  • Tree structured models are a very popular data modeling technique and work remarkably well.
  • Tree structured models are also easy for business users to understand.

Chapter 4: Fitting a Model to Data

  • Tuning the parameters so that the model fits the data is parameter learning or parametric modeling.
  • The most common procedure is one you’re already familiar with, linear regression.
  • Logistic regression applies linear models to class probability estimation, and is one of the most useful data mining techniques.
  • Nonlinear support vector machines and neural networks fit parameters based on complex, nonlinear functions.
  • If we increase the complexity, we can fit the data too well, then we are just memorizing the data.

Chapter 5: Overfitting and Its Avoidance

  • Just because a model fits the data very well, doesn’t mean it is better at predicting. It could just be memorizing the data.
  • If you torture the data long enough, it will confess.
  • Fundamental tradeoff between model complexity and overfitting.
  • Always hold out data to test the model.
  • A fitting graph shows the difference between accuracy during training and accuracy during testing.
  • Overfitting is bad because the model picks up spurious correlations that produce incorrect generalizations.
  • A learning curve is a plot of the generalization performance against the amount of training data.

Chapter 6: Similarity, Neighbors, and Clusters

  • Similarity between data instances is described as distance between their feature vectors.
  • Nearest-neighbor methods predict by calculating distance between a new data and neighbors in the training set.
  • Similarity is used as the basis for the most common methods of unsupervised data mining, clustering.
  • Hierarchical clustering can provide insights that instruct further data mining.
  • A cluster centroid can be used as the basis for understanding clusters.

Chapter 7: Decision Analytic Thinking I, What is a Good Model?

  • Accuracy is too simplistic a metric.
  • A confusion matrix differentiates between different types of errors (eg. sensitivity vs specificity)
  • Expected value frameworks are extremely useful in organizing data science thinking and evaluating models.

Chapter 8: Visualizing Model Performance

  • A profit curve is useful for business user to evaluate classifiers.
  • A Receiver Operating Characteristics (ROC) graph is useful for evaluating models when class priors or costs/benefits are not known.
  • Since ROC curves are not intuitive, a cumulative response curve (or lift curve) is most appropriate for some business users.

Chapter 9: Evidence and Probabilities

  • Bayes rule is used for conditional probabilities which occurs frequently in business problems.
  • Naive Bayes rule is valuable because it is very efficient, practical to use, and can learn on the fly.
  • Naive Bayes rule should be avoided when costs/benefits are uses. Best to use when rankings are more important.
  • Bayes rule are the basis of evidence lifts. Evidence lifts are useful for understanding data like “Facebook Likes as a predictor of High IQ”

Chapter 10: Representting and Mining Text

  • Term frequency (TFIDF) is a simple and useful data mining technique for text.
  • Topic layers can also be used to assist with understanding text.

Chapter 11: Decision Analytic Thinking II, Towards Analytical Engineering

  • Expected value framework is a core approach useful in many data science scenarios.

Chapter 12: Other Data Science Tasks and Techniques

  • Not discussed in depth in this book: co-occurrence grouping, lift and leverage, market basket analysis, profiling, link predictions, social recommendation, data reduction, latent information, bias vs variance, ensemble models, causal explanations

Chapter 13: Data Science and Business Strategy

  • Understanding data science concepts leads to awareness of new opportunities.
  • Understanding the ROI of data science results in increased investment in data and data science teams.
  • Data science is a sustainable competitive advantage.
  • A culture of data science is valuable in building a data science team.
  • A top data scientist is worth many times an average data scientist.
  • Data science is learned by working with top data scientists, either in industry or academia.
  • A top data science manager understands the technical principles, understands the business needs, and manages people and projects well.
  • There is only one reliable predictor of success of a data science research project: prior success.
  • Top data scientists want to work with other top data scientists. Most want more responsibility. Most want to be part of a fast-growing, successful company.
  • Consider funding a PhD student for $50k/year.
  • Consider taking on a data science professor or a top data science consultant as a scientific advisor to guide projects and attract data scientists.
  • An immature data science team has processes that are ad-hoc.
  • A medium-maturity data science team employs well-trained data scientists and managers.
  • A high-maturity data science team focuses on processes as well as projects.