The Big Idea: Invest in data and data science teams. Better data + better data scientists = better models = better business decisions = sustainable competitive advantage.
Chapter 1: Introduction, Data Analytic Thinking
- Data mining is the extraction of knowledge from data.
- Data science is a set of principles to guide data mining.
- Big data means datasets that are too large for traditional data processing systems and require new technologies such as Hadoop, HBase, MongoDB.
- We are in Big Data 1.0 still. Big Data 2.0 will be the golden era of data science.
- Building a top-notch data science team is nontrivial but can be a tremendous strategic advantage.
- Ex: fraud detection, Amazon, Harrah’s casinos.
- It’s important for managers and executives to understand basic data science principles to get the most from data science projects and teams.
- Just like chemistry is not about test tubes, data science is not about data engineering or data mining.
Chapter 2: Business Problems and Data Science Solutions
- There are a few fundamental types of data mining tasks: classification, regression, similarity matching, clustering, association grouping, profiling, link prediction, data reduction, causal modeling.
- This book will focus on: classification, regression, similar matching, and clustering
- Ex: churn prediction is a classification problem.
- Supervised vs unsupervised. Supervised data mining has a specific target. Unsupervised data mining is used to learn and observe patterns in the data but doesn’t have a specific target.
- It’s important to appropriately evaluate prediction models.
- Your model is not what the data scientists design, it’s what the engineers build.
- Data science engineers are software engineers who have expertise in production systems and in data science.
- Data mining is closer to R&D than to software engineering.
- Invest in pilot studies and throwaway prototypes.
- Analytics skills (ability to formulate problems well, to prototype solutions, to make reasonable assumptions) are more important than software engineering skills in a data science team.
- Useful skills for a business analyst: statistics, SQL, data warehousing, regression analysis, machine learning.
- There is much overlap, but there is a different because understanding the reason for churn, vs predicting which customers to target to reduce future churn.
- Ex. Who are the most profitable customers? SQL
- Ex. Is there really a difference between the profitable customers and the average customers? Statistics and hypothesis testing.
- Ex. But who really are these profitable customers? Can I characterize them? SQL, statistics, automated pattern finding. Classification.
- Ex. Will some particular new customer be profitable? How much revenue should I expect this customer to generate? Predictive model. Regression.
Chapter 3: Introduction to Predictive Modeling, From Correlation to Supervised Segmentation
- Predictive modeling is supervised segmentation. We have some target quantity we would like to predict.
- A classification tree, or decision tree, is a method to classify data instances.
- Tree structured models are a very popular data modeling technique and work remarkably well.
- Tree structured models are also easy for business users to understand.
Chapter 4: Fitting a Model to Data
- Tuning the parameters so that the model fits the data is parameter learning or parametric modeling.
- The most common procedure is one you’re already familiar with, linear regression.
- Logistic regression applies linear models to class probability estimation, and is one of the most useful data mining techniques.
- Nonlinear support vector machines and neural networks fit parameters based on complex, nonlinear functions.
- If we increase the complexity, we can fit the data too well, then we are just memorizing the data.
Chapter 5: Overfitting and Its Avoidance
- Just because a model fits the data very well, doesn’t mean it is better at predicting. It could just be memorizing the data.
- If you torture the data long enough, it will confess.
- Fundamental tradeoff between model complexity and overfitting.
- Always hold out data to test the model.
- A fitting graph shows the difference between accuracy during training and accuracy during testing.
- Overfitting is bad because the model picks up spurious correlations that produce incorrect generalizations.
- A learning curve is a plot of the generalization performance against the amount of training data.
Chapter 6: Similarity, Neighbors, and Clusters
- Similarity between data instances is described as distance between their feature vectors.
- Nearest-neighbor methods predict by calculating distance between a new data and neighbors in the training set.
- Similarity is used as the basis for the most common methods of unsupervised data mining, clustering.
- Hierarchical clustering can provide insights that instruct further data mining.
- A cluster centroid can be used as the basis for understanding clusters.
Chapter 7: Decision Analytic Thinking I, What is a Good Model?
- Accuracy is too simplistic a metric.
- A confusion matrix differentiates between different types of errors (eg. sensitivity vs specificity)
- Expected value frameworks are extremely useful in organizing data science thinking and evaluating models.
Chapter 8: Visualizing Model Performance
- A profit curve is useful for business user to evaluate classifiers.
- A Receiver Operating Characteristics (ROC) graph is useful for evaluating models when class priors or costs/benefits are not known.
- Since ROC curves are not intuitive, a cumulative response curve (or lift curve) is most appropriate for some business users.
Chapter 9: Evidence and Probabilities
- Bayes rule is used for conditional probabilities which occurs frequently in business problems.
- Naive Bayes rule is valuable because it is very efficient, practical to use, and can learn on the fly.
- Naive Bayes rule should be avoided when costs/benefits are uses. Best to use when rankings are more important.
- Bayes rule are the basis of evidence lifts. Evidence lifts are useful for understanding data like “Facebook Likes as a predictor of High IQ”
Chapter 10: Representting and Mining Text
- Term frequency (TFIDF) is a simple and useful data mining technique for text.
- Topic layers can also be used to assist with understanding text.
Chapter 11: Decision Analytic Thinking II, Towards Analytical Engineering
- Expected value framework is a core approach useful in many data science scenarios.
Chapter 12: Other Data Science Tasks and Techniques
- Not discussed in depth in this book: co-occurrence grouping, lift and leverage, market basket analysis, profiling, link predictions, social recommendation, data reduction, latent information, bias vs variance, ensemble models, causal explanations
Chapter 13: Data Science and Business Strategy
- Understanding data science concepts leads to awareness of new opportunities.
- Understanding the ROI of data science results in increased investment in data and data science teams.
- Data science is a sustainable competitive advantage.
- A culture of data science is valuable in building a data science team.
- A top data scientist is worth many times an average data scientist.
- Data science is learned by working with top data scientists, either in industry or academia.
- A top data science manager understands the technical principles, understands the business needs, and manages people and projects well.
- There is only one reliable predictor of success of a data science research project: prior success.
- Top data scientists want to work with other top data scientists. Most want more responsibility. Most want to be part of a fast-growing, successful company.
- Consider funding a PhD student for $50k/year.
- Consider taking on a data science professor or a top data science consultant as a scientific advisor to guide projects and attract data scientists.
- An immature data science team has processes that are ad-hoc.
- A medium-maturity data science team employs well-trained data scientists and managers.
- A high-maturity data science team focuses on processes as well as projects.