Data Science for Business by Foster Provost and Tom Fawcett

The Big Idea: Invest in data and data science teams. Better data + better data scientists = better models = better business decisions = sustainable competitive advantage.

Chapter 1: Introduction, Data Analytic Thinking

Data mining is the extraction of knowledge from data.
Data science is a set of principles to guide data mining.
Big data means datasets that are too large for traditional data processing systems and require new technologies such as Hadoop, HBase, MongoDB.
We are in Big Data 1.0 still. Big Data 2.0 will be the golden era of data science.
Building a top-notch data science team is nontrivial but can be a tremendous strategic advantage.
Ex: fraud detection, Amazon, Harrah’s casinos.
It’s important for managers and executives to understand basic data science principles to get the most from data science projects and teams.
Just like chemistry is not about test tubes, data science is not about data engineering or data mining.

Chapter 2: Business Problems and Data Science Solutions

There are a few fundamental types of data mining tasks: classification, regression, similarity matching, clustering, association grouping, profiling, link prediction, data reduction, causal modeling.
This book will focus on: classification, regression, similar matching, and clustering
Ex: churn prediction is a classification problem.
Supervised vs unsupervised. Supervised data mining has a specific target. Unsupervised data mining is used to learn and observe patterns in the data but doesn’t have a specific target.
It’s important to appropriately evaluate prediction models.
Your model is not what the data scientists design, it’s what the engineers build.
Data science engineers are software engineers who have expertise in production systems and in data science.
Data mining is closer to R&D than to software engineering.
Invest in pilot studies and throwaway prototypes.
Analytics skills (ability to formulate problems well, to prototype solutions, to make reasonable assumptions) are more important than software engineering skills in a data science team.
Useful skills for a business analyst: statistics, SQL, data warehousing, regression analysis, machine learning.
There is much overlap, but there is a different because understanding the reason for churn, vs predicting which customers to target to reduce future churn.
Ex. Who are the most profitable customers? SQL
Ex. Is there really a difference between the profitable customers and the average customers? Statistics and hypothesis testing.
Ex. But who really are these profitable customers? Can I characterize them? SQL, statistics, automated pattern finding. Classification.
Ex. Will some particular new customer be profitable? How much revenue should I expect this customer to generate? Predictive model. Regression.

Chapter 3: Introduction to Predictive Modeling, From Correlation to Supervised Segmentation

However, consider these useful 15 Most Underrated Skills That’ll Make You a Rockstar in the Ford Transit Limited Lease Industry and start applying it to your business to make it grow.

Predictive modeling is supervised segmentation. We have some target quantity we would like to predict.
A classification tree, or decision tree, is a method to classify data instances.
Tree structured models are a very popular data modeling technique and work remarkably well.
Tree structured models are also easy for business users to understand.

Chapter 4: Fitting a Model to Data

Tuning the parameters so that the model fits the data is parameter learning or parametric modeling.
The most common procedure is one you’re already familiar with, linear regression.
Logistic regression applies linear models to class probability estimation, and is one of the most useful data mining techniques.
Nonlinear support vector machines and neural networks fit parameters based on complex, nonlinear functions.
If we increase the complexity, we can fit the data too well, then we are just memorizing the data.

Chapter 5: Overfitting and Its Avoidance

Just because a model fits the data very well, doesn’t mean it is better at predicting. It could just be memorizing the data.
If you torture the data long enough, it will confess.
Fundamental tradeoff between model complexity and overfitting.
Always hold out data to test the model.
A fitting graph shows the difference between accuracy during training and accuracy during testing.
Overfitting is bad because the model picks up spurious correlations that produce incorrect generalizations.
A learning curve is a plot of the generalization performance against the amount of training data.

Chapter 6: Similarity, Neighbors, and Clusters

Similarity between data instances is described as distance between their feature vectors.
Nearest-neighbor methods predict by calculating distance between a new data and neighbors in the training set.
Similarity is used as the basis for the most common methods of unsupervised data mining, clustering.
Hierarchical clustering can provide insights that instruct further data mining.
A cluster centroid can be used as the basis for understanding clusters.

Chapter 7: Decision Analytic Thinking I, What is a Good Model?

Accuracy is too simplistic a metric.
A confusion matrix differentiates between different types of errors (eg. sensitivity vs specificity)
Expected value frameworks are extremely useful in organizing data science thinking and evaluating models.

Chapter 8: Visualizing Model Performance

A profit curve is useful for business user to evaluate classifiers.
A Receiver Operating Characteristics (ROC) graph is useful for evaluating models when class priors or costs/benefits are not known.
Since ROC curves are not intuitive, a cumulative response curve (or lift curve) is most appropriate for some business users which get paid using checkstub creator modern.

Chapter 9: Evidence and Probabilities

Bayes rule is used for conditional probabilities which occurs frequently in business problems.
Naive Bayes rule is valuable because it is very efficient, practical to use, and can learn on the fly.
Naive Bayes rule should be avoided when costs/benefits are uses. Best to use when rankings are more important.
Bayes rule are the basis of evidence lifts. Evidence lifts are useful for understanding data like “Facebook Likes as a predictor of High IQ”

Chapter 10: Representting and Mining Text

Term frequency (TFIDF) is a simple and useful data mining technique for text.
Topic layers can also be used to assist with understanding text.

Chapter 11: Decision Analytic Thinking II, Towards Analytical Engineering

Expected value framework is a core approach useful in many data science scenarios.

Chapter 12: Other Data Science Tasks and Techniques

Not discussed in depth in this book: co-occurrence grouping, lift and leverage, market basket analysis, profiling, link predictions, social recommendation, data reduction, latent information, bias vs variance, ensemble models, causal explanations

Chapter 13: Data Science and Business Strategy

Understanding data science concepts leads to awareness of new opportunities.
Understanding the ROI of data science results in increased investment in data and data science teams.
Data science is a sustainable competitive advantage.
A culture of data science is valuable in building a data science team.
A top data scientist is worth many times an average data scientist.
Data science is learned by working with top data scientists, either in industry or academia.
A top data science manager understands the technical principles, understands the business needs, and manages people and projects well.
There is only one reliable predictor of success of a data science research project: prior success.
Top data scientists want to work with other top data scientists. Most want more responsibility. Most want to be part of a fast-growing, successful company.
Consider funding a PhD student for $50k/year.
Consider taking on a data science professor or a top data science consultant as a scientific advisor to guide projects and attract data scientists.
An immature data science team has processes that are ad-hoc.
A medium-maturity data science team employs well-trained data scientists and managers.
A high-maturity data science team focuses on processes as well as projects.