Confident Data Skills – Kirill Eremenko

Confident Data Skills – by Kirill Eremenko
Date read: 3/29/19. Recommendation: 8/10.

Great resource for those wanting to learn the fundamentals of data science. It’s particularly relevant if you’re looking to better leverage data in your existing job (as I am in product management) or explore a new career path in data science (huge opportunities here, in case you’ve been living under a rock). Eremenko does a great job breaking down the data science process for beginners and explaining the essential algorithms. Case studies from Netflix and LinkedIn, help bring these concepts to life.

See my notes below or Amazon for details and reviews.

My Notes:

Fundamentals:
Data has always been out there. What’s changed in the past decade is our ability to collect, organize, analyze and visualize it.

Data = quantitative AND qualitative.

“Big data” is a dynamic term given to datasets that are massive in volume (too big), velocity (too rapid), or variety (too many different data attributes). Technology always being developed to improve this, that’s why what we consider “big data” is in constant flux.

Cloud = storage facility with a virtualized infrastructure. 

Netflix:
The Netflix recommendation is a great example of the power of data science. Netflix was able use viewing habits to create niche subcategories (Exciting horror movies from the 1980s). They were also able to see overlap in audience’s viewing patterns - identifying that people who enjoyed political dramas also enjoyed Kevin Spacey films, which led them to remake House of Cards.

Healthcare:
One of the things that makes data science so powerful is the sheer volume it enables us to process. Can help support doctors in diagnosing patients. Doctor might have seen 5,000 patients in their career. Machine has accumulated knowledge of 1,000,000 cases.

Multidisciplinary:
Beneficial to have roots in a different discipline when you enter data science – gives you an advantage and helps you ask the right questions. 

The data science process:

  1. Identify the question

  2. Prepare the data (ETL - extract, transform, load)

  3. Analyze the data

  4. Visualize the insights

  5. Present the insights

Prepare the data:
-Extract the data from its sources – ensures that you aren’t altering the original source.

-Transform the data into a comprehensible language for access in a relational database. This step is about reformatting, joining, splitting, aggregating, and cleaning the data. 

-Load the data into the end source (warehouse).

Essential algorithms:
Three main groups – classification, clustering, reinforcement learning.

Classification – when you know the categories you want to group, or classify, new data points into (e.g. survey response to a yes/no question)

-Types of classification algorithms: decision trees, random forest, K-nearest neighbors (K-NN), Naive Bayes, logistic regression.

-Decision tree runs tests on individual attributes in your dataset in order to determine the possible outcomes. Questions are the branches, answers are the leaves. Better for smaller datasets.

-Random forest builds upon same principles as decision tree, it just uses many different trees to make the same prediction and averages the results from the individual trees. Every decision tree casts its vote, random forest takes most voted option. Better for larger datasets

-K-nearest neighbors (K-NN) analyzes likeness by calculating the distance between a new data point and existing data points. Deterministic model. Assumption it makes is that unknown features will be similar. 

-Naive Bayes allows new data points to be easily included in the algorithm to dynamically update the probability value. Probabilistic model. Good for non-linear problems where classes cannot be separated with a straight line on the scatter plot and for datasets containing outliers (other algorithms easily biased by outliers). Drawback: naive assumptions made can create bias.

-Logistic regression is good for analyzing the likelihood of a customer’s interest in your product, evaluating response of customers based on demographic data, specifying which variable is the most statistically significant.

-Simple linear regression analyzes relationship between one dependent and one independent variable.

-Multiple linear regression analyzes relationships between on dependent and two or more independent variables.

Clustering – when you don’t know the groups you want an analysis to place your data into (e.g. survey based on age, distance from company’s closest store). 

-Types of clustering algorithms: K-means, hierarchical.

-K-means discovers statistically significant categories or groups in a given dataset. 

-Hierarchical includes agglomerative (bottom-up, works from single data point and groups it with nearest data points in incremental steps until all points have been absorbed into single cluster, this is the most common) and divisive (begins at top, single cluster encompasses all data points, works its way down, splitting the single cluster apart in order of distance between data points), both recorded in a dendrogram. 

Reinforcement learning - a form of machine learning that leans on concepts of behaviorism to train AI. 

-Types of algorithms: Upper confidence bound, Thompson sampling.

-Upper confidence bound (UCB) is a dynamic strategy that increases in accuracy as additional information is collected. Deterministic. After a single round, use data to alter bounds of one of the variants. Good for finding most effective ad campaigns or managing multiple project finances.