Data Science
On this page we cover the following topics:
1. What is Data Science
2. Data Science Tools
3. Introduction into Probability Theory
4. Introduction into Statistics
1. What is Data Science (DS)?
The goal of DS is to extract knowledge or insights from data. Data Scientists spend most of their time on these basic activities:
 Collecting the data (CSV & Excel files, web scraping, log files, databases & APIs, etc.)
 Cleaning, aggregating, and loading the data (ETL = Extract, Transform, Load)
 Visualizing the data to get intuitive understanding about it
 Performing some processing and mathematical/statistical analysis and mathematical modeling. Most common tasks: – cluster analysis (splitting data into groups) – regression analysis (approximating with theoretical curves)
 Visualizing the results – reporting and recommendations
A Data Scientist should combine several skills:
 knowledge of a given subject matter (context)
 mathematics and statistical methods
 using computers to work with data.
More advanced topics associated with DS are Machine Leaning (ML) and Artificial Intelligence (AI).
 Machine Leaning (ML) – algorithms to recognize patterns and make predictions
 Deep Machine Leaning (DML) – complex ML (multilayer, structured, hierarchical)
 Artificial Intelligence (AI) – learning and problem solving
 Data Mining – exploring data searching for patterns (Data stores, Databases, Statistics, ML, AI)
 Sentiment Analysis – opinion mining
 Natural Language Processing (NLP) – extract data and meaning from the web and literature
Here are some common areas where the methods can be applied:
 Making recommendations (people who liked this also liked that)
 Find similar users (find more users who also likely to buy this product)
 Cluster analysis (discovering groups)
 Searching (search engine, ranking)
 Optimizing (travel, cost, etc.)
 Removing Spam (email, blog feeds/posts, searches)
 Predictions and making decisions
 Building Price Models
 Finding new perspective products, new markets, new acquisition candidates, etc.
 Epidemiology
Here are some links about Data Science:
 https://en.wikipedia.org/wiki/Data_science
 http://techbeacon.com/softwareengineersguidedatascience
 https://datascience.berkeley.edu/about/whatisdatascience/
 http://www.evanmiller.org/abtesting/ – very nice statistical calculators
 http://firstround.com/review/doingdatasciencerightyourmostcommonquestionsanswered/
 https://yanirseroussi.com/category/datascience2/
 http://www.datascienceglossary.org/ – good glossary of Data Science terms
Here are the lists of books about Data Science:
 Books on probability and statistics https://www.quora.com/Whataresomegoodbooksforlearningprobabilityandstatistics
 Books on Data Science https://www.quora.com/Whatarethebestbooksaboutdatascienc
2. Data Science Tools
Databases:
 Data warehouse – OLAP vs OLTP, star schema, dimensional model
 SQL (Structured Query Language)
 ETL tools (ETL = Extract, Transform, Load)
Databases on Cloud:
 Amazon Redshift – Amazon’s own database product in Amazon cloud
 Snowflake Computing – Elastic SQL Data Warehouse on top of amazon cloud, supports both relational tables and JSON
 Microsoft Azure – Elastic SQL Data Warehouse, on Microsoft Cloud
 Teradata – Expandable relational datawarehouse system (since 1979), “shared nothing” architecture.
 IBM dashDB – IBM dashDB
 HPE Vertica in the cloud – HP Vertica database on the cloud
 MongoDB
Big Data tools – see on a separate page:
 Big Data, Cloud – http://www.selectorweb.com/resources/bigdata/
CSV and JSON data formats:
 CSV – https://en.wikipedia.org/wiki/Commaseparated_values (CommaSeparated Values)
 JSON – https://en.wikipedia.org/wiki/JSON – (JavaScript Object Notation)
Python:
 https://en.wikipedia.org/wiki/Python_(programming_language)
 https://www.python.org/
 https://www.continuum.io
 https://www.yhat.com/products/rodeo
R Programming Language:
 https://en.wikipedia.org/wiki/R_(programming_language)
 https://www.rproject.org/
 https://cran.rproject.org/doc/manuals/rrelease/Rintro.html
 https://www.rstudio.com/home/
 http://shiny.rstudio.com
STAN:
 http://mcstan.org – Statistical Modeling (mostly Columbia University)
 http://mcstan.org/workshops/nyc/
 http://stats.stackexchange.com/questions/165/howwouldyouexplainmarkovchainmontecarlomcmctoalayperson
Some Data Science Algorithms:
 Naive Bayesian Classifier –
 Decision Tree Learning – (Gradient Tree Boosting, Random Forest, …)
 Neural Networks –
 SupportVector Machines (SVMs) – supervised learning models
 kNearest Neighbors – used for classification or regression
 Clustering (Cluster Analysis) –
 Multidimensional Scaling – (MDS) – visualizing similarities in dataset
 NonNegative Matrix Factorization – (NMF) – presenting big matrix as multiplication of two smaller ones
 Optimization –
Seminal Articles Every Data Scientist Should Read (from datasciencecentral.com, August 2014)
External Papers
 Bigtable: A Distributed Storage System for Structured Data
 A Few Useful Things to Know about Machine Learning
 Random Forests
 A Relational Model of Data for Large Shared Data Banks
 MapReduce for Machine Learning on Multicore
 Pasting Small Votes for Classification in Large Databases and OnLine
 Recommendations ItemtoItem Collaborative Filtering
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
 Spanner: Google’s GloballyDistributed Database
 Megastore: Providing Scalable, Highly Available Storage for Interactive Services
 F1: A Distributed SQL Database That Scales
 APACHE DRILL: Interactive AdHoc Analysis at Scale
 A New Approach to Linear Filtering and Prediction Problems
 Top 10 algorithms on Data mining
 The PageRank Citation Ranking: Bringing Order to the Web
 MapReduce: Simplified Data Processing on Large Clusters
 The Google File System
 Amazon’s Dynamo
DSC Internal Papers
 How to detect spurious correlations, and how to find the …
 Automated Data Science: Confidence Intervals
 16 analytic disciplines compared to data science
 From the trenches: 360degree data science
 10 types of regressions. Which one to use?
 Practical illustration of MapReduce (Hadoopstyle), on real data
 Jackknife logistic and linear regression for clustering and predict…
 A synthetic variance designed for Hadoop and big data
 Fast Combinatorial Feature Selection with New Definition of Predict…
 Internet topology mapping
 11 Features any database, SQL or NoSQL, should have
 10 Features all Dashboards Should Have
 Clustering idea for very large datasets
 Hidden decision trees revisited
 Correlation and RSquared for Big Data
 What Map Reduce can’t do
 Excel for Big Data
 Fast clustering algorithms for massive datasets
 The curse of big data
 Interesting Data Science Application: Steganography
Some Data Science videos for beginners:

Data Science for Beginners video 1: The 5 questions that data science can answer

Data Science for Beginners video 2: Is your data ready for data science?

Part 3 – How to ask a question you can answer with data

Part 4 – Predict an answer with a simple model

Part 5 – Copy other people’s work to do data science

Data Science for Rest of Us

Fantastic Data Science materials from Microsoft:

Good site with Data Science Articles and Tutorials:
3. Introduction to Probability Theory
Probability Definitions
Probability of an event is the measure of the likelihood that the event will occur (in some experiment). Probability is a number between 0 and 1. But sometimes people express probability using percents.
0 – event is impossible
0.5 – the event is just as likely to occur as it is to not occur
1 – event will certainly happen, 100%.
The sum of probabilities of all outcomes of an experiment should always be 1.
Outcomes can be combinations of individual events.
For example, if we throw one dice cube – there are 6 possible outcomes (1,2,3,4,5,6),each one has same probability – 1/6.
If we throw 2 cubes, there are 6*6=36 possible combinations.Probability of each combination is the same – 1/36.
If we define an outcome as a sum of numbers on top side of cubes after the throw, then there are 11 possible outcomes:
sum  combinations  Number of combinations 
Probability 
2  1+1  1  1/36 
3  1+2,2+1  2  2/36 
4  1+3,2+2,3+1  3  3/36 
5  1+4,2+3,3+2,4+1  4  4/36 
6  1+5,2+4,3+3,4+2,5+1  5  5/36 
7  1+6,2+5,3+4,4+3,5+2,6+1  6  6/36 
8  2+6,3+5,4+4,5+3,6+2  5  5/36 
9  3+6,4+5,5+4,6+3  4  4/36 
10  4+6,5+5,6+4  3  3/36 
11  5+6,6+5  2  2/36 
12  6+6  1  1/36 
Total  36  1 
From the table above you see how some outcomes can result from more combinations – thus have higher probability of happening.
Marginal Probability = Simple Probablility, a probability of an independent event occurring.
The term “marginal variable” is used to refer to a subset of variables in a table. They called “marginal” because they used to be calculated by summing values in a table along rows or columns, and writing the sum in the margins of the table.
Examples of assigning probability.
1. Tossing a coin.
Possible outcomes : (tails or heads).
Events (tails of heads) are mutually exclusive. Their probabilities should be equal.
Pt + Ph = 1
Pt = 0.5
Ph = 0.5
2. Rolling a die.
Possible outcomes: (1,2,3,4,5,6).
They are mutually exclusive, probabiltiy equal, sum() = 1
So each probabiltiy = 1/6 = 0.167 3.
3. Job Search.
I get 2 emails with job descriptions every day, 10 per week.
Two of them may be a fit: 2 of 10 (0.2).
So I send an email that I am interested. I am invited to every third position I requested (0.2/3 = 0.067)
And I get 1 position out of 10 initial interviews (0.0067).
So the probability that an incoming email will result in a job is 0.0067, or ~1%. So, I need to consider 100 positions (this will take 10 weeks), send 20 emails asking to invite, go to 6..7 interviews, and eventually get a job.
So job search takes 2..3 months.
Rules (Or, And, etc.)
Lets A be an event. P(A) – probabiltiy of event A then: P(A) + P(not A) = 1 If A and B are events, then: P(A or B) = P(A) + P(B) – P(A and B) for mutually exclusive events P(A and B) = 0, so P(A or B) = P(A) + P(B)
Graphical representation of events as circles (Venn Diagrams).
See many examples – just google images for: Venn Diagram A or B.
For example, this image shows Intersection (A and B),
negation of intersection (A but not B, or B but not A),
union (A or B), and subtraction (A but not B) :
Conditional Probability
P(AB) – probability of A given that B has occurred.
P(AB) = P(A and B) / P(B) (if P(B) is not zero)
If A & B are independent form each other, then:
P(AB) = P(A)
P(BA) = P(B)
Bayes’ Theorem (Law, Rule)
– https://en.wikipedia.org/wiki/Thomas_Bayes – (/’be?z/; 1701–1761)
– https://en.wikipedia.org/wiki/Bayes%27_theorem –
– https://en.wikipedia.org/wiki/Bayesian_inference –
P(AB) = P(BA)*P(A)/P(B)
where A and B are events and P(B) not equal 0.
Frequentist interpretation of Bayes’ theorem.
Probability measures a proportion of outcomes.
Suppose an experiment is performed many times.
P(A) is the proportion of outcomes with property A,
and P(B) that with property B.
P(BA) is the proportion of outcomes with property B out of outcomes with property A,
and P(AB) the proportion of those with A out of those with B.
Bayesian interpretation of Bayes’ theorem.
Probability measures a degree of belief.
Bayes’ theorem then links the degree of belief in a proposition before (A)
and after (B) accounting for evidence.
For example, suppose it is believed with 50% certainty that a coin is twice as likely to land heads than tails.
If the coin is flipped a number of times and the outcomes observed,
then that degree of belief may rise, fall or remain the same depending on the results.
For proposition A and evidence B,
P(A), the prior, is the initial degree of belief in A.
P(AB), the posterior, is the degree of belief having accounted for B.
P(BA)/P(B) – represents the support B provides for A.
See also Bayesian inference.
– https://en.wikipedia.org/wiki/Bayesian_inference
Bayesian inference is a method of statistical inference in which
Bayes’ theorem is used to update the probability for a hypothesis
as more evidence or information becomes available.
Random Variable. Probability Distribution
Random variable is a numerical variable taking different values.
Values may be discrete or continuous.
For example, think of variable “x” which may have any real value between 10 and +10.
Or think of a variable “z” which can have discrete values 0,0.1,0.2, … 0.9, 1.0
Discrete Case
Probability function P(x) of a discrete random variable “x” is simply a function which gives probability of discrete values.
For example, if the random variable “x” may have two values: ( 0 or 1 ),
and if probabilities of those values are equal, then
P(0) = 0.5, f(1) = 0.5.
The sum of individual probabilities for all possible values should be 1.
Expected value is an weighted average of all values.
E(x) = sum(Xi*P(Xi)), where Xi goes through all allowed values of X.
Variance = sigma squared = sum( (XiE(x))**2 * P(Xi) )
Standard Deviation = sqrt(sigma squared)
Probability Density Function (PDF) vs. Cumulative Probability Function ( CPF)
Continuous Case
Probability Density Function f(x) is a continuous function.
f(x)*dx gives probability that value is between x and x+dx.
The total integral of f(x)*dx should be equal to 1.
Cumulative Probability Function (CPF)
This is simply a function F(x) which gives probability that value is less than x.
It monotonously grows from 0 to 1 as x grows (from small to large values).
The PDF is a first derivative of CPF.
CPF(x+dx) – CPF(x) = = [Prob. of value < (x+dx)] – [Prob. of value < x] = f(x)*dx
Mean value = Expected value
For discrete case we had:
E(x) = sum(Xi*P(Xi))
For continuous case this translates into
E(x) = Integral(xF(x)dx)
Binomial Distribution
– https://en.wikipedia.org/wiki/Binomial_distribution
if:
Experiment consists of N trials. Each trial can have two outcomes (True of False).
The probabilities of these outcomes are the same for all trials (p and q = 1p).
The trials are independent from each other.
Then:
The probability of having “x” True values (out of N) is:
f(x) = C(N,x)*p^x*(1p)^(Nx)
where C(N,x) = N!/(x!*(Nx)!) – a binomial coefficient
At large numbers the distribution f(x) has a bellcurve shape
Mean value = Expected value E(x) = N*p
Variance = N*p*(1p).
Combinatorics
Factorial
N! = N*(N1)*(N2)*…*1
Reordering
Let’s say, we have N coins or some other thingies (Nelement unordered set).
Question: In how many ways can we arrange them?
Answer: N!
Example: How many ways are there to arrange the letters ABCDE in a row?
Answer: 5!=120
Subsets
Question: In how many ways can we select melements out of Nelement set?
Answer: C(N,m) = N!/(m!*(Nm)!) – a binomial coefficient
Poisson and Exponential Distributions
– https://en.wikipedia.org/wiki/Poisson_distribution
– https://en.wikipedia.org/wiki/Exponential_distribution
Suppose you are sitting near the road. You notice that on average you see 10 cars passing by every minute (we say – the rate lambda = 10/min). Knowing the average rate, what would be the probability of having only 5 cars in one minute? Or 15 cars? Or any given number? Poisson Probability P(k) is the probability of a given number of events occurring in a fixed interval (of time) if these events occur with a known average rate and independently of each other and of the time since the previous events.
P(x) = (lambda**x)*exp(lambda)/x!
For example, for x=0 (probability of zero events in given interval):
P(0) = (lambda**0)*exp(lambda)/0! = 1*exp(lambda)/1 = exp(lambda)
Intervals between Poisson events distributed exponentially.
If we have events happening which follow Poisson Distribution, we can measure the time intervals between these events.
Question: How these times will be distributed?
Answer: These times will have exponential distribution with Probability Density Function (PDF) as:
f(t) = lambda*exp(lambda*t)
The average time interval will be (1/lambda).
Standard deviation happens to be the same (1/lambda).
Uniform and Normal Distributions. Central Limit Theorem.
Uniform Distribution
If some value can change between some minimum and maximum values a & b, and the probability of any value in interval [a,b] is the same, then we say that we have uniform distribution.
The PDF (Probability Density Function) for Uniform Distribution looks like a rectangle with the height (1/(ba)), so that the total area under the curve is exactly 1.
More: Uniform distribution (continuous), Normal Distribution
Normal (Gaussian Bellshaped) Distribution
The basic shape of the PDF is exp(x*x/2) You need to add a coefficient to normalize it to 1 (means – make area under the curve equal to exactly 1). You can also have the curves have different position (average value) and different width (sigma).
Please see Generic formula and shape.
The normal distribution is very common in nature. Because when we have big number of sets of variables, the distribution of parameters of these sets (for example, averages) tend to converge to this bell form. This fact is called Central Limit Theorem.
In other words, suppose we have a variable with exponential distribution (or any other distribution). Suppose we get a sample set of N1 values, and we have N2 such sets. We calculate an average in each set – so we have N2 average values.
Question: How will these average values be distributed?
Answer: As N1 & N2 become big enough, the distribution will get close to the Normal Gaussian bell curve.
Interesting that if the original variable has some other distribution (not exponential, but something else – uniform, Poisson, etc), we will still get the Normal distribution!
See Illustration of the central limit theorem. On that page you can select the shape of the underlying distribution and see how regardless of that distribution, the result is always the Normal distribution.
See also:
 Central Limit Theorem – Uniform Distribution
 Central Limit Theorem – Parabolic Distribution
 Central Limit Theorem – Triangular Distribution
There are also some nice animated demonstrations – just google for them.
De Moivre–Laplace theorem
De MoireLaplace theorem is a simple demonstration how a symmetrical binomial distribution can be approximated by Gaussian bell curve (under certain conditions). See video demonstration below.
 https://en.wikipedia.org/wiki/De_Moivre–Laplace_theorem
4. Introduction into Statistics
“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.” ( Source: Wikipedia article about Statistics.)
Being a Statistician is the major part of being a Data Scientist. And it is tricky, because there is easy to misuse statistics:
“There are three kinds of lies: lies, damned lies, and statistics.” (Source: Lies, damned lies, and statistics)
Statistics involves
 Collecting (sampling) the data from different kind of observations
 Visualizing the data, estimating the parameters (average/mean values, errors, etc.)
 Testing validity of a hypothesis (null hypothesis) – or alternative hypothesis
 Estimating intervals, estimating significance of observed differences, etc.
Here are some basic formulas:
Intuitive explanation for dividing by (n1) when calculating standard deviation:
– http://stats.stackexchange.com/questions/3931/intuitiveexplanationfordividingbyn1whencalculatingstandarddeviation
More advanced topics:
Linear Regression – ( https://en.wikipedia.org/wiki/Linear_regression ) – draw a line through experimental points.
Simple Linear Regression – ( https://en.wikipedia.org/wiki/Simple_linear_regression ) – the least squares estimator of a linear regression model.
Here is the simplest standard OLS method (OLS = Ordinary Least Squares).
ym = y_mean = sum(y)/N xm = x_mean = sum(x)/N var(x) = variance(x) = sum((xxm)^2) covar(x,y) = covariance(x,y) = sum((yym)*(xxm)) slope b = covar(x,y) / var(x) intercept a = ym  b*xm We can simplify as following: var(x) = sum((xxm)^2) = sum(x^2)  sum(2*x*xm) + sum(xm^2) = sum(x^2)  N*xm^2 covar(x,y) = ... using similar logic = sum(y*x)  N*xm*ym So: slope b = (sum(y*x)  N*xm*ym) / (sum(x^2)  N*xm^2) intercept a = ym  b*xm = (ym * sum(x^2)  ym*N*xm^2 xm*sum(y*x)+xm*N*xm*ym)/(sum(x^2)  N*xm^2) = (ym * sum(x^2) xm*sum(y*x))/(sum(x^2)  N*xm^2) 
When drawing a curve through experimental points, it is very important to estimate how good this curve fits the data. And if fitting the data with this curve makes sense. For this you can use the Coefficient of Determination:
 Coefficient of Determination R^{2} – ( https://en.wikipedia.org/wiki/Coefficient_of_determination ) – pronounced “R squared”, is a number between 0 and 1. The closer it is to 1 – the better data fits the model.
More Statistics:
 Confidence Interval – ( https://en.wikipedia.org/wiki/Confidence_interval ) –
 Cluster Analysis – ( https://en.wikipedia.org/wiki/Cluster_analysis ) – finding groups (clusters) in data.
 Student’s tdistribution – ( https://en.wikipedia.org/wiki/Student%27s_tdistribution ) – estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
 Chisquared distribution – ( https://en.wikipedia.org/wiki/Chisquared_distribution ) – distribution of a sum of the squares of k independent standard normal random variables.
 Fdistribution – ( https://en.wikipedia.org/wiki/Fdistribution ) – rightskewed distribution used most commonly in Analysis of Variance.
 Gamma distribution – ( https://en.wikipedia.org/wiki/Gamma_distribution ) – a twoparameter family of continuous probability distributions. The common exponential distribution and chisquared distribution are special cases of the gamma distribution.
Stochastic Processes, Time Series Analysis
– https://en.wikipedia.org/wiki/Stochastic_process –
– https://en.wikipedia.org/wiki/Time_series –
– https://en.wikipedia.org/wiki/Random_walk –
– https://en.wikipedia.org/wiki/Brownian_motion –
– https://en.wikipedia.org/wiki/Diffusion , https://en.wikipedia.org/wiki/Diffusion_process –
– https://en.wikipedia.org/wiki/Stationary_process –
– https://en.wikipedia.org/wiki/Stable_process –
Poisson Process
– https://en.wikipedia.org/wiki/Poisson_point_process –
– https://en.wikipedia.org/wiki/Dirac_delta_function –
– https://en.wikipedia.org/wiki/Compound_Poisson_process –
White Noise, Gaussian Noise, Brownian Noise
– https://en.wikipedia.org/wiki/White_noise –
– https://en.wikipedia.org/wiki/Gaussian_noise –
– https://en.wikipedia.org/wiki/Brownian_noise –
Markov Process
– https://en.wikipedia.org/wiki/Markov_chain –
– https://en.wikipedia.org/wiki/Markov_process –
Correlation function:
– https://en.wikipedia.org/wiki/Correlation_function –
– https://en.wikipedia.org/wiki/Autocorrelation –
Spectral Density, Fourier Analysis
– https://en.wikipedia.org/wiki/Spectral_density –
– https://en.wikipedia.org/wiki/Fourier_analysis –
– https://en.wikipedia.org/wiki/Fast_Fourier_transform –
Convolution, Functional Analysis
– https://en.wikipedia.org/wiki/Convolution –
– https://en.wikipedia.org/wiki/Functional_analysis –
Extracting signal from Noise
– https://en.wikipedia.org/wiki/Signaltonoise_ratio –
– https://en.wikipedia.org/wiki/Noise_reduction – (filtering, etc.)
If we add N samples “s” of noisy data, then the signal will add proportionally to N,
whereas noise will grow only as sqrt(N). So, if we add 100 samples, we will
improve out Signal/Noise ratio ~10 times.
Monte Carlo Method
– https://en.wikipedia.org/wiki/Monte_Carlo_method – used to evaluate something by using randomlygenerated samples. The more samples – the more accurate the result.