Data Science

On this page we cover the following topics:

1. What is Data Science
2. Data Science Tools
3. Introduction into Probability Theory
4. Introduction into Statistics


1. What is Data Science (DS)?

The goal of DS is to extract knowledge or insights from data. Data Scientists spend most of their time on these basic activities:

  • Collecting the data (CSV & Excel files, web scraping, log files, databases & APIs, etc.)
  • Cleaning, aggregating, and loading the data (ETL = Extract, Transform, Load)
  • Visualizing the data to get intuitive understanding about it
  • Performing some processing and mathematical/statistical analysis and mathematical modeling. Most common tasks: – cluster analysis (splitting data into groups) – regression analysis (approximating with theoretical curves)
  • Visualizing the results – reporting and recommendations

A Data Scientist should combine several skills:

  • knowledge of a given subject matter (context)
  • mathematics and statistical methods
  • using computers to work with data.

More advanced topics associated with DS are Machine Leaning (ML) and Artificial Intelligence (AI).

Here are some common areas where the methods can be applied:

  • Making recommendations (people who liked this also liked that)
  • Find similar users (find more users who also likely to buy this product)
  • Cluster analysis (discovering groups)
  • Searching (search engine, ranking)
  • Optimizing (travel, cost, etc.)
  • Removing Spam (email, blog feeds/posts, searches)
  • Predictions and making decisions
  • Building Price Models
  • Finding new perspective products, new markets, new acquisition candidates, etc.
  • Epidemiology


Here are some links about Data Science:


Here are the lists of books about Data Science:

 2. Data Science Tools


Databases on Cloud:

  • Amazon Redshift – Amazon’s own database product in Amazon cloud
  • Snowflake Computing – Elastic SQL Data Warehouse on top of amazon cloud, supports both relational tables and JSON
  •  Microsoft Azure – Elastic SQL Data Warehouse, on Microsoft Cloud
  • Teradata – Expandable relational datawarehouse system (since 1979), “shared nothing” architecture.
  • IBM dashDB – IBM dashDB
  • HPE Vertica in the cloud – HP Vertica database on the cloud
  • MongoDB

Big Data tools – see on a separate page:


CSV and JSON data formats:




R Programming Language:




Some Data Science Algorithms:


Seminal Articles Every Data Scientist Should Read (from, August 2014)

External Papers

  1. Bigtable: A Distributed Storage System for Structured Data
  2. A Few Useful Things to Know about Machine Learning
  3. Random Forests
  4. A Relational Model of Data for Large Shared Data Banks
  5. Map-Reduce for Machine Learning on Multicore
  6. Pasting Small Votes for Classification in Large Databases and On-Line
  7. Recommendations Item-to-Item Collaborative Filtering
  8. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
  9. Spanner: Google’s Globally-Distributed Database
  10. Megastore: Providing Scalable, Highly Available Storage for Interactive Services
  11. F1: A Distributed SQL Database That Scales
  12. APACHE DRILL: Interactive Ad-Hoc Analysis at Scale
  13. A New Approach to Linear Filtering and Prediction Problems
  14. Top 10 algorithms on Data mining
  15. The PageRank Citation Ranking: Bringing Order to the Web
  16. MapReduce: Simplified Data Processing on Large Clusters
  17. The Google File System
  18. Amazon’s Dynamo


DSC Internal Papers

  1. How to detect spurious correlations, and how to find the …
  2. Automated Data Science: Confidence Intervals
  3. 16 analytic disciplines compared to data science
  4. From the trenches: 360-degree data science
  5. 10 types of regressions. Which one to use?
  6. Practical illustration of Map-Reduce (Hadoop-style), on real data
  7. Jackknife logistic and linear regression for clustering and predict…
  8. A synthetic variance designed for Hadoop and big data
  9. Fast Combinatorial Feature Selection with New Definition of Predict…
  10. Internet topology mapping
  11. 11 Features any database, SQL or NoSQL, should have
  12. 10 Features all Dashboards Should Have
  13. Clustering idea for very large datasets
  14. Hidden decision trees revisited
  15. Correlation and R-Squared for Big Data
  16. What Map Reduce can’t do
  17. Excel for Big Data
  18. Fast clustering algorithms for massive datasets
  19. The curse of big data
  20. Interesting Data Science Application: Steganography


Some Data Science videos for beginners:



3. Introduction to Probability Theory

Probability Definitions

Probability of an event is the measure of the likelihood that the event will occur (in some experiment). Probability is a number between 0 and 1. But sometimes people express probability using percents.

0 – event is impossible
0.5 – the event is just as likely to occur as it is to not occur
1 – event will certainly happen, 100%.

The sum of probabilities of all outcomes of an experiment should always be 1.

Outcomes can be combinations of individual events.
For example, if we throw one dice cube – there are 6 possible outcomes (1,2,3,4,5,6),each one has same probability – 1/6.
If we throw 2 cubes, there are 6*6=36 possible combinations.Probability of each combination is the same – 1/36.

If we define an outcome as a sum of numbers on top side of cubes after the throw, then there are 11 possible outcomes:

sum combinations Number of
2 1+1 1 1/36
3 1+2,2+1 2 2/36
4 1+3,2+2,3+1 3 3/36
5 1+4,2+3,3+2,4+1 4 4/36
6 1+5,2+4,3+3,4+2,5+1 5 5/36
7 1+6,2+5,3+4,4+3,5+2,6+1 6 6/36
8 2+6,3+5,4+4,5+3,6+2 5 5/36
9 3+6,4+5,5+4,6+3 4 4/36
10 4+6,5+5,6+4 3 3/36
11 5+6,6+5 2 2/36
12 6+6 1 1/36
Total 36 1

From the table above you see how some outcomes can result from more combinations – thus have higher probability of happening.

Note: Events can be mutually exclusive, or not. Also, events may be independent, or dependent on each other.

Marginal Probability = Simple Probablility, a probability of an independent event occurring.

The term “marginal variable” is used to refer to a subset of variables in a table. They called “marginal” because they used to be calculated by summing values in a table along rows or columns, and writing the sum in the margins of the table.

Examples of assigning probability.

1. Tossing a coin.

Possible outcomes : (tails or heads).
Events (tails of heads) are mutually exclusive. Their probabilities should be equal.

Pt + Ph = 1
Pt = 0.5
Ph = 0.5

2. Rolling a die.

Possible outcomes: (1,2,3,4,5,6).
They are mutually exclusive, probabiltiy equal, sum() = 1
So each probabiltiy = 1/6 = 0.167 3.

3. Job Search.

I get 2 emails with job descriptions every day, 10 per week.

Two of them may be a fit: 2 of 10 (0.2).

So I send an email that I am interested. I am invited to every third position I requested (0.2/3 = 0.067)

And I get 1 position out of 10 initial interviews (0.0067).

So the probability that an incoming email will result in a job is 0.0067, or ~1%. So, I need to consider 100 positions (this will take 10 weeks), send 20 emails asking to invite, go to 6..7 interviews, and eventually get a job.

So job search takes 2..3 months.

Rules (Or, And, etc.)

Lets A be an event. P(A) – probabiltiy of event A then: P(A) + P(not A) = 1 If A and B are events, then: P(A or B) = P(A) + P(B) – P(A and B) for mutually exclusive events P(A and B) = 0, so P(A or B) = P(A) + P(B)

Graphical representation of events as circles (Venn Diagrams).

See many examples – just google images for: Venn Diagram A or B.
For example, this image shows Intersection (A and B),
negation of intersection (A but not B, or B but not A),
union (A or B), and subtraction (A but not B) :
Venn Diagrams

Conditional Probability

P(A|B) – probability of A given that B has occurred.

P(A|B) = P(A and B) / P(B) (if P(B) is not zero)

If A & B are independent form each other, then:

P(A|B) = P(A)

P(B|A) = P(B)


Bayes’ Theorem (Law, Rule) – (/’be?z/; 1701–1761)

P(A|B) = P(B|A)*P(A)/P(B)

where A and B are events and P(B) not equal 0.

Frequentist interpretation of Bayes’ theorem.

Probability measures a proportion of outcomes.
Suppose an experiment is performed many times.
P(A) is the proportion of outcomes with property A,
and P(B) that with property B.
P(B|A) is the proportion of outcomes with property B out of outcomes with property A,
and P(A|B) the proportion of those with A out of those with B.

Bayesian interpretation of Bayes’ theorem.

Probability measures a degree of belief.
Bayes’ theorem then links the degree of belief in a proposition before (A)
and after (B) accounting for evidence.

For example, suppose it is believed with 50% certainty that a coin is twice as likely to land heads than tails.
If the coin is flipped a number of times and the outcomes observed,
then that degree of belief may rise, fall or remain the same depending on the results.

For proposition A and evidence B,
P(A), the prior, is the initial degree of belief in A.
P(A|B), the posterior, is the degree of belief having accounted for B.
P(B|A)/P(B) – represents the support B provides for A.

See also Bayesian inference.

Bayesian inference is a method of statistical inference in which
Bayes’ theorem is used to update the probability for a hypothesis
as more evidence or information becomes available.

Random Variable. Probability Distribution

Random variable is a numerical variable taking different values.
Values may be discrete or continuous.

For example, think of variable “x” which may have any real value between -10 and +10.

Or think of a variable “z” which can have discrete values 0,0.1,0.2, … 0.9, 1.0

Discrete Case

Probability function P(x) of a discrete random variable “x” is simply a function which gives probability of discrete values.

For example, if the random variable “x” may have two values: ( 0 or 1 ),
and if probabilities of those values are equal, then

P(0) = 0.5, f(1) = 0.5.

The sum of individual probabilities for all possible values should be 1.

Expected value is an weighted average of all values.

E(x) = sum(Xi*P(Xi)), where Xi goes through all allowed values of X.

Variance = sigma squared = sum( (Xi-E(x))**2 * P(Xi) )

Standard Deviation = sqrt(sigma squared)

Probability Density Function (PDF)  vs.  Cumulative Probability Function ( CPF)

Continuous Case

Probability Density Function f(x) is a continuous function.

f(x)*dx gives probability that value is between x and x+dx.

The total integral of f(x)*dx should be equal to 1.

 Cumulative Probability Function (CPF)

This is simply a function F(x) which gives probability that value is less than x.
It monotonously grows from 0 to 1 as x grows (from small to large values).

Probability Density Function vs Cumulative Probability Function

The PDF is a first derivative of CPF.

CPF(x+dx) – CPF(x) = = [Prob. of value < (x+dx)] – [Prob. of value < x] = f(x)*dx

Mean value = Expected value

For discrete case we had:
E(x) = sum(Xi*P(Xi))

For continuous case this translates into
E(x) = Integral(xF(x)dx)

Binomial Distribution

Experiment consists of N trials. Each trial can have two outcomes (True of False).
The probabilities of these outcomes are the same for all trials (p and q = 1-p).
The trials are independent from each other.

The probability of having “x” True values (out of N) is:

f(x) = C(N,x)*p^x*(1-p)^(N-x)
where C(N,x) = N!/(x!*(N-x)!) – a binomial coefficient
At large numbers the distribution f(x) has a bell-curve shape

Mean value = Expected value E(x) = N*p
Variance = N*p*(1-p).



N! = N*(N-1)*(N-2)*…*1


Let’s say, we have N coins or some other thing-ies (N-element unordered set).

QuestionIn how many ways can we arrange them?


Explanation: N variants for the 1st element (N-1) variants for the 2nd element etc.

Example: How many ways are there to arrange the letters ABCDE in a row?

Answer: 5!=120


Question: In how many ways can we select m-elements out of N-element set?

Answer: C(N,m) = N!/(m!*(N-m)!) – a binomial coefficient


Explanation: N variants for the first element of subset (N-1) for the 2nd etc. (N-m+1) for the m-th element The above can be written as N!/(N-m)! Then we need to divide by the number of reordering permutations within the subset, that is by m!


Poisson and Exponential Distributions

Suppose you are sitting near the road. You notice that on average you see 10 cars passing by every minute (we say – the rate lambda = 10/min). Knowing the average rate, what would be the probability of having only 5 cars in one minute? Or 15 cars? Or any given number? Poisson Probability P(k) is the probability of a given number of events occurring in a fixed interval (of time) if these events occur with a known average rate and independently of each other and of the time since the previous events.

Poisson Formula

P(x) = (lambda**x)*exp(-lambda)/x!

For example, for x=0 (probability of zero events in given interval):
P(0) = (lambda**0)*exp(-lambda)/0! = 1*exp(-lambda)/1 = exp(-lambda)

Poisson Distribution


Intervals between Poisson events distributed exponentially.

If we have events happening which follow Poisson Distribution, we can measure the time intervals between these events.

Question: How these times will be distributed?

Answer: These times will have exponential distribution with Probability Density Function (PDF) as:

f(t) = lambda*exp(-lambda*t)

The average time interval will be (1/lambda).
Standard deviation happens to be the same (1/lambda).

Uniform and Normal Distributions. Central Limit Theorem.

Uniform Distribution

If some value can change between some minimum and maximum values a & b, and the probability of any value in interval [a,b] is the same, then we say that we have uniform distribution.
The PDF (Probability Density Function) for Uniform Distribution looks like a rectangle with the height (1/(b-a)), so that the total area under the curve is exactly 1.

More: Uniform distribution (continuous)Normal Distribution

Normal (Gaussian Bell-shaped) Distribution

The basic shape of the PDF is exp(-x*x/2) You need to add a coefficient to normalize it to 1 (means – make area under the curve equal to exactly 1). You can also have the curves have different position (average value) and different width (sigma).


Normal Distribution

Please see Generic formula and shape.

The normal distribution is very common in nature. Because when we have big number of sets of variables, the distribution of parameters of these sets (for example, averages) tend to converge to this bell form. This fact is called Central Limit Theorem.

In other words, suppose we have a variable with exponential distribution (or any other distribution). Suppose we get a sample set of N1 values, and we have N2 such sets. We calculate an average in each set – so we have N2 average values.

Question: How will these average values be distributed?

Answer: As N1 & N2 become big enough, the distribution will get close to the Normal Gaussian bell curve.

Interesting that if the original variable has some other distribution (not exponential, but something else – uniform, Poisson, etc), we will still get the Normal distribution!

See Illustration of the central limit theoremOn that page you can select the shape of the underlying distribution and see how regardless of that distribution, the result is always the Normal distribution.

See also:

There are also some nice animated demonstrations – just google for them.

De Moivre–Laplace theorem

De Moire-Laplace theorem is a simple demonstration how a symmetrical binomial distribution can be approximated by Gaussian bell curve (under certain conditions). See video demonstration below.





 4. Introduction into Statistics  

“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.” ( Source: Wikipedia article about Statistics.)

Being a Statistician is the major part of being a Data Scientist. And it is tricky, because there is easy to misuse statistics:
“There are three kinds of lies: lies, damned lies, and statistics.” (Source: Lies, damned lies, and statistics)

Statistics involves

  • Collecting (sampling) the data from different kind of observations
  • Visualizing the data, estimating the parameters (average/mean values, errors, etc.)
  • Testing validity of a hypothesis (null hypothesis) – or alternative hypothesis
  • Estimating intervals, estimating significance of observed differences, etc.


Here are some basic formulas:

Statistics formulas

Intuitive explanation for dividing by (n-1) when calculating standard deviation:

More advanced topics:

Linear Regression – ( ) – draw a line through experimental points.

Simple Linear Regression – ( ) – the  least squares estimator of a linear regression model.

Here is the simplest standard OLS method (OLS = Ordinary Least Squares).

ym = y_mean = sum(y)/N
xm = x_mean = sum(x)/N

var(x) = variance(x) = sum((x-xm)^2)
covar(x,y) = covariance(x,y) = sum((y-ym)*(x-xm))

slope b = covar(x,y) / var(x)
intercept a = ym - b*xm

We can simplify as following:
var(x) = sum((x-xm)^2) = sum(x^2) - sum(2*x*xm) + sum(xm^2) = sum(x^2) - N*xm^2
covar(x,y) = ... using similar logic = sum(y*x) - N*xm*ym

slope b = (sum(y*x) - N*xm*ym) / (sum(x^2) - N*xm^2)
intercept a = ym - b*xm = (ym * sum(x^2) - ym*N*xm^2 -xm*sum(y*x)+xm*N*xm*ym)/(sum(x^2) - N*xm^2)
= (ym * sum(x^2) -xm*sum(y*x))/(sum(x^2) - N*xm^2)


When drawing a curve through experimental points, it is very important to estimate how good this curve fits the data. And if fitting the data with this curve makes sense. For this you can use the Coefficient of Determination:


More Statistics:


Stochastic Processes, Time Series Analysis – – –
– – – –

Poisson Process – – –

White Noise, Gaussian Noise, Brownian Noise – – –

Markov Process – –

Correlation function: – –

Spectral Density, Fourier Analysis – –
– –

Convolution, Functional Analysis –
– –

Extracting signal from Noise
– – (filtering, etc.)
If we add N samples “s” of noisy data, then the signal will add proportionally to N,
whereas noise will grow only as sqrt(N). So, if we add 100 samples, we will
improve out Signal/Noise ratio ~10 times.

Monte Carlo Method – used to evaluate something by using randomly-generated samples. The more samples – the more accurate the result.