On this page we cover the following topics:
The goal of DS is to extract knowledge or insights from data. Data Scientists spend most of their time on these basic activities:
- Collecting the data (CSV & Excel files, web scraping, log files, databases & APIs, etc.)
- Cleaning, aggregating, and loading the data (ETL = Extract, Transform, Load)
- Visualizing the data to get intuitive understanding about it
- Performing some processing and mathematical/statistical analysis and mathematical modeling. Most common tasks: – cluster analysis (splitting data into groups) – regression analysis (approximating with theoretical curves)
- Visualizing the results – reporting and recommendations
A Data Scientist should combine several skills:
- knowledge of a given subject matter (context)
- mathematics and statistical methods
- using computers to work with data.
More advanced topics associated with DS are Machine Leaning (ML) and Artificial Intelligence (AI).
- Machine Leaning (ML) – algorithms to recognize patterns and make predictions
- Deep Machine Leaning (DML) – complex ML (multi-layer, structured, hierarchical)
- Artificial Intelligence (AI) – learning and problem solving
- Data Mining – exploring data searching for patterns (Data stores, Databases, Statistics, ML, AI)
- Sentiment Analysis – opinion mining
- Natural Language Processing (NLP) – extract data and meaning from the web and literature
Here are some common areas where the methods can be applied:
- Making recommendations (people who liked this also liked that)
- Find similar users (find more users who also likely to buy this product)
- Cluster analysis (discovering groups)
- Searching (search engine, ranking)
- Optimizing (travel, cost, etc.)
- Removing Spam (email, blog feeds/posts, searches)
- Predictions and making decisions
- Building Price Models
- Finding new perspective products, new markets, new acquisition candidates, etc.
Here are some links about Data Science:
- http://www.evanmiller.org/ab-testing/ – very nice statistical calculators
- http://www.datascienceglossary.org/ – good glossary of Data Science terms
Here are the lists of books about Data Science:
- Books on probability and statistics https://www.quora.com/What-are-some-good-books-for-learning-probability-and-statistics
- Books on Data Science https://www.quora.com/What-are-the-best-books-about-data-scienc
- Data warehouse – OLAP vs OLTP, star schema, dimensional model
- SQL (Structured Query Language)
- ETL tools (ETL = Extract, Transform, Load)
Databases on Cloud:
- Amazon Redshift – Amazon’s own database product in Amazon cloud
- Snowflake Computing – Elastic SQL Data Warehouse on top of amazon cloud, supports both relational tables and JSON
- Microsoft Azure – Elastic SQL Data Warehouse, on Microsoft Cloud
- Teradata – Expandable relational datawarehouse system (since 1979), “shared nothing” architecture.
- IBM dashDB – IBM dashDB
- HPE Vertica in the cloud – HP Vertica database on the cloud
Big Data tools – see on a separate page:
- Big Data, Cloud – http://www.selectorweb.com/resources/big-data/
CSV and JSON data formats:
- CSV – https://en.wikipedia.org/wiki/Comma-separated_values (Comma-Separated Values)
R Programming Language:
- http://mc-stan.org – Statistical Modeling (mostly Columbia University)
Some Data Science Algorithms:
- Naive Bayesian Classifier –
- Decision Tree Learning – (Gradient Tree Boosting, Random Forest, …)
- Neural Networks –
- Support-Vector Machines (SVMs) – supervised learning models
- k-Nearest Neighbors – used for classification or regression
- Clustering (Cluster Analysis) –
- Multidimensional Scaling – (MDS) – visualizing similarities in dataset
- Non-Negative Matrix Factorization – (NMF) – presenting big matrix as multiplication of two smaller ones
- Optimization –
Seminal Articles Every Data Scientist Should Read (from datasciencecentral.com, August 2014)
- Bigtable: A Distributed Storage System for Structured Data
- A Few Useful Things to Know about Machine Learning
- Random Forests
- A Relational Model of Data for Large Shared Data Banks
- Map-Reduce for Machine Learning on Multicore
- Pasting Small Votes for Classification in Large Databases and On-Line
- Recommendations Item-to-Item Collaborative Filtering
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- Spanner: Google’s Globally-Distributed Database
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- F1: A Distributed SQL Database That Scales
- APACHE DRILL: Interactive Ad-Hoc Analysis at Scale
- A New Approach to Linear Filtering and Prediction Problems
- Top 10 algorithms on Data mining
- The PageRank Citation Ranking: Bringing Order to the Web
- MapReduce: Simplified Data Processing on Large Clusters
- The Google File System
- Amazon’s Dynamo
DSC Internal Papers
- How to detect spurious correlations, and how to find the …
- Automated Data Science: Confidence Intervals
- 16 analytic disciplines compared to data science
- From the trenches: 360-degree data science
- 10 types of regressions. Which one to use?
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- Jackknife logistic and linear regression for clustering and predict…
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict…
- Internet topology mapping
- 11 Features any database, SQL or NoSQL, should have
- 10 Features all Dashboards Should Have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- What Map Reduce can’t do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- The curse of big data
- Interesting Data Science Application: Steganography
Some Data Science videos for beginners:
Data Science for Beginners video 1: The 5 questions that data science can answer
Data Science for Beginners video 2: Is your data ready for data science?
Part 3 – How to ask a question you can answer with data
Part 4 – Predict an answer with a simple model
Part 5 – Copy other people’s work to do data science
Data Science for Rest of Us
Fantastic Data Science materials from Microsoft:
Good site with Data Science Articles and Tutorials:
Probability of an event is the measure of the likelihood that the event will occur (in some experiment). Probability is a number between 0 and 1. But sometimes people express probability using percents.
0 – event is impossible
0.5 – the event is just as likely to occur as it is to not occur
1 – event will certainly happen, 100%.
The sum of probabilities of all outcomes of an experiment should always be 1.
Outcomes can be combinations of individual events.
For example, if we throw one dice cube – there are 6 possible outcomes (1,2,3,4,5,6),each one has same probability – 1/6.
If we throw 2 cubes, there are 6*6=36 possible combinations.Probability of each combination is the same – 1/36.
If we define an outcome as a sum of numbers on top side of cubes after the throw, then there are 11 possible outcomes:
From the table above you see how some outcomes can result from more combinations – thus have higher probability of happening.
Marginal Probability = Simple Probablility, a probability of an independent event occurring.
The term “marginal variable” is used to refer to a subset of variables in a table. They called “marginal” because they used to be calculated by summing values in a table along rows or columns, and writing the sum in the margins of the table.
Examples of assigning probability.
1. Tossing a coin.
Possible outcomes : (tails or heads).
Events (tails of heads) are mutually exclusive. Their probabilities should be equal.
Pt + Ph = 1
Pt = 0.5
Ph = 0.5
2. Rolling a die.
Possible outcomes: (1,2,3,4,5,6).
They are mutually exclusive, probabiltiy equal, sum() = 1
So each probabiltiy = 1/6 = 0.167 3.
3. Job Search.
I get 2 emails with job descriptions every day, 10 per week.
Two of them may be a fit: 2 of 10 (0.2).
So I send an email that I am interested. I am invited to every third position I requested (0.2/3 = 0.067)
And I get 1 position out of 10 initial interviews (0.0067).
So the probability that an incoming email will result in a job is 0.0067, or ~1%. So, I need to consider 100 positions (this will take 10 weeks), send 20 emails asking to invite, go to 6..7 interviews, and eventually get a job.
So job search takes 2..3 months.
Rules (Or, And, etc.)
Lets A be an event. P(A) – probabiltiy of event A then: P(A) + P(not A) = 1 If A and B are events, then: P(A or B) = P(A) + P(B) – P(A and B) for mutually exclusive events P(A and B) = 0, so P(A or B) = P(A) + P(B)
Graphical representation of events as circles (Venn Diagrams).
See many examples – just google images for: Venn Diagram A or B.
For example, this image shows Intersection (A and B),
negation of intersection (A but not B, or B but not A),
union (A or B), and subtraction (A but not B) :
P(A|B) – probability of A given that B has occurred.
P(A|B) = P(A and B) / P(B) (if P(B) is not zero)
If A & B are independent form each other, then:
P(A|B) = P(A)
P(B|A) = P(B)
Bayes’ Theorem (Law, Rule)
P(A|B) = P(B|A)*P(A)/P(B)
where A and B are events and P(B) not equal 0.
Frequentist interpretation of Bayes’ theorem.
Probability measures a proportion of outcomes.
Suppose an experiment is performed many times.
P(A) is the proportion of outcomes with property A,
and P(B) that with property B.
P(B|A) is the proportion of outcomes with property B out of outcomes with property A,
and P(A|B) the proportion of those with A out of those with B.
Bayesian interpretation of Bayes’ theorem.
Probability measures a degree of belief.
Bayes’ theorem then links the degree of belief in a proposition before (A)
and after (B) accounting for evidence.
For example, suppose it is believed with 50% certainty that a coin is twice as likely to land heads than tails.
If the coin is flipped a number of times and the outcomes observed,
then that degree of belief may rise, fall or remain the same depending on the results.
For proposition A and evidence B,
P(A), the prior, is the initial degree of belief in A.
P(A|B), the posterior, is the degree of belief having accounted for B.
P(B|A)/P(B) – represents the support B provides for A.
See also Bayesian inference.
Bayesian inference is a method of statistical inference in which
Bayes’ theorem is used to update the probability for a hypothesis
as more evidence or information becomes available.
Random Variable. Probability Distribution
Random variable is a numerical variable taking different values.
Values may be discrete or continuous.
For example, think of variable “x” which may have any real value between -10 and +10.
Or think of a variable “z” which can have discrete values 0,0.1,0.2, … 0.9, 1.0
Probability function P(x) of a discrete random variable “x” is simply a function which gives probability of discrete values.
For example, if the random variable “x” may have two values: ( 0 or 1 ),
and if probabilities of those values are equal, then
P(0) = 0.5, f(1) = 0.5.
The sum of individual probabilities for all possible values should be 1.
Expected value is an weighted average of all values.
E(x) = sum(Xi*P(Xi)), where Xi goes through all allowed values of X.
Variance = sigma squared = sum( (Xi-E(x))**2 * P(Xi) )
Standard Deviation = sqrt(sigma squared)
Probability Density Function (PDF) vs. Cumulative Probability Function ( CPF)
Probability Density Function f(x) is a continuous function.
f(x)*dx gives probability that value is between x and x+dx.
The total integral of f(x)*dx should be equal to 1.
Cumulative Probability Function (CPF)
This is simply a function F(x) which gives probability that value is less than x.
It monotonously grows from 0 to 1 as x grows (from small to large values).
The PDF is a first derivative of CPF.
CPF(x+dx) – CPF(x) = = [Prob. of value < (x+dx)] – [Prob. of value < x] = f(x)*dx
Mean value = Expected value
For discrete case we had:
E(x) = sum(Xi*P(Xi))
For continuous case this translates into
E(x) = Integral(xF(x)dx)
Experiment consists of N trials. Each trial can have two outcomes (True of False).
The probabilities of these outcomes are the same for all trials (p and q = 1-p).
The trials are independent from each other.
The probability of having “x” True values (out of N) is:
f(x) = C(N,x)*p^x*(1-p)^(N-x)
where C(N,x) = N!/(x!*(N-x)!) – a binomial coefficient
At large numbers the distribution f(x) has a bell-curve shape
Mean value = Expected value E(x) = N*p
Variance = N*p*(1-p).
N! = N*(N-1)*(N-2)*…*1
Let’s say, we have N coins or some other thing-ies (N-element unordered set).
Question: In how many ways can we arrange them?
Example: How many ways are there to arrange the letters ABCDE in a row?
Question: In how many ways can we select m-elements out of N-element set?
Answer: C(N,m) = N!/(m!*(N-m)!) – a binomial coefficient
Poisson and Exponential Distributions
Suppose you are sitting near the road. You notice that on average you see 10 cars passing by every minute (we say – the rate lambda = 10/min). Knowing the average rate, what would be the probability of having only 5 cars in one minute? Or 15 cars? Or any given number? Poisson Probability P(k) is the probability of a given number of events occurring in a fixed interval (of time) if these events occur with a known average rate and independently of each other and of the time since the previous events.
P(x) = (lambda**x)*exp(-lambda)/x!
For example, for x=0 (probability of zero events in given interval):
P(0) = (lambda**0)*exp(-lambda)/0! = 1*exp(-lambda)/1 = exp(-lambda)
Intervals between Poisson events distributed exponentially.
If we have events happening which follow Poisson Distribution, we can measure the time intervals between these events.
Question: How these times will be distributed?
Answer: These times will have exponential distribution with Probability Density Function (PDF) as:
f(t) = lambda*exp(-lambda*t)
The average time interval will be (1/lambda).
Standard deviation happens to be the same (1/lambda).
Uniform and Normal Distributions. Central Limit Theorem.
If some value can change between some minimum and maximum values a & b, and the probability of any value in interval [a,b] is the same, then we say that we have uniform distribution.
The PDF (Probability Density Function) for Uniform Distribution looks like a rectangle with the height (1/(b-a)), so that the total area under the curve is exactly 1.
Normal (Gaussian Bell-shaped) Distribution
The basic shape of the PDF is exp(-x*x/2) You need to add a coefficient to normalize it to 1 (means – make area under the curve equal to exactly 1). You can also have the curves have different position (average value) and different width (sigma).
Please see Generic formula and shape.
The normal distribution is very common in nature. Because when we have big number of sets of variables, the distribution of parameters of these sets (for example, averages) tend to converge to this bell form. This fact is called Central Limit Theorem.
In other words, suppose we have a variable with exponential distribution (or any other distribution). Suppose we get a sample set of N1 values, and we have N2 such sets. We calculate an average in each set – so we have N2 average values.
Question: How will these average values be distributed?
Answer: As N1 & N2 become big enough, the distribution will get close to the Normal Gaussian bell curve.
Interesting that if the original variable has some other distribution (not exponential, but something else – uniform, Poisson, etc), we will still get the Normal distribution!
See Illustration of the central limit theorem. On that page you can select the shape of the underlying distribution and see how regardless of that distribution, the result is always the Normal distribution.
- Central Limit Theorem – Uniform Distribution
- Central Limit Theorem – Parabolic Distribution
- Central Limit Theorem – Triangular Distribution
There are also some nice animated demonstrations – just google for them.
De Moivre–Laplace theorem
De Moire-Laplace theorem is a simple demonstration how a symmetrical binomial distribution can be approximated by Gaussian bell curve (under certain conditions). See video demonstration below.
“Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.” ( Source: Wikipedia article about Statistics.)
Being a Statistician is the major part of being a Data Scientist. And it is tricky, because there is easy to misuse statistics:
“There are three kinds of lies: lies, damned lies, and statistics.” (Source: Lies, damned lies, and statistics)
- Collecting (sampling) the data from different kind of observations
- Visualizing the data, estimating the parameters (average/mean values, errors, etc.)
- Testing validity of a hypothesis (null hypothesis) – or alternative hypothesis
- Estimating intervals, estimating significance of observed differences, etc.
Here are some basic formulas:
Intuitive explanation for dividing by (n-1) when calculating standard deviation:
More advanced topics:
Linear Regression – ( https://en.wikipedia.org/wiki/Linear_regression ) – draw a line through experimental points.
Simple Linear Regression – ( https://en.wikipedia.org/wiki/Simple_linear_regression ) – the least squares estimator of a linear regression model.
Here is the simplest standard OLS method (OLS = Ordinary Least Squares).
ym = y_mean = sum(y)/N xm = x_mean = sum(x)/N var(x) = variance(x) = sum((x-xm)^2) covar(x,y) = covariance(x,y) = sum((y-ym)*(x-xm)) slope b = covar(x,y) / var(x) intercept a = ym - b*xm We can simplify as following: var(x) = sum((x-xm)^2) = sum(x^2) - sum(2*x*xm) + sum(xm^2) = sum(x^2) - N*xm^2 covar(x,y) = ... using similar logic = sum(y*x) - N*xm*ym So: slope b = (sum(y*x) - N*xm*ym) / (sum(x^2) - N*xm^2) intercept a = ym - b*xm = (ym * sum(x^2) - ym*N*xm^2 -xm*sum(y*x)+xm*N*xm*ym)/(sum(x^2) - N*xm^2) = (ym * sum(x^2) -xm*sum(y*x))/(sum(x^2) - N*xm^2)
When drawing a curve through experimental points, it is very important to estimate how good this curve fits the data. And if fitting the data with this curve makes sense. For this you can use the Coefficient of Determination:
- Coefficient of Determination R2 – ( https://en.wikipedia.org/wiki/Coefficient_of_determination ) – pronounced “R squared”, is a number between 0 and 1. The closer it is to 1 – the better data fits the model.
- Confidence Interval – ( https://en.wikipedia.org/wiki/Confidence_interval ) –
- Cluster Analysis – ( https://en.wikipedia.org/wiki/Cluster_analysis ) – finding groups (clusters) in data.
- Student’s t-distribution – ( https://en.wikipedia.org/wiki/Student%27s_t-distribution ) – estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
- Chi-squared distribution – ( https://en.wikipedia.org/wiki/Chi-squared_distribution ) – distribution of a sum of the squares of k independent standard normal random variables.
- F-distribution – ( https://en.wikipedia.org/wiki/F-distribution ) – right-skewed distribution used most commonly in Analysis of Variance.
- Gamma distribution – ( https://en.wikipedia.org/wiki/Gamma_distribution ) – a two-parameter family of continuous probability distributions. The common exponential distribution and chi-squared distribution are special cases of the gamma distribution.
Stochastic Processes, Time Series Analysis
– https://en.wikipedia.org/wiki/Stochastic_process –
– https://en.wikipedia.org/wiki/Time_series –
– https://en.wikipedia.org/wiki/Random_walk –
– https://en.wikipedia.org/wiki/Brownian_motion –
– https://en.wikipedia.org/wiki/Diffusion , https://en.wikipedia.org/wiki/Diffusion_process –
– https://en.wikipedia.org/wiki/Stationary_process –
– https://en.wikipedia.org/wiki/Stable_process –
– https://en.wikipedia.org/wiki/Correlation_function –
– https://en.wikipedia.org/wiki/Autocorrelation –
Spectral Density, Fourier Analysis
– https://en.wikipedia.org/wiki/Spectral_density –
– https://en.wikipedia.org/wiki/Fourier_analysis –
– https://en.wikipedia.org/wiki/Fast_Fourier_transform –
Extracting signal from Noise
– https://en.wikipedia.org/wiki/Signal-to-noise_ratio –
– https://en.wikipedia.org/wiki/Noise_reduction – (filtering, etc.)
If we add N samples “s” of noisy data, then the signal will add proportionally to N,
whereas noise will grow only as sqrt(N). So, if we add 100 samples, we will
improve out Signal/Noise ratio ~10 times.
Monte Carlo Method
– https://en.wikipedia.org/wiki/Monte_Carlo_method – used to evaluate something by using randomly-generated samples. The more samples – the more accurate the result.