Data Science

Central Limit Theorem and Distributions in Data Science

A distribution with a mean µ and variance σ², the sampling distribution of the mean approaches a normal distribution with a mean (µ) and a variance σ²/N as N, the sample size increases.

Satyabrata Sahoo

Jun 29, 2022 — 6 min read

Image Credits: Casey Dunn on vimeo

In this article, we will understand about Central Limit Theorem and different types of distributions mainly used in Data Science.

Central Limit Theorem:

let us begin with an example. There is no exact data about the average income of every Indian. This is because a single person can have different incomes depending on the region, gender, education, and other factors. So we cannot get true information about the average income of Indians. In the situation when we do not get precise data on an issue we can then estimate an average number using a random sample to get some idea about it.

In that case, the average we have is not the population average µ but an estimate X of the population mean (that is, a measure of sample centrality).

If we take a homogeneous second sample, it is unlikely that the average calculated for second sample will be the same for the average calculated for the first sample. In fact, statisticians know that reiterated samples from the same population give different sample means.

They have also proved that the distribution of those sample means will constantly be normally distributed, irrespective of the form of the parent population. This is referred to as the Central Limit Theorem.

A distribution with a mean µ and variance σ², the sampling distribution of the mean approaches a normal distribution with a mean (µ) and a variance σ²/N as N, the sample size increases.

The amazing and counter-intuitive thing about the central limit theorem is that The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-Normal distribution from which the average is computed is decidedly non-Normal.

As the sample size n increases, the variance of the sampling distribution decreases. This is logical because the higher the sample size, the more proximate we are to quantifying the true population sample size, and the more proximate we are to quantifying the true population parameters.

Formula:

Central Limit Theorem formula:

As we know central limit theorem states that if we collect a sample then the mean of the sample will follow a normal distribution.

Here we get an interval estimate of the population mean. If the SE of the population is known then we can estimate the interval estimation of the population mean. This is a method of using the central limit theorem to estimate the population mean or to get the interval estimate of the population mean.

Here we introduced a new term that is Normal Distribution. So what is it?

Normal Distribution:

The Normal distribution is a pattern for the distribution of set of data that shows a bell-shaped curve. This is also called Gaussian distribution.

The Normal distribution has the mean, median, and mode all coinsiding at the peak.
The curve is centered and diminished on both sides. Most observations are close to the mean.
It can be determined entirely by the value of mean and standard deviation. If the mean and Standard Deviation(SD) change the shape of the bell also change.
The Area under the curve is One.
The Normal distribution will matches the data because of the law of large numbers. The Law of large number says that no matter what is the distribution, if there is large enough number of observations, data will tend to normal.
Normal distribution=z-distribution=Bell shaped curve.
The Normal distribution has zero skewness and mesokurtic.
Mean=median=mode
The Normal distribution is symmetric about the mean. It is exactly same as either side of the mean. If we go x-distance to the left and x-distance of the right, the probability to the right side is same as the probability of the left side of the mean.
If we go 1 standard deviation (SD) to the right side of the mean and if I go 1 SD to the left side of the mean then the range contains 68% of the total probability means that 68% of the observations are within ± 1 SD.
The probability between range of the total probability means that 95% of the observations are within ± 2 SD.
The probability between range of the total probability means that 99.7% of the observations are within ± 3 SD.
The normal distribution with mean 0 and standard deviation 1 is known as standard normal distribution (z-distribution).

Additional Distributions:

There are some distributions that are derived from the normal distribution.

1) Chi-Squared Distribution or χ2-distribution:

If follows normal distribution that means also follows standard normal distribution then

The Chi-squared distribution is created when we sum the square of normally distributed variables.
The Chi-squared distribution has only one important parameter which is known as degree of freedom (df).
The Degree of freedom means how many normally distributed variables we are adding.
In Chi-squared distribution, the value is always positive because when we work on a dataset, the value present is that dataset might be positive or negative and we are evaluating the Chi-square statistic, after squaring, the result that we found always a positive value.
A Chi-Squared distribution is a set of values that are distributed and separated by the P-value and Degree of Freedom (df).

We will check the P-value for the statistic value in the Chi-Squared Distribution table within the Degree of Freedom.

2) T- Distribution or student's t-distribution:

While z-distribution is for the population, t-distribution is for the sample distribution. Hence, the shape of the ‘t’ sampling distribution is similar to that of the ‘z’ sampling distribution in that it is

Symmetrical
Centered over a mean of zero
Variance depends on the sample size, more specifically on the degrees of freedom.

If the number of degrees of freedom increases, the variance of the t-distribution approaches more closely to that of z.

T-distribution is created if

In general the sample size n ≥ 30, shapes are almost similar. For n of 30 taken as the dividing point between small & large samples.

t-test for the population mean:

When to use Z-test and T-test?

z-test:

σ is known and the population is normal.
σ is known and the sample size n ≥ 30 (the population need not be normal).

t-test:

σ is not known and the population is assumed to be normal.
the sample size is smaller than 30.

3) F- Distribution (Fisher–Snedecor distribution):

The F distribution or F ratio, also known as the Fisher-Snedecor distribution, is a continuous probability distribution that often occurs as a null distribution in test statistics, especially in the analysis of variance (ANOVA) and other F-tests.

It is created by