Data Science

Inferential Statistics in Data Science

Satyabrata Sahoo

May 17, 2022 — 4 min read

Image credit: Canva

Statistics is the practice or science concerned with collecting, organizing, analyzing, interpreting, and presenting empirical data. In our previous article, "Descriptive Statistics and Analysis" all the analysis we have done is Descriptive Statistics. With Descriptive Statistics, we simply describe how to organize and summarize data characteristics. We have understood how to discover patterns in a given data using various approaches and visualization techniques. This article will discuss Inferential Statistics, one of the most prominent notions in statistics for data science.

Inferential statistics is a set of methods that are used to draw conclusions or inferences about the characteristics of a population-based on data from a sample. When we are working with real-time datasets, the population data may have millions of observations or variables, making analysis complex and time-consuming. So we will use sample data to estimate or test a hypothesis about the population.

Sampling Schemas:

Sampling is a technique for making statistical inferences and selecting a subset of the population to estimate the characteristics of the entire population. Different sampling schemas are widely used in the industries.

I. Simple Random Sampling (SRS)
II. Stratified Random Sampling(StRS)
III.Cluster Random Sampling(CRS)
IV. Systematic Random Sampling(SyRM)

I. Simple Random Sampling (SRS):

The SRS is of two types:

a) Simple Random Sampling with Replacement (SRSWR)

b) Simple Random Sampling without Replacement (SRSWOR)

a) SRSWR:

Randomly select the observations, and after noting down the observations, we put them back; then, if we make the next selection, the same observation may come.

b) SRSWOR:

We have taken a sample after noting down the observations, and we won't put it back. So there will be no chance of the same observations coming more than once.

I. Stratified Random Sampling(StRS):

Stratified sampling divides the population into segments, homogeneous groups (strata) based on a particular variable. StRS again of two types.

a) Stratified Random Sampling with Replacement (StRSWR)

b) Stratified Random Sampling without Replacement (StRSWOR)

For example, divide the Population of India into four parts: east, west, north, and south parts for sampling. For each part or segment, we do StRSWR and StRSWOR. We ensure that we get observations from each segment.

II. Cluster Random Sampling (CRS):

Cluster Sampling is a method where the population is divided into multiple groups depending on some variables, and we select randomly from the groups. It is of two types.

a) Cluster Random Sampling with Replacement (CRSWR)

b) Cluster Random Sampling without Replacement (CRSWOR)

For example, divide the Population of India into four parts: east, west, north, and south parts and then randomly select people from any of that groups. Here we will not get data from every part.

The difference between stratified and cluster is :

· In both cases, we divided our population into groups or segments.

· In Stratified, we choose randomly from every group.

· In cluster, we choose randomly from any group or category and select data from that category.

III. Systematic Random Sampling(SyRM):

Systematic Random Sampling is nothing, but we choose a particular order. For example, randomly order people and select an alternate. We ordered the observations in a specific way and then based on that order, we chose in a particular sequence. In 95 per cent of the cases, we use Simple Random Sampling without replacement.

Estimation:

The essential part of inferential statistical analysis is the "Estimation" of the population parameter. This is of two types.

a) Point

b) Interval

a) Point Estimation:

Point estimation gives a pointed value of the estimated population parameter. It is usually done for two population parameters.

Central Tendency
Dispersion

We have already understood these two topics in our previous article.

For example, in an examination, how much score will you get in that examination. If the score is precisely a value, let's say 87. This is known as point estimation. If the score is between 85-90 is known as interval estimation.

So how to estimate the population mean by using the sample? To find the central tendency using a sample, the best possible method to estimate that is by calculating a simple average of that sample.

Dispersion refers to the spread or variability in the data. It determines how spread out are the score around the mean.

Why is Dispersion significant?

It gives additional information that enables to judge of the reliability of the measures of central tendency. If the data is widely spread, the central location is less representative of data as a whole than it would be for data more closely centred around the mean. Since problems are peculiar to widely dispersed data, Dispersion identifies and tackles issues accordingly. For example, if a wide spread of values is away from the centre, this may be undesirable or presents a risk; one may avoid choosing that distribution.

So here, sample standard deviation is the best possible estimation. It is a little different from the standard deviation. In normal standard deviation, we divide the square of the difference by the number of observations, but in sample standard deviation, we divide by the number of observations minus one. This is because if the sample size is large enough, the (N-1) will converse toward the population standard deviation.

b) Interval:

If the population parameter lies between certain degrees of confidence, we call it interval estimation. If we estimate a range or interval within which the true population parameter lies, then we are using an interval estimation.

To be continued…

Inferential Statistics in Data Science

Satyabrata Sahoo

Read more

A Treaty Towards Transparency and Fairness

Are Evidence-Based Medicine and Public Health Incompatible?

Book Launch: Ek Samandar Mere Andar

This Weekend on IP Wave: How do you create value? More 'Ferris Bueller's Day Off,' Less 'The Terminator'