Data Science

Hypothesis Testing Statistics in Data Science

Hypothesis testing is a form of inferential statistics that helps scientists, researchers, or almost anyone to draw conclusions about an entire population based on a representative sample.

Satyabrata Sahoo

Jul 19, 2022 — 6 min read

Photo by Alvaro Reyes on Unsplash

Hypothesis testing is a form of inferential statistics that helps scientists, researchers, or almost anyone to draw conclusions about an entire population based on a representative sample. Hypothesis testing techniques are commonly used in statistics and data science to determine whether the assertion about the occurrence of an event is correct, whether the results drawn by the performance metrics of the machine learning & AI model are representative of the models, or if they happened by chance. There are many engineers who have never worked in statistics or data science. Thus, there are many issues from the engineering side in building a pipeline of data science and rewrite code created by data scientists into appropriate and maintainable code. For those data / ML engineers and data scientists, this piece will help you.

When performing Hypothesis Testing, firstly,

In a test procedure, to start with, a hypothesis must be formulated.
The validity of the hypothesis is tested.
If the hypothesis is found to be true, it is accepted.
If it is found to be untrue, it is rejected.

So here, the difference between the estimation and hypothesis is in estimation; it’s all about calculating either a good value of a parameter or an interval in which the parameter is supposed to lie within a certain degree of confidence whereas, in hypothesis testing, we verify some of our notions. The notion is basically the business inputs.

Hypothesis tests consist of the following steps:

a) Null Hypothesis / Notion /

Typically when we go through hypothesis testing we usually talk about the hypothesis on the central tendency. The hypothesis which is being tested for possible rejection is called the null hypothesis. The null hypothesis shows that nothing new is happening to the population. It looks like:

We always assume that the null hypothesis is true, or at least is the most plausible explanation before we do the test. The test can only disprove the null hypothesis.

b) Alternate Hypothesis/

The alternate hypothesis is the hypothesis that we set out to test for. It is the hypothesis that we wish to prove. It is represented by h11 which can be in three forms.

We can’t test all the above three tests as a whole. For single iteration one test at a time.

c) Test Statistics:

Test statistics are the numbers that we calculate on a sample which helps us to decide whether the null hypothesis is true or the alternative is true. Here we use different test statistics that we have already discussed in our previous article. Here we use a z-test, t-test, f-test, etc. to determine the P-value. If the sample size is more than 30, it is recommended to use z-statistics. Otherwise, t-statistics could be used.

d) Distributional Assumption (P-Value calculation):

In order to decide which hypothesis is true, we have to assume a distribution that means null hypothesis or alternate hypothesis which one is true, we have to assume a distribution to decide.

e) Significance Level:

The significance level is the minimum confidence that we should have to pronounce that the null hypothesis is true.

Let’s understand this with an example.

Q) Suppose the price of petrol in India is normally distributed with a mean of 95 INR per litre, and a standard deviation of 3.2 INR/litre. To test whether this price is in fact true, we have a sample of 50 service stations and obtained a mean of 96.6 INR/litre.

Ans)
Step 1: h22

Step 2: h221

Step 3:
Test Statistics:
Mean= 95
Standard Deviation=3.2
Under the Central Limit theorem,
exm

The test statistics that we are measured known as Z-statistics. We have to find the probability of the null hypothesis is being true with 3.5355339. Since we are calculating greater than the null hypothesis, the area to the right is known as the right-tailed test. If we are testing less than the null hypothesis, the area to the left is left tailed test. Here we have to calculate the area to the right in N(0,1).

Now from the Z- table, we calculate the area to the left. Once we know the area to the left, we can calculate the area to the right. So here the area to the right is 1 from the Z- table, now the area to the right is 0. So the confidence here from the null hypothesis is zero means no confidence.

Step 4: The distributional assumption here is under the central limit theorem and the best statistics follow the normal distribution. From normal distribution, we got the confidence that the area to the right is known as the p-value. So here the p-value is zero(0).

Step 5: Significance level is usually taken as 5%. At least we have 5% confidence in the hypothesis. Our p-value here is 0<=5%. So we will safely say that the mean > 95, because, in the Z-table, the highest probability is 3.4 which means for any value more than 3.4, the area to the left is 1.

When we perform hypothesis testing statistics, a p-value helps us to determine the significance of the result. A p-value is a number between zero(0) and one(1). A small p-value (typically≤0.05) indicates strong evidence against the null hypothesis. So we reject the null hypothesis.

Error in Hypothesis Testing:

There are two types of errors: Type I error and type II error. We can reject a null hypothesis in favor of the alternate when in fact it is true, this is known as Type 1 Error. When we accept a null hypothesis even it is false, this is known as a Type 2 Error.

Testing for the mean when the standard deviation is given is known as Z-test. But in the most often use cases the standard deviation will not give, in those cases we used a different case statistic known as t-test statistics. The concept remains the same just the statistics have changed. We have to just look at the p-value and compare it with the significance level.

Whenever we have the mean given, we don’t know the standard deviation, and dealing with one sample we do a one-sample t-test.

If we are testing for the mean, we don’t know the standard deviation and there are two samples we do a two-sample t-test.

Whenever we are testing for the mean, we don’t know the standard deviation and we have the pair of data, pair of data means for each observation we have two data points. The test that we used here is a paired-sample t-test.

Hypothesis testing using Python:

Importing and reading Datasets:

Information about dataset:

One-sample T-test:

Here p-value<5%. So how do we know in which direction the hypothesis is rejected? Here it is doing not equal to (<>) test. We look at the sign of the test statistics, here it is positive. So it is rejected at the > side. If it is negative, it is rejected at the < side. Here the test statistics show that it is more than 50 (population mean>50).