In our previous article titled “Inferential Statistics in Data Science”, we have discussed different sampling schemas and point estimation in inferential statistics. So far, we have determined the results associated with individual observations or sample means when the actual population parameters are known. In reality, the true population parameters are seldom used. We now learn how to infer levels of Confidence or a measure of accuracy on parameters estimated using samples. Let’s start with our new concepts.
If we estimate a range or interval within which the true population parameter lies, then we are using an Interval estimation method. This is the most common method of estimation. We can also apply a level of how confident we are in the estimate. Please note that we can never be 100% confident unless we go through the whole population. For example, let us imagine that you have decided to randomly measure 100 students in a college, and you get a sample average age (𝑥̅) of 23. You might get close to the population’s real age (μ), but the chances are that the true age value is somewhere between 20 and 30. It is most accurate to say that the average age for students in the college is somewhere between a specific interval [20, 30]. Here we can say that we are 95% confident that the average age in our college falls somewhere between 20 and 25. There is still a 5% chance that the population parameter is outside the expected range.
Now we are going to move toward a very important thing called hypothesis testing but before that, we need to understand what p-value and Confidence intervals are?
To understand the p-value or probability value, let us consider one example: Everyone using the keyboard to input data into the system. During the input of data, most people press the spacebar at the centre region and fewer press the left or right region. With respect to this pressing, below is the Gaussian Distribution curve that we can find:
Let's say if the p-value is 0.90; this basically means the probability of touching the spacebar in that given region is 90%. If the p-value is 0.05, that basically means out of the entire touches to the spacebar, only 5% touched on this region. So this is what the p-value will specify. This is very simple; no need to worry about anything else. Now let's discuss what Confidence Interval is?
The confidence interval is a range of values within which we expect the population parameter lies in. The confidence level of the interval is generally denoted by 1-alpha (1-α). If we are 95% confident that our population parameter is inside that interval, then α=5%. And if we are 98% confident that our population parameter is inside that interval, then α must be 2%. α is also known as the p-value. Hence, the p-value is the probability that a randomly picked sample will have the mean lying outside the Confidence interval.
How do we calculate the Confidence Interval estimates? The interval estimate of a confidence interval is calculated by 𝐬𝐚𝐦𝐩𝐥𝐞 𝐦𝐞𝐚𝐧 ∓ 𝐦𝐚𝐫𝐠𝐢𝐧𝐚𝐥 𝐞𝐫𝐫𝐨r
We can extend this principle further:
- We can be 90% confident that the true population mean lies within (x ) ̅± 1.645(SE)
- We can be 95% confident that the true population mean lies within x ̅ ± 1.960(SE)
- We can be 99% confident that the true population mean lies within x ̅ ± 2.576(SE)
Here you may have a question what are standard errors?
Since all samples drawn from a population are similar but not the same as the population, we calculate a Standard Error. Standard Error is the standard deviation of the sample means from the population mean. Also, Standard Error ultimately converges to the Standard Deviation of the population.
In the next article, we will understand the main concept of Inferential statistics, Hypothesis Testing, and why it is used in Inferential statistics. The vital role of the Central Limit Theorem in Inferential statistics. Everything we will going to discuss.