Descriptive Statistics and visualization with Plotly python
Statistics are pretty much important in machine learning or deep learning because whenever we are learning it is all because of the data. If we are able to understand the data properly, we will be able to understand the entire statistics in terms of how it is getting used in machine learning or deep learning that also we will able to understand. Let’s see what is the definition of statistics? In a simple way, Statistics is the science of collecting, organizing, presenting, analyzing and interpreting numerical data to assist in making more effective decisions. We can define statistics into two types; Descriptive statistics and Inferential statistics. In this article, we will focus on descriptive statistics and cover common descriptive measures and construct tables and graphs by using python libraries like pandas, plotly etc which we are going to use for statistical analysis and visualization.
Descriptive statistics consists of organizing and summarizing data. In brief, we can say descriptive statistics describe the data set that’s being analyzed but don’t allow us to draw any conclusions or make any inferences about the data. It involves the calculation of various measures such as Measures of Central Tendency, Measures of Dispersion, Tables and Graphs.
Types of Data Available:
Before beginning any specialized computation and plotting works, first, we need to understand the type of data that is commonly used in statistical analysis. There are mainly two types of data available: numerical/quantitative data and categorical/qualitative data.
Numerical data: Numerical data refers to the data that is measured or counted such as age, height, weight, salary, number of employees in an organization, etc.
Categorical data: Categorical data, as the name implies, are usually non-numerical information grouped into a category or multiple categories such as religion, colours, gender, marital status, etc.
Let's review the type of variables:
Load the Dataset:
I am using a dataset relevant to car sales data of different manufacturers, “car_sales.csv” as our sample data. The dataset is available in our GitHub repository.
To get a better understanding of data, let’s analyze the data by using pandas libraries in python.
In Figure 2: In cell 1, we are importing two libraries: pandas and NumPy. In Cell 2, reading raw data from our GitHub repository and temporarily stored it into a data frame named “df”. Cell 3 shows the first five records present in the dataset. Cell 4 gives the information about the data set. The information contains the total number of rows and columns, column data types, range index, and the number of non-null cells in each column. Mostly the data types of a column may be “int64” or “float64” which represents the numerical data whereas “object” represents the categorical data. Here is only one exceptional column named “Latest_Launch” which is a categorical variable, but originally it is in the form of “date(month/Day/Year)” format. To convert this variable to the Date Time format, a single line of code is enough. Here is the python code:
Let’s try to Understand:
Measures of Central Tendency:
To summarize our numerical data, we have to find out the central tendency of our data. For example, from the imported dataset, we can ask the question what is the average sales of different manufacturers? We can use one most common measure to answer this question, “Mean”.
1) Mean: It is just the average of the data, computed as the sum of the data points divided by the number of points. It is the most accessible metric to understand and communicate. Sometimes mean is prone to the presence of outliers. Outlier is the data point that are significantly different from other observations, which cause severe problems in statistical analysis.
2) Median: It describes the middle value of a sorted list of numbers. The total count of an ordered sequence of numbers is divided by 2. If the data set has an odd number of values count, then the value in the middle position is the median and when the list of data has an even number of values count, the mean of the two values in the middle of the list will be the median. Median is more “robust” to the presence of outliers.
Example: To compare the performance of any single employee against a group.
3) Mode: The values most frequently observed. If a single value occurs, then there is no mode. If two, or more, values occur as frequently as each other and more frequently than any other, then there are two, or more, modes.
Example: A parent wanting to know whether their child is better or worse than a typical child at his grade level.
In Python Code:
Python pandas offer three simple methods to calculate mean(), median(), and mode(). So let’s use it in our Car Sales dataset.
Measure of Dispersion:
1) Standard Deviation (SD) : The SD is a calculation of the amount of variation or dispersion of a set of values. It reveals how a set of values spread out from their mean. A low Standard Deviation reveals that the values are nearer to the mean value and a high Standard Deviation shows a high diversion from the mean value. The SD is affected by the outliers. It is calculated by the square root of the variance.
To calculate standard deviation we can follow some steps:
- Calculate the mean of the dataset
- Subtract the mean from each value present in the dataset.
- Square them
- Add the result after squaring
- Divide it by the number of values present minus one..
- Calculate the Square root of it.
Example: dataset= {10,11,12,13,14} - Mean= (10+11+12+13+14)/5=12
- {10-12, 11-12, 12-12, 13-12, 14-12}={-2,-1,0,1,2}
- {-22 , -12 ,02, 12,22}={4,1,0,1,4}
- 4+1+0+1+4=10
- 10/(5-1)=2.5 (Variance)
- Sqrt(2.5)=1.58 (SD)
2) Variance: This measure of dispersion checks the spread of the data about the mean. It is calculated by the square of the standard deviation. It is also affected by outliers.
3) Percentiles: A percentile (or a centile) is a measure used in statistics representing the value below which a given percentage of data falls. If we calculate the 35 percentile, it is the value below which 35% of the observations may be found. The minimum value is the zeroth percentile and the maximum is the 100th percentile and the 50th is the median.
4) Range: It is defind as the maximum value minus the minimum value.
5) Quartile Deviation: Quartile deviations can be defined as half the difference between the third and first quartiles of a particular dataset.
6) Minimum value: The smallest or lowest value in the data set.
7) Maximum value: The greatest or highest value in the data set.
8) Coefficient of Range: The ratio of the difference between the highest and lowest value in a data set to the sum of the highest and lowest value.
9) Coefficient of Variation (CoV): The portion of standard deviation to mean is known as CoV. It is unit free and we can compare the coefficient of variation in Age with respect to covariance in income. CoV is the ratio of SD to the arithmetic mean. CoV = SD/mean.
Statistical analysis for categorical data represents the count, unique, top, freq of the data.
Let's plot a boxplot of Price_in_thousands of all the manufacturers present in our data. It does not show the shape of the distribution but it can give us a better understanding of the distribution of data, and the potential outliers that may present.
Now, let’s try to build a histogram to observe the data distribution for our car sales data.
Now when we analyze categorical variables, different bar charts, and pie charts we create.
Github Resources:
All the Python scripts presented here are written in a Jupyter Notebook and shared through a Github Repository. Feel free to download the notebook from: