In data science, we build regression models to see how well one variable can be predicted based on one or more variables. Can we use excel to do the same? Yes we can use different excel Add-ins or tools to do this. With latest versions of Excel, it does not take more than a minute to build a model. Before building a model, let us know a little about what is Linear Regression? So Linear regression is a statistical method used to explore the correlation between two continuous quantitative variables. This technique is used to predict the value of an independent variable based on the value of other dependent variables. If there is only one independent variable, it is a simple linear regression; if there are multiple independent variables, it is a multiple linear regression. So here, we are trying to use the relationship pattern to predict the value of a variable. It is a supervised learning task where output is a continuous value.
When Y is numerical, 𝑌 = 𝐹(𝑋) + 𝐶
Since we are using linear regression, the equation is a linear equation.
Here 𝑥1, 𝑥2, 𝑥3, 𝑥4, 𝑥5 … . . 𝑥𝑛 are variables that we used to predict 𝑌. 𝛽0, 𝛽1, 𝛽2, … . 𝛽𝑛 are the coefficient variables. The whole instruments of linear regression is built around predicting the value or getting the values of 𝛽0, 𝛽1, 𝛽2, …
The Error can be measured by the quantity MSE (Mean Squared Error). The MSE is the average of squared error.
We should build an equation so that MSE is very low.
𝐻𝑒𝑟𝑒 𝑎𝑐𝑡𝑢𝑎𝑙 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛 𝑡𝑜 𝑢𝑠. 𝑥1, 𝑥2, 𝑥3, 𝑥4, 𝑥5 … . . 𝑥𝑛 𝑎𝑛𝑑 𝛽0 , 𝛽1, 𝛽2, … . 𝛽𝑛 𝑎𝑟𝑒 𝑢𝑛𝑘𝑛𝑜𝑤𝑛. So we have to minimize the quantity with respect to 𝛽0 , 𝛽1, 𝛽2, … . 𝛽𝑛 by using partial differentiation. We will partially differentiate the quantity 𝛽0 with respect to 𝛽0 and equate to zero, we will get 𝛽0 . Similarly we calculate for 𝛽2, … . 𝛽𝑛 . This method is known as ordinary least squares (OLS) method.
Example: Let us consider a small Car Sales dataset provided by the Analytixlabs when I was studying there, with variables Manufacturer, Model, Vehicle type, year resale value, Price in thousands, Engine size, Horsepower, Wheel base, Width, Length, Curb weight, Fuel capacity, Fuel efficiency, Latest Launch, Power perf factor, Sales in thousands. Here we have features of different cars sold in the past. Sales depend upon the features of a car. Now we are proposing a regression model on the features so that when next time a new car is launched, we can predict how much it will sell.
Steps follow to build the model in Excel:
Step 1: Remove everything that is categorical. So we have: year resale value, Price in thousands, Engine size, Horsepower, Wheelbase, Width, Length, Curb weight, Fuel capacity, Fuel efficiency, Power perf factor, and Sales in thousands.
Step 2: Replace missing value by zero.
Step 3: 𝐻𝑒𝑟𝑒 𝒀 = 𝐷𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = 𝑆𝑎𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑠.
𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠: 𝑥1 = 𝑦𝑒𝑎𝑟 𝑟𝑒𝑠𝑎𝑙𝑒 𝑣𝑎𝑙𝑢𝑒, 𝑥2 = 𝑃𝑟𝑖𝑐𝑒 𝑖𝑛 𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑠, 𝑥3 = 𝐸𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒, 𝑥4 = 𝐻𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟, 𝑥5 = 𝑊ℎ𝑒𝑒𝑙𝑏𝑎𝑠𝑒, 𝑥6 = 𝑊𝑖𝑑𝑡ℎ, 𝑥7 = 𝐿𝑒𝑛𝑔𝑡ℎ, 𝑥8 = 𝐶𝑢𝑟𝑏 𝑤𝑒𝑖𝑔ℎ𝑡, 𝑥9 = 𝐹𝑢𝑒𝑙 𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦, 𝑥10 = 𝑃𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑓 𝑓𝑎𝑐𝑡𝑜𝑟, 𝑥11 = 𝐹𝑢𝑒𝑙 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦
Analysis ToolPak may not be enabled by default, so we need to enable it manually. To do this,
I. On the Menu Bar, click “File”
II. Next, click on “Options”
III. Then Click on “Add-ins”, in the “Manage” dropdown box, select ‘Excel Add-Ins’ and click on ‘Go...’ button.
IV. Select Checkbox of ‘Analysis ToolPak’ and then click on ‘OK’.
Step 4: Now we can see ‘Data Analysis’ tools in the ‘Data’ tab. Click on ‘Data Analysis’ tab. From the “Analysis Tools”, select “Regression” and click “OK”
Select the Input Y range and Input X range. In our case, we use linear regression to select the dependent and independent variables from Step 3. We have included headers so check the ‘Labels’ box. Then Click “OK”.
The regression analysis output is now created in a new worksheet with regression statistics, ANOVA, residuals, t Stat, P-value, Lower 95%, Upper 95%, Lower 95.0%, Upper 95.0%, and coefficients.
- Multiple R stands for multiple correlations. Now in Pearson correlation, we are looking at the correlation of two variables at a time. Here the correlation is in Y and all the X’s together. X’s are Independent Variables in Step 3. The higher the Multiple R value better is the model. This is one of the strengths of the model.
- Observations are the number of observations taken part in the regression model.
- In layman’s understanding, the concept of R Square is the total information that is present in Y, a portion of it is captured by the model and the rest is the error. R Square is what percentage of the total information in Y is captured by the model.
The higher the value of R Square, the better is the model. Here R Square=0.288 means that 28.8% of the information or variance of Y is captured by the model.
- In the ANOVA table (which gives an idea about the level of variability within our regression model) from the summary output, the only interesting variable that we can look into is Significance F (P- value). To determine whether the difference between groups means is statistically significant or not, examine the p-value corresponding to the F-statistic.
- The df (degrees of freedom) regression is the number of independent variables in our regression model. Here df regression is 11, because 12 independent variables, including intercept (or constant, implying that the expected mean value of Y when all X=0) we have selected for model building. df residual is the total number of observations (rows) in the dataset subtracted from the number of estimated variables. Here it is 157- 12=145. df residual is the sum of df regression and df residual.
- The SS stands for Sum Square. SS regression is basically the total variation of the dependent variable explained by our regression model. It is the sum of the squared differences between the predicted value and the mean of all data points.
- Also we can calculate R Square from SS Regression,
- The MS stands for Mean Square. It is calculated by the sum of squares divided by df (degrees of freedom).
- The F-Value is used to test the hypothesis.
We can get the p-value from the F-Distribution. The high value of F indicates higher value of mean square of regression means higher value of SS, more information captured by the regression model.
- Standard Error is the average error that we have made in estimating the 𝛽 . This is the average difference between actual and predicted values. Standard error returns the estimated standard deviation of the coefficient distribution. We need the standard error to calculate the t-stat.
- t-stat is basically the coefficient divided by the standard error. This t-stat follows t-distribution therefore we calculate the p-value.
- The p-value is testing the hypothesis. For each 𝛽, we are testing individually.
- If p-value > 0.05, we need to drop that variable. In our case, we have multiple variables like: year resale value, Price in thousands, Length, Curb weight, Fuel capacity, Power perf factor, and Fuel efficiency having 𝛽 > 𝑝 − 𝑣𝑎𝑙𝑢𝑒 . So we look at the variable having the highest p-value drop that first and rebuild the model and then drop the variables one by one in descending order from the regression model and rebuild the model in each drop.
- The Lower 95% and Higher 95% give the confidence interval. This is the interval estimate of β. The coefficient is the point estimate of β. Out of all the models, if we built 100 on 100 different data, 95% of those models have their β in between Lower 95% and Higher 95%.
So as a result, here 𝛽0, 𝛽1, 𝛽2, 𝛽3, 𝛽4, 𝛽5, 𝛽6, 𝛽7, 𝛽8, 𝛽9, 𝛽10, 𝛽11 are the Coefficients which plays a vital role in building a regression equation. The value of 𝛽 not only gives us the direction but the absolute value gives the amount of relationship. So if we want to increase the sales, should we increase any variables like Engine size, Wheelbase, Horsepower etc. Then we should first look into the coefficient values and figure out which one is higher. In our case, when looking into the coefficient value of Engine size, if we increase one unit from engine size, the sales increase by 31 units.
Here Intercept coefficient=𝛽0, year resale value coefficient= 𝛽1, Price in thousands coefficient= 𝛽2, Engine size coefficient=𝛽3 and so on.
So, our regression equation would be:
𝑺𝒂𝒍𝒆𝒔 𝒊𝒏 𝒕𝒉𝒐𝒖𝒔𝒂𝒏𝒅𝒔 = 0.926 + 0.142 ∗ 𝑦𝑒𝑎𝑟 𝑟𝑒𝑠𝑎𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 + 0.078 ∗ 𝑃𝑟𝑖𝑐𝑒 𝑖𝑛 𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑠 + 0.004 ∗ 𝐸𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒 + 0.019 ∗ 𝐻𝑜𝑟𝑠𝑒𝑝𝑜𝑤𝑒𝑟 + 0.001 ∗ 𝑊ℎ𝑒𝑒𝑙𝑏𝑎𝑠𝑒 + 0.045 ∗ 𝑊𝑖𝑑𝑡ℎ + 0.478 ∗ 𝐿𝑒𝑛𝑔𝑡ℎ + 0.428 ∗ 𝐶𝑢𝑟𝑏 𝑤𝑒𝑖𝑔ℎ𝑡 + 0.465 ∗ 𝐹𝑢𝑒𝑙 𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 + 0.371 ∗ 𝐹𝑢𝑒𝑙 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 + 0.370 ∗ 𝑃𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑓 𝑓𝑎𝑐𝑡𝑜r