How do we perform Linear Regression Model in Excel?
In data science, we build regression models to see how well one variable can be predicted based on one or more variables. Can we use excel to do the same? Yes we can use different excel Add-ins or tools to do this. With latest versions of Excel, it does not take more than a minute to build a model. Before building a model, let us know a little about what is Linear Regression? So Linear regression is a statistical method used to explore the correlation between two continuous quantitative variables. This technique is used to predict the value of an independent variable based on the value of other dependent variables. If there is only one independent variable, it is a simple linear regression; if there are multiple independent variables, it is a multiple linear regression. So here, we are trying to use the relationship pattern to predict the value of a variable. It is a supervised learning task where output is a continuous value.
When Y is numerical, π = πΉ(π) + πΆ
Since we are using linear regression, the equation is a linear equation.
Here π₯1, π₯2, π₯3, π₯4, π₯5 β¦ . . π₯π are variables that we used to predict π. π½0, π½1, π½2, β¦ . π½π are the coefficient variables. The whole instruments of linear regression is built around predicting the value or getting the values of π½0, π½1, π½2, β¦
Error:
The Error can be measured by the quantity MSE (Mean Squared Error). The MSE is the average of squared error.
We should build an equation so that MSE is very low.
π»πππ πππ‘π’ππ ππ ππππ€π π‘π π’π . π₯1, π₯2, π₯3, π₯4, π₯5 β¦ . . π₯π πππ π½0 , π½1, π½2, β¦ . π½π πππ π’πππππ€π. So we have to minimize the quantity with respect to π½0 , π½1, π½2, β¦ . π½π by using partial differentiation. We will partially differentiate the quantity π½0 with respect to π½0 and equate to zero, we will get π½0 . Similarly we calculate for π½2, β¦ . π½π . This method is known as ordinary least squares (OLS) method.
Example: Let us consider a small Car Sales dataset provided by the Analytixlabs when I was studying there, with variables Manufacturer, Model, Vehicle type, year resale value, Price in thousands, Engine size, Horsepower, Wheel base, Width, Length, Curb weight, Fuel capacity, Fuel efficiency, Latest Launch, Power perf factor, Sales in thousands. Here we have features of different cars sold in the past. Sales depend upon the features of a car. Now we are proposing a regression model on the features so that when next time a new car is launched, we can predict how much it will sell.
Steps follow to build the model in Excel:
Step 1: Remove everything that is categorical. So we have: year resale value, Price in thousands, Engine size, Horsepower, Wheelbase, Width, Length, Curb weight, Fuel capacity, Fuel efficiency, Power perf factor, and Sales in thousands.
Step 2: Replace missing value by zero.
Step 3: π»πππ π = π·ππππππππ‘ ππππππππ = πππππ ππ π‘βππ’π ππππ .
πΌππππππππππ‘ πππππππππ : π₯1 = π¦πππ πππ πππ π£πππ’π, π₯2 = πππππ ππ π‘βππ’π ππππ , π₯3 = πΈπππππ π ππ§π, π₯4 = π»πππ ππππ€ππ, π₯5 = πβππππππ π, π₯6 = ππππ‘β, π₯7 = πΏππππ‘β, π₯8 = πΆπ’ππ π€πππβπ‘, π₯9 = πΉπ’ππ πππππππ‘π¦, π₯10 = πππ€ππ ππππ ππππ‘ππ, π₯11 = πΉπ’ππ ππππππππππ¦
Step 4:
Analysis ToolPak may not be enabled by default, so we need to enable it manually. To do this,
I. On the Menu Bar, click βFileβ
II. Next, click on βOptionsβ
III. Then Click on βAdd-insβ, in the βManageβ dropdown box, select βExcel Add-Insβ and click on βGo...β button.
IV. Select Checkbox of βAnalysis ToolPakβ and then click on βOKβ.
Step 4: Now we can see βData Analysisβ tools in the βDataβ tab. Click on βData Analysisβ tab. From the βAnalysis Toolsβ, select βRegressionβ and click βOKβ
Select the Input Y range and Input X range. In our case, we use linear regression to select the dependent and independent variables from Step 3. We have included headers so check the βLabelsβ box. Then Click βOKβ.
The regression analysis output is now created in a new worksheet with regression statistics, ANOVA, residuals, t Stat, P-value, Lower 95%, Upper 95%, Lower 95.0%, Upper 95.0%, and coefficients.
Result Interpretation:
- Multiple R stands for multiple correlations. Now in Pearson correlation, we are looking at the correlation of two variables at a time. Here the correlation is in Y and all the Xβs together. Xβs are Independent Variables in Step 3. The higher the Multiple R value better is the model. This is one of the strengths of the model.
- Observations are the number of observations taken part in the regression model.
- In laymanβs understanding, the concept of R Square is the total information that is present in Y, a portion of it is captured by the model and the rest is the error. R Square is what percentage of the total information in Y is captured by the model.
The higher the value of R Square, the better is the model. Here R Square=0.288 means that 28.8% of the information or variance of Y is captured by the model. - In the ANOVA table (which gives an idea about the level of variability within our regression model) from the summary output, the only interesting variable that we can look into is Significance F (P- value). To determine whether the difference between groups means is statistically significant or not, examine the p-value corresponding to the F-statistic.
- The df (degrees of freedom) regression is the number of independent variables in our regression model. Here df regression is 11, because 12 independent variables, including intercept (or constant, implying that the expected mean value of Y when all X=0) we have selected for model building. df residual is the total number of observations (rows) in the dataset subtracted from the number of estimated variables. Here it is 157- 12=145. df residual is the sum of df regression and df residual.
- The SS stands for Sum Square. SS regression is basically the total variation of the dependent variable explained by our regression model. It is the sum of the squared differences between the predicted value and the mean of all data points.
- Also we can calculate R Square from SS Regression,
- The MS stands for Mean Square. It is calculated by the sum of squares divided by df (degrees of freedom).
- The F-Value is used to test the hypothesis.
We can get the p-value from the F-Distribution. The high value of F indicates higher value of mean square of regression means higher value of SS, more information captured by the regression model.
- Standard Error is the average error that we have made in estimating the π½ . This is the average difference between actual and predicted values. Standard error returns the estimated standard deviation of the coefficient distribution. We need the standard error to calculate the t-stat.
- t-stat is basically the coefficient divided by the standard error. This t-stat follows t-distribution therefore we calculate the p-value.
- The p-value is testing the hypothesis. For each π½, we are testing individually.
- If p-value > 0.05, we need to drop that variable. In our case, we have multiple variables like: year resale value, Price in thousands, Length, Curb weight, Fuel capacity, Power perf factor, and Fuel efficiency having π½ > π β π£πππ’π . So we look at the variable having the highest p-value drop that first and rebuild the model and then drop the variables one by one in descending order from the regression model and rebuild the model in each drop.
- The Lower 95% and Higher 95% give the confidence interval. This is the interval estimate of Ξ². The coefficient is the point estimate of Ξ². Out of all the models, if we built 100 on 100 different data, 95% of those models have their Ξ² in between Lower 95% and Higher 95%.
So as a result, here π½0, π½1, π½2, π½3, π½4, π½5, π½6, π½7, π½8, π½9, π½10, π½11 are the Coefficients which plays a vital role in building a regression equation. The value of π½ not only gives us the direction but the absolute value gives the amount of relationship. So if we want to increase the sales, should we increase any variables like Engine size, Wheelbase, Horsepower etc. Then we should first look into the coefficient values and figure out which one is higher. In our case, when looking into the coefficient value of Engine size, if we increase one unit from engine size, the sales increase by 31 units.
Here Intercept coefficient=π½0, year resale value coefficient= π½1, Price in thousands coefficient= π½2, Engine size coefficient=π½3 and so on.
So, our regression equation would be:
πΊππππ ππ ππππππππ π = 0.926 + 0.142 β π¦πππ πππ πππ π£πππ’π + 0.078 β πππππ ππ π‘βππ’π ππππ + 0.004 β πΈπππππ π ππ§π + 0.019 β π»πππ ππππ€ππ + 0.001 β πβππππππ π + 0.045 β ππππ‘β + 0.478 β πΏππππ‘β + 0.428 β πΆπ’ππ π€πππβπ‘ + 0.465 β πΉπ’ππ πππππππ‘π¦ + 0.371 β πΉπ’ππ ππππππππππ¦ + 0.370 β πππ€ππ ππππ ππππ‘πr
Keep Learning!