Chapter 3 Linear Regression

3.1 Model Fitting

3.1.1 Model Summary

## 
## Call:
## lm(formula = gross ~ year + certificate + runtime + genre + rating, 
##     data = train_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -290.69  -26.08   -7.30   13.03  830.52 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -137.85487    8.63938 -15.957  < 2e-16 ***
## year1970           -3.87542    9.42478  -0.411 0.680935    
## year1971           -2.80330    9.59515  -0.292 0.770169    
## year1972          -15.06170    9.16267  -1.644 0.100234    
## year1973          -13.95687    9.26134  -1.507 0.131828    
## year1974          -14.56539    9.15107  -1.592 0.111480    
## year1975           -3.12536   10.39778  -0.301 0.763739    
## year1977            3.97240   10.57808   0.376 0.707270    
## year1978            2.84313   10.60215   0.268 0.788575    
## year1979           -0.18021   10.51887  -0.017 0.986332    
## year1980           10.64177    9.70045   1.097 0.272640    
## year1981            5.10760    9.21075   0.555 0.579226    
## year1982            2.59656    9.11590   0.285 0.775771    
## year1983            6.98645    9.09738   0.768 0.442520    
## year1984            2.61209    8.92271   0.293 0.769719    
## year1985           -7.95517    8.70555  -0.914 0.360833    
## year1986            3.19815    8.58474   0.373 0.709496    
## year1987            1.02436    8.51156   0.120 0.904208    
## year1988            5.22470    8.50893   0.614 0.539207    
## year1989            8.44815    8.55863   0.987 0.323612    
## year1990           11.29015    8.54004   1.322 0.186178    
## year1991           10.18008    8.63533   1.179 0.238460    
## year1992           13.53419    8.53408   1.586 0.112781    
## year1993           11.16539    8.55122   1.306 0.191670    
## year1994            9.06280    8.46649   1.070 0.284441    
## year1995           11.86080    8.45463   1.403 0.160672    
## year1996           13.81837    8.45351   1.635 0.102145    
## year1997           11.76002    8.36493   1.406 0.159781    
## year1998           16.74185    8.36450   2.002 0.045351 *  
## year1999           25.37426    8.43803   3.007 0.002641 ** 
## year2000           16.93140    8.32003   2.035 0.041865 *  
## year2001           16.10946    8.32936   1.934 0.053123 .  
## year2002           19.23010    8.31165   2.314 0.020700 *  
## year2003           19.19027    8.35074   2.298 0.021572 *  
## year2004           14.76399    8.29930   1.779 0.075267 .  
## year2005           20.32833    8.29115   2.452 0.014224 *  
## year2006           15.37888    8.25290   1.863 0.062416 .  
## year2007           18.86797    8.25311   2.286 0.022257 *  
## year2008           19.56242    8.26219   2.368 0.017910 *  
## year2009           18.49899    8.27099   2.237 0.025325 *  
## year2010           26.46834    8.28003   3.197 0.001393 ** 
## year2011           20.30902    8.26150   2.458 0.013971 *  
## year2012           22.47593    8.27189   2.717 0.006592 ** 
## year2013           20.29357    8.22216   2.468 0.013591 *  
## year2014           18.74780    8.22410   2.280 0.022643 *  
## year2015           18.31896    8.21495   2.230 0.025764 *  
## year2016           19.67048    8.19812   2.399 0.016433 *  
## year2017           24.85789    8.18792   3.036 0.002402 ** 
## year2018           17.58427    8.18299   2.149 0.031658 *  
## year2019           40.61469    8.41087   4.829 1.39e-06 ***
## year2020            4.74310   10.18953   0.465 0.641588    
## year2021           40.70115    9.51255   4.279 1.89e-05 ***
## yearOther           5.95449    8.00326   0.744 0.456882    
## certificateOther   22.60671    1.88203  12.012  < 2e-16 ***
## certificatePG      58.25183    1.84888  31.507  < 2e-16 ***
## certificatePG-13   61.69036    1.69226  36.454  < 2e-16 ***
## certificateR       25.36004    1.58204  16.030  < 2e-16 ***
## runtime             0.38742    0.02071  18.711  < 2e-16 ***
## genreAdult        -10.91955   31.37147  -0.348 0.727790    
## genreAdventure     -9.65920    1.71302  -5.639 1.74e-08 ***
## genreAnimation     13.72542    1.81341   7.569 3.96e-14 ***
## genreBiography    -38.84403    2.99654 -12.963  < 2e-16 ***
## genreComedy       -21.43946    1.29996 -16.492  < 2e-16 ***
## genreCrime        -22.64123    1.54289 -14.675  < 2e-16 ***
## genreDrama        -29.53421    1.36724 -21.601  < 2e-16 ***
## genreFamily       -19.61369    9.28584  -2.112 0.034683 *  
## genreFantasy       -7.60225    4.68469  -1.623 0.104654    
## genreFilm-Noir    -36.21209   24.33484  -1.488 0.136750    
## genreHistory      -23.66423   27.16556  -0.871 0.383706    
## genreHorror        -0.57666    2.11120  -0.273 0.784748    
## genreMusic        -22.32211   54.34872  -0.411 0.681283    
## genreMusical      -17.40108   16.40813  -1.061 0.288925    
## genreMystery      -24.14321    6.29763  -3.834 0.000127 ***
## genreRomance      -37.53941    8.73826  -4.296 1.75e-05 ***
## genreSci-Fi       -28.84733   20.54008  -1.404 0.160206    
## genreThriller     -13.24588    7.31069  -1.812 0.070027 .  
## genreWar          -48.76727   54.31289  -0.898 0.369254    
## rating             14.16936    0.45469  31.163  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.24 on 16749 degrees of freedom
## Multiple R-squared:  0.2614, Adjusted R-squared:  0.258 
## F-statistic: 76.97 on 77 and 16749 DF,  p-value: < 2.2e-16

We can read from the summary of the linear model that the R-squared of our model on the training set is only 0.2614, which indicates a very bad fit. To further examine whether the linearity assumption is hold, we draw the residual plot as following.

3.1.2 Check Linearity

From the Residual Plot, we can observe that majority points are lying in a line with negative slope. This might indicate that our model will over-estimate certain types movies’ box office and under-estimate some other types. That is, some common characteristics of data with high fitted value contributes a lot to its over all prediction and vice versa for data with low predicted value.
Also note that we have negative fitted value, which obviously does not make sense in reality. By checking the dataset, we have no negative values in gross variable. This excludes the possibility that negative fitted values come from dataset. Hence, this is another factor supporting that a linear model might not be appropriate.

3.2 Model Evaluation

##        MSE      MAE R-Squared
## 1 2726.577 31.71711 0.2759302

The evaluation table shows the MSE, MAE, and R-Squared value of our model on the test set. All of three criterion indicate that our model had a bad performance. Hence, from the aspect of accuracy, linear model should not be our target model.

3.3 Model Interpretation

3.3.1 Partial Dependence Plot

Next, we want to explore the marginal effect of rating, runtime, certificate, and genre on predicted value.

Note that the PDP of rating and rutime are both straight line, this is because in linear model, the relationship between response and each feature is linear. We can also see that both variables have positive marginal effect on gross. However, the range of movie box office in two figures are different, rating results in a larger variation in terms of predicted value. This suggests that rating could explain more variation of movie box office. Therefore, rating should be a more important feature compared with runtime. And this is consistent with our intuition.

The PDP of certificate shows that PG-13 and PG movies tend to have more box office while Not Rated movies tend to have less box office. This is consistent with the results in scatter plot of box office Vs certificate.

The PDP of genre shows that animation, action, and horror movies are top three most popular types while war, film-noir, and biography are the three types with least audience.

3.3.2 Local Interpretable Model-agnostic Explanations (LIME)

Because of the high interpretability of linear model, LIME features plots seem to be redundant. But I want to see whether the results obtained from local interpretable model is consistent with the ordinary linear model.

model_intercept	model_prediction	feature	feature_value	feature_weight	feature_desc	prediction
93.333	45.399	year	27	-2.133	year = 1996	43.805
93.333	45.399	certificate	3	27.663	certificate = PG	43.805
93.333	45.399	runtime	100	-58.298	runtime <= 232	43.805
93.333	45.399	genre	1	16.098	genre = Action	43.805
93.333	45.399	rating	5	-31.264	3.27 < rating <= 5.25	43.805

From the explanation table, we have that the local model for case 3649 is \(\hat{y}_{lime} = 93.333 -31.264 \cdot \mathbf{1}_{3.27 < rating <= 5.25}+27.663 \cdot \mathbf{1}_{certificate = PG}+16.098 \cdot \mathbf{1}_{genre = Action}-58.298 \cdot \mathbf{1}_{runtime <= 232}-2.133 \cdot \mathbf{1}_{year = 1996}\). And note that in our original model, the coefficient of certificatePG is 58.252, genreAction is baseline and the coefficient of most levels are negative, the coefficient of year1996 is 13.818, the coefficient of rating is 14.169, and the coefficient of runtime is 0.387. Thus, the local (linear) model is different from our original linear model from quantitative aspect. However, from the LIME plot, in terms of direction, the effect of certificate and genre are both consistent with results from original model. Hence from this respect, the local model has same output as original linear model.

##   lime error original error
## 1  -42.78843      -16.20443
## 2  118.21543      102.77843
## 3  -28.09811      -26.50411
## 4  -13.22456        2.63944

The residual table gives the residual of local model and original model on the 4 data points used in LIME. We can see that our ordinary linear model outperforms the local model. I think this may be related to the motivation of using lime, that is, we want to add interpretability of our original model, which in exchange, sacrificed some accuracy. To be more specific, by using lime, we create some fake data points around the data point that we want to predict, and we also used a black box model to estimate those fake points. Hence, those two steps might be the reason why we get an even worse prediction than ordinary linear model does. However, linear model is already highly interpretable, we might only need assistance of lime for less interpretable models.

3.3.2.1 Gower Distance

As Gower distance increases, the variance of difference between pairs increases. This indicates that: for very different observations, our model sometimes makes very different predictions and sometimes makes similar predictions; for very similar observations, our model tend to consistently make similar predictions. From the figure, we can claim that for our linear model, if the Gower distance is less than 0.18, the prediction will be very similar (difference<25).

3.3.3 Shapley Additive Explanations

For linear model, the shapley value of each feature is merely its coefficient. Hence, we do not need this technique to help interpret the model.

3.3.4 Feature Importance

The Feature Importance plot shows the relative importance of each feature fed into our linear model. Surprisingly, the most important feature is certificate instead of rating this is different from conclusion derived from partial dependence plot, that is rating is the most important feature as rating caused the most variation in response variable.