Multiple Regression - University of Minnesota

Multiple Regression - University of Minnesota

Chapter 18 Multiple Regression 1 18.1 Introduction In this chapter we extend the simple linear regression model, and allow for any number of independent variables. We expect to build a model that fits the data better than the simple linear regression model. 2 Introduction We shall use computer printout to Assess the model How well it fits the data Is it useful Are any required conditions violated? Employ the model Interpreting the coefficients Predictions using the prediction equation Estimating the expected value of the dependent variable 3 18.2 Model and Required Conditions We allow for k independent variables to potentially be related to the dependent variable Coefficients

Random error variable y = 0 + 1x1+ 2x2 + + kxk + Dependent variable Independent variables 4 Multiple Regression for k = 2, Graphical Demonstration - I y The simple linear regression model allows for one independent variable, x y =0 + 1x + x x + = y 1 1 0 1 = yy = 00 Note how the straight line becomes a plain, and...

x2 2 + x 1 1x + 0 + 2x 2 y= x 1 + 2x 22 2 1 + x yy == +00x1x+11+x+11+21x2x22 = 0++0 1 +1 2x 2 y = y y = +0 1x 1 y = 0 The multiple linear regression model X1 allows for more than one independent variable. Y = 0 + 1x1 + 2x2 + X2

5 Multiple Regression for k = 2, Graphical Demonstration - II y y= b0+ b1x2 Note how a parabola becomes a parabolic Surface. b0 X1 y = b0 + b1x12 + b2x2 X2 6 Required conditions for the error variable The error is normally distributed. The mean is equal to zero and the standard deviation is constant ( for all values of y. The errors are independent. 7 18.3 Estimating the Coefficients and Assessing the Model The procedure used to perform regression analysis: Obtain the model coefficients and statistics using a statistical software. Diagnose violations of required conditions. Try to remedy

problems when identified. Assess the model fit using statistics obtained from the sample. If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions. 8 Estimating the Coefficients and Assessing the Model, Example Example 18.1 Where to locate a new motor inn? La Quinta Motor Inns is planning an expansion. Management wishes to predict which sites are likely to be profitable. Several areas where predictors of profitability can be identified are: Competition Market awareness Demand generators Demographics Physical quality 9 Profitability Competition

Rooms Number of hotels/motels rooms within 3 miles from the site. Market awareness Nearest Distance to the nearest La Quinta inn. Customers Office space College enrollment Margin Community Physical Income

Disttwn Median household income. Distance to downtown. 10 Estimating the Coefficients and Assessing the Model, Example Profitability Competition Rooms Number of hotels/motels rooms within 3 miles from the site. Market awareness Nearest Distance to the nearest La Quinta inn. Customers

Office space College enrollment Operating Margin Community Physical Income Disttwn Median household income. Distance to downtown. 11 Estimating the Coefficients and Assessing the Model, Example Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model: Margin = Rooms NearestOfficeCollege + 5Income +

Disttwn Xm18-016 Margin 55.5 33.8 49 31.9 57.4 49 Number 3203 2810 2890 3422 2687 3759 Nearest 4.2 2.8 2.4 3.3 0.9 2.9 Office Space 549 496 254 434 678

635 Enrollment 8 17.5 20 15.5 15.5 19 Income 37 35 35 38 42 33 Distance 2.7 14.4 2.6 12.1 6.9 10.8 12 Regression Analysis, Excel Output SUMMARY OUTPUT Thisisisthe thesample sampleregression

regressionequation equation This Regression Statistics (sometimes calledthe theprediction predictionequation) equation) (sometimes called Multiple R 0.7246 R Square 0.5251 Margin = 38.14 - 0.0076Number +1.65Nearest Adjusted R Square 0.4944 Standard Error 5.51+ 0.020Office Space +0.21Enrollment Observations 100 + 0.41Income - 0.23Distance ANOVA df Regression Residual Total Intercept Number

Nearest Office Space Enrollment Income Distance 6 93 99 SS 3123.8 2825.6 5949.5 Coefficients Standard Error 38.14 6.99 -0.0076 0.0013 1.65 0.63 0.020 0.0034 0.21 0.13 0.41 0.14 -0.23 0.18 MS

520.6 30.4 F Significance F 17.14 0.0000 t Stat P-value 5.45 0.0000 -6.07 0.0000 2.60 0.0108 5.80 0.0000 1.59 0.1159 2.96 0.0039 -1.26 0.2107 13 Model Assessment The model is assessed using three tools: The standard error of estimate The coefficient of determination The F-test of the analysis of variance

The standard error of estimates participates in building the other tools. 14 Standard Error of Estimate The standard deviation of the error is estimated by the Standard Error of Estimate: SSE s n k 1 The magnitude of s is judged by comparing it to y. 15 Standard Error of Estimate From the printout, s = 5.51 Calculating the mean value of y we have y 45.739 It seems s is not particularly small. Question: Can we conclude the model does not fit the data well? 16 Coefficient of Determination The definition is

SSE R 1 2 ( y y ) i 2 From the printout, R2 = 0.5251 52.51% of the variation in operating margin is explained by the six independent variables. 47.49% remains unexplained. When adjusted for degrees of freedom, Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] = = 49.44% 17 Testing the Validity of the Model We pose the question: Is there at least one independent variable linearly related to the dependent variable? To answer the question we test the hypothesis H0: 0 = 1 = 2 = = k H1: At least one i is not equal to zero. If at least one i is not equal to zero, the model has some validity. 18

Testing the Validity of the La Quinta Inns Regression Model The hypotheses are tested by an ANOVA procedure ( the Excel output) MSR/MSE ANOVA Regression Residual Total df k = 6 nk1 = 93 n-1 = 99 SS 3123.8 2825.6 5949.5 MS 520.6 30.4 F Significance F 17.14 0.0000

SSR MSR=SSR/k SSE MSE=SSE/(n-k-1) 19 Testing the Validity of the La Quinta Inns Regression Model [Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model; the model is useful, and thus, the null hypothesis should be rejected. Therefore, the rejection region is SSR F SSE k n k 1 Rejection region F>F,k,n-k-1 20 Testing the Validity of the La Quinta Inns Regression Model

ANOVA Regression Residual Total Conclusion: Conclusion: There Thereisissufficient sufficientevidence evidencetotoreject reject the thenull nullhypothesis hypothesisininfavor favorofofthe thealternative alternativehypothesis. hypothesis. AtAtleast atatleast i isisnot SS MS totozero. F Thus, Significance F leastdfone oneofofthe the notequal

equal zero. Thus, least i one variable 6 3123.8isislinearly 520.6 related 17.14 totoy.y. 0.0000 oneindependent independent variable linearly related 93 2825.6 30.4 This Thislinear linearregression regressionmodel modelisisvalid valid 99 5949.5 F,k,n-k-1 = F0.05,6,100-6-1=2.17 F = 17.14 > 2.17 Also, the p-value (Significance F) = 0.0000

Reject the null hypothesis. 21 Interpreting the Coefficients b0 = 38.14. This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. b1 = 0.0076. In this model, for each additional room within 3 mile of the La Quinta inn, the operating margin decreases on average by .0076% (assuming the other variables are held constant). 22 Interpreting the Coefficients b2 = 1.65. In this model, for each additional mile that the nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65% when the other variables are held constant. b3 = 0.020. For each additional 1000 sq-ft of office space, the operating margin will increase on average by .02% when the other variables are held constant. b4 = 0.21. For each additional thousand students the operating margin increases on average by .21% when the other variables are held constant. 23 Interpreting the Coefficients b5 = 0.41. For additional $1000 increase in median household income, the operating margin increases on average by .41%, when the other variables remain

constant. b6 = -0.23. For each additional mile to the downtown center, the operating margin decreases on average by .23% when the other variables are held constant. 24 Testing the Coefficients The hypothesis for each i is H 0: i 0 H 1: i 0 Excel printout Intercept Number Nearest Office Space Enrollment Income Distance Coefficients Standard Error 38.14 6.99 -0.0076 0.0013 1.65 0.63 0.020 0.0034 0.21 0.13 0.41

0.14 -0.23 0.18 Test statistic b i i t sb i d.f. = n - k -1 t Stat P-value 5.45 0.0000 -6.07 0.0000 2.60 0.0108 5.80 0.0000 1.59 0.1159 2.96 0.0039 -1.26 0.2107 25 Using the Linear Regression Equation The model can be used for making predictions by

Producing prediction interval estimate for the particular value of y, for a given values of xi. Producing a confidence interval estimate for the expected value of y, for given values of xi. The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the 26 coefficients i La Quinta Inns, Predictions Xm18-01 Predict the average operating margin of an inn at a site with the following characteristics: 3815 rooms within 3 miles, Closet competitor .9 miles away, 476,000 sq-ft of office space, 24,500 college students, $35,000 median household income, 11.2 miles distance to downtown center. MARGIN = 38.14 - 0.0076(3815) +1.65(.9) + 0.020(476) +0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%

27 La Quinta Inns, Predictions Interval estimates by Excel (Data Analysis Plus) Prediction Interval Margin Predicted value 37.1 Prediction Interval Lower limit Upper limit 25.4 48.8 Interval Estimate of Expected Value Lower limit Upper limit 33.0 41.2 It is predicted, with 95% confidence that the operating margin will lie between 25.4% and 48.8%. It is estimated the average operating margin of all sites that fit this category falls within 33% and 41.2%.

The average inn would not be profitable (Less than2850%). Assessment and Interpretation: MBA Program Admission Policy The dean of a large university wants to raise the admission standards to the popular MBA program. She plans to develop a method that can predict an applicants performance in the program. She believes a students success can be predicted by: Undergraduate GPA Graduate Management Admission Test (GMAT) score Number of years of work experience 29 MBA Program Admission Policy A randomly selected sample of students who completed the MBA was selected. (See MBA). MBA GPA UnderGPA 8.43 6.58 8.15 8.88 . . 10.89 10.38 10.39 10.73 . .

GMAT Work 584 483 484 646 . . 9 7 4 6 . . Develop a plan to decide which applicant to admit. 30 MBA Program Admission Policy Solution The model to estimate is: y = 0 +1x1+ 2x2+ 3x3+ y = MBA GPA x1 = undergraduate GPA [UnderGPA] x2 = GMAT score [GMAT] x3 = years of work experience [Work] The estimated model: MBA GPA = b0 + b1UnderGPA + b2GMAT + b3Work

31 MBA Program Admission Policy Model Diagnostics SUMMARY OUTPUT We estimate the regression model then we check: Regression Statistics Multiple R 0.6808 R Square 0.4635 Adjusted R Square 0.4446 Standard Error 0.788 Observations 89 Normality of errors Standardized residuals ANOVA df Regression Residual Total 3 85 88

SS 45.60 52.77 98.37 MS 40 15.2030 0.62 F Significance F 24.48 0.0000 20 10 0 P-value Coefficients Standard Error t Stat -2.5 Intercept 0.466 1.506 0.31 0.7576 -1.5 UnderGPA 0.063 0.120 0.52

0.6017 GMAT 0.011 0.001 8.16 0.0000 Work 0.093 0.031 3.00 0.0036 -0.5 0.5 1.5 2.5 More 32 MBA Program Admission Policy Model Diagnostics SUMMARY OUTPUT We estimate the regression model then we check: Regression Statistics

Multiple R 0.6808 R Square 0.4635 Adjusted R Square 0.4446 Standard Error 0.788 Observations 89 The variance of the error variable Residuals ANOVA df Regression Residual Total 3 85 88 SS 45.60 52.77 98.37 2 MS 15.20 1 0.62 0

F Significance F 24.48 0.0000 -1 6 7 8 9 10 -2 Coefficients Standard Error t Stat P-value -3 Intercept 0.466 1.506 0.31 0.7576 UnderGPA 0.063 0.120 0.52

0.6017 GMAT 0.011 0.001 8.16 0.0000 Work 0.093 0.031 3.00 0.0036 33 MBA Program Admission Policy Model Diagnostics SUMMARY OUTPUT Regression Statistics Multiple R 0.6808 R Square 0.4635 Adjusted R Square 0.4446 Standard Error 0.788 Observations 89 ANOVA df Regression Residual Total

Intercept UnderGPA GMAT Work SS 3 85 88 45.60 52.77 98.37 Coefficients Standard Error 0.466 1.506 0.063 0.120 0.011 0.001 0.093 0.031 MS 15.20 0.62 F Significance F 24.48 0.0000

t Stat P-value 0.31 0.7576 0.52 0.6017 8.16 0.0000 3.00 0.0036 34 MBA Program Admission Policy Model Assessment SUMMARY OUTPUT Regression Statistics Multiple R 0.6808 R Square 0.4635 Adjusted R Square 0.4446 Standard Error 0.788 Observations 89 46.35% of the variation in MBA GPA is explained by

the model. The model is valid (p-value = 0.0000) GMAT score and years of work experience are linearly related to MBA GPA. Insufficient evidence of linear relationship between undergraduate 35 GPA and MBA GPA. ANOVA df Regression Residual Total 3 85 88

SS 45.60 52.77 98.37 MS 15.20 0.62 Coefficients Standard Error t Stat Intercept 0.466 1.506 0.31 UnderGPA 0.063 0.120 0.52 GMAT 0.011 0.001 8.16 Work 0.093 0.031 3.00 F Significance F 24.48

0.0000 P-value 0.7576 0.6017 0.0000 0.0036 18.4 Regression Diagnostics - II The conditions required for the model assessment to apply must be checked. Is the error variable normally Draw a histogram of the residuals distributed? Is the error variance constant? Plot the residuals versus ^y Are the errors independent? Plot the residuals versus the time periods Can we identify outlier? Is multicolinearity (intercorrelation)a problem? 36 Diagnostics: Multicolinearity Example 18.2: Predicting house price (Xm18-02) A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size. A random sample of 100 houses was drawn and data recorded. Price 124100 218300

117800 . . Bedrooms 3 4 3 . . H Size 1290 2080 1250 . . Lot Size 3900 6600 3750 . . Analyze the relationship among the four variables 37 Diagnostics: Multicolinearity The proposed model is PRICE = 0 + 1BEDROOMS + 2H-SIZE +3LOTSIZE + SUMMARY OUTPUT

Regression Statistics Multiple R 0.7483 R Square 0.5600 Adjusted R Square 0.5462 Standard Error 25023 Observations 100 The model is valid, but no variable is significantly related to the selling price ?! ANOVA df Regression Residual Total 3 96 99 SS 76501718347 60109046053 136610764400 Coefficients Standard Error Intercept

37718 14177 Bedrooms 2306 6994 House Size 74.30 52.98 Lot Size -4.36 17.02 MS 25500572782 626135896 t Stat 2.66 0.33 1.40 -0.26 F Significance F 40.73 0.0000 P-value 0.0091 0.7423 0.1640 0.7982

38 Diagnostics: Multicolinearity Multicolinearity is found to be a problem. Price Price Bedrooms H Size Lot Size 1 0.6454 0.7478 0.7409 Bedrooms H Size 1 0.8465 0.8374 1 0.9936 Lot Size 1 Multicolinearity causes two kinds of difficulties: The t statistics appear to be too small. The coefficients cannot be interpreted as slopes. 39

Remedying Violations of the Required Conditions Nonnormality or heteroscedasticity can be remedied using transformations on the y variable. The transformations can improve the linear relationship between the dependent variable and the independent variables. Many computer software systems allow us to make the transformations easily. 40 Transformations, Example. Reducing Nonnormality by Transformations A brief list of transformations y = log y (for y > 0) Use when the s increases with y, or Use when the error distribution is positively skewed y = y2 Use when the s2 is proportional to E(y), or Use when the error distribution is negatively skewed y = y1/2 (for y > 0) Use when the s2 is proportional to E(y) y = 1/y Use when s2 increases significantly when y increases beyond some critical value. 41

Durbin - Watson Test: Are the Errors Autocorrelated? This test detects first order autocorrelation between consecutive residuals in a time series If autocorrelation exists the error variables are not independent nn Residual at time i ii22 dd ((eeii eeii11))22 nn 2 eeii2 11 ii The range range of of dd is is 00 dd

44 The 42 Positive First Order Autocorrelation + + + Residuals + 0 + + Time + + Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then, the value of d is small (less than 2). 43 Negative First Order Autocorrelation Residuals +

+ + + + + + 0 Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2). Time 44 One tail test for Positive First Order Autocorrelation If d

dU there is not enough evidence to show that positive first-order correlation exists If d is between dL and dU the test is inconclusive. First order correlation exists dL

Inconclusive test Positive first order correlation Does not exists dU 45 One Tail Test for Negative First Order Autocorrelation If d>4-dL, negative first order correlation exists If d<4-dU, negative first order correlation does not exists if d falls between 4-dU and 4-dL the test is inconclusive. Negative first order correlation does not exist Inconclusive test 4-dU Negative first order correlation exists 4-dL

46 Two-Tail Test for First Order Autocorrelation If d

4-dL first order autocorrelation exists If d falls between dL and dU or between 4-dU and 4-dLthe test is inconclusive If d falls between dU and 4-dU there is no evidence for first order autocorrelation First order correlation exists 0 dL First order correlation does not exist Inconclusive test dU 2 First order correlation does not

exist Inconclusive test 4-dU First order correlation exists 4-dL 47 4 Testing the Existence of Autocorrelation, Example Example 18.3 (Xm18-03) How does the weather affect the sales of lift tickets in a ski resort? Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected. The model hypothesized was TICKETS=0+1SNOWFALL+2TEMPERATURE+ Regression analysis yielded the following results: 48 The Regression Equation

Assessment (I) Xm18-03 The model model seems seems to to be be very very poor: poor: The SUMMARY OUTPUT Regression Statistics Multiple R 0.3465 R Square 0.1200 Adjusted R Square 0.0165 Standard Error 1712 Observations 20 R-square=0.1200 It is not valid (Signif. F =0.3373) No variable is linearly related to Sales ANOVA df

Regression Residual Total Intercept Snowfall Tempture 2 17 19 SS 6793798 49807214 56601012 Coefficients Standard Error 8308.0 903.73 74.59 51.57 -8.75 19.70 MS 3396899 2929836 F Signif. F 1.16

0.3373 t Stat P-value 9.19 0.0000 1.45 0.1663 -0.44 0.6625 49 Diagnostics: The Error Distribution The errors histogram 7 6 5 4 3 2 1 0 -2.5 -1.5 -0.5 0.5 1.5 2.5 More

The errors may be normally distributed 50 Diagnostics: Heteroscedasticity Residual vs. predicted y 3000 2000 1000 0 -10007500 8500 9500 10500 11500 12500 -2000 -3000 -4000 It appears there is no problem of heteroscedasticity (the error variance seems to be constant). 51 Diagnostics: First Order Autocorrelation

Residual over time 3000 2000 1000 0 -1000 0 -2000 -3000 5 10 15 20 25 -4000 The errors are not independent!! 52 Diagnostics: First Order Autocorrelation Using the computer - Excel Tools > Data Analysis > Regression (check the residual option and then OK) Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals from the regression run > OK Durbin-Watson Statistic -2793.99 -1723.23

d = 0.5931 The residuals -2342.03 -956.955 -1963.73 . . Test for positive first order autocorrelation: n=20, k=2. From the Durbin-Watson table we have: dL=1.10, dU=1.54. The statistic d=0.5931 Conclusion: Because d

Recently Viewed Presentations

  • The data collection mechanism and the statistical databases ...

    The data collection mechanism and the statistical databases ...

    D. Data collection initiatives: The Way ahead for live and main reference statistical database ( continue) a)The degree of statistical awareness may be among the main hidden constraints in establishing such a mechanism for data collection and statistical production.
  • Fate and Transport of Microbes in Water, Soils and Sediments

    Fate and Transport of Microbes in Water, Soils and Sediments

    Increased rainfall asociated with outbreaks of leptospirosis, Rift Valley Fever, hantavirus pulmonary syndrome, malaria, Ross Valley Fever, and others Possible link between 1991-95 El Nino, with inv=creased temperature and increased Cholera in the Bay of Bengal and in Latin American...
  • Proteiinianalyysi 5 - ekhidna.biocenter.helsinki.fi

    Proteiinianalyysi 5 - ekhidna.biocenter.helsinki.fi

    Proteiinianalyysi 5 ... Conservation of unusual structural features Clusters of conserved residues giving especially sharp signatures in enzyme active sites Sequence similarity through bridging intermediates leading to elongated clusters in protein space Functional similarities conserved ...
  • Workshops in Information Skills and Electronic Resources WISER

    Workshops in Information Skills and Electronic Resources WISER

    WISER SCIENCE: Electronic Resources for Engineering Louise Colver & Nicola Mawer 15th February 2006 Overview of session Introduction to OxLIP Demonstration of key resources in Engineering subjects Search skills Creating a search strategy Finding the best search terms Combining terms...
  • The Food Pyramid

    The Food Pyramid

    THE FOOD PYRAMID By Leah Starck TEKS 115.5. Health Education, Grade 3. A) Knowledge and Skills 1. Health Behaviors. The Student explains ways to enhance and maintain health throughout their lifespan.
  • Day 54 - LotF 9-12 Quiz, Story of an Hour, Fishbowl

    Day 54 - LotF 9-12 Quiz, Story of an Hour, Fishbowl

    Lord of the Flies Discussion 11-12. Lord of the Flies Quiz. Story of an Hour. Story of an Hour Fishbowl. Closure. Welcome to class! Objectives. Analyze a text for thematic elements. Construct a theme statement. ... At the end of...
  • The Great West - Weebly

    The Great West - Weebly

    All of the Populist demands would eventually be made into law initially by Progressive Presidents (T. Roosevelt, Taft, Wilson) and later by F. Roosevelt and New Deal Democrats ... Political cartoons on the Railroad. Leonidas L. Polk, Biography. Mary Elizabeth...
  • ECA Statistical Database

    ECA Statistical Database

    The ECA Statistical Database Objectives To create a central database system to manage all socio-economic data on African region at ECA To create a corporate platform at ECA for statistical data management To provide easy access to statistical data to...