Regression Analysis

Regression analysis “fits” or derives a model that describes the variation of a response (“dependent”) variable as a function of one or more predictor (or “independent”) variables.  The general regression model is one of several that share the same basic conceptual model

data = systematic component + irregular component

where the systematic component is predictable or explainable by the predictor variables, and is represented by the regression model, while the irregular component is regarded as “noise” or prediction errorsvariations in the response variable that can not be accounted for by the predictor variables.

The regression equation

The specific bivarite, linear, regression model is

                                       

where  is the dependent or response variable,  and  are the coefficients of the regression line (also known as the slope (  ) and intercept (  )),  is the independent or predictor variable, and  is the predictor error, or “noise” or “residual.”  (Note that usually in descriptions of regression analysis, upper-case  ’s and  ’s stand for “raw” data values, while lower case  ’s and  ’s stand for deviations of  and  about their respective means, i.e.

                                            

There are several alternative ways of writing the regression equation or model

·     true model, no error:                    ,

·     true model, with error:                  ,

·     true model, no subscripts:             ,

·     true model, with error:                  (where  and  are the regression coefficients and  is the residuals,

·     estimated model                           , where  and  are estimates of  and , and  is an estimate of ,

·     estimated model                           , where  and  are estimates of  and , and  is an estimate of .

Other variables and quantities

The are a number of other quantities that are important in the analysis, including:

·     the “fitted” or predicted values of the response variable  (called “y-hat”)

                                

·     the residuals or prediction errors

                                       

·     the sums of squared deviations and their cross products

                           

·     and the residual sum of squares

                             

Fitting the regression equation (i.e. estimating parameters)

The regression equation is “fitted” by choosing the values of  and  in such a way that the sum of squares of the prediction errors are minimized, i.e.

                               

The specific values of  and  that minimize D could be found iteratively, or by trial and error, but it is known that the following “ordinary least-squares” (OLS) estimates of  and  do in fact minimize D:

                      

Goodness-of-fit statistics

The “goodness of fit” of the regression equation, or a measure of the strength of the relationship between  and  can be described in several ways.  As in analysis of variance of the dependent variable  can be decomposed into two components

                                       

Where  is the “total sum of squares” (of deviations of individual dependent variable values of the mean),  is the “regression sum of squares” or that component of the total sum of squares “explained” by the regression equation, and  is the “residual sum of squares,” or the sum of squares of the residual,

                                                 

An F-statistic that can be used to test the null hypothsis that the relationship between is not significant is

                                  

The denominator of this expression, , is also known as the “mean-square error of the regression,” and is sometimes represented by .  The square root of the mean-square error, , is called the “standard error of the regression,” and provides a measure of uncertainty in the estimates of  produced by the regression equation.

Another measure of the strength of the relationship between the response and predictor variable is the “explained variance” (a proportion, but sometimes expressed as a percentage), also known as the “coefficient of determination,” or  

                            

Significance of the regression coefficients

There are a number of other quantities that are useful in interpreting a regression equation.  These include standard errors for the slope and intercept

                                

Using these standard errors, t-statistics that can be used to test hypotheses about the regression coefficients can be constructed

                                     

where  and  are hypthesized values of the regression coefficients, which are usually taken to be 0, so that large values of the t-statistics will signal that  and  values that are significant.  The standard error or standard deviation of the predicted value of the response variable, , given a particular value of the predicted variable, , is