Negative Binomial; Testing For Overdispersion in Poisson regression

 

Deviance and Pearson Chi-Square divided by the degrees of freedom  are used to detect overdispersion or underdispersion in the Poisson regression. Values greater than 1 indicate overdispersion, that is, the true variance is bigger than the mean, values smaller than 1 indicate underdispersion, the true variance is smaller than the mean.  Evidence of underdispersion or overdispersion indicates inadequate fit of the Poisson  model. We can test for overdispersion with a likelihood ratio test based on Poisson and negative binomial distributions. This test tests equality of the mean and the variance imposed by the Poisson distribution against the alternative that the variance exceeds the mean. For the negative binomial distribution, the variance = mean + k mean2 (k>= 0, the negative binomial distribution reduces to Poisson when k=0). The null hypothesis is:

 

            H0  : k=0

 

and the alternative hypothesis is:

 

            Ha  : k>0.

 

To carry out the test, follow the steps below:

 

(i)                  Run the regression model using negative binomial distribution, record LL (log likelihood) value.

(ii)                Record LL for the Poisson model.

(iii)               Use the LR (likelihood ratio) test, that is, compute LR statistic,

-2( LL (Poisson) – LL (negative binomial)). The asymptotic distribution of the LR statistic has probability mass of one half at zero and one half – Chi-sq distribution with 1 df (see A. C. Cameron, P.K.Trivedi, Regression Analysis of Count Data, Cambridge University Press, 1998). To test the null hypothesis at the significance level a, use the critical value of Chi-sq distribution corresponding to significance level 2a, that  is reject H0 if LR statistic >  c2  (1-2a , 1 df).

 

Fitting Negative Binomial  Regression in SAS

 

Fitting negative binomial regression in SAS is very similar to fitting Poisson regression. We assume that  the model is the same as the one described in the section titled Poisson Regression Overview, that is, the log of the mean, m, is a linear function of independent variables,

 

log(m) = intercept + b1*X1 +b2*X2 + ....+ b3*Xm,

 

which implies that m is  the exponential function of independent variables,

 

            m = exp(intercept + b1*X1 +b2*X2 + ....+ b3*Xm).

 

Instead of assuming as before that the distribution of  Y, number of occurrences of an event, is Poisson, we will now assume that Y has a negative binomial distribution. That means, in particular, relaxing the assumption about equality of mean and variance (Poisson distribution property), since the variance of negative binomial is equal to m + km2 , where k>= 0 is a dispersion parameter.    

                The maximum likelihood method is used to estimate k as well as the  parameters of the regression model for log(m).

 

            The SAS syntax for running negative binomial regression is almost the same as for Poisson regression. The only change is the dist option in the MODEL statement. Instead of  dist = poisson, dist = nb should be used.

 

            The following SAS program runs negative binomial regression. n_c is the number of occurrences of  a disease by region and age group and total is the total number of subjects at risk, also by region and age group.

 

options ls=78;

data d;

      input region age n_c total;

      l_total = log(total);

      cards;

1 1  0 102250

1 2  2  89321

1 3  5  76345

2 1 26 256789

2 2  5 223962

2 3  0 199768

3 1  0  87954

3 2  0  90327

3 3  0  78213

4 1  0 123456

4 2  1 112657

4 3  0 108918

;

run;

*************************************;

*Poisson regression *****************;

*************************************;

proc genmod data=d;

      class region age;

      model n_c = region age / dist   = poisson

                          link   = log

                          offset = l_total type3 wald;

run;

*************************************;

*negative binomial regression *******;

*************************************;

proc genmod data=d;

      class region age;

      model n_c = region age / dist   = nb

                          link   = log

                          offset = l_total type3 wald;

run;

 

 

First PROC GENMOD runs the Poisson regression. The table below provides the fit statistics for that model.

 

Criteria For Assessing Goodness Of Fit

Criterion

DF

Value

Value/DF

Deviance

6

31.5023

5.2504

Scaled Deviance

6

31.5023

5.2504

Pearson Chi-Square

6

33.9957

5.6659

Scaled Pearson X2

6

33.9957

5.6659

Log Likelihood

 

47.4400

 

 

The model does not fit the data well. The values of Pearson Chi-sq and deviance divided by the degrees of freedom are significantly larger than 1, giving evidence of overdispersion. Next, the negative binomial regression is run (second PROC GENMOD) to see if it fits the data better. Below is the goodness-of-fit table.

 

Criteria For Assessing Goodness Of Fit

Criterion

DF

Value

Value/DF

Deviance

6

8.8671

1.4779

Scaled Deviance

6

8.8671

1.4779

Pearson Chi-Square

6

5.0962

0.8494

Scaled Pearson X2

6

5.0962

0.8494

Log Likelihood

 

53.2509

 

 

 

To carry out the LR test for significance of  overdispersion, that is, to test the hypothesis

 

H0  : k=0

 

against the alternative hypothesis:

 

Ha  : k>0,

 

we need to compute  -2( LL (Poisson) – LL (negative binomial)) (see Negative Binomial; testing for overdispersion). In the example, it is equal to –2(47.44 – 53.2509) = 11.6218, which corresponds to p-value = 0.0003. Hence, we reject H0  : k=0 and conclude that the mean and variance are not equal and that the Poisson distribution assumption has to be abandoned. The goodness-of-fit statistics, presented in the table above, contains information on assessment of fit for the negative binomial distribution. Since the values of Pearson Chi-sq  and deviance divided by the number of degrees of freedom are close to 1 and Pearson Chi-sq=8.8671 and the deviance = 5.0962, each with 6 degrees of freedom for their approximately Chi-sq distribution, the fit is adequate.

 

The interpretation of the results is the same as in the case of Poisson regression (see interpretation of results in Modeling Number of Occurrences of an Event and interpretation of results in Modeling Incidence).

 

References

 

Cameron, A.C and P.K. Trivedi, Regression Analysis of Count Data, Cambridge   University Press, 1998

 

SAS/STAT User’s Guide, version 8. SAS Institute, 2000

 

Stokes, M. E., C. S. Davis and G. G. Koch, Categorical Data Analysis Using the SAS System, SAS Institute, 1995