Poisson Regression Overview

 

            Poisson regression is often used to analyze count data. It can be used to model the number of occurrences of an event of interest or the rate of occurrence of an event of interest, as a function of some independent variables. For example, rate of insurance claims, number of doctor visits, incidence of diseases, crime incidence, number of days a child is absent from school, colony counts for bacteria, can be modeled using Poisson regression.

            In Poisson regression it is assumed that the dependent variable Y,  number of  occurrences of an event, has a Poisson distribution given the independent variables X1, X2, ...., Xm,

 

            P(Y=k| x1, x2, ..., xm) = e- m mk / k!,            k=0, 1, 2, ......,

 

where the log of the mean m is assumed to be a linear function of the independent variables. That is,

 

log(m) = intercept + b1*X1 +b2*X2 + ....+ b3*Xm,

 

which implies that m is  the exponential function of independent variables,

 

            m = exp(intercept + b1*X1 +b2*X2 + ....+ b3*Xm).

 

In many situations the rate or incidence of  an event needs to be modeled instead of the number of occurrences. For example, suppose that we know the number of occurrences of certain disease by county and we want to find out if frequency of occurrence depends on certain demographic variables and health policy programs also recorded by county. Since more at risk subjects result in more occurrences of the disease, we need to adjust for the number of subjects at risk in each county. For such data, we can write a Poisson regression model  in the following form:

 

            log(m) = log(N) + intercept + b1*X1 +b2*X2 + ....+ b3*Xm,

 

where N is the total number of subjects at risk by county. The logarithm of variable N is used as an offset, that is, a regression variable with a constant coefficient of 1 for each observation. The log of  the incidence, log (m / N), is modeled now as a linear function of  independent variables.

            The maximum likelihood method is used to estimate the parameters of Poisson regression models.

            In SAS, the GENMOD procedure can fit Poisson regression models.

 

Fitting Poisson Regression in SAS

 

            We will use PROC GENMOD to fit Poisson regression models in two examples. In the first example, the number of occurrences of an event and  in the second example, the incidence, will be modeled.

 

Modeling number of occurrences of an event

 

The data used in this example was generated from a Poisson distribution using SAS function RANPOI. There are four variables, n_events, c1, x2, x3, and 240 subjects in the data set.  For each subject the number of occurrences of an event is saved in the variable called n_events. Variables c1 (nominal), x2 and x3 (ordinal with four categories) will be used as independent variables in the model. We can think of  this data analysis as, for example, modeling the number of books read by a student in 5th grade (n_events) as a function of student’s grade in math (x2), language arts (x3) and a reading incentive program (c1 with three categories, three different reading incentive programs). A table below shows 10 observations from the data set.

 

c1

x2

x3

n_events

1

2

1

9

1

3

3

13

1

3

4

22

1

4

4

12

2

2

3

12

2

2

4

13

2

3

2

17

2

3

3

11

2

4

4

12

3

3

4

22

 

The following sas program runs the Poisson regression analysis (in the program the data set is called one).

 

proc genmod data=one;

      class c1;

      model n_events = c1 x2 x3 / dist   = poisson

                                  link   = log type3;

contrast 'c1, 1 vs 2' c1 1 -1 0;

contrast 'c1, 1 vs 3' c1 1 0 -1;

contrast 'c1, 2 vs 3' c1 0 1 -1;

estimate 'c1=1, x2=4, x3=4' int 1

                            c1 1 0 0

                            x2 4

                            x3 4 /exp;

estimate 'c1=2, x2=4, x3=4' int 1

                            c1 0 1 0

                            x2 4

                            x3 4 /exp;

estimate 'c1=3, x2=4, x3=4' int 1

                            c1 0 0 1

                            x2 4

                            x3 4 /exp;

run;

 

Explanation of the statements and options used in the program

 

class c1; Specifies c1 as a classification variable. The remaining variables will be treated as continuous variables in the analysis.

 

dist   = poisson link   = log  Specifies Poisson distribution to be used in the model, that is, declares n_events to have a Poisson distribution,  and requests modeling log of the mean of n_events as a linear function of the independent variables.

 

type3 requests type 3 tests (overall tests for the effects in the model). By default, likelihood ratio tests are printed. If Wald tests are desired, change type3 to

type3 wald .

 

contrast 'c1, 1 vs 2' c1 1 -1 0; compares level 1 of class variable c1 with level 2. Contrast specification is the same as in proc glm;

 

estimate 'c1=1, x2=4, x3=4' int 1

                            c1 1 0 0

                            x2 4

                            x3 4 /exp; Estimates the mean number of events for subjects in category 1 of variable c1, with values 4 for x2 and 4 for x3. If exp option is not used then only the log of the mean number of events is estimated.

 

 

Interpretation of results

 

The following table contains information on assessment of fit.

 

Criteria For Assessing Goodness Of Fit

Criterion

DF

Value

Value/DF

Deviance

235

241.3872

1.0272

Scaled Deviance

235

241.3872

1.0272

Pearson Chi-Square

235

234.5383

0.9980

Scaled Pearson X2

235

234.5383

0.9980

Log Likelihood

 

4149.0588

 

 

 

The Deviance and Pearson Chi-Square have approximately Chi-square distribution with the number of degrees of freedom printed in column titled DF.  Both indicate adequate fit. Scaled Deviance and Scaled Pearson X2 are Deviance and Pearson Chi-Square, respectively, divided by the dispersion parameter. Here they are the same because for the Poisson distribution the dispersion parameter is 1. Deviance and Pearson Chi-Square divided by the degrees of freedom  are often used to detect overdispersion or underdispersion. For Poisson distribution the mean and the variance are equal, which implies that the deviance and the Pearson statistic divided by the degrees of freedom should be approximately one. Values greater than 1 indicate overdispersion, that is, the true variance is bigger than the mean, values smaller than 1 indicate underdispersion, the true variance is smaller than the mean.  Evidence of underdispersion or overdispersion indicates inadequate fit of the Poisson  model. Corrective measures include using the deviance or Pearson Chi-Square divided by degrees of freedom as an estimate of the dispersion parameter instead of setting it to 1 (options DSCALE and PSCALE in the MODEL statement) or, in the case of overdispersion, running the negative binomial regression instead of the Poisson regression (options DIST=NB instead of DIST=POISSON in the MODEL statement).  For our example, the ratios are close to 1, so we can conclude that the fit of the Poisson model is adequate.

 

The following table presents the significance tests of the model parameters.

 

Analysis Of Parameter Estimates

Parameter

 

DF

Estimate

Standard Error

Wald 95% Confidence Limits

Chi-Square

Pr > ChiSq

Intercept

 

1

1.4096

0.0749

1.2628

1.5563

354.51

<.0001

c1

1

1

0.3438

0.0465

0.2528

0.4349

54.79

<.0001

c1

2

1

0.0893

0.0492

-0.0070

0.1857

3.30

0.0693

c1

3

0

0.0000

0.0000

0.0000

0.0000

.

.

x2

 

1

0.1405

0.0171

0.1069

0.1740

67.30

<.0001

x3

 

1

0.1981

0.0173

0.1642

0.2319

131.65

<.0001

Scale

 

0

1.0000

0.0000

1.0000

1.0000

 

 

 

The first variable, C1, is a class variable. SAS orders categories alphabetically or, if numeric, from smallest to largest, and takes the last one as a reference category. Positive parameters for C1=1 and C2=2 mean that the mean number of events (books read) is higher for subjects in program 1 and 2 then in program 3. The results of the Wald Chi-Square tests indicate that there is a statistically significant difference between category 1 of C1 and category 3 of C1 (Chi-Square=54.79, p-value <0.0001), but the difference between program 2 and 3 (category 2 of C1 and category 3 of C1) is not significant (Chi-Square=3.30, p-value=0.0693). Positive parameters for variables X2 and X3 indicate that the mean number of events increases when the value of X2 or X3 increases. That is, the average number of books read (events) is bigger for students with higher grades in math or language arts.

            The parameter estimates given in the table can be used to predict the mean number of events (books read) for different categories of C1 and different values of  X2 and X3. For example, for C1=1,  X2=4 and X3=4,

 

predicted n_events =exp(1.4096 +0.3438 +0.1405*4 + 0.1981*4)

 

and for C1=3,  X2=4 and X3=4,

 

predicted n_events =exp(1.4096 +0.1405*4 + 0.1981*4).

 

It is easy to see that the predicted number of events for subjects in category 1 of C1 divided by the predicted number of events for subjects in category 3 of C1 is exp(0.3438) = 1.41, that is, children in the reading program 1 read, on average,  41% more books than children in program 3. The ratio does not depend on the values of X2 and X3 because there is no interaction term between C1 and these variables in the model.

 

The next table gives the overall test of significance and is the result of option TYPE3 in the MODEL statement. The test for variables with just one degree of freedom are the same as in the table above.

 

LR Statistics For Type 3 Analysis

Source

DF

Chi-Square

Pr > ChiSq

c1

2

61.31

<.0001

x2

1

67.86

<.0001

x3

1

133.84

<.0001

 

The Contrast Estimate Results table is the result of the ESTIMATE statement. It gives the predicted values of  log of  mean number of events (first row for each label) and the predicted mean number of events (labeled Exp), their standard errors,  95% confidence intervals and the test of the null hypothesis that the mean number of events is 0.

 

Contrast Estimate Results

Label

Estimate

Standard Error

Alpha

Confidence Limits

Chi-Square

Pr > ChiSq

c1=1, x2=4, x3=4

3.1075

0.0434

0.05

3.0225

3.1924

5137.3

<.0001

Exp(c1=1, x2=4, x3=4)

22.3644

0.9696

0.05

20.5425

24.3479

 

 

c1=2, x2=4, x3=4

2.8530

0.0463

0.05

2.7623

2.9436

3805.0

<.0001

Exp(c1=2, x2=4, x3=4)

17.3389

0.8019

0.05

15.8363

18.9841

 

 

c1=3, x2=4, x3=4

2.7636

0.0474

0.05

2.6707

2.8565

3399.0

<.0001

Exp(c1=3, x2=4, x3=4)

15.8573

0.7517

0.05

14.4504

17.4012

 

 

 

As predicted from the model, the average number of events (books read) for children with grade 4 in math and language arts is 15.8573 for reading program 3, 17.3389 for reading program 2 and 22.3644 for reading program 1.

 

The last table is the result of the CONTRAST statement. It provides tests for all pairwise comparisons of  variable C1 levels. The tests are likelihood ratio tests as indicated in the Type column. To get the Wald tests, the WALD option in the contrast statement needs to be requested. Difference between program 1 and 2 as well as the difference between program 1 and 3 are significant (Chi-square 31.86, p-value <0.0001 and Chi-square 55.60, p-value <0.0001, respectively), but programs 2 and 3 are not significantly different (Chi-square 3.30, p-value 0.0691).

 

Contrast Results

Contrast

DF

Chi-Square

Pr > ChiSq

Type

c1, 1 vs 2

1

31.86

<.0001

LR

c1, 1 vs 3

1

55.60

<.0001

LR

c1, 2 vs 3

1

3.30

0.0691

LR

 

Since category 3 of C1 is the reference category, tests presented in the Analysis of Parameter Estimates table for c1=1  and c1=2 and tests in rows 2 and 3 of the above table are for the same hypotheses. Small differences in test statistics values are due the fact that the Analysis of Parameter Estimates table gives the Wald tests and the Contrast Results table, the likelihood ratio tests (option Wald was not used in the contrast statement).