Poisson
regression is often used to analyze count data. It can be used to model the
number of occurrences of an event of interest or the rate of occurrence of an event
of interest, as a function of some independent variables. For example, rate of
insurance claims, number of doctor visits, incidence of diseases, crime
incidence, number of days a child is absent from school, colony counts for
bacteria, can be modeled using Poisson regression.
In
Poisson regression it is assumed that the dependent variable Y, number of
occurrences of an event, has a Poisson distribution given the
independent variables X1, X2, ...., Xm,
P(Y=k|
x1, x2, ..., xm) = e- m mk / k!, k=0, 1, 2, ......,
where the log of the mean m is assumed to be a linear function of the
independent variables. That is,
log(m) = intercept + b1*X1 +b2*X2 + ....+ b3*Xm,
which implies that m is the exponential function of independent variables,
m = exp(intercept + b1*X1 +b2*X2 + ....+
b3*Xm).
In many situations the rate or incidence of an event needs to be modeled instead of the
number of occurrences. For example, suppose that we know the number of
occurrences of certain disease by county and we want to find out if frequency
of occurrence depends on certain demographic variables and health policy
programs also recorded by county. Since more at risk subjects result in more
occurrences of the disease, we need to adjust for the number of subjects at
risk in each county. For such data, we can write a Poisson regression
model in the following form:
log(m) = log(N) + intercept + b1*X1 +b2*X2 + ....+
b3*Xm,
where N is the total number of subjects at risk by
county. The logarithm of variable N is used as an offset, that is, a regression
variable with a constant coefficient of 1 for each observation. The log of the incidence, log (m / N), is modeled now as a linear function
of independent variables.
The
maximum likelihood method is used to estimate the parameters of Poisson
regression models.
In
SAS, the GENMOD procedure can fit Poisson regression models.
We
will use PROC GENMOD to fit Poisson regression models in two examples. In the
first example, the number of occurrences of an event and in the second example, the incidence, will
be modeled.
|
c1 |
x2 |
x3 |
n_events |
|
1 |
2 |
1 |
9 |
|
1 |
3 |
3 |
13 |
|
1 |
3 |
4 |
22 |
|
1 |
4 |
4 |
12 |
|
2 |
2 |
3 |
12 |
|
2 |
2 |
4 |
13 |
|
2 |
3 |
2 |
17 |
|
2 |
3 |
3 |
11 |
|
2 |
4 |
4 |
12 |
|
3 |
3 |
4 |
22 |
proc genmod data=one;
class c1;
model n_events = c1 x2 x3 / dist = poisson
link = log type3;
contrast 'c1, 1 vs 2' c1 1 -1 0;
contrast 'c1, 1 vs 3' c1 1 0 -1;
contrast 'c1, 2 vs 3' c1 0 1 -1;
estimate 'c1=1, x2=4,
x3=4'
int 1
c1 1 0 0
x2 4
x3 4 /exp;
estimate 'c1=2, x2=4,
x3=4'
int 1
c1 0 1 0
x2 4
x3 4 /exp;
estimate 'c1=3, x2=4,
x3=4'
int 1
c1 0 0 1
x2 4
x3 4 /exp;
run;
class c1; Specifies c1 as a
classification variable. The remaining variables will be treated as continuous
variables in the analysis.
dist = poisson link
= log Specifies
Poisson distribution to be used in the model, that is, declares n_events to
have a Poisson distribution, and
requests modeling log of the mean of n_events as a linear function of the
independent variables.
type3 requests type 3 tests (overall tests for the effects
in the model). By default, likelihood ratio tests are printed. If Wald tests
are desired, change type3 to
type3 wald .
contrast 'c1, 1 vs 2' c1 1 -1 0; compares
level 1 of class variable c1 with level 2. Contrast specification is the same
as in proc glm;
estimate 'c1=1, x2=4,
x3=4'
int 1
c1 1 0 0
x2 4
x3 4 /exp; Estimates the
mean number of events for subjects in category 1 of variable c1, with values 4
for x2 and 4 for x3. If exp option is not used
then only the log of the mean number of events is estimated.
The
following table contains information on assessment of fit.
|
Criteria For
Assessing Goodness Of Fit |
|||
|
Criterion |
DF |
Value |
Value/DF |
|
Deviance |
235 |
241.3872 |
1.0272 |
|
Scaled Deviance |
235 |
241.3872 |
1.0272 |
|
Pearson Chi-Square |
235 |
234.5383 |
0.9980 |
|
Scaled Pearson X2 |
235 |
234.5383 |
0.9980 |
|
Log Likelihood |
|
4149.0588 |
|
The Deviance and Pearson Chi-Square have approximately Chi-square distribution with the number of degrees of freedom printed in column titled DF. Both indicate adequate fit. Scaled Deviance and Scaled Pearson X2 are Deviance and Pearson Chi-Square, respectively, divided by the dispersion parameter. Here they are the same because for the Poisson distribution the dispersion parameter is 1. Deviance and Pearson Chi-Square divided by the degrees of freedom are often used to detect overdispersion or underdispersion. For Poisson distribution the mean and the variance are equal, which implies that the deviance and the Pearson statistic divided by the degrees of freedom should be approximately one. Values greater than 1 indicate overdispersion, that is, the true variance is bigger than the mean, values smaller than 1 indicate underdispersion, the true variance is smaller than the mean. Evidence of underdispersion or overdispersion indicates inadequate fit of the Poisson model. Corrective measures include using the deviance or Pearson Chi-Square divided by degrees of freedom as an estimate of the dispersion parameter instead of setting it to 1 (options DSCALE and PSCALE in the MODEL statement) or, in the case of overdispersion, running the negative binomial regression instead of the Poisson regression (options DIST=NB instead of DIST=POISSON in the MODEL statement). For our example, the ratios are close to 1, so we can conclude that the fit of the Poisson model is adequate.
The
following table presents the significance tests of the model parameters.
|
Analysis Of
Parameter Estimates |
||||||||
|
Parameter |
|
DF |
Estimate |
Standard Error |
Wald 95% Confidence
Limits |
Chi-Square |
Pr > ChiSq |
|
|
Intercept |
|
1 |
1.4096 |
0.0749 |
1.2628 |
1.5563 |
354.51 |
<.0001 |
|
c1 |
1 |
1 |
0.3438 |
0.0465 |
0.2528 |
0.4349 |
54.79 |
<.0001 |
|
c1 |
2 |
1 |
0.0893 |
0.0492 |
-0.0070 |
0.1857 |
3.30 |
0.0693 |
|
c1 |
3 |
0 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
. |
. |
|
x2 |
|
1 |
0.1405 |
0.0171 |
0.1069 |
0.1740 |
67.30 |
<.0001 |
|
x3 |
|
1 |
0.1981 |
0.0173 |
0.1642 |
0.2319 |
131.65 |
<.0001 |
|
Scale |
|
0 |
1.0000 |
0.0000 |
1.0000 |
1.0000 |
|
|
The first variable, C1, is a
class variable. SAS orders categories alphabetically or, if numeric, from
smallest to largest, and takes the last one as a reference category. Positive
parameters for C1=1 and C2=2 mean that the mean number of events (books read)
is higher for subjects in program 1 and 2 then in program 3. The results of the
Wald Chi-Square tests indicate that there is a statistically significant
difference between category 1 of C1 and category 3 of C1 (Chi-Square=54.79,
p-value <0.0001), but the difference between program 2 and 3 (category 2 of
C1 and category 3 of C1) is not significant (Chi-Square=3.30, p-value=0.0693).
Positive parameters for variables X2 and X3 indicate that the mean number of
events increases when the value of X2 or X3 increases. That is, the average
number of books read (events) is bigger for students with higher grades in math
or language arts.
The parameter estimates given in the table can be used to
predict the mean number of events (books read) for different categories of C1
and different values of X2 and X3. For
example, for C1=1, X2=4 and X3=4,
predicted n_events =exp(1.4096
+0.3438 +0.1405*4 + 0.1981*4)
and for C1=3, X2=4 and X3=4,
predicted n_events
=exp(1.4096 +0.1405*4 + 0.1981*4).
It is easy to see that the predicted number of events for subjects in category 1 of C1 divided by the predicted number of events for subjects in category 3 of C1 is exp(0.3438) = 1.41, that is, children in the reading program 1 read, on average, 41% more books than children in program 3. The ratio does not depend on the values of X2 and X3 because there is no interaction term between C1 and these variables in the model.
The next table gives the overall test of significance and is the result of option TYPE3 in the MODEL statement. The test for variables with just one degree of freedom are the same as in the table above.
|
LR Statistics For
Type 3 Analysis |
|||
|
Source |
DF |
Chi-Square |
Pr > ChiSq |
|
c1 |
2 |
61.31 |
<.0001 |
|
x2 |
1 |
67.86 |
<.0001 |
|
x3 |
1 |
133.84 |
<.0001 |
The Contrast Estimate Results
table is the result of the ESTIMATE statement. It gives the predicted values
of log of mean number of events (first row for each label) and the
predicted mean number of events (labeled Exp), their standard errors, 95% confidence intervals and the test of the
null hypothesis that the mean number of events is 0.
|
Contrast Estimate
Results |
|||||||
|
Label |
Estimate |
Standard Error |
Alpha |
Confidence Limits |
Chi-Square |
Pr > ChiSq |
|
|
c1=1, x2=4, x3=4 |
3.1075 |
0.0434 |
0.05 |
3.0225 |
3.1924 |
5137.3 |
<.0001 |
|
Exp(c1=1, x2=4, x3=4) |
22.3644 |
0.9696 |
0.05 |
20.5425 |
24.3479 |
|
|
|
c1=2, x2=4, x3=4 |
2.8530 |
0.0463 |
0.05 |
2.7623 |
2.9436 |
3805.0 |
<.0001 |
|
Exp(c1=2, x2=4, x3=4) |
17.3389 |
0.8019 |
0.05 |
15.8363 |
18.9841 |
|
|
|
c1=3, x2=4, x3=4 |
2.7636 |
0.0474 |
0.05 |
2.6707 |
2.8565 |
3399.0 |
<.0001 |
|
Exp(c1=3, x2=4, x3=4) |
15.8573 |
0.7517 |
0.05 |
14.4504 |
17.4012 |
|
|
As predicted from the model, the average number of
events (books read) for children with grade 4 in math and language arts is
15.8573 for reading program 3, 17.3389 for reading program 2 and 22.3644 for
reading program 1.
The last table is the result of the CONTRAST statement. It provides tests for all pairwise comparisons of variable C1 levels. The tests are likelihood ratio tests as indicated in the Type column. To get the Wald tests, the WALD option in the contrast statement needs to be requested. Difference between program 1 and 2 as well as the difference between program 1 and 3 are significant (Chi-square 31.86, p-value <0.0001 and Chi-square 55.60, p-value <0.0001, respectively), but programs 2 and 3 are not significantly different (Chi-square 3.30, p-value 0.0691).
|
Contrast Results |
||||
|
Contrast |
DF |
Chi-Square |
Pr > ChiSq |
Type |
|
c1, 1 vs 2 |
1 |
31.86 |
<.0001 |
LR |
|
c1, 1 vs 3 |
1 |
55.60 |
<.0001 |
LR |
|
c1, 2 vs 3 |
1 |
3.30 |
0.0691 |
LR |
Since category 3 of C1 is
the reference category, tests presented in the Analysis of Parameter Estimates
table for c1=1 and c1=2 and tests in rows
2 and 3 of the above table are for the same hypotheses. Small differences in
test statistics values are due the fact that the Analysis of Parameter
Estimates table gives the Wald tests and the Contrast Results table, the
likelihood ratio tests (option Wald was not used in the contrast statement).