Poisson Model

Plot
Poisson model
- Log link
- Identity link
Poisson model with Overdispersion
- Quasi-likelihood approach
- Negative Binomial Regression
Offset: Rate Data
- Poisson MOdel with Offset
- Negative Binomial Regression with offset

[Data] Crab Data: comes from a study of nesting horseshoe crabs (J. Brockmann, Ethology, 102: 1-21, 1996). Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s shell width, which is a summary of her size.

# if the crab.csv is under the same directory as the code, 
# just use "crab.csv", instead of "../data/crab.csv".

crab=read.csv("../data/crab.csv",header=TRUE)
colnames(crab)

## [1] "Obs"        "Color"      "Spine"      "Width"      "Weight"    
## [6] "Satellites"

Plot

First, we group the records using the interval of the width variable:<= 23.25 <=24.25 <=25.25 <=26.25 <=27.25 <=28.25 <=29.25 > 29.25. For each interval, we obtain the mean number of satellites.

upper.endpoint = c(23.25, 24.25,25.25,26.25,27.25,28.25,29.25,(max(crab$Width)+0.1))

width.interval =cut(crab$Width, c(min(crab$Width)-0.1,upper.endpoint))
crab = cbind(crab,width.interval)

# run the command if you have not installed this software:install.packages("plyr")
library(plyr) 
mean.x.byinterval = ddply(crab,~width.interval,summarise,mean=mean(Width))$mean
mean.y.byinterval = ddply(crab,~width.interval,summarise,mean=mean(Satellites))$mean

# plot the mean number against the upper point of the intevals
par(mfrow=c(1,2))
plot(mean.x.byinterval,mean.y.byinterval,ylab="Mean number of Satellites")
plot(mean.x.byinterval,log(mean.y.byinterval),ylab="Mean number of Satellites")

Poisson model

[Goal] Fit the Poisson loglinear model with response = number of satellites and covariate = width

Log link

poisson.M1 = glm(Satellites ~ Width, data=crab, family=poisson(link="log"))
summary(poisson.M1)

## 
## Call:
## glm(formula = Satellites ~ Width, family = poisson(link = "log"), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## Width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

# add other predictors
poisson.M2 = glm(Satellites ~ Width + Weight, data=crab, family=poisson(link="log"))
summary(poisson.M2)

## 
## Call:
## glm(formula = Satellites ~ Width + Weight, family = poisson(link = "log"), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9308  -1.9705  -0.5481   0.9700   4.9905  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.29168    0.89929  -1.436  0.15091   
## Width        0.04590    0.04677   0.981  0.32640   
## Weight       0.44744    0.15864   2.820  0.00479 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.89  on 170  degrees of freedom
## AIC: 921.18
## 
## Number of Fisher Scoring iterations: 6

# compare the model 1 and model 2 using the likelihood ratio test
anova(poisson.M1, poisson.M2, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: Satellites ~ Width
## Model 2: Satellites ~ Width + Weight
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       171     567.88                        
## 2       170     559.89  1   7.9934 0.004695 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Identity link

R encounters some difficulties with identity link link="identity".

poisson.M3 = glm(Satellites ~ Width, data=crab, family=poisson(link="identity"))

## Warning in log(y/mu): NaNs produced

## Error: no valid set of coefficients has been found: please supply starting values

Poisson model with Overdispersion

Quasi-likelihood approach

Use family="quasipoisson" instead of family="poisson" to estimate the dispersion parameter.

poisson.M4 = glm(Satellites ~ Width, data=crab, family=quasipoisson(link="log"))
summary(poisson.M4)

## 
## Call:
## glm(formula = Satellites ~ Width, family = quasipoisson(link = "log"), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.30476    0.96729  -3.417 0.000793 ***
## Width        0.16405    0.03562   4.606 7.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.182205)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Negative Binomial Regression

library(MASS)
NB.M1 = glm.nb(Satellites ~ Width, data=crab, link="log")
summary(NB.M1)

## 
## Call:
## glm.nb(formula = Satellites ~ Width, data = crab, link = "log", 
##     init.theta = 0.90456808)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7798  -1.4110  -0.2502   0.4770   2.0177  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.05251    1.17143  -3.459 0.000541 ***
## Width        0.19207    0.04406   4.360  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
## 
##     Null deviance: 213.05  on 172  degrees of freedom
## Residual deviance: 195.81  on 171  degrees of freedom
## AIC: 757.29
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.905 
##           Std. Err.:  0.161 
## 
##  2 x log-likelihood:  -751.291

Note that the dispersion parameter from R (i.e., 0.9046) is actually the estimate of \(k\), the inverse of that from SAS.

Offset: Rate Data

Poisson MOdel with Offset

accident=read.csv("../data/trainAccident.csv",header=TRUE)
accident$x = accident$year - 1975
accident = as.data.frame(accident)
head(accident)  # to examine the data input.

##   year train.km collision1 collision2  x
## 1 2003      518          0          3 28
## 2 2002      516          1          3 27
## 3 2001      508          0          4 26
## 4 2000      503          1          3 25
## 5 1999      505          1          2 24
## 6 1998      487          0          4 23

accident.fit = glm(collision2 ~ x + offset(log(train.km)), data=accident, 
                   family=poisson(link="log") )
summary(accident.fit)

## 
## Call:
## glm(formula = collision2 ~ x + offset(log(train.km)), family = poisson(link = "log"), 
##     data = accident)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0580  -0.7825  -0.0826   0.3775   3.3873  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.21142    0.15892  -26.50  < 2e-16 ***
## x           -0.03292    0.01076   -3.06  0.00222 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 47.376  on 28  degrees of freedom
## Residual deviance: 37.853  on 27  degrees of freedom
## AIC: 133.52
## 
## Number of Fisher Scoring iterations: 5

Negative Binomial Regression with offset

library(MASS)
accident.nb = glm.nb(collision2 ~ x + offset(log(train.km)), data=accident, link="log")
summary(accident.nb)

## 
## Call:
## glm.nb(formula = collision2 ~ x + offset(log(train.km)), data = accident, 
##     link = "log", init.theta = 10.11828724)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72370  -0.65461  -0.05868   0.32984   2.64065  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.19999    0.19584 -21.446  < 2e-16 ***
## x           -0.03367    0.01288  -2.615  0.00893 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(10.1183) family taken to be 1)
## 
##     Null deviance: 32.045  on 28  degrees of freedom
## Residual deviance: 25.264  on 27  degrees of freedom
## AIC: 132.69
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  10.12 
##           Std. Err.:  8.00 
## 
##  2 x log-likelihood:  -126.69

Again, note that the dispersion parameter from R is the inverse of that from SAS.