[Data] Crab Data: comes from a study of nesting horseshoe crabs (J. Brockmann, Ethology, 102: 1-21, 1996). Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s shell width, which is a summary of her size.
# if the crab.csv is under the same directory as the code,
# just use "crab.csv", instead of "../data/crab.csv".
crab=read.csv("../data/crab.csv",header=TRUE)
colnames(crab)
## [1] "Obs" "Color" "Spine" "Width" "Weight"
## [6] "Satellites"
First, we group the records using the interval of the width variable:<= 23.25 <=24.25 <=25.25 <=26.25 <=27.25 <=28.25 <=29.25 > 29.25. For each interval, we obtain the mean number of satellites.
upper.endpoint = c(23.25, 24.25,25.25,26.25,27.25,28.25,29.25,(max(crab$Width)+0.1))
width.interval =cut(crab$Width, c(min(crab$Width)-0.1,upper.endpoint))
crab = cbind(crab,width.interval)
# run the command if you have not installed this software:install.packages("plyr")
library(plyr)
mean.x.byinterval = ddply(crab,~width.interval,summarise,mean=mean(Width))$mean
mean.y.byinterval = ddply(crab,~width.interval,summarise,mean=mean(Satellites))$mean
# plot the mean number against the upper point of the intevals
par(mfrow=c(1,2))
plot(mean.x.byinterval,mean.y.byinterval,ylab="Mean number of Satellites")
plot(mean.x.byinterval,log(mean.y.byinterval),ylab="Mean number of Satellites")
[Goal] Fit the Poisson loglinear model with response = number of satellites and covariate = width
poisson.M1 = glm(Satellites ~ Width, data=crab, family=poisson(link="log"))
summary(poisson.M1)
##
## Call:
## glm(formula = Satellites ~ Width, family = poisson(link = "log"),
## data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## Width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
# add other predictors
poisson.M2 = glm(Satellites ~ Width + Weight, data=crab, family=poisson(link="log"))
summary(poisson.M2)
##
## Call:
## glm(formula = Satellites ~ Width + Weight, family = poisson(link = "log"),
## data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9308 -1.9705 -0.5481 0.9700 4.9905
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.29168 0.89929 -1.436 0.15091
## Width 0.04590 0.04677 0.981 0.32640
## Weight 0.44744 0.15864 2.820 0.00479 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.89 on 170 degrees of freedom
## AIC: 921.18
##
## Number of Fisher Scoring iterations: 6
# compare the model 1 and model 2 using the likelihood ratio test
anova(poisson.M1, poisson.M2, test="LRT")
## Analysis of Deviance Table
##
## Model 1: Satellites ~ Width
## Model 2: Satellites ~ Width + Weight
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 170 559.89 1 7.9934 0.004695 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R encounters some difficulties with identity link link="identity"
.
poisson.M3 = glm(Satellites ~ Width, data=crab, family=poisson(link="identity"))
## Warning in log(y/mu): NaNs produced
## Error: no valid set of coefficients has been found: please supply starting values
Use family="quasipoisson"
instead of family="poisson"
to estimate the dispersion parameter.
poisson.M4 = glm(Satellites ~ Width, data=crab, family=quasipoisson(link="log"))
summary(poisson.M4)
##
## Call:
## glm(formula = Satellites ~ Width, family = quasipoisson(link = "log"),
## data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.30476 0.96729 -3.417 0.000793 ***
## Width 0.16405 0.03562 4.606 7.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.182205)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
library(MASS)
NB.M1 = glm.nb(Satellites ~ Width, data=crab, link="log")
summary(NB.M1)
##
## Call:
## glm.nb(formula = Satellites ~ Width, data = crab, link = "log",
## init.theta = 0.90456808)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7798 -1.4110 -0.2502 0.4770 2.0177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05251 1.17143 -3.459 0.000541 ***
## Width 0.19207 0.04406 4.360 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
##
## Null deviance: 213.05 on 172 degrees of freedom
## Residual deviance: 195.81 on 171 degrees of freedom
## AIC: 757.29
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.905
## Std. Err.: 0.161
##
## 2 x log-likelihood: -751.291
Note that the dispersion parameter from R (i.e., 0.9046) is actually the estimate of \(k\), the inverse of that from SAS.
accident=read.csv("../data/trainAccident.csv",header=TRUE)
accident$x = accident$year - 1975
accident = as.data.frame(accident)
head(accident) # to examine the data input.
## year train.km collision1 collision2 x
## 1 2003 518 0 3 28
## 2 2002 516 1 3 27
## 3 2001 508 0 4 26
## 4 2000 503 1 3 25
## 5 1999 505 1 2 24
## 6 1998 487 0 4 23
accident.fit = glm(collision2 ~ x + offset(log(train.km)), data=accident,
family=poisson(link="log") )
summary(accident.fit)
##
## Call:
## glm(formula = collision2 ~ x + offset(log(train.km)), family = poisson(link = "log"),
## data = accident)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0580 -0.7825 -0.0826 0.3775 3.3873
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.21142 0.15892 -26.50 < 2e-16 ***
## x -0.03292 0.01076 -3.06 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 47.376 on 28 degrees of freedom
## Residual deviance: 37.853 on 27 degrees of freedom
## AIC: 133.52
##
## Number of Fisher Scoring iterations: 5
library(MASS)
accident.nb = glm.nb(collision2 ~ x + offset(log(train.km)), data=accident, link="log")
summary(accident.nb)
##
## Call:
## glm.nb(formula = collision2 ~ x + offset(log(train.km)), data = accident,
## link = "log", init.theta = 10.11828724)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.72370 -0.65461 -0.05868 0.32984 2.64065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.19999 0.19584 -21.446 < 2e-16 ***
## x -0.03367 0.01288 -2.615 0.00893 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(10.1183) family taken to be 1)
##
## Null deviance: 32.045 on 28 degrees of freedom
## Residual deviance: 25.264 on 27 degrees of freedom
## AIC: 132.69
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 10.12
## Std. Err.: 8.00
##
## 2 x log-likelihood: -126.69
Again, note that the dispersion parameter from R is the inverse of that from SAS.