[Data] Crab Data: comes from a study of nesting horseshoe crabs (J. Brockmann, Ethology, 102: 1-21, 1996). Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s shell width, which is a summary of her size.
[Goal] Fit the Poisson loglinear model with response = number of satellites and covariate = width.
[Specification]
The dist=poi
specifies the random component to be the Poisson.
The link=log
specifies the link function to be the log link. It is also possible to use other links, such as identity link link=identity
.
/*
if the crab.csv is under the same directory as the code,
just use "crab.csv", instead of "../data/crab.csv".
*/
proc import datafile='..\data\crab.csv' out=crab
dbms=csv replace;
run;
Proc Genmod data=crab;
Model Satellites=width / LRCI TYPE3 dist=poi link=log;
run;
The GENMOD Procedure
Model Information
Data Set WORK.CRAB
Distribution Poisson
Link Function Log
Dependent Variable Satellites
Number of Observations Read 173
Number of Observations Used 173
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 171 567.8786 3.3209
Scaled Deviance 171 567.8786 3.3209
Pearson Chi-Square 171 544.1570 3.1822
Scaled Pearson X2 171 544.1570 3.1822
Log Likelihood 68.4463
Full Log Likelihood -461.5881
AIC (smaller is better) 927.1762
AICC (smaller is better) 927.2468
BIC (smaller is better) 933.4828
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Likelihood Ratio
Standard 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -3.3048 0.5422 -4.3662 -2.2407 37.14
Width 1 0.1640 0.0200 0.1247 0.2030 67.51
Scale 0 1.0000 0.0000 1.0000 1.0000
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept <.0001
Width <.0001
Scale
NOTE: The scale parameter was held fixed.
LR Statistics For Type 3 Analysis
Chi-
Source DF Square Pr > ChiSq
Width 1 64.91 <.0001
[Result]
Or \[ \hat\mu = \exp(-3.305+0.164x) \]
Since \(\exp(\hat\beta) = \exp(0.164)=1.18\), a 1 cm increase in width has an 18% increase in the estimated mean number of satellites. For instance, at \(x=26.3+1=27.3\), \(\hat\mu\) is equal to \(1.18(2.7) = 3.2\)
/*
if the crab.csv is under the same directory as the code,
just use "crab.csv", instead of "../data/crab.csv".
*/
proc import datafile='..\data\crab.csv' out=crab
dbms=csv replace;
run;
Proc Genmod data=crab;
Model Satellites=width / dist=poi link=identity;
run;
The GENMOD Procedure
Model Information
Data Set WORK.CRAB
Distribution Poisson
Link Function Identity
Dependent Variable Satellites
Number of Observations Read 173
Number of Observations Used 173
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 171 557.7083 3.2615
Scaled Deviance 171 557.7083 3.2615
Pearson Chi-Square 171 542.4854 3.1724
Scaled Pearson X2 171 542.4854 3.1724
Log Likelihood 73.5314
Full Log Likelihood -456.5030
AIC (smaller is better) 917.0060
AICC (smaller is better) 917.0766
BIC (smaller is better) 923.3126
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -11.5321 1.5104 -14.4924 -8.5717 58.29
Width 1 0.5495 0.0593 0.4333 0.6657 85.89
Scale 0 1.0000 0.0000 1.0000 1.0000
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept <.0001
Width <.0001
Scale
NOTE: The scale parameter was held fixed.
[Result]
\(\hat\mu\) can be negative for small \(x\)
A 1 cm increase in width has predicted increase of \(\hat\beta=0.55\) in the expected number of satellites
At \(x=26.3\), the fitted value is \(\hat\mu=-11.51+0.55(26.3)=2.93\). At \(x=27.3\), it is 3.48
This model provides a simple description of the width effect: On average, a 2 cm increase in width corresponds to about an extra satellite
In model statement, add scale=Pearson
or scale=Deviance
/*
if the crab.csv is under the same directory as the code,
just use "crab.csv", instead of "../data/crab.csv".
*/
proc import datafile='..\data\crab.csv' out=crab
dbms=csv replace;
run;
Proc Genmod data=crab;
Model Satellites=width / dist=poi link=log scale=Pearson;
run;
The GENMOD Procedure
Model Information
Data Set WORK.CRAB
Distribution Poisson
Link Function Log
Dependent Variable Satellites
Number of Observations Read 173
Number of Observations Used 173
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 171 567.8786 3.3209
Scaled Deviance 171 178.4544 1.0436
Pearson Chi-Square 171 544.1570 3.1822
Scaled Pearson X2 171 171.0000 1.0000
Log Likelihood 21.5091
Full Log Likelihood -461.5881
AIC (smaller is better) 927.1762
AICC (smaller is better) 927.2468
BIC (smaller is better) 933.4828
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -3.3048 0.9673 -5.2006 -1.4089 11.67
Width 1 0.1640 0.0356 0.0942 0.2339 21.22
Scale 0 1.7839 0.0000 1.7839 1.7839
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept 0.0006
Width <.0001
Scale
NOTE: The scale parameter was estimated by the square root of Pearson's
Chi-Square/DOF.
[Result]
Pearson Chi-Square
is \(\chi^2/df\) and the value is 3.1822. Under the pscale
option, the SE of the estimated coefficients will be rescaled by a multiple factor \(\sqrt{\chi^2/df} = \sqrt{3.1822} = 1.7839\).
The estimated cofficients remain the same.
Comparing this output with the previous Poisson regression:
“Scaled Deviance” and “Scaled Pearson X2” are different
Wald test statistic is recomputed
“Scale” estimate is no longer equal to 1
/*
if the crab.csv is under the same directory as the code,
just use "crab.csv", instead of "../data/crab.csv".
*/
proc import datafile='..\data\crab.csv' out=crab
dbms=csv replace;
run;
Proc Genmod data=crab;
Model Satellites=width / dist=negbin link=log;
run;
The GENMOD Procedure
Model Information
Data Set WORK.CRAB
Distribution Negative Binomial
Link Function Log
Dependent Variable Satellites
Number of Observations Read 173
Number of Observations Used 173
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 171 195.8112 1.1451
Scaled Deviance 171 195.8112 1.1451
Pearson Chi-Square 171 144.7507 0.8465
Scaled Pearson X2 171 144.7507 0.8465
Log Likelihood 154.3889
Full Log Likelihood -375.6455
AIC (smaller is better) 757.2910
AICC (smaller is better) 757.4330
BIC (smaller is better) 766.7508
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -4.0525 1.2642 -6.5303 -1.5747 10.28
Width 1 0.1921 0.0476 0.0987 0.2854 16.27
Dispersion 1 1.1055 0.1971 0.7795 1.5679
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept 0.0013
Width <.0001
Dispersion
NOTE: The negative binomial dispersion parameter was estimated by maximum
likelihood.
[Result]
The negative binomial GLM is \[ \log(\hat\mu) = -4.0525+0.1921 \cdot width \] with \(SE(\hat\beta) = 0.0476\). For instance, at \(width=26.3 cm\), the mean number of satellites is \[ \exp(-4.0525+0.1921\times 26.3) = 2.718 \]
At a predicted \(\hat \mu\), the estimated variance of response is \[ \hat\mu+\hat D\hat\mu^2 = \hat\mu + 1.1055\hat\mu^2 \]
The 95% Wald confidence interval for the effect of width (\(\beta\)) is \[ 0.1921 \pm 1.96(0.0476) = (0.0987, 0.2854) \] It is much wider than that for the Poisson GLM, which is \[ 0.164 \pm 1.96(0.020) = (0.125, 0.203) \]
[Data] TrainAccident: number of two types of train-related accidents in the UK between 1975 and 2003.
Accidents involving trains alone (collisions, derailments, and overruns)
Collisions between trains and road vehicles. This is the type of accident to be modeled. Denote it by \(Y\).
The data also provide the annual number of train-kilometres, which is a measure of railway activity indicating the millions of kilometres travelled by trains during the year. Denote it by \(t\).
[Goal] Fit the Poisson loglinear model with offset
[Specification] Use offset
option in the model statement.
data accident;
input year Train_km collision1 collision2 @@;
log_trainkm = log(train_km);
x = year - 1975;
datalines;
2003 518 0 3 1988 443 2 4
2002 516 1 3 1987 397 1 6
2001 508 0 4 1986 414 2 13
2000 503 1 3 1985 418 0 5
1999 505 1 2 1984 389 5 3
1998 487 0 4 1983 401 2 7
1997 463 1 1 1982 372 2 3
1996 437 2 2 1981 417 2 2
1995 423 1 2 1980 430 2 2
1994 415 2 4 1979 426 3 3
1993 425 0 4 1978 430 2 4
1992 430 1 4 1977 425 1 8
1991 439 2 6 1976 426 2 12
1990 431 1 2 1975 436 5 2
1989 436 4 4
;
/* x is the number of years since 1975*/
proc genmod;
model collision2 = x / dist=poisson link=log offset=log_trainkm;
run;
The GENMOD Procedure
Model Information
Data Set WORK.ACCIDENT
Distribution Poisson
Link Function Log
Dependent Variable collision2
Offset Variable log_trainkm
Number of Observations Read 29
Number of Observations Used 29
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 27 37.8528 1.4020
Scaled Deviance 27 37.8528 1.4020
Pearson Chi-Square 27 42.1918 1.5627
Scaled Pearson X2 27 42.1918 1.5627
Log Likelihood 55.8825
Full Log Likelihood -64.7596
AIC (smaller is better) 133.5192
AICC (smaller is better) 133.9808
BIC (smaller is better) 136.2538
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -4.2114 0.1589 -4.5229 -3.8999 702.27
x 1 -0.0329 0.0108 -0.0540 -0.0118 9.36
Scale 0 1.0000 0.0000 1.0000 1.0000
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept <.0001
x 0.0022
Scale
NOTE: The scale parameter was held fixed.
[Result]
The fitted model is \[ \log(\hat\mu) = \log(t) - 4.2114 -0.0329 x \] with \(SE(\hat\beta) = 0.0108\)
In 1975 (\(x=0\)), the estimated rate of train accidents is \[ (0.0148)(0.968)^0 = 0.0148 \]
In 2003 (\(x=28\)), the estimated rate of train accidents is \[ (0.0148)(0.968)^{28} = 0.0059 \text{ per million kilometers} \]
data accident;
input year Train_km collision1 collision2 @@;
log_trainkm = log(train_km);
x = year - 1975;
datalines;
2003 518 0 3 1988 443 2 4
2002 516 1 3 1987 397 1 6
2001 508 0 4 1986 414 2 13
2000 503 1 3 1985 418 0 5
1999 505 1 2 1984 389 5 3
1998 487 0 4 1983 401 2 7
1997 463 1 1 1982 372 2 3
1996 437 2 2 1981 417 2 2
1995 423 1 2 1980 430 2 2
1994 415 2 4 1979 426 3 3
1993 425 0 4 1978 430 2 4
1992 430 1 4 1977 425 1 8
1991 439 2 6 1976 426 2 12
1990 431 1 2 1975 436 5 2
1989 436 4 4
;
/* x is the number of years since 1975*/
proc genmod;
model collision2 = x / dist=negbin link=log offset=log_trainkm;
run;
The GENMOD Procedure
Model Information
Data Set WORK.ACCIDENT
Distribution Negative Binomial
Link Function Log
Dependent Variable collision2
Offset Variable log_trainkm
Number of Observations Read 29
Number of Observations Used 29
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 27 25.2639 0.9357
Scaled Deviance 27 25.2639 0.9357
Pearson Chi-Square 27 28.6415 1.0608
Scaled Pearson X2 27 28.6415 1.0608
Log Likelihood 57.2972
Full Log Likelihood -63.3450
AIC (smaller is better) 132.6899
AICC (smaller is better) 133.6499
BIC (smaller is better) 136.7918
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square
Intercept 1 -4.2000 0.1981 -4.5883 -3.8117 449.48
x 1 -0.0337 0.0130 -0.0592 -0.0081 6.66
Dispersion 1 0.0988 0.0782 0.0210 0.4660
Analysis Of Maximum
Likelihood Parameter
Estimates
Parameter Pr > ChiSq
Intercept <.0001
x 0.0099
Dispersion
NOTE: The negative binomial dispersion parameter was estimated by maximum
likelihood.
[Result]
The fitted negative binomial regression is \[ \log(\hat\mu) -\log(t) = -4.20-0.0337 x \] which is similar to the Poisson regression
There is some overdispersion because \(\hat D=0.0988\) (\(SE=0.0782\))