Count data: Poisson model

[Data] Crab Data: comes from a study of nesting horseshoe crabs (J. Brockmann, Ethology, 102: 1-21, 1996). Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s shell width, which is a summary of her size.

[Goal] Fit the Poisson loglinear model with response = number of satellites and covariate = width.

[Specification]

Poisson model with Overdispersion

Quasi-likelihood approach

In model statement, add scale=Pearson or scale=Deviance

/*
  if the crab.csv is under the same directory as the code, 
  just use "crab.csv", instead of "../data/crab.csv".
*/
  
proc import datafile='..\data\crab.csv' out=crab
dbms=csv  replace;
run;

Proc Genmod data=crab; 
 Model Satellites=width / dist=poi link=log  scale=Pearson;
run;
                           The GENMOD Procedure

                            Model Information

                     Data Set               WORK.CRAB
                     Distribution             Poisson
                     Link Function                Log
                     Dependent Variable    Satellites

                  Number of Observations Read         173
                  Number of Observations Used         173

                  Criteria For Assessing Goodness Of Fit
 
     Criterion                     DF           Value        Value/DF

     Deviance                     171        567.8786          3.3209
     Scaled Deviance              171        178.4544          1.0436
     Pearson Chi-Square           171        544.1570          3.1822
     Scaled Pearson X2            171        171.0000          1.0000
     Log Likelihood                           21.5091                
     Full Log Likelihood                    -461.5881                
     AIC (smaller is better)                 927.1762                
     AICC (smaller is better)                927.2468                
     BIC (smaller is better)                 933.4828                


Algorithm converged.                                                       

            Analysis Of Maximum Likelihood Parameter Estimates
 
                              Standard   Wald 95% Confidence         Wald
  Parameter   DF   Estimate      Error          Limits         Chi-Square

  Intercept    1    -3.3048     0.9673    -5.2006    -1.4089        11.67
  Width        1     0.1640     0.0356     0.0942     0.2339        21.22
  Scale        0     1.7839     0.0000     1.7839     1.7839             

                           Analysis Of Maximum
                           Likelihood Parameter
                                Estimates
 
                          Parameter   Pr > ChiSq

                          Intercept       0.0006
                          Width           <.0001
                          Scale                 

NOTE: The scale parameter was estimated by the square root of Pearson's 
      Chi-Square/DOF.

[Result]

Pearson Chi-Square is \(\chi^2/df\) and the value is 3.1822. Under the pscale option, the SE of the estimated coefficients will be rescaled by a multiple factor \(\sqrt{\chi^2/df} = \sqrt{3.1822} = 1.7839\).

  • The Standard Error for width is 0.0356 (= \(1.7839 \times 0.02\)), here the 0.02 is the output from the Poisson model.
  • The estimated cofficients remain the same.

    Comparing this output with the previous Poisson regression: 
    • “Scaled Deviance” and “Scaled Pearson X2” are different

    • Wald test statistic is recomputed

    • “Scale” estimate is no longer equal to 1

Negative Binomial Regression

/*
  if the crab.csv is under the same directory as the code, 
  just use "crab.csv", instead of "../data/crab.csv".
*/
  
proc import datafile='..\data\crab.csv' out=crab
dbms=csv  replace;
run;

Proc Genmod data=crab; 
 Model Satellites=width / dist=negbin link=log;
run;
                           The GENMOD Procedure

                             Model Information

                  Data Set                      WORK.CRAB
                  Distribution          Negative Binomial
                  Link Function                       Log
                  Dependent Variable           Satellites

                  Number of Observations Read         173
                  Number of Observations Used         173

                  Criteria For Assessing Goodness Of Fit
 
     Criterion                     DF           Value        Value/DF

     Deviance                     171        195.8112          1.1451
     Scaled Deviance              171        195.8112          1.1451
     Pearson Chi-Square           171        144.7507          0.8465
     Scaled Pearson X2            171        144.7507          0.8465
     Log Likelihood                          154.3889                
     Full Log Likelihood                    -375.6455                
     AIC (smaller is better)                 757.2910                
     AICC (smaller is better)                757.4330                
     BIC (smaller is better)                 766.7508                


Algorithm converged.                                                       

            Analysis Of Maximum Likelihood Parameter Estimates
 
                              Standard   Wald 95% Confidence         Wald
 Parameter    DF   Estimate      Error          Limits         Chi-Square

 Intercept     1    -4.0525     1.2642    -6.5303    -1.5747        10.28
 Width         1     0.1921     0.0476     0.0987     0.2854        16.27
 Dispersion    1     1.1055     0.1971     0.7795     1.5679             

                            Analysis Of Maximum
                           Likelihood Parameter
                                 Estimates
 
                          Parameter    Pr > ChiSq

                          Intercept        0.0013
                          Width            <.0001
                          Dispersion             

NOTE: The negative binomial dispersion parameter was estimated by maximum 
      likelihood.

[Result]

  • The negative binomial GLM is \[ \log(\hat\mu) = -4.0525+0.1921 \cdot width \] with \(SE(\hat\beta) = 0.0476\). For instance, at \(width=26.3 cm\), the mean number of satellites is \[ \exp(-4.0525+0.1921\times 26.3) = 2.718 \]

  • At a predicted \(\hat \mu\), the estimated variance of response is \[ \hat\mu+\hat D\hat\mu^2 = \hat\mu + 1.1055\hat\mu^2 \]

  • The 95% Wald confidence interval for the effect of width (\(\beta\)) is \[ 0.1921 \pm 1.96(0.0476) = (0.0987, 0.2854) \] It is much wider than that for the Poisson GLM, which is \[ 0.164 \pm 1.96(0.020) = (0.125, 0.203) \]

Rate Data

[Data] TrainAccident: number of two types of train-related accidents in the UK between 1975 and 2003.

  1. Accidents involving trains alone (collisions, derailments, and overruns)

  2. Collisions between trains and road vehicles. This is the type of accident to be modeled. Denote it by \(Y\).

  3. The data also provide the annual number of train-kilometres, which is a measure of railway activity indicating the millions of kilometres travelled by trains during the year. Denote it by \(t\).

[Goal] Fit the Poisson loglinear model with offset

[Specification] Use offset option in the model statement.

Poisson Model with offset

            data accident;
            input year Train_km collision1 collision2 @@;
            log_trainkm = log(train_km);
            x = year - 1975;
            datalines;
            2003 518 0 3 1988 443 2 4
            2002 516 1 3 1987 397 1 6
            2001 508 0 4 1986 414 2 13
            2000 503 1 3 1985 418 0 5
            1999 505 1 2 1984 389 5 3
            1998 487 0 4 1983 401 2 7
            1997 463 1 1 1982 372 2 3
            1996 437 2 2 1981 417 2 2
            1995 423 1 2 1980 430 2 2
            1994 415 2 4 1979 426 3 3
            1993 425 0 4 1978 430 2 4
            1992 430 1 4 1977 425 1 8
            1991 439 2 6 1976 426 2 12
            1990 431 1 2 1975 436 5 2
            1989 436 4 4
            ;
            
            /* x is the number of years since 1975*/
            proc genmod;
            model collision2 = x / dist=poisson link=log offset=log_trainkm;
            run;
                           The GENMOD Procedure

                             Model Information

                    Data Set              WORK.ACCIDENT
                    Distribution                Poisson
                    Link Function                   Log
                    Dependent Variable       collision2
                    Offset Variable         log_trainkm

                  Number of Observations Read          29
                  Number of Observations Used          29

                  Criteria For Assessing Goodness Of Fit
 
     Criterion                     DF           Value        Value/DF

     Deviance                      27         37.8528          1.4020
     Scaled Deviance               27         37.8528          1.4020
     Pearson Chi-Square            27         42.1918          1.5627
     Scaled Pearson X2             27         42.1918          1.5627
     Log Likelihood                           55.8825                
     Full Log Likelihood                     -64.7596                
     AIC (smaller is better)                 133.5192                
     AICC (smaller is better)                133.9808                
     BIC (smaller is better)                 136.2538                


Algorithm converged.                                                       

            Analysis Of Maximum Likelihood Parameter Estimates
 
                              Standard   Wald 95% Confidence         Wald
  Parameter   DF   Estimate      Error          Limits         Chi-Square

  Intercept    1    -4.2114     0.1589    -4.5229    -3.8999       702.27
  x            1    -0.0329     0.0108    -0.0540    -0.0118         9.36
  Scale        0     1.0000     0.0000     1.0000     1.0000             

                           Analysis Of Maximum
                           Likelihood Parameter
                                Estimates
 
                          Parameter   Pr > ChiSq

                          Intercept       <.0001
                          x               0.0022
                          Scale                 

NOTE: The scale parameter was held fixed.

[Result]

  • The fitted model is \[ \log(\hat\mu) = \log(t) - 4.2114 -0.0329 x \] with \(SE(\hat\beta) = 0.0108\)

  • The estimated rate is \[\begin{eqnarray*} \hat\mu/t &=& \exp(- 4.2114 -0.0329 x)\\ &=& (0.0148)(0.968)^x \end{eqnarray*}\]
  • In 1975 (\(x=0\)), the estimated rate of train accidents is \[ (0.0148)(0.968)^0 = 0.0148 \]

  • In 2003 (\(x=28\)), the estimated rate of train accidents is \[ (0.0148)(0.968)^{28} = 0.0059 \text{ per million kilometers} \]

Negative Binomial Regression with offset

            data accident;
            input year Train_km collision1 collision2 @@;
            log_trainkm = log(train_km);
            x = year - 1975;
            datalines;
            2003 518 0 3 1988 443 2 4
            2002 516 1 3 1987 397 1 6
            2001 508 0 4 1986 414 2 13
            2000 503 1 3 1985 418 0 5
            1999 505 1 2 1984 389 5 3
            1998 487 0 4 1983 401 2 7
            1997 463 1 1 1982 372 2 3
            1996 437 2 2 1981 417 2 2
            1995 423 1 2 1980 430 2 2
            1994 415 2 4 1979 426 3 3
            1993 425 0 4 1978 430 2 4
            1992 430 1 4 1977 425 1 8
            1991 439 2 6 1976 426 2 12
            1990 431 1 2 1975 436 5 2
            1989 436 4 4
            ;
            
            /* x is the number of years since 1975*/
            proc genmod;
            model collision2 = x / dist=negbin link=log offset=log_trainkm;
            run;
                           The GENMOD Procedure

                             Model Information

                  Data Set                  WORK.ACCIDENT
                  Distribution          Negative Binomial
                  Link Function                       Log
                  Dependent Variable           collision2
                  Offset Variable             log_trainkm

                  Number of Observations Read          29
                  Number of Observations Used          29

                  Criteria For Assessing Goodness Of Fit
 
     Criterion                     DF           Value        Value/DF

     Deviance                      27         25.2639          0.9357
     Scaled Deviance               27         25.2639          0.9357
     Pearson Chi-Square            27         28.6415          1.0608
     Scaled Pearson X2             27         28.6415          1.0608
     Log Likelihood                           57.2972                
     Full Log Likelihood                     -63.3450                
     AIC (smaller is better)                 132.6899                
     AICC (smaller is better)                133.6499                
     BIC (smaller is better)                 136.7918                


Algorithm converged.                                                       

            Analysis Of Maximum Likelihood Parameter Estimates
 
                              Standard   Wald 95% Confidence         Wald
 Parameter    DF   Estimate      Error          Limits         Chi-Square

 Intercept     1    -4.2000     0.1981    -4.5883    -3.8117       449.48
 x             1    -0.0337     0.0130    -0.0592    -0.0081         6.66
 Dispersion    1     0.0988     0.0782     0.0210     0.4660             

                            Analysis Of Maximum
                           Likelihood Parameter
                                 Estimates
 
                          Parameter    Pr > ChiSq

                          Intercept        <.0001
                          x                0.0099
                          Dispersion             

NOTE: The negative binomial dispersion parameter was estimated by maximum 
      likelihood.

[Result]

  • The fitted negative binomial regression is \[ \log(\hat\mu) -\log(t) = -4.20-0.0337 x \] which is similar to the Poisson regression

  • There is some overdispersion because \(\hat D=0.0988\) (\(SE=0.0782\))