---
title: "Nonexistence of estimates of Poisson models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Nonexistence of estimates of Poisson models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette is adapted from https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md#r-packages.

## Example 1

```r
library(capybara)
```

The table below shows a dataset with 12 observations and four regressors. Observations 5 is separated
because $y=0$ and $z \neq x_2 - x_1$ is positive in those observations, and zero otherwise.

```r
correia2019$example1
which(correia2019$example1$x2 - correia2019$example1$x1 != 0 & correia2019$example1$y == 0)
```

```
# A tibble: 12 × 5
       y    x1    x2    x3    x4
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0     0     0     2    10
 2     0     0     0     0    -2
 3     0     0     0     0     6
 4     0     0     0     4     5
 5     0     1     0     0     3
 6     2     0     0     0     3
 7     2     0     0     0     4
 8     2     0     0    -2    15
 9     2     0     0     0    -7
10     4     0     0    -3    15
11     6    -3    -3     0     4
12     6     0     0     0     4
[1] 5
```

Base R shall not give a warning when estimating a Poisson model on this data.

```r
glm(
  y ~ x1 + x2 + x3 + x4,
  data = correia2019$example1,
  family = poisson()
)
```

```r
Call:  glm(formula = y ~ x1 + x2 + x3 + x4, family = poisson(), data = correia2019$example1)

Coefficients:
(Intercept)           x1           x2           x3           x4  
    0.59095    -17.78017     17.32952     -0.47085     -0.03779  

Degrees of Freedom: 11 Total (i.e. Null);  7 Residual
Null Deviance:      31.91 
Residual Deviance: 15.96        AIC: 46.99
```

Capybara will detect separation on this dataset.

```r
fepoisson(
  y ~ x1 + x2 + x3 + x4,
  data = correia2019$example1
)
```

```r
Separation found in 1 observation(s)
Formula: y ~ x1 + x2 + x3 + x4

Family: Poisson

Estimates:

|             | Estimate | Std. Error | z value | Pr(>|z|)  |
|-------------|----------|------------|---------|-----------|
| (Intercept) |   0.5910 |     0.3029 |  1.9510 | 0.0511 +  |
| x1          |  -0.4507 |     0.1648 | -2.7352 | 0.0062 ** |
| x2          |       NA |         NA |      NA | NA        |
| x3          |  -0.4708 |     0.2311 | -2.0367 | 0.0417 *  |
| x4          |  -0.0378 |     0.0438 | -0.8634 | 0.3879    |

Significance codes: ** p < 0.01; * p < 0.05; + p < 0.10

Pseudo R-squared: 0.4813 

Number of observations: Full 12; Separated 1; Perfect classification 0 

Number of Fisher Scoring iterations: 5
```

## Example 2

The table below shows a different dataset with 12 observations and four regressors. Observations 2, 3, 6, 7 and 8 are
separated because $y=0$ and $z > x_2 - x_1$ is positive in those observations, and zero otherwise.

```r
correia2019$example2
which(correia2019$example2$x2 - correia2019$example2$x1 > 0 & correia2019$example2$y == 0)
```

```r
# A tibble: 12 × 5
       y    x1    x2    x3    x4
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1     0    14     4     0   -12
 2     0     0    35    34    12
 3     0     0     3     0    17
 4     0     0     0     0     1
 5     0     0     0    -2     7
 6     0     0    25     0    12
 7     0     0    13     0    60
 8     0     0    15     0     7
 9     0     1     0     7   -24
10     9     0     0     0    18
11     4     2     0     6    -1
12     2     1     0     0    -7
[1] 2 3 6 7 8
```

Base R shall give a convergence warning when estimating a Poisson model on this data.

```r
glm(
  y ~ x1 + x2 + x3 + x4,
  data = correia2019$example2,
  family = poisson()
)
```

```
Call:  glm(formula = y ~ x1 + x2 + x3 + x4, family = poisson(), data = correia2019$example2)

Coefficients:
(Intercept)           x1           x2           x3           x4  
    -367.83       512.42     -1644.86      -105.85        20.56  

Degrees of Freedom: 11 Total (i.e. Null);  7 Residual
Null Deviance:      46.72 
Residual Deviance: 9.672e-06    AIC: 19.93
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted rates numerically 0 occurred
```

Capybara will detect separation on this dataset.

```r
fepoisson(
  y ~ x1 + x2 + x3 + x4,
  data = correia2019$example2
)
```

```r
Separation found in 6 observation(s)
Formula: y ~ x1 + x2 + x3 + x4

Family: Poisson

Estimates:

|             | Estimate  | Std. Error | z value | Pr(>|z|) |
|-------------|-----------|------------|---------|----------|
| (Intercept) | -239.2049 |  1134.4711 | -0.2109 |   0.8330 |
| x1          |  333.7767 |  1575.6630 |  0.2118 |   0.8322 |
| x2          |        NA |         NA |      NA |   NA     |
| x3          |  -68.9251 |   325.6385 | -0.2117 |   0.8324 |
| x4          |   13.4112 |    63.0263 |  0.2128 |   0.8315 |

Significance codes: ** p < 0.01; * p < 0.05; + p < 0.10

Pseudo R-squared: 1.00 

Number of observations: Full 12; Separated 6; Perfect classification 0 

Number of Fisher Scoring iterations: 25
```

## Example 3 (fixed effects)

Base R shall not give a warning when estimating a Poisson model with fixed effects on the dataset below. The separation
in this case is less clear, which motivates why capybara uses linear programming to detect separation in Poisson models.

```r
correia2019$fe1
```

```
# A tibble: 18 × 5
       y    x1    x2     i     j
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1     2     0     0     2     1
 2     0     0     0     4     2
 3     0     0     0     1     1
 4     1     1     0     4     3
 5     0     0     1     2     2
 6     0     0     0     2     2
 7     1     0     0     5     4
 8     0     1     2     2     3
 9     1     0     0     1     1
10     0     0     0     2     2
11     0     2     0     1     3
12     0     0     0     1     3
13     0     1     0     2     2
14     0     0     1     5     2
15     0     0     1     2     4
16     0     0     0     1     2
17     0     0     0     1     1
18     2     0     0     1     2
```

Base R shall not give a warning when estimating a Poisson model with fixed effects on this data.

```r
glm(
  y ~ x1 + x2 + factor(i) + factor(j),
  data = correia2019$fe1,
  family = poisson()
)
```

```r
Call:  glm(formula = y ~ x1 + x2 + factor(i) + factor(j), family = poisson(), 
    data = correia2019$fe1)

Coefficients:
(Intercept)           x1           x2   factor(i)2   factor(i)4   factor(i)5  
    -0.4114      -0.4845     -18.5226       0.4233       0.7375       0.9101  
 factor(j)2   factor(j)3   factor(j)4  
    -0.9854      -0.5696      -0.4986  

Degrees of Freedom: 17 Total (i.e. Null);  9 Residual
Null Deviance:      18.77 
Residual Deviance: 13.36        AIC: 42.59
Warning message:
glm.fit: fitted rates numerically 0 occurred
```

Capybara will detect separation when estimating Poisson models.

```r
fepoisson(
  y ~ x1 + x2 | i + j,
  data = correia2019$fe1
)
```

```
Separation found in 4 observation(s)
Formula: y ~ x1 + x2 | i + j

Family: Poisson

Estimates:

|    | Estimate | Std. Error | z value | Pr(>|z|) |
|----|----------|------------|---------|----------|
| x1 |  -0.4811 |     1.2419 | -0.3874 |   0.6984 |
| x2 |       NA |         NA |      NA |   NA     |

Significance codes: ** p < 0.01; * p < 0.05; + p < 0.10

Pseudo R-squared: 0.1874 

Fixed effects:
  i: 4
  j: 4

Number of observations: Full 18; Separated 4; Perfect classification 0 

Number of Fisher Scoring iterations: 4
```

'correia2019' (Stata) will also detect separation when estimating Poisson models with fixed effects.

```stata
. correia2019 y x1 x2, a(i j)
(simplex method dropped 4 separated observations)
(dropped 1 singleton observations)
note: 1 variable omitted because of collinearity: x2
 $$ Stopping (no negative residuals); separation found in 0 observations (1 iterations and 732 subiterations)
Iteration 1:   deviance = 1.4149e+01  eps = .         iters = 4    tol = 1.0e-04                                                                            
...
Iteration 6:   deviance = 1.3364e+01  eps = 1.85e-16  iters = 3    tol = 1.0e-07                                                                            
-------------------------------------------------------------------------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon 
> below tolerance)
Converged in 6 iterations and 21 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression                              No. of obs      =         13
Absorbing 2 HDFE groups                           Residual df     =          7
                                                  Wald chi2(1)    =       0.53
Deviance             =  13.36443052               Prob > chi2     =     0.4686
Log pseudolikelihood =  -11.2959209               Pseudo R2       =     0.0607
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  -.4845469   .6685201    -0.72   0.469    -1.794822    .8257284
          x2 |          0  (omitted)
       _cons |  -.5708466   .4604099    -1.24   0.215    -1.473233    .3315402
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
           i |         3           0           3     |
           j |         3           1           2     |
-----------------------------------------------------+
```

A difference with respect to Stata's 'correia2019' is that capybara requires an explicit cluster term in the formula
to compute robust standard errors using a sandwich operation.

```r
correia2019$fe1$pair <- paste0(correia2019$fe1$i, correia2019$fe1$j)

fepoisson(
  y ~ x1 + x2 | i + j | pair,
  data = correia2019$fe1
)
```

```r
Separation found in 4 observation(s)
Formula: y ~ x1 + x2 | i + j | pair

Family: Poisson

Estimates:

|    | Estimate | Std. Error | z value | Pr(>|z|) |
|----|----------|------------|---------|----------|
| x1 |  -0.4811 |     0.3101 | -1.5514 |   0.1208 |
| x2 |       NA |         NA |      NA |   NA     |

Significance codes: ** p < 0.01; * p < 0.05; + p < 0.10

Pseudo R-squared: 0.1874 

Fixed effects:
  i: 4
  j: 4

Number of observations: Full 18; Separated 4; Perfect classification 0 

Number of Fisher Scoring iterations: 4
```

Other R packages may not detect separation in Poisson models with fixed effects. By disabling the separation check in
Capybara, we can match 'alpaca' and 'fixest' results.

Capybara without checking separation (incorrect $\beta_2$):

```r
fepoisson(
  y ~ x1 + x2 | i + j,
  data = correia2019$fe1,
  control = list(check_separation = FALSE)
)
```

```r
Formula: y ~ x1 + x2 | i + j

Family: Poisson

Estimates:

|    | Estimate | Std. Error | z value | Pr(>|z|) |
|----|----------|------------|---------|----------|
| x1 |  -0.4845 |     1.2439 | -0.3895 |   0.6969 |
| x2 | -12.3711 |   383.1463 | -0.0323 |   0.9742 |

Significance codes: ** p < 0.01; * p < 0.05; + p < 0.10

Pseudo R-squared: 0.2614 

Fixed effects:
  i: 4
  j: 4

Number of observations: Full 18; Missing 0; Perfect classification 0 

Number of Fisher Scoring iterations: 14
```

Alpaca:

```r
alpaca::feglm(
  y ~ x1 + x2 | i + j,
  data = correia2019$fe1,
  family = poisson()
)
```

```r
poisson - log link, l= [4, 4]

      x1       x2 
 -0.4845 -17.3711 
```

Fixest:

```r
fixest::feglm(
  y ~ x1 + x2 | i + j,
  data = correia2019$fe1,
  family = poisson()
)
```

```r
GLM estimation, family = poisson, Dep. Var.: y
Observations: 18
Fixed-effects: i: 4,  j: 4
Standard-errors: IID 
     Estimate Std. Error   z value Pr(>|z|) 
x1  -0.484547    1.70957 -0.283432  0.77685 
x2 -18.514805 6880.80328 -0.002691  0.99785 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -12.3   Adj. Pseudo R2: -0.353285
           BIC:  50.6     Squared Cor.:  0.261426
```
