While this is hardly a tutorial, I’ve been spending a good deal of time working with zero-inflated data for a forthcoming project, and have worked with it extensively in the past.

The point of zero-inflated models is there are ultimately two sources of zeros, zeros can come from the primary model (usually Poisson), or they can be injected into the total model from some other source. For example, casualties from terror attacks may be zero inflated. Consider the idea that either there is no attack on a particular day (0 casualties), or the attack occurs but no one is hurt (also 0 casualties). However, we may want to model the daily casualties as a single model, so we need both parts in one area. (Seminal ZIP paper, my applied paper on ZIP)

Here’s some code creating ZIP data, and then testing the ZIP model in Python to see if it properly recovers the coefficients I am looking for. Indeed, it does. Since the ZIP models in python are not designed for the same things economists are typically looking for (ex, no output of standard errors and coefficient estimates in one table), this is a good exploration in to ZIP models, and a way to look around Python, the code du jour.

ZIP

Load Prerequisites

from matplotlib import pyplot as plt
import numpy as np
import math as math
import pandas as pd
from scipy import stats
import statsmodels.discrete.count_model as reg_models
#help(reg_mo1dels)

Generate Mock Data

The key point here in zero inflated (ZI) processes is that there is TWO ways of generating zeros. The zero can be generated either through the (ZI) or through another process, usually Poisson (P). Common examples include assembly line failure, the number of crimes in a neighborhood in a given hour. Critically here was the challenge of indexing Python appropriately.

np.random.seed(123456789)

N = 100000
x_1= np.random.normal(5,1,size=N)
x_2= np.random.normal(5,1,size=N)
x=pd.DataFrame([x_1,x_2]).T
poisson_part = np.zeros(N)
zi_part = np.zeros(N)

for i, item in enumerate(x_1):
    poisson_part[i] = np.random.poisson(math.exp(0.2*x_1[i]-0.1*x_2[i])) #needed to initialize the test object. Note the poisson parameter is of the form e^(Bx), ln(lambda) = Bx
for i, item in enumerate(x_1):
    zi_part[i] = np.random.logistic(0.3*x_1[i]-0.2*x_2[i]) > 0 #needed to initialize the test object.

y = zip_model_data = poisson_part * zi_part

print(x.iloc[0:10,:]) 
print(poisson_part[0:10])
print(zi_part[0:10,])
print(y[0:10,])

Let’s take a look at the data:

          0         1
0  7.212902  5.344758
1  7.128398  4.654915
2  6.841711  4.715646
3  5.082382  5.770086
4  5.858964  3.736356
5  4.173984  3.839683
6  6.157271  5.560635
7  6.375915  7.051355
8  5.943028  4.532561
9  5.847871  5.960908
[0. 1. 4. 3. 0. 2. 0. 2. 3. 0.]
[1. 0. 1. 0. 1. 0. 1. 1. 1. 1.]
[0. 0. 4. 0. 0. 0. 0. 2. 3. 0.]

Generate the model-object itself.

out=reg_models.ZeroInflatedPoisson(y,x,x, inflation='logit')

Fit it:

fit_regularized=out.fit_regularized(maxiter = 100) #essentially forces convergence by penalizing. Biases estimates.

Optimization terminated successfully.    (Exit mode 0)
            Current function value: 1.479267848993134
            Iterations: 10
            Function evaluations: 12
            Gradient evaluations: 10

Let’s take a look at what our (regularized) fit was:

fit_regularized.params ##notice that these are regularized values, not the true values.  The ordinal scale of the variables

inflate_0   -3.339664
inflate_1   -3.144663
0            0.237893
1           -0.232087
dtype: float64

Let’s try a REAL fit, no regularization, no biasing the estimates.

fit=out.fit(method='bfgs', maxiter = 1000) # May need more than the default 35 iterations, very small number!

..../python3.6/site-packages/statsmodels/base/model.py:508: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

Optimization terminated successfully.
         Current function value: 1.374978
         Iterations: 13
         Function evaluations: 16
         Gradient evaluations: 16

How did it look?

fit.params

inflate_0   -0.305607
inflate_1    0.208904
0            0.197614
1           -0.096841
dtype: float64

Nailed it! Regularization had had a noticeable effect on the process, but these are the true coefficients I was attempting to recover.

Note that ZIP in python does not automatically assume the existence of an intercept term. Nor does it automatically calculate the standard errors. I’ll be bootstrapping those soon enough.

4 responses to “Python and Zero-Inflated Models”

YM

Jun 25

Thank you for this clean example! Do you have an intuition regarding the signs of the coefficients “inflate_” ? I was expecting the opposite sign but I probably miss something here.

1. regressforward
  
  Jun 25
  
  The “inflate_” coefficients are from the inflation process, so in practice the higher the “inflate_” coefficient, the more likely there are to be zeros in the process. This is the opposite of the Poisson portion of the model where higher coefficients are associated with higher observed values. It can be very tricky!
  
YM

Jun 25

Thank you for your prompt answer. Nonetheless, I’m still puzzled: inflate_0 = -0.3 while zi_part is defined with +0.3 . The intuitive coefficient signs arise with out=reg_models.ZeroInflatedPoisson(y,x,-x, inflation=’logit’) but I don’t get why a minus is necessary… Any hint is welcome 🙂

YM

Jun 28

Thank you! I finally got it

Python and Zero-Inflated Models

ZIP

Share this:

4 responses to “Python and Zero-Inflated Models”

Leave a comment Cancel reply