Programming, Statistics

Python and Zero-Inflated Models

While this is hardly a tutorial, I’ve been spending a good deal of time working with zero-inflated data for a forthcoming project, and have worked with it extensively in the past.

The point of zero-inflated models is there are ultimately two sources of zeros, zeros can come from the primary model (usually Poisson), or they can be injected into the total model from some other source. For example, casualties from terror attacks may be zero inflated. Consider the idea that either there is no attack on a particular day (0 casualties), or the attack occurs but no one is hurt (also 0 casualties). However, we may want to model the daily casualties as a single model, so we need both parts in one area. (Seminal ZIP paper,  my applied paper on ZIP)

Here’s some code creating ZIP data, and then testing the ZIP model in Python to see if it properly recovers the coefficients I am looking for. Indeed, it does.  Since the ZIP models in python are not designed for the same things economists are typically looking for (ex, no output of standard errors and coefficient estimates in one table), this is a good exploration in to ZIP models, and a way to look around Python, the code du jour.

ZIP

Load Prerequisites
from matplotlib import pyplot as plt
import numpy as np
import math as math
import pandas as pd
from scipy import stats
import statsmodels.discrete.count_model as reg_models
#help(reg_mo1dels)
Generate Mock Data
The key point here in zero inflated (ZI) processes is that there is TWO ways of generating zeros.   The zero can be generated either through the (ZI) or through another process, usually Poisson (P).   Common examples include assembly line failure, the number of crimes in a neighborhood in a given hour.    Critically here was the challenge of indexing Python appropriately.

np.random.seed(123456789)

N = 100000
x_1= np.random.normal(5,1,size=N)
x_2= np.random.normal(5,1,size=N)
x=pd.DataFrame([x_1,x_2]).T
poisson_part = np.zeros(N)
zi_part = np.zeros(N)

for i, item in enumerate(x_1):
    poisson_part[i] = np.random.poisson(math.exp(0.2*x_1[i]-0.1*x_2[i])) #needed to initialize the test object. Note the poisson parameter is of the form e^(Bx), ln(lambda) = Bx
for i, item in enumerate(x_1):
    zi_part[i] = np.random.logistic(0.3*x_1[i]-0.2*x_2[i]) > 0 #needed to initialize the test object.

y = zip_model_data = poisson_part * zi_part

print(x.iloc[0:10,:]) 
print(poisson_part[0:10])
print(zi_part[0:10,])
print(y[0:10,])
Let’s take a look at the data:
          0         1
0  7.212902  5.344758
1  7.128398  4.654915
2  6.841711  4.715646
3  5.082382  5.770086
4  5.858964  3.736356
5  4.173984  3.839683
6  6.157271  5.560635
7  6.375915  7.051355
8  5.943028  4.532561
9  5.847871  5.960908
[0. 1. 4. 3. 0. 2. 0. 2. 3. 0.]
[1. 0. 1. 0. 1. 0. 1. 1. 1. 1.]
[0. 0. 4. 0. 0. 0. 0. 2. 3. 0.]
Generate the model-object itself.
out=reg_models.ZeroInflatedPoisson(y,x,x, inflation='logit')
Fit it:
fit_regularized=out.fit_regularized(maxiter = 100) #essentially forces convergence by penalizing. Biases estimates.
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 1.479267848993134
            Iterations: 10
            Function evaluations: 12
            Gradient evaluations: 10
Let’s take a look at what our (regularized) fit was:
fit_regularized.params ##notice that these are regularized values, not the true values.  The ordinal scale of the variables 
inflate_0   -3.339664
inflate_1   -3.144663
0            0.237893
1           -0.232087
dtype: float64
Let’s try a REAL fit, no regularization, no biasing the estimates.
fit=out.fit(method='bfgs', maxiter = 1000) # May need more than the default 35 iterations, very small number!
..../python3.6/site-packages/statsmodels/base/model.py:508: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
Optimization terminated successfully.
         Current function value: 1.374978
         Iterations: 13
         Function evaluations: 16
         Gradient evaluations: 16
How did it look?
fit.params
inflate_0   -0.305607
inflate_1    0.208904
0            0.197614
1           -0.096841
dtype: float64

Nailed it! Regularization had had a noticeable effect on the process, but these are the true coefficients I was attempting to recover.

Note that ZIP in python does not automatically assume the existence of an intercept term. Nor does it automatically calculate the standard errors. I’ll be bootstrapping those soon enough.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s