While this is hardly a tutorial, I’ve been spending a good deal of time working with zero-inflated data for a forthcoming project, and have worked with it extensively in the past.
The point of zero-inflated models is there are ultimately two sources of zeros, zeros can come from the primary model (usually Poisson), or they can be injected into the total model from some other source. For example, casualties from terror attacks may be zero inflated. Consider the idea that either there is no attack on a particular day (0 casualties), or the attack occurs but no one is hurt (also 0 casualties). However, we may want to model the daily casualties as a single model, so we need both parts in one area. (Seminal ZIP paper, my applied paper on ZIP)
Here’s some code creating ZIP data, and then testing the ZIP model in Python to see if it properly recovers the coefficients I am looking for. Indeed, it does. Since the ZIP models in python are not designed for the same things economists are typically looking for (ex, no output of standard errors and coefficient estimates in one table), this is a good exploration in to ZIP models, and a way to look around Python, the code du jour.
from matplotlib import pyplot as plt import numpy as np import math as math import pandas as pd from scipy import stats import statsmodels.discrete.count_model as reg_models #help(reg_mo1dels)
np.random.seed(123456789) N = 100000 x_1= np.random.normal(5,1,size=N) x_2= np.random.normal(5,1,size=N) x=pd.DataFrame([x_1,x_2]).T poisson_part = np.zeros(N) zi_part = np.zeros(N) for i, item in enumerate(x_1): poisson_part[i] = np.random.poisson(math.exp(0.2*x_1[i]-0.1*x_2[i])) #needed to initialize the test object. Note the poisson parameter is of the form e^(Bx), ln(lambda) = Bx for i, item in enumerate(x_1): zi_part[i] = np.random.logistic(0.3*x_1[i]-0.2*x_2[i]) > 0 #needed to initialize the test object. y = zip_model_data = poisson_part * zi_part print(x.iloc[0:10,:]) print(poisson_part[0:10]) print(zi_part[0:10,]) print(y[0:10,])
0 1 0 7.212902 5.344758 1 7.128398 4.654915 2 6.841711 4.715646 3 5.082382 5.770086 4 5.858964 3.736356 5 4.173984 3.839683 6 6.157271 5.560635 7 6.375915 7.051355 8 5.943028 4.532561 9 5.847871 5.960908 [0. 1. 4. 3. 0. 2. 0. 2. 3. 0.] [1. 0. 1. 0. 1. 0. 1. 1. 1. 1.] [0. 0. 4. 0. 0. 0. 0. 2. 3. 0.]
fit_regularized=out.fit_regularized(maxiter = 100) #essentially forces convergence by penalizing. Biases estimates.
Optimization terminated successfully. (Exit mode 0) Current function value: 1.479267848993134 Iterations: 10 Function evaluations: 12 Gradient evaluations: 10
fit_regularized.params ##notice that these are regularized values, not the true values. The ordinal scale of the variables
inflate_0 -3.339664 inflate_1 -3.144663 0 0.237893 1 -0.232087 dtype: float64
fit=out.fit(method='bfgs', maxiter = 1000) # May need more than the default 35 iterations, very small number!
..../python3.6/site-packages/statsmodels/base/model.py:508: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)
Optimization terminated successfully. Current function value: 1.374978 Iterations: 13 Function evaluations: 16 Gradient evaluations: 16
inflate_0 -0.305607 inflate_1 0.208904 0 0.197614 1 -0.096841 dtype: float64
Nailed it! Regularization had had a noticeable effect on the process, but these are the true coefficients I was attempting to recover.
Note that ZIP in python does not automatically assume the existence of an intercept term. Nor does it automatically calculate the standard errors. I’ll be bootstrapping those soon enough.
4 thoughts on “Python and Zero-Inflated Models”
Thank you for this clean example! Do you have an intuition regarding the signs of the coefficients “inflate_” ? I was expecting the opposite sign but I probably miss something here.
The “inflate_” coefficients are from the inflation process, so in practice the higher the “inflate_” coefficient, the more likely there are to be zeros in the process. This is the opposite of the Poisson portion of the model where higher coefficients are associated with higher observed values. It can be very tricky!
Thank you for your prompt answer. Nonetheless, I’m still puzzled: inflate_0 = -0.3 while zi_part is defined with +0.3 . The intuitive coefficient signs arise with out=reg_models.ZeroInflatedPoisson(y,x,-x, inflation=’logit’) but I don’t get why a minus is necessary… Any hint is welcome 🙂
Thank you! I finally got it