While this is hardly a tutorial, I’ve been spending a good deal of time working with zero-inflated data for a forthcoming project, and have worked with it extensively in the past.
The point of zero-inflated models is there are ultimately two sources of zeros, zeros can come from the primary model (usually Poisson), or they can be injected into the total model from some other source. For example, casualties from terror attacks may be zero inflated. Consider the idea that either there is no attack on a particular day (0 casualties), or the attack occurs but no one is hurt (also 0 casualties). However, we may want to model the daily casualties as a single model, so we need both parts in one area. (Seminal ZIP paper, my applied paper on ZIP)
Here’s some code creating ZIP data, and then testing the ZIP model in Python to see if it properly recovers the coefficients I am looking for. Indeed, it does. Since the ZIP models in python are not designed for the same things economists are typically looking for (ex, no output of standard errors and coefficient estimates in one table), this is a good exploration in to ZIP models, and a way to look around Python, the code du jour.
ZIP
from matplotlib import pyplot as plt
import numpy as np
import math as math
import pandas as pd
from scipy import stats
import statsmodels.discrete.count_model as reg_models
#help(reg_mo1dels)
np.random.seed(123456789)
N = 100000
x_1= np.random.normal(5,1,size=N)
x_2= np.random.normal(5,1,size=N)
x=pd.DataFrame([x_1,x_2]).T
poisson_part = np.zeros(N)
zi_part = np.zeros(N)
for i, item in enumerate(x_1):
poisson_part[i] = np.random.poisson(math.exp(0.2*x_1[i]-0.1*x_2[i])) #needed to initialize the test object. Note the poisson parameter is of the form e^(Bx), ln(lambda) = Bx
for i, item in enumerate(x_1):
zi_part[i] = np.random.logistic(0.3*x_1[i]-0.2*x_2[i]) > 0 #needed to initialize the test object.
y = zip_model_data = poisson_part * zi_part
print(x.iloc[0:10,:])
print(poisson_part[0:10])
print(zi_part[0:10,])
print(y[0:10,])
out=reg_models.ZeroInflatedPoisson(y,x,x, inflation='logit')
fit_regularized=out.fit_regularized(maxiter = 100) #essentially forces convergence by penalizing. Biases estimates.
fit_regularized.params ##notice that these are regularized values, not the true values. The ordinal scale of the variables
fit=out.fit(method='bfgs', maxiter = 1000) # May need more than the default 35 iterations, very small number!
fit.params
Nailed it! Regularization had had a noticeable effect on the process, but these are the true coefficients I was attempting to recover.
Note that ZIP in python does not automatically assume the existence of an intercept term. Nor does it automatically calculate the standard errors. I’ll be bootstrapping those soon enough.
Thank you for this clean example! Do you have an intuition regarding the signs of the coefficients “inflate_” ? I was expecting the opposite sign but I probably miss something here.
The “inflate_” coefficients are from the inflation process, so in practice the higher the “inflate_” coefficient, the more likely there are to be zeros in the process. This is the opposite of the Poisson portion of the model where higher coefficients are associated with higher observed values. It can be very tricky!
Thank you for your prompt answer. Nonetheless, I’m still puzzled: inflate_0 = -0.3 while zi_part is defined with +0.3 . The intuitive coefficient signs arise with out=reg_models.ZeroInflatedPoisson(y,x,-x, inflation=’logit’) but I don’t get why a minus is necessary… Any hint is welcome 🙂
Thank you! I finally got it