While this is hardly a tutorial, I’ve been spending a good deal of time working with zero-inflated data for a forthcoming project, and have worked with it extensively in the past.
The point of zero-inflated models is there are ultimately two sources of zeros, zeros can come from the primary model (usually Poisson), or they can be injected into the total model from some other source. For example, casualties from terror attacks may be zero inflated. Consider the idea that either there is no attack on a particular day (0 casualties), or the attack occurs but no one is hurt (also 0 casualties). However, we may want to model the daily casualties as a single model, so we need both parts in one area. (Seminal ZIP paper, my applied paper on ZIP)
Here’s some code creating ZIP data, and then testing the ZIP model in Python to see if it properly recovers the coefficients I am looking for. Indeed, it does. Since the ZIP models in python are not designed for the same things economists are typically looking for (ex, no output of standard errors and coefficient estimates in one table), this is a good exploration in to ZIP models, and a way to look around Python, the code du jour.
ZIP
from matplotlib import pyplot as plt
import numpy as np
import math as math
import pandas as pd
from scipy import stats
import statsmodels.discrete.count_model as reg_models
#help(reg_mo1dels)
np.random.seed(123456789)
N = 100000
x_1= np.random.normal(5,1,size=N)
x_2= np.random.normal(5,1,size=N)
x=pd.DataFrame([x_1,x_2]).T
poisson_part = np.zeros(N)
zi_part = np.zeros(N)
for i, item in enumerate(x_1):
poisson_part[i] = np.random.poisson(math.exp(0.2*x_1[i]-0.1*x_2[i])) #needed to initialize the test object. Note the poisson parameter is of the form e^(Bx), ln(lambda) = Bx
for i, item in enumerate(x_1):
zi_part[i] = np.random.logistic(0.3*x_1[i]-0.2*x_2[i]) > 0 #needed to initialize the test object.
y = zip_model_data = poisson_part * zi_part
print(x.iloc[0:10,:])
print(poisson_part[0:10])
print(zi_part[0:10,])
print(y[0:10,])
out=reg_models.ZeroInflatedPoisson(y,x,x, inflation='logit')
fit_regularized=out.fit_regularized(maxiter = 100) #essentially forces convergence by penalizing. Biases estimates.
fit_regularized.params ##notice that these are regularized values, not the true values. The ordinal scale of the variables
fit=out.fit(method='bfgs', maxiter = 1000) # May need more than the default 35 iterations, very small number!
fit.params
Nailed it! Regularization had had a noticeable effect on the process, but these are the true coefficients I was attempting to recover.
Note that ZIP in python does not automatically assume the existence of an intercept term. Nor does it automatically calculate the standard errors. I’ll be bootstrapping those soon enough.
Leave a Reply to YM Cancel reply