Reddit Button Forecast (Post-Mortem*: 5/25)

The forecast correctly predicted the button would hit zero on May 23rd.  I see that the timer did not expire as a result of low latency “zombie” botting, but we have a 0.00 timer displayed on the day I was predicting. The final forecast was set on May 16th, seven days earlier.  I am pleased that the forecast was accurate so many days out, I was worried about posting a flurry of +3 and +2 day forecasts before it ended. While a detractor would say my forecast did update and change with new data, the technique remained comfortably in the ARIMA family each time. Most importantly, the only forecast to expire was correct, seven days ahead.

I will not continue to manually update the forecast given that the lower bound of zero has been reached.  Keep an eye out for code to be released.

Below is the final forecast, announced 5/16:

The current forecast suggests the button will end on the 23rd.  On a positive note, the forecasts have been getting shorter and shorter as I have updated them. The first forecast was +15 days, then +12, +13, and +8. This current one is +7.   I am still hoping to get two forecasts which both suggest the same day, but I imagine the trends are changing over time faster than my technique allows.

The grey confidence interval suggests that the timer will probably never over 30 seconds for any 10 minute period again.  If you want a badge indicating less than 30 seconds, you will always be able to find it if you wait for, at most, 10 minutes, even during peak hours. Very few people will have to wait that long.


Each cycle or wave in the graph is approximately one day, representing the daily cycle of activity on the button, high in the afternoons, low in the late nights/mornings. There is also a slight weekly cycle, but it is not easy to notice in the plot.

The button’s values have partially stabilized with a pretty persistent protection around the red barrier, but there is still some noticeable decay between the 4000 and 6000 period marks. I have continued to use an indicator for the lowest observed badge color to help soften the impact of the early periods, when it was impossible to get a low badge color due to the frequency of clicking. We are now in a period where it is demonstratively possible to get red, with patience and discipline- we have observed red badges occur. Using the lowest observed badge color as a variable allows us to separate out this current period from earlier ones where the data was less descriptive of the current state.

Out of the grey collection of possible futures highlighted, it looks like button is declining steadily, the general future looks rather grim. The upper line of the grey 75% confidence interval is below 30 seconds, suggests that the timer will not be kept at over half for a full 10 minutes ever again. [This prediction did not end up to be true.]  I note that the existence of a good forecast means that the red guard can simply pay extra close attention to the period in which they think it will end, and this forecast might actually extend the life of the button. Maybe.


First, I downloaded the second-by-second data at about 5/16 at 12:00pm CST from here. To ease the computational load and reduce unwanted noise in the forecast, the  4+ million data points were aggregated from seconds into intervals of ten minutes each. I examine only the lowest value of the timer, since the topic of interest is when the timer hits zero. (This strikes me as somewhat ad hoc, because the distributions of minimums are likely non-normal, they would be from an extreme value distribution.) Below is a plot of the ten minute minimums for the button. Each cycle is about a day, and there appears to be a weekly cycle that is very slight.

I exclude any period where the timer was not well recorded for technical reasons, which has helped return the forecast to normal after the “great button failure”. I am much more confident in this current forecast as a result. New to this forecast, I have also added a dummy indicator for the lowest badge observed. It began as purple, and then slowly slid to red. We are in a post-red period, but when the button began, we had only seen purple. The structure of the model ought to reflect that. This significant set of variables suggests that the button’s lowest observed value in a 10 minute period is sinking at an accelerated pace compared to the early stages of the button.

Then, I estimate the process using ARIMA(1,1,1) and weekly, daily, and hourly Fourier cyclical components. I include one pair of sin() and cos() waves to catch any cyclical trends in weeks, days, or hours. This is roughly the same technique I used to predict the next note in Beethoven’s Symphony, which worked with 95+% accuracy. They tend to fit very well, and in fact, I am often shocked by how effective they are when used correctly.

Below I show how well the past data fits with our model model.  This type of extreme fit is typical when ARIMA sequences are applied correctly, and only shows that I do seem to fit the past reasonably well. I check this plot to ensure that my forecast does not predict impossible amounts and it stays between 0-60 for our past data.

The fit appears to be very good, better than prior weeks, suggesting my model is better now that I have included lowest observed badge. There are few periods where the forecast is very off base.  (I am not sure why the last line spikes up so much, I would like to take a careful look at the code to see what’s going on, that spike is not part of ARIMA and therefore is a problem within my forecast, likely involving the very last period.)

Below, I show the errors of the forecast above. At this scale it is clear there are a few days where my model misjudges the fit. I am unsurprising by this, given I have about so many observations, but I am disappointed some intervals are incorrectly predicted by 20 seconds or more. This is the cost of estimation, perhaps.


On to more technical details.  My process looks at its own previous values to make a forecast. I need to make sure that my sequence is not missing critical associations.  Let us see how well the past values are associated with a current one. Big lines mean big association. We call plots of these correlations the ACF and PACF.  I plot them below. They suggest our fit is relatively well done. (They fall mostly within the blue lines for many/most steps, the first of the ACF is excluded, because the current value is 100% equal to itself.) For these steps that are outside of the blue in the PACF, I doubt the sequence has 25 lags or leads, and such things are not quickly calculable on a home computer anyway, so I am going to reject them as a possibility. Adding too many terms and over-fitting the data would be equally unwise.

5_16_acf 5_16_pacf

I avoid looking at the Augmented Dickey-Fuller Test because I am looking at minimums, and therefore have concerns about the normality of the errors, but have considered it.

Commentary on Other predictions Types and Styles:

Some are attempting to use the remaining number of greys. I am currently not encouraged that this approach is good. I note that the count of remaining greys appear to be largely insignificant in predicting the next lowest value of the button. (I have tried to include them in a variety of ways, including natural logs, and they did not influence the prediction.) I conclude from this that the number of greys largely is irrelevant. I suspect that a portion of the greys are pre-disposed to click, and this proportion of “click eventually” vs “never-click” matters more than the total number of greys, but I suspect this proportion fluctuates dramatically from minute to minute and I cannot isolate what the true proportion is without serious adjustment in my technique.

Some are attempting to predict the button failure by a clicks/minute approach, which I am intrigued by, but I have not investigated this closely as an approach.

I note that I have some reservations about the asymptotic validity of my estimators. I am investigating these currently.

Historical Forecasts

To see how my forecast changes, and in the interest of full disclosure, I will keep tabs of my past estimates and note how additional data improves or worsens these estimations.

Current Update (5/16) – May 23rd. Noting that previous updates have all shrunk distance to button failure: +14 days, then +13, +13, and +8. This current one is +7, within a week.

Updated Badge Technique (5/11) – May 19th New Technique Added: Used lowest observed badge color to help separate out the pattern of early periods (the purple, blue, orange periods) from the late patterns (the post-red period).

Revisiting the Forecast (5/3) -May 16th

Update: After Button Failure (4/27) – May 9th.

Great Button Failure Update (4/26) – May 28th, likely in error.

Initial Forecast (4/24) 
 – May 8th.


#Load Dependencies


library(“zoo”, lib.loc=”~/R/win-library/3.1″)
library(“xts”, lib.loc=”~/R/win-library/3.1″)
library(“lubridate”, lib.loc=”~/R/win-library/3.1″)
library(“forecast”, lib.loc=”~/R/win-library/3.1″)

#Source of data:
button button$time<-as.POSIXct(button$now_timestamp, origin=”1970-01-01″) #taken from
#Surprisingly, this feeds several periods of wrong time for just shy of 720 seconds. They are all zero.
#I must manually input a minimum for the button- prior to button time hitting zero, there had been false zeros, for lack of a better word.

#First there is the missing data. There is the periods between clicks where the timer clicks down by 1 second, and actually missing data. The ticking down periods are irrelevant because every click always happens at a local minimum.
#Get opening and closing time to sequence data.
all.dates<-seq(time.min, time.max, by=”sec”)
#merge data into single data frame with all data<-merge(all.dates.frame, button,all=FALSE)

#I trust that I did this correctly. Let us replace the button data frame now, officially.

#let us collapse this
#Need objects as xts:
button_xts<-button_xts[‘2015/’] #2015 to end of data set. Fixes odd error timings.
t15 min.
end<-endpoints(button_xts,on=”seconds”,t*60) # t minute periods #I admit end is a terrible name.
col1<-period.apply(button_xts$seconds_left,INDEX=end,FUN=function(x) {min(x,na.rm=TRUE)}) #generates some empty sets
col2<-period.apply(button_xts$participants,INDEX=end,FUN=function(x) {min(x,na.rm=TRUE)})

# we will add a lowest observed badge marker.
for(i in 1:length(button_xts$seconds_left)){
min_badge[i]<-floor(min(button_xts$seconds_left[1:(max(c(i-60/t,1)))])/10) #lowest badge seen yesterday is important.
#let’s get these factors as dummy variables.

#Seasons matter. I prefer Fourier Series:
fourier {
n X for(i in 1:terms)
X[,2*i-1] X[,2*i] }
colnames(X) return(X)
regressors<-data.frame(hours,days,weeks,badge_class[,2:dim(badge_class)[2]]) #badge_class[,2:dim(badge_class)[2]] #tried to use particpants. They are not significant.

#automatically chose from early ARIMA sequences, seasonal days, weeks, individual badge numbers are accounted for as a DRIFT term in the ARIMA sequence.
png(filename=”~/Button Data/5_20_acf.png”)
png(filename=”~/Button Data/5_20_pacf.png”)

#Let’s see how good this plot is of the hourly trend?
t.o.forecast<-paste(“Prediction starts at: “, date(),sep=””)
png(filename=”~/Button Data/5_20_historical.png”)
plot(fitted(reg), main=”Past Values of Button”, xlab=”Time (in 10 minute increments)”, ylab=”Lowest Button Time in 10 minute Interval)”, ylim=c(0,60))
png(filename=”~/Button Data/5_20_error.png”)
plot(res, main=”Error of Forecast”,,xlab=”Time (in 10 minute increments)”, ylab=”Error of Forecast Technique on Past Data”)
png(filename=”~/Button Data/5_20_overlay.png”)
plot(fitted(reg), main=”Past Values of Button overlayed with Forecast”,xlab=”Time (in 10 minute increments)”, ylab=”Lowest Button Time in 10 minute Interval”, ylim=c(0,60))

#forecast value of button:
#size of forecast
w n<-7*24*60/t
viable<-(dim(regressors)[1]-n):dim(regressors)[1] #gets the last week.

a=as.Date(seq(start, by=”15 min”,length.out=length(f_cast)))

png(filename=”~/Button Data/5_20_forecast.png”)
plot(forecast_values,ylim=c(0,60), main=”Lowest Button Time In Every 10 minute Period”, ylab=”10 minute Minimum of Button”, xlab=”Number of 10 minute Periods Since Button Creation”)
footnotes<-paste(“Timer Death in about 4 weeks. Prediction starts at “, date(),”. 75% CI in Grey.”,sep=””)

Programming, Statistics

An Example of Plotting Multiple Time Series (Stock Values) on a Graph in R

I am currently in the process of designing a portfolio to manage investments. While such programs are not best plastered over the internet, a few basic concepts about plotting can be displayed.  For example, I have created a rather appealing plot, which demonstrates how to plot series of multiple images in a single plot, shown below:
Code is below, including my process to detrend the data. The critical lines are in bold, highlighting the fact that you can use sample(colors()) to select from the body of colors at random. This is useful when you may have to generate many plots, potentially without greatly detailed manual supervision, and you are not demanding publication-quality color selection (which is plausible for personal investigative use).

#after obtaining closing prices, you should make sure you clean your inputs. Ensure you know why there are NA’s, or you will make a critical error of omission.



data<-diff(data, lag=1, differences=1)

#Check for any remaining trends in data over and above the natural cyclical or time-trending motion of the stocks!
#Detrend based off of the bond, a necessary part of even a basic CAPM portfolio
detrended<-data-xhat #also, norm.
for(n in 2:N){

lines(index(detrended),detrended[,n], col=sample(colors(),size=1))


Programming, Statistics

Music and Math

Many people claim there is a strong correlation between music and math.
Below, I demonstrate that the patterns in music are NOT well predicted by typical statistical approaches.

I have taken a MIDI file of Beethoven’s 5th, and analyzed the track using non-parametric estimation techniques. These techniques included panel data techniques, ARMA, and extensive non-parametric estimation techniques (polynomial and Fourier series to capture cyclical components). I then use the song’s notes and my estimation technique to create a forecast of following notes. I then play the “forecasted song”.

After listening, the “forecasted song” does does not well match the original. As a consequence, I can state that the mathematical techniques common to forecasting do not well predict a song.  Below are several attempts which I have highlighted:

The R-squared for these estimations are in fact VERY high, in the high 90’s. (Only few of the coefficients are significant, the data is clearly overfitted in some regressions.) This song in fact falls into the so-called uncanny valley, and is only slightly deviant from the actual Beethoven’s 5th. However, the ear is strongly cultured to perfection in the subject of music, and the errors are devastating to us.

Programming, Statistics

A Brief Presentation for the Student Investment Club (SIC)

In an attempt to branch out and see what other people do in terms of work, I’ve been creating a model for the Student Investment Club to simultaneously forecast GDP, CPI, and Unemployment.  While such a prediction is clearly overly ambitious for a casual effort, I made an attempt at it using some basic methodologies.  The dependent variables that I used in this case were guided by the preferences of the group, rather than by any particular theoretical model.  As such, I have very little faith in the model to be a powerful predictor on a fundamental level, but I do expect it to be correlated with the actual values.

Attached is my presentation (5-10 minutes) about my preliminary forecasts and findings. It is meant to be delivered to a nontechnical audience and is meant to be a partial (but not complete) disclosure of the problems with my approach. Below is a version of the model I am using to make such a forecast, with some admittedly sparse commentary.

SIC Presentation – Professional (Warm)


# Old work functions as a great key
# gdp<-getSymbols(‘GDPC1′,src=’FRED’)
# indp<-getSymbols(‘INDPRO’, src=’FRED’)
# ism<-getSymbols(‘NAPM’, src=’FRED’)
# cap<-getSymbols(‘TCU’,src=’FRED’)
# wage<-getSymbols(‘AHETPI’,src=’FRED’) #Productivity? Proxy:wages
# ppi<-getSymbols(‘PPIACO’,src=’FRED’)
# unemploy<-U1RATENSA
# libor<-USDONTD156N
# cpi<-getSymbols(‘CPIAUCSL’,src=’FRED’)
# nom_pers_inc<-getSymbols(‘PINCOME’,src=’FRED’) #this might need to be real
# senti<-getSymbols(‘UMCSENT’,src=’FRED’)
# #demand<-getSymbols(‘DEMOTHCONS’,src=’FRED’)#Consumer demand? Proxy: request for more loans
# #cpi<-getSymbols(‘TCU’,src=’FRED’) #Total sales? Proxy: Change in buisness inventories

#Get the data
out<-lapply(b, aggregate, by=as.yearqtr, mean)

# Scale it appropriately.
series<-lapply(out,window,start=as.yearqtr(“2000 Q1”), end=as.yearqtr(“2013 Q1”))#trims to a consistant window.
series<-log(series) #log the series
series<-as.ts(series) #need time series for this following operator:
series<-diff.ts(series[,c(“GDPC1″,”INDPRO”,”TCU”,”AHETPI”,”PPIACO”,”CPIAUCSL”,”PINCOME”,”UMCSENT”,”FEDFUNDS”,”U1RATENSA”)]) #first difference
series<-data.frame(series) #back to df
series$NAPM<-matrix(NAPM[(dim(NAPM)[1]+2-dim(series)[1]):dim(NAPM)[1]]) #Some may be stationary!
series$t.index<-zooreg(series, start=as.yearqtr(“2000 Q1”),end=as.yearqtr(“2013 Q1″), frequency = 4) #need a time trend
series$quarter<-as.vector(seq(from=1,to=4, by=1))
# series$PINCOME_2<-(series$PINCOME)^2 #are these acceptable?
# series$GDPC_2<-(series$GDPC1)^2
# documentation
series$Lead_GDPC1<-lag(zoo(lagGDP),k=+2, na.pad=TRUE)
series$Lead_CPIAUCSL<-lag(zoo(lagCPI),k=+2, na.pad=TRUE)
series$Lead_U1RATENSA<-lag(zoo(lagUNEMP),k=+2, na.pad=TRUE) #impact takes at least 2 quarters. This is needed because we are missing CPI numbers for last quarter. Sentiment is delayed 6 months as propietary information. If it is set to +2, the estimates are to see what it would be like if we had the current info (pay for it).
eq1<- Lead_GDPC1 ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
eq2<- Lead_CPIAUCSL ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
eq3<- Lead_U1RATENSA ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
# series<-data.frame(series)
fit<-systemfit(eqsystem, method=”SUR”, data=series)
pred<-predict(fit,series, se.pred=TRUE)
pred_ci<-predict(fit, series, interval=”confidence”, level=0.95) #note events are not normal.
plot(series$GDPC1, type=”l”, col=”darkgreen”, ylab=”% Change in GDP”, xlab=”Quarters (since 2000)”, main=”GDP forecast”) #the dimseries -40 gets me 10years.
points(pred[1], type=”l”, col=”blue”, lty=5)
points(pred_ci[,c(3)],type=”l”, col=”red”, lty=2)
points(pred_ci[,c(2)],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)

plot(series$CPIAUCSL, type=”l”, col=”darkgreen”, ylab=”% Change in CPI”, xlab=”Quarters (since 2000)”,main=”CPI forecast”)
points(pred[3], type=”l”, col=”blue”, lty=5)
points(pred_ci[,5],type=”l”, col=”red”, lty=2)
points(pred_ci[,6],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)

plot(series$U1RATENSA, type=”l”, col=”darkgreen”, ylab=”% Change in UNEMP”, xlab=”Quarters (since 2000)”, main=”UNEMP forecast”)
points(pred[5], type=”l”, col=”blue”, lty=5)
points(pred_ci[,8],type=”l”, col=”red”, lty=2)
points(pred_ci[,9],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)




Generating correlated variables in R, and an uninvertability hypotheises.

This has been a problem I’ve been suffering with as of late.  How do you generate variables with an arbitrary covariance in R?  There’s no “covariance=3” button!  This creates some issues when I’m trying to simulate various behaviours in my econometrics studying (such as omitted variable bias, forthcoming). So I attempted to come up with a solution– only to discover the problem is much more difficult than I initially thought!

Say x is an arbitrary NxK matrix composed of N draws from K different random variables, and we would like to generate a matrix y (NxK) such that it has  an arbitrary covariance matrix c (KxK) and arbitrary mean m=E(y)=1xK. Define 1 as a 1xN vector of 1’s, which allows 1*x=E(x)=1xK, and N remains a scalar.

c+E(y)'(1*x/N)=y'x/N Convert into sample analogs, E(y) is a selected parameter, so it has no sample analog at the moment.
cN+E(y)'(1*x)=y'x Keep in mind N is a scalar

But unfortunately, xx‘ needs to be invertible! xx‘ rarely is invertable as an NxN when N is large and K is small-> only x’x, a KxK is likely invertable!  I think we can show that as N approaches infinity, xx’ is singular with probability approaching 1 for a finite K. At a minimum, we can show that this isn’t going to work in R! Work below.

c<-matrix(runif(K^2,-1,1), ncol=K, nrow=K)
c[upper.tri(c)]<-c[lower.tri(c)] #correlation matrixes are symmetric by definition.

x<-matrix(rnorm(N,ex,10)) #one row is alcready created.
tick<-0 #cheap trick to make sure your loop repeated successfully. You’ll see it.
for(i in 2:K){
holding<-as.matrix(rnorm(N,ex,1)) #it’s an Nx1, k=1

solve(t(x)%*%x/N) #kxk is invertable

solve(x%*%t(x)/N) #nxn is not!

My thought is that k squared terms: (x_{n1}^2+x_{n2}^2+...x_{nk}^2)/N for K different terms. But if K is small, then the sum of x_k^2‘s is an unincreasing finite number, divided by N, which approaches infinity. As such, all the diagonal elements approach 0.  The same is true of the cross diagonal elements. An NxN matrix with all 0’s in all elements has rank less than its dimension and cannot be inverted.

Programming, Statistics

Basic bootstrapping in R

I’ve been having some trouble determining how strata works in the boot() command.   My intuition says it should select from within each type of strata. But no guarantees!  I can always just type “boot” and read the whole command line for line… but this isn’t always easy to interpret without comments.
So here’s a quick test to make sure it does what I think it does.

library(boot) #load the package
x<-rep(c(0,100,0,1), each=20) #make a long list of 20 zeros,20 one hundreds, 20 zeros, 20 ones.pool<-matrix(x, ncol=2) #make that list into a 20 by 2 matrix, filling it downward.

pool #let’s look at it.
mean(pool[i,1]) #mean of the first column of pool, using individuals i
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #resample and perform that operation 500 times.  Has variation in output.
boot(pool,f, R=500, strata=pool[,2]) #resample and perform that operation -reselecting from within the strata on the rhs. Since all observations within each strata are identical, the standard deviation should be zero.

Here’s an interesting mistake that I made while creating this command.

} #create a function that uses (data, individuals)
boot(pool,f, R=500) #Has no variation in bootstrapping.  Why? What’s up with that?

Answer: There’s no individuals. It just resamples the entire population of the first column of pooled 500 times, getting the final result as shown.

Programming, Statistics

Ordering and structuring text in STATA

Like most researchers, I have difficulty finding data. On occasion, when I manage to find data, it comes in a useless or meaningless format. For example, a local police department offered to print out all of its police reports for me for the last 15 years at $0.25 a page… which I would have to enter by hand.  I politely declined.  Here’s an example of sorting out city titles from a poorly formatted document- a short version of that document can be found here.

The first and most important thing is to have your data loaded.

clear all
use "location", clear

This loads the data and clears any extra junk STATA may have floating around.  The next thing to do is identify any sort of regular pattern.  The first identifiable patterns were that titles for cities began every 64 lines, and sometimes went for between 3 or 4 lines, before having another 64 lines of regular income reports (useless at this point) are posted. The second pattern is that they all are either villages, cities, or towns.  Below are some attempts at accomplishing this goal via making some sort of indicator every 64 lines (didn’t work), or marking every area that starts with CITY (inefficient). Don’t forget to put strings in quotation marks!

*Marks every line with a list that counts 0...63. I thought the 63rd would be useful, but because of the irregular length of the header, it's worthless.
*gen city_titles=mod(_n-1,64)
*Is the line saying city? If so, give it a mark, literally. 
*gen mark=cond(var1=="CITY","mark",".",".")

Notice that using (*) comments out the line in STATA, and (.) is STATA’s version of NA.  Here is a more successful attempt:

*Identify if it's probably a title
gen mark=1 if var1=="CITY"
replace mark=1 if var1=="TOWN"
replace mark=1 if var1=="VILLAGE"

Don’t forget the double equals!  In STATA, all if statements I have observed are followed by (==) instead of (=), and I still manage to forget them often!  Next, I will create 3 more indicators, aptly labeled mark_1 to mark_3, all showing if it follows the indicator above by 1, 2, or 3 variables. In programming, it is better to give clear, boring, brief names rather than unclear mess. I would rather have a long object name than an unclear short one, since you never know who is going to have to read your code:

*Is it following a title?
gen mark_1=mark[_n-1]
gen mark_2=mark[_n-2]
gen mark_3=mark[_n-3]

In retrospect, I could have perhaps consolidated those more efficiently. Does anyone else see how I could have done that in a single variable (mark) instead of having multiple indicators (mark_1…mark_3)?  Either way, I have some extra chaff I have to eliminate. having (.)’s in my data set is agitating. Let’s set those all as 0’s, and sum the remaining columns together in something called “total”, so I know if they have an indicator (1 or more) or not (0).  Then we can drop everything we don’t want!

*Eliminate non-numerical entries as numeric.
replace mark=0 if mark==.
replace mark_1=0 if mark_1==.
replace mark_2=0 if mark_2==.
replace mark_3=0 if mark_3==.

*Is it marked?
gen total=mark+mark_1+mark_2+mark_3

drop if total!=1

Bam!  Much better. We dropped somewhere around 80% of our unwanted junk right there! We still could never compile this all by hand, though.  Here, we want it to concatenate all our variables 1…11 in a downward direction.  So we make a local marco list called to_combine, since that’s what we’ll be doing with them.

local to_combine var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11

Below is a STATA style do-loop, a common element in meaningful programming everywhere.  We want to repeat for each and every variable in the “to_combine” list a series of operations designed to deal with a few regular problems we observed. Firstly, we want to fix an issue with names. You may notice “FOND DU LAC” takes up 3 columns rather than the intended 1, being as that is the name of a city/town/village/county in Wisconsin.  This is the first thing we fix, by dropping the spaces with “DU” and “LAC” and replacing all “FOND” entries with “FOND DU LAC”.  Such a procedure is repeated for other problems we noted. There are 300+ problems we found, so this is not a perfect fix yet, (as noted by entry v9/2 in the “after” image below), but it’s much better. The second thing we fix is putting “COUNTY” horizontally under “TOTAL”, since in the original document (pdf), we know that after an entry says “TOTAL”, it always says the total for the county, not for any other grouping.  Lastly, we want it to concatenate horizontally but stop before it hits any part of the income sheet.  Stata uses & for “and” and | for “or”. You may use both in long chains for a single if statement. This last was done by a bit of trial and error to make sure I got it right. That’s fine, since computers are cheap and thinking hurts.  It stops at the line containing every end of line indicator I could think of, like “TAXES”, “POPULATION”, or “REVENUES”, which are clearly not city titles.

foreach var in `to_combine'{

*We have a space problem (Part 1)
replace `var'="" if `var'=="DU" 
replace `var'="LA CROSSE" if `var'=="LA"

*Puts county under total.  It's a mandatory identifier visible in the pdf. (Part 2)
replace `var'="" if `var'=="COUNTY"
replace `var'="COUNTY" if `var'[_n-2]=="TOTAL"

*Concatenates appropriately, stopping if it hits anything
replace `var'=`var'[_n-1]+"."+`var' if var2[_n-1]!="POPULATION" & var1[_n-1]!="REVENUES" & var1[_n-1]!="TAXES" & mark_3[_n-1]!=1 & var2!="POPULATION" & var1!="REVENUES" & var1!="TAXES"


Now, let’s just drop the stuff we don’t need, and we’re looking at a much much cleaner worksheet, giving us the before/after image shown below:

*Drops excess rows or columns.
drop if mark==1
drop if mark_1==1
drop if mark_2==1
drop if var2=="POPULATION"
drop if var1=="REVENUES"
drop if var1=="TAXES"