Programming

Accidental Art 2

My earlier paper mentioned in “Accidental Art” on phased entry has been postponed.   This postponement is a fortunate side effect of my successful publication in Regional Science and Urban Economics.  My successful publication has promptly propelled me into a direction of crime economics, rather than one of phased entry. As a result, this phased entry paper has been put on hold. Currently I’m doing some basic modeling on counter-terrorism, which makes me feel like a criminal mastermind. Amusingly enough, there’s still some some beautiful accidental art being churned out by my model, this time in Matlab!

accidental art 2 accidental art 1

Advertisements
Programming, Statistics

A Brief Presentation for the Student Investment Club (SIC)

In an attempt to branch out and see what other people do in terms of work, I’ve been creating a model for the Student Investment Club to simultaneously forecast GDP, CPI, and Unemployment.  While such a prediction is clearly overly ambitious for a casual effort, I made an attempt at it using some basic methodologies.  The dependent variables that I used in this case were guided by the preferences of the group, rather than by any particular theoretical model.  As such, I have very little faith in the model to be a powerful predictor on a fundamental level, but I do expect it to be correlated with the actual values.

Attached is my presentation (5-10 minutes) about my preliminary forecasts and findings. It is meant to be delivered to a nontechnical audience and is meant to be a partial (but not complete) disclosure of the problems with my approach. Below is a version of the model I am using to make such a forecast, with some admittedly sparse commentary.

SIC Presentation – Professional (Warm)

library(tseries)
library(quantmod)
library(systemfit)

# Old work functions as a great key
# gdp<-getSymbols(‘GDPC1′,src=’FRED’)
# indp<-getSymbols(‘INDPRO’, src=’FRED’)
# ism<-getSymbols(‘NAPM’, src=’FRED’)
# cap<-getSymbols(‘TCU’,src=’FRED’)
# wage<-getSymbols(‘AHETPI’,src=’FRED’) #Productivity? Proxy:wages
# ppi<-getSymbols(‘PPIACO’,src=’FRED’)
# unemploy<-U1RATENSA
# libor<-USDONTD156N
#
# cpi<-getSymbols(‘CPIAUCSL’,src=’FRED’)
# nom_pers_inc<-getSymbols(‘PINCOME’,src=’FRED’) #this might need to be real
# senti<-getSymbols(‘UMCSENT’,src=’FRED’)
# #demand<-getSymbols(‘DEMOTHCONS’,src=’FRED’)#Consumer demand? Proxy: request for more loans
# #cpi<-getSymbols(‘TCU’,src=’FRED’) #Total sales? Proxy: Change in buisness inventories

#Get the data
out<-NULL
b_names<-c(“GDPC1″,”INDPRO”,”NAPM”,”TCU”,”AHETPI”,”PPIACO”,”CPIAUCSL”,”PINCOME”,”UMCSENT”,”FEDFUNDS”,”U1RATENSA”)
getSymbols(b_names,src=’FRED’)
b<-list(GDPC1,INDPRO,NAPM,TCU,AHETPI,PPIACO,CPIAUCSL,PINCOME,UMCSENT,FEDFUNDS,U1RATENSA)
FEDFUNDS<-na.exclude(FEDFUNDS)
out<-lapply(b, aggregate, by=as.yearqtr, mean)

# Scale it appropriately.
series<-lapply(out,window,start=as.yearqtr(“2000 Q1”), end=as.yearqtr(“2013 Q1”))#trims to a consistant window.
series<-lapply(series,cbind)
series<-data.frame(series)
names(series)<-b_names
series<-log(series) #log the series
series<-as.ts(series) #need time series for this following operator:
series<-diff.ts(series[,c(“GDPC1″,”INDPRO”,”TCU”,”AHETPI”,”PPIACO”,”CPIAUCSL”,”PINCOME”,”UMCSENT”,”FEDFUNDS”,”U1RATENSA”)]) #first difference
lagGDP<-series[,”GDPC1″]
lagCPI<-series[,”CPIAUCSL”]
lagUNEMP<-series[,”U1RATENSA”]
series<-data.frame(series) #back to df
series$NAPM<-matrix(NAPM[(dim(NAPM)[1]+2-dim(series)[1]):dim(NAPM)[1]]) #Some may be stationary!
series$lvl_UMCSENT<-matrix(UMCSENT[(dim(UMCSENT)[1]+2-dim(series)[1]):dim(UMCSENT)[1]])
series$lvl_TCU<-matrix(TCU[(dim(TCU)[1]+2-dim(series)[1]):dim(TCU)[1]])
series$lvl_NAPM<-matrix(NAPM[(dim(NAPM)[1]+2-dim(series)[1]):dim(NAPM)[1]])
series$lvl_FEDFUNDS<-matrix(FEDFUNDS[(dim(FEDFUNDS)[1]+2-dim(series)[1]):dim(FEDFUNDS)[1]])
series$t.index<-zooreg(series, start=as.yearqtr(“2000 Q1”),end=as.yearqtr(“2013 Q1″), frequency = 4) #need a time trend
series$quarter<-as.vector(seq(from=1,to=4, by=1))
# series$PINCOME_2<-(series$PINCOME)^2 #are these acceptable?
# series$GDPC_2<-(series$GDPC1)^2
series_hold<-data.frame(series)
# documentation http://cran.r-project.org/web/packages/systemfit/vignettes/systemfit.pdf
series$Lead_GDPC1<-lag(zoo(lagGDP),k=+2, na.pad=TRUE)
series$Lead_CPIAUCSL<-lag(zoo(lagCPI),k=+2, na.pad=TRUE)
series$Lead_U1RATENSA<-lag(zoo(lagUNEMP),k=+2, na.pad=TRUE) #impact takes at least 2 quarters. This is needed because we are missing CPI numbers for last quarter. Sentiment is delayed 6 months as propietary information. If it is set to +2, the estimates are to see what it would be like if we had the current info (pay for it).
eq1<- Lead_GDPC1 ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
eq2<- Lead_CPIAUCSL ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
eq3<- Lead_U1RATENSA ~ INDPRO + lvl_NAPM + lvl_UMCSENT + GDPC1 + TCU + CPIAUCSL + FEDFUNDS + U1RATENSA + factor(quarter)
eqsystem<-list(GDP=eq1,CPI=eq2,UNEMP=eq3)
# series<-data.frame(series)
fit<-systemfit(eqsystem, method=”SUR”, data=series)
pred<-predict(fit,series, se.pred=TRUE)
pred_ci<-predict(fit, series, interval=”confidence”, level=0.95) #note events are not normal.
plot(series$GDPC1, type=”l”, col=”darkgreen”, ylab=”% Change in GDP”, xlab=”Quarters (since 2000)”, main=”GDP forecast”) #the dimseries -40 gets me 10years.
points(pred[1], type=”l”, col=”blue”, lty=5)
points(pred_ci[,c(3)],type=”l”, col=”red”, lty=2)
points(pred_ci[,c(2)],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)

plot(series$CPIAUCSL, type=”l”, col=”darkgreen”, ylab=”% Change in CPI”, xlab=”Quarters (since 2000)”,main=”CPI forecast”)
points(pred[3], type=”l”, col=”blue”, lty=5)
points(pred_ci[,5],type=”l”, col=”red”, lty=2)
points(pred_ci[,6],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)

plot(series$U1RATENSA, type=”l”, col=”darkgreen”, ylab=”% Change in UNEMP”, xlab=”Quarters (since 2000)”, main=”UNEMP forecast”)
points(pred[5], type=”l”, col=”blue”, lty=5)
points(pred_ci[,8],type=”l”, col=”red”, lty=2)
points(pred_ci[,9],type=”l”, col=”red”, lty=2)
legend(x=”bottomleft”,c(“Green= Actual GDP”,”Red= 95% CI”,”Blue=Forecast”), cex=0.90)
summary(fit)

tail(pred)
pred<-rbind(0,rbind(0,pred))
pred_ci<-rbind(0,rbind(0,pred_ci))
tail(series[c(“CPIAUCSL”,”GDPC1″,”U1RATENSA”)])

 

Programming

Using “cbind” and “as.vector”: Computationally intensive commands

As perhaps a mere interesting note, cbind when combined with as.vector can be a particularly RAM-intensive set of commands. I noted the following script excerpt caused my computer to quickly consume  11GB of RAM on a 300k entry dataset. :

for(j in c(1:200)){
mod.out$coefficients$count[1:80]<-lim.my.df[1:80,j]
mod.out$coefficients$zero[1:80]<-lim.my.df[81:160,j]
a<-predict(mod.out,clip.1)
b<-predict(mod.out,clip.mean)
diff.j<-mean(a-b)
# diff[,paste(i,”.”,j)]<-diff.j
diff<-as.vector(cbind(diff,diff.j))
}

The purpose of this script is to use bootstrapped coefficients generate an average partial effect between clip.1 and clip.mean. We will later use this to get a estimate of the standard errors of the APE. As it stands, it eats all my RAM quite promptly and causes the computer to crash.  The following script, nearly identical, does not have this problem:

for(j in c(1:200)){
mod.out$coefficients$count[1:80]<-lim.my.df[1:80,j]
mod.out$coefficients$zero[1:80]<-lim.my.df[81:160,j]
a<-predict(mod.out,clip.1)
b<-predict(mod.out,clip.mean)
diff.j<-mean(a-b)
# diff[,paste(i,”.”,j)]<-diff.j
diff<-cbind(diff,diff.j)
}
diff<-as.vector(diff)

And this works just fine! In fact, it barely  consumes 25% of my RAM.

Programming

Basic programming in Linux

I’ve just written my first program in Linux(Ubuntu) using bash commands. I’d love to give more details about what’s happening and how I’m doing it, but basically I’m following instructions and then playing with the little that I know about it.  I’ll be writing a program in two parts here, basically just as a demonstration of what they do and how they work.

Here’s the first program. Put it into a text file and save it as “testprogram”.

#!/usr/bin/env bash
# A simple command in a shell script
name=”COMPUTER”
printf “Hello, world!\n”
printf “My name is \n”
printf “${name}! \n”
printf “What’s your name? \n”
read user_name
printf “I didn’t get that, is your name $name? \n”
read confirm
printf “What’s your name again? \n”
read user_name2
while [ $user_name = $user_name2 ]
do
printf “I heard your name is $user_name! No need to SHOUT!\n”
done

After doing that, navigate using “cd” and “ls” to the appropriate directory that you saved “testprogram” to.
Now, enter the command, this is the second part:

chmod a+rx testprogram

This will make the program “testprogram”, for All users (a), Readable and eXecutable (+rx).
You may now run the file by entering:

./testprogram

The program will produce the following output:

Hello World!
My name is COMPUTER!
What’s your name?
-prompt1-
I didn’t get that, is your name <returns prompt>?
-prompt2, irrelevant what you say here, yes, your name again, etc.-
What’s your name again?
-prompt3-
If your name in the first prompt is exactly the same as that of the third, it will scream:
I heard your name is <return prompt 1>! No need to SHOUT!
        It will repeat this until the program is terminated by the user.

I’m really pumped about this. This is part of my ongoing work at completing my dissertation (I need to make a standalone script to start the programs I want and pass arguments to it), as well as getting me to be a more useful part of the universe, so I’m always excited to get things done.

Programming, Statistics

Basic bootstrapping in R

I’ve been having some trouble determining how strata works in the boot() command.   My intuition says it should select from within each type of strata. But no guarantees!  I can always just type “boot” and read the whole command line for line… but this isn’t always easy to interpret without comments.
So here’s a quick test to make sure it does what I think it does.

library(boot) #load the package
x<-rep(c(0,100,0,1), each=20) #make a long list of 20 zeros,20 one hundreds, 20 zeros, 20 ones.pool<-matrix(x, ncol=2) #make that list into a 20 by 2 matrix, filling it downward.

pool #let’s look at it.
f<-function(pool,i){
mean(pool[i,1]) #mean of the first column of pool, using individuals i
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #resample and perform that operation 500 times.  Has variation in output.
boot(pool,f, R=500, strata=pool[,2]) #resample and perform that operation -reselecting from within the strata on the rhs. Since all observations within each strata are identical, the standard deviation should be zero.

Here’s an interesting mistake that I made while creating this command.

f<-function(pool,i){
mean(pool[,1])
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #Has no variation in bootstrapping.  Why? What’s up with that?

Answer: There’s no individuals. It just resamples the entire population of the first column of pooled 500 times, getting the final result as shown.

Programming, Statistics

Ordering and structuring text in STATA

Like most researchers, I have difficulty finding data. On occasion, when I manage to find data, it comes in a useless or meaningless format. For example, a local police department offered to print out all of its police reports for me for the last 15 years at $0.25 a page… which I would have to enter by hand.  I politely declined.  Here’s an example of sorting out city titles from a poorly formatted document- a short version of that document can be found here.

The first and most important thing is to have your data loaded.

clear all
use "location", clear

This loads the data and clears any extra junk STATA may have floating around.  The next thing to do is identify any sort of regular pattern.  The first identifiable patterns were that titles for cities began every 64 lines, and sometimes went for between 3 or 4 lines, before having another 64 lines of regular income reports (useless at this point) are posted. The second pattern is that they all are either villages, cities, or towns.  Below are some attempts at accomplishing this goal via making some sort of indicator every 64 lines (didn’t work), or marking every area that starts with CITY (inefficient). Don’t forget to put strings in quotation marks!

*Marks every line with a list that counts 0...63. I thought the 63rd would be useful, but because of the irregular length of the header, it's worthless.
*gen city_titles=mod(_n-1,64)
*Is the line saying city? If so, give it a mark, literally. 
*gen mark=cond(var1=="CITY","mark",".",".")

Notice that using (*) comments out the line in STATA, and (.) is STATA’s version of NA.  Here is a more successful attempt:

*Identify if it's probably a title
gen mark=1 if var1=="CITY"
replace mark=1 if var1=="TOWN"
replace mark=1 if var1=="VILLAGE"

Don’t forget the double equals!  In STATA, all if statements I have observed are followed by (==) instead of (=), and I still manage to forget them often!  Next, I will create 3 more indicators, aptly labeled mark_1 to mark_3, all showing if it follows the indicator above by 1, 2, or 3 variables. In programming, it is better to give clear, boring, brief names rather than unclear mess. I would rather have a long object name than an unclear short one, since you never know who is going to have to read your code:

*Is it following a title?
gen mark_1=mark[_n-1]
gen mark_2=mark[_n-2]
gen mark_3=mark[_n-3]

In retrospect, I could have perhaps consolidated those more efficiently. Does anyone else see how I could have done that in a single variable (mark) instead of having multiple indicators (mark_1…mark_3)?  Either way, I have some extra chaff I have to eliminate. having (.)’s in my data set is agitating. Let’s set those all as 0’s, and sum the remaining columns together in something called “total”, so I know if they have an indicator (1 or more) or not (0).  Then we can drop everything we don’t want!

*Eliminate non-numerical entries as numeric.
replace mark=0 if mark==.
replace mark_1=0 if mark_1==.
replace mark_2=0 if mark_2==.
replace mark_3=0 if mark_3==.

*Is it marked?
gen total=mark+mark_1+mark_2+mark_3

drop if total!=1

Bam!  Much better. We dropped somewhere around 80% of our unwanted junk right there! We still could never compile this all by hand, though.  Here, we want it to concatenate all our variables 1…11 in a downward direction.  So we make a local marco list called to_combine, since that’s what we’ll be doing with them.

local to_combine var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11

Below is a STATA style do-loop, a common element in meaningful programming everywhere.  We want to repeat for each and every variable in the “to_combine” list a series of operations designed to deal with a few regular problems we observed. Firstly, we want to fix an issue with names. You may notice “FOND DU LAC” takes up 3 columns rather than the intended 1, being as that is the name of a city/town/village/county in Wisconsin.  This is the first thing we fix, by dropping the spaces with “DU” and “LAC” and replacing all “FOND” entries with “FOND DU LAC”.  Such a procedure is repeated for other problems we noted. There are 300+ problems we found, so this is not a perfect fix yet, (as noted by entry v9/2 in the “after” image below), but it’s much better. The second thing we fix is putting “COUNTY” horizontally under “TOTAL”, since in the original document (pdf), we know that after an entry says “TOTAL”, it always says the total for the county, not for any other grouping.  Lastly, we want it to concatenate horizontally but stop before it hits any part of the income sheet.  Stata uses & for “and” and | for “or”. You may use both in long chains for a single if statement. This last was done by a bit of trial and error to make sure I got it right. That’s fine, since computers are cheap and thinking hurts.  It stops at the line containing every end of line indicator I could think of, like “TAXES”, “POPULATION”, or “REVENUES”, which are clearly not city titles.

foreach var in `to_combine'{

*We have a space problem (Part 1)
replace `var'="" if `var'=="DU" 
...snipped...
replace `var'="LA CROSSE" if `var'=="LA"

*Puts county under total.  It's a mandatory identifier visible in the pdf. (Part 2)
replace `var'="" if `var'=="COUNTY"
replace `var'="COUNTY" if `var'[_n-2]=="TOTAL"

*Concatenates appropriately, stopping if it hits anything
replace `var'=`var'[_n-1]+"."+`var' if var2[_n-1]!="POPULATION" & var1[_n-1]!="REVENUES" & var1[_n-1]!="TAXES" & mark_3[_n-1]!=1 & var2!="POPULATION" & var1!="REVENUES" & var1!="TAXES"

}

Now, let’s just drop the stuff we don’t need, and we’re looking at a much much cleaner worksheet, giving us the before/after image shown below:

*Drops excess rows or columns.
drop if mark==1
drop if mark_1==1
drop if mark_2==1
drop if var2=="POPULATION"
drop if var1=="REVENUES"
drop if var1=="TAXES"

Image