Programming, Statistics

The Optimum Pokemon Portfolio and Principal Component Decomposition (PCD) using R

I have very recently completed the Stanford Lagunita online course on Statistical Learning, and Tibrishani & Hastie have taught me a great deal about Principal Components.  No learning is complete without exercises, however, so I have found a wonderful data set that seems popular, the attacks and weaknesses of Pokemon.  (I am, admittedly, not a pokemon player, so I have had to ask others to help me understand some of the intricacies of the game.)

Principal Component Decomposition:

First and foremost, principal component decomposition finds the direction that maximizes variation in the data.  At the same time, this can be said to be the eigenvalue of the data, the direction which best describes the direction of the data.
For example, if there is a spill of dirt on a white tile floor, the direction of the spill (eigenvalue) would always be the direction the dirt is most widely spread (principal component).

After looking at the beautiful charts used in the link above, I realized this would be very interesting to do a PCD on. What Pokemon are most similar and which are most different in terms of strengths and weaknesses? To find out we will break it into its principal components, and find out in which directions the data is spread out.

Pokemon can vary along 18 dimensions of strengths and weaknesses, since there are 18 types of Pokemon. This means there can be up to 18 principal components. We are not sure which principal components are useful without investigation. We show below how much variation is explained by each type of Pokemon. There doesn’t appear to be any clear point where there the principal components drop off in their usefulness, perhaps the first 3 or the first 5 seem to capture the most variation.  The amount of variation captured by each principal component is outlined below.
Let us now look at the principal components of the Pokemon attack/weakness chart directly.  We can visualize them in a biplot, where the arrows show the general attacking direction of the pokemon and the black labels show the defending labels.  The distance from the center of biplot shows the deviation of that pokemon type from the central eigenvalue/principal component.  Labels that are close together are more similar than those further apart.


So for example, Ghost attacks (arrows) are closely aligned with Ghost defence (black label) and Dark defence (black label).  In general, the Pokemon that are most different in defence is Fighting and Ghost, and still again distinct from Flying and Ground defence.  This suggests that if you wanted a Pokemon portfolio that would be very resilient to attack, you would want Fighting/Ghost types.  If you want a variety of attacks, you might want to look into Ghost/Normal types or Grass/Electric.

Keep in mind together these only explain about 35.5% of the variation of Pokemon types, there are other dimensions in which Pokemon vary.  I expected fire and water to be more clearly different (and they are very distinct, they go opposite directions for a long distance from the center!), but they are less distinct than ghost/normal.

The Optimum Pokemon Portfolio:

This lead me to wonder what type of pokemon portfolio would be best against the world, something outside the scope of the Statistical Learning course but well within my reach as an economist.  Since I don’t know what the pokemon-world looks like, I assumed the pokemon that show up are of a randomly and evenly selected type. (This is a relatively strong assumption, it is likely the pokemon encounters are not evenly distributed among the types).  The question is then, what type of pokemon should we collect to be the best against a random encounter, assuming we simply reach into our bag and grab the first pokemon we see to fight with?

First, I converted the matrix of strengths and weaknesses above into one that describes the spread of the strength-weakness gap, that is to say, if Water attacks Fire at 200% effectiveness, and defends at 50% effectiveness, a fight between the Water and Fire is +150% more effective than a regular pokemon attack (say Normal to Normal or Ice to Ice). Any bonuses a pokemon may have against its own type was discarded, because it would be pointless.  The chart for this, much like the wonderful link that got me the data in the first place, is here, where red is bad and blue is good: Rplot12

Then I added the strength-weakness gap together for each type of pokemon, which assumes that the pokemon are facing an a opponent of a random type.  According to this then, the most effective type of pokemon are on average:

Type                              Effectiveness
Steel                               0.22222222
Fire                                0.11111111
Ground                              0.11111111
Fairy                               0.11111111
Water                               0.08333333
Ghost                               0.08333333
Flying                              0.05555556
Electric                            0.00000000
Fighting                            0.00000000
Poison                             -0.02777778
Rock                               -0.02777778
Dark                               -0.02777778
Ice                                -0.08333333
Dragon                             -0.08333333
Normal                             -0.11111111
Psychic                            -0.11111111
Bug                                -0.11111111
Grass                              -0.19444444

That is to say, Steel pokemon, against a random opponent, will on average be 22% more effective.  (This is the mean, not the median.) And against a random opponent a Grass pokemon will be expected to be 19% less effective than a Fighting pokemon, shockingly low. Amusingly, Normal pokemon are worse than normal (0) against the average pokemon.

This does not mean you ONLY want Steel pokemon because you could come up with an opponent that is strong against Steel. Nor do you want to entirely avoid Grass pokemon, since they are very strong against many things that Steel is weak against. Merely that if you’re willing to roll the dice, a Steel pokemon will probably be your best bet.  Trainers do not want to take strong risks, trainers are risk averse.  You want to maximize your poke-payoff while minimizing how frequently you face negatively stacked fights. The equation for this is:

Maximize: \ \mu * vars - \delta * t(vars) * cov * vars + \lambda*(1- t(ones) * vars) \ wrt. \ vars

Where \mu is your vector of payoffs in the table above, \delta is your risk aversion, cov is the covariance matrix of the differenced pokemon data set, and vars is your portfolio selection which must add up to one hundred percent.

How risk averse are you?  You could be very risk averse and want to never come across a bad pokemon to fight, or you could love rolling the dice and only want one type of pokemon. So I have plotted the optimal portfolio for many levels of risk-tolerance.  It is a little cluttered, so I have labelled them directly as well as in the legend.


The visualization is indeed a little messy, but as you become more risk averse, you add more Electric, Normal, Fire, and Ice pokemon (and more!) to help reduce the chance of a bad engagement.  In order to do this, one reduces the weight we put on Steel, Ground, and Fairy pokemon, but doesn’t eliminate them entirely.  Almost nothing adds Dragon, Ghost, Rock. or Bug pokemon, they are nearly completely dominated by other combinations of pokemon types.

I’ve plotted two interesting portfolios along the spectrum of risk aversion below. They include one with nearly no risk aversion (0.001), and one with high risk aversion (10).

This slideshow requires JavaScript.

Of course, most importantly of all, regardless of your Pokemon and your interest in being “the very best”, you should still pick the coolest Pokemon and play for fun.

Code is included below:

#Data from:
#write.csv(chart, file="/home/bsweber/Documents/poke_chart.csv")
# library(devtools) 
# install_github("vqv/ggbiplot", force=TRUE)

differences <- (poke_chart-1) - (t(poke_chart)-1)
core <- poke_chart

poke_pcd<-prcomp(core, center=TRUE, scale=TRUE)
plot(poke_pcd, type="l", main="Pokemon PCD")

poke_palette<-c("#A8A878", "#EE8130", "#6390F0", "#F7D02C", "#7AC74C", "#96D9D6", "#C22E28", "#A33EA1", "#E2BF65", "#A98FF3", "#F95587", "#A6B91A", "#B6A136", "#735797", "#6F35FC", "#705746", "#B7B7CE", "#D685AD")

ggbiplot(poke_pcd, labels= rownames(core), ellipse = TRUE, circle = TRUE, obs.scale = 1, var.scale = 1) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')
  #Score plot is for rows, attack data. loading lot is for columns, defense data.  So bug and fairy have similar attacks (shown by rays), similar defences (shown by points). Ghost and normal have almost identical defences, but different attacks.
ggbiplot(poke_pcd, labels= colnames(core), ellipse = TRUE, circle = TRUE, obs.scale = 1, var.scale = 1, choice=c(2,3)) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')  #Score plot is for rows, attack data. loading lot is for columns, defense data.
ggbiplot(poke_pcd, labels= colnames(core), ellipse = TRUE, circle = TRUE, obs.scale = 1, var.scale = 1, choice=c(5,6)) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')  #Score plot is for rows, attack data. loading lot is for columns, defense data.
ggbiplot(poke_pcd, labels= colnames(core), ellipse = TRUE, circle = TRUE, obs.scale = 1, var.scale = 1, choice=c(7,8)) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')  #Score plot is for rows, attack data. loading lot is for columns, defense data.

cov_core<- t(differences-mean(differences)) %*% (differences-mean(differences)) #Make the Cov. Matrix of differences. 
cov_core[order(diag(cov_core), decreasing=TRUE),order(diag(cov_core), decreasing=TRUE)]
vars<-as.matrix(rep(1/18, times=18))
mu<-t(as.matrix(apply(differences/18, 1, sum))) #Average rate of return over 18 pokemon types.

data.frame(mu[,order(t(mu), decreasing=TRUE)]) #Table of Pokemon Types

delta<- 1  #risk aversion parameter

  out<- matrix(0, nrow=0, ncol=18)
  for(j in 1:1000){
  Dmat <- cov_core * 2 * delta
  dvec <- mu
  Amat <- cbind(1, diag(18))
  bvec <- c(1, rep(0, 18) )
  qp <- solve.QP(Dmat, dvec, Amat, bvec, meq=1)
  out<-rbind(out, round(pos_answers, digits=3))
  df <- data.frame(x=1:nrow(out))
    df.melted <- melt(out)
      colnames(df.melted)<-c("Risk_Aversion", "Pokemon_Type", "Amount_Used")
 qplot(Risk_Aversion, Amount_Used, data=df.melted, color=Pokemon_Type, geom="path", main="Pokemon % By Risk Aversion") + 
   # ylim(0, 0.175) +
   scale_color_manual(values = poke_palette) +
#   geom_smooth(se=FALSE) +
   geom_text_repel(data=df.melted[df.melted$Risk_Aversion==8.5,], aes(label=Pokemon_Type, size=9, fontface = 'bold'), nudge_y = 0.005, show.legend = FALSE) 

  # Another plot that is less appealing
  # matplot(out, type = "l", lty = 1, lwd = 2, col=poke_palatte)
  # legend( 'center' , legend = colnames(core), cex=0.8,  pch=19, col=poke_palatte)
  pie(head(out, 1), labels= colnames(out), col=poke_palette)
  pie(tail(out, 1), labels= colnames(out), col=poke_palette)
  df_1<-data.frame(matrix(out[1,], ncol=1))
  ggplot(data=df_1, aes(x=Pokemon_Type, y=Percentage, fill=Pokemon_Type))+
    geom_bar(stat="identity", position=position_dodge()) +
    scale_fill_manual(values = poke_palette)+
    ggtitle("Pokemon Portfolio With Almost No Risk Aversion")
  ggplot(data=df_2, aes(x=Pokemon_Type, y=Percentage, fill=Pokemon_Type))+
    geom_bar(stat="identity", position=position_dodge()) +
    scale_fill_manual(values = poke_palette) +
    ggtitle("Pokemon Portfolio With Very Strong Risk Aversion")
  cov_core[order(diag(cov_core), decreasing=TRUE),order(diag(cov_core), decreasing=TRUE)]
  melt_diff$value<- factor(melt_diff$value)
  simplepalette<-colorRampPalette(c("red", "grey", "darkgreen"))
  ggplot(data = melt_diff, aes(x=Var1, y=Var2, fill=value) ) + 
    scale_fill_manual(values=simplepalette(9), breaks=levels(melt_diff$value)[seq(1, N, by=1)], name="Net Advantage" )+
    ggtitle("Net Pokemon Combat Advantage")+
    xlab("Opponent") +
    ylab("Pokemon of Choice")

Multiple Linear Regression in R

This slideshow requires JavaScript.

In the previous exercise: Why do we need N-2?, I show a simple 1 dimensional regression by hand, which is followed by an examination of sample standard errors.  Below I make more extensive use of R (and an additional package) to plot what linear regression looks like in multiple dimensions. This generates the images above, (along with several others).  This illustrates that linear regression remains flat even in N dimensions, the surface of the regression is linear in coefficients.

As a class exercise, I ask that you consider different pairs dependent variables that are  functions of one another. What happens if the function is linear? What happens if the function is nonlinear, for example, cos(x_1)=x_2? Examine what happens to the surface of your regression as compared to the shape of the relationship you are investigating.  Is there a way you can contort the regression estimate into a curved surface to better match?  Why or why not?

install.packages(“plot3D”) # we need 3d plotting
library(“plot3D”, lib.loc=”~/R/win-library/3.1″) #Load it into R’s current library, may vary by computer.

set.seed(2343) #ensures replicatation. Sets seed of random number generators.
n<-25 #number of samples
x_1<-rnorm(n) #Our x’s come from a random sampling of X’s.
b_1<-3 #Those cursed jello puddings are associated with increased crime. Linear regression is supportive of association- not causation.
b_2<-(-3) # But student transit programs are associated with a decline in crime.
y<-b_0+b_1*x_1+b_2*x_2+u #This is defining our true Y. The true relationship is linear.

#look at data in each dimension
#look at data overall
points3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=5) #look at data. phi/theta is tilt.

fit<-lm(y~x_1+x_2)  #fit it with a linear model, regressing y on x_1, x_2

#Make a surface
x_1.pred <- seq(min(x_1), max(x_1), length.out = n)
x_2.pred <- seq(min(x_2), max(x_2), length.out = n)
xy <- expand.grid(x_1=x_1.pred, x_2=x_2.pred)
y.pred <- matrix (nrow = n, ncol = n, data = predict(fit, newdata = data.frame(xy), interval = “prediction”))

summary(fit) #view output of variables.

fitpoints<-predict(fit)  #get predicted points, needed to make a surface.

scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=5 , surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #look at data. phi/theta is tilt.
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=45, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #From straight on it is a flat plane, residuals are highlighted
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=30, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #From other angles it is clear it is somewhat straight.
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=60, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #look at data. phi/theta is tilt.


Programming, Statistics, Teaching Materials

Why do we need n-2? An example in R

Below is a simple example showing why we may want the (\Sigma u^2_i )/ (n-2) as our estimates of \large \sigma^2, when our naive intuition may suggest we only want the simple average of squared errors (\Sigma u^2_i )/ (n).

To show this in no uncertain terms, I have coded a linear regression by hand in R.  Also embedded in the work below are several rules I follow about writing code. They are rules 0-6.  There are many other rules, since code writing is an art.

####Coding in R
#### Rule 1: Always comment on every few lines of code. It is not unheard of to comment every single line, particularly for new coders, or complex code.
#### You will need to reference your work at a later date, and after about 3 months, the purpose is lost. Also, I need to read it.

#### Rule 2: Define your variables first. Luckily these names are shared for us.
#### For your projects, use names which are clear for your research: (y=crime in Williamsburg, VA, X= Number of jello puddings consumed)

set.seed(1223) #ensures replication. Sets seed of random number generators.
n<-25 #number of samples
x<-2*rnorm(n) #Our x’s come from a random sampling of X’s.
b_1<-3 #Those cursed jello puddings are associated with increased crime. Linear regression is supportive of association- not causation.
u<-rnorm(n) #We satisfy both independent mean and zero mean assumptions
y<-b_0+b_1*x+u #This is defining our true Y. The true relationship is linear.

plot(x,y) #Rule 0, really. Always check your data.

#### Rule 3: After definitions begin your second stage of work. Probably trimming existing data, etc. Do these in the order they were added.
hat_b_1<-sum( (x-mean(x)) * (y-mean(y)) ) / sum( (x-mean(x))^2 ) #Spaces between any parenthesized section of operations. We need to be able to see which parentheses are which.
hat_b_1 # Rule 4: Indent work which is conceptually subordinate. Indent more as needed. Four spaces=1 tab.
hat_b_0 # Rule 5: Check your work as you go along. For our example, I got 9.89

abline(a=hat_b_0, b=hat_b_1, col=”red”) #let’s add a red line of best fit. And we must see how our plot looks. Repeat rule 0.


plot(x,hat_u) # Let’s see our residuals
hist(hat_u) # Let’s see our histogram

#### Rule 6: Keep your final analysis as punchy and short as possible without sacrificing clarity.
#### The mean sum of the squared errors (usually unknown to us as researchers)
sigma_sq<-sum(u^2)/n #this is the value we’re trying to estimate
sigma_sq_naive<-sum(hat_u^2)/n #this is a naive estimation of it
sigma_sq_hat<-sum(hat_u^2)/(n-2) #this turns out to be more accurate, particularly in small samples. If n->infinity this goes away. Try it for yourself!

#R, is this assessment true? Is sig_sq_hat a better estimator of sig_sq than our naive estimator? Is it true we need the (-2)?
(sigma_sq-sigma_sq_naive) > (sigma_sq-sigma_sq_hat)

Here is one of several plots made by this code, showing a nice linear regression over the data:


Please don’t forget the derivation of why this is true!  This is simply some supportive evidence that it might be true.


Reddit Button Forecast (Post-Mortem*: 5/25)

The forecast correctly predicted the button would hit zero on May 23rd.  I see that the timer did not expire as a result of low latency “zombie” botting, but we have a 0.00 timer displayed on the day I was predicting. The final forecast was set on May 16th, seven days earlier.  I am pleased that the forecast was accurate so many days out, I was worried about posting a flurry of +3 and +2 day forecasts before it ended. While a detractor would say my forecast did update and change with new data, the technique remained comfortably in the ARIMA family each time. Most importantly, the only forecast to expire was correct, seven days ahead.

I will not continue to manually update the forecast given that the lower bound of zero has been reached.  Keep an eye out for code to be released.

Below is the final forecast, announced 5/16:

The current forecast suggests the button will end on the 23rd.  On a positive note, the forecasts have been getting shorter and shorter as I have updated them. The first forecast was +15 days, then +12, +13, and +8. This current one is +7.   I am still hoping to get two forecasts which both suggest the same day, but I imagine the trends are changing over time faster than my technique allows.

The grey confidence interval suggests that the timer will probably never over 30 seconds for any 10 minute period again.  If you want a badge indicating less than 30 seconds, you will always be able to find it if you wait for, at most, 10 minutes, even during peak hours. Very few people will have to wait that long.


Each cycle or wave in the graph is approximately one day, representing the daily cycle of activity on the button, high in the afternoons, low in the late nights/mornings. There is also a slight weekly cycle, but it is not easy to notice in the plot.

The button’s values have partially stabilized with a pretty persistent protection around the red barrier, but there is still some noticeable decay between the 4000 and 6000 period marks. I have continued to use an indicator for the lowest observed badge color to help soften the impact of the early periods, when it was impossible to get a low badge color due to the frequency of clicking. We are now in a period where it is demonstratively possible to get red, with patience and discipline- we have observed red badges occur. Using the lowest observed badge color as a variable allows us to separate out this current period from earlier ones where the data was less descriptive of the current state.

Out of the grey collection of possible futures highlighted, it looks like button is declining steadily, the general future looks rather grim. The upper line of the grey 75% confidence interval is below 30 seconds, suggests that the timer will not be kept at over half for a full 10 minutes ever again. [This prediction did not end up to be true.]  I note that the existence of a good forecast means that the red guard can simply pay extra close attention to the period in which they think it will end, and this forecast might actually extend the life of the button. Maybe.


First, I downloaded the second-by-second data at about 5/16 at 12:00pm CST from here. To ease the computational load and reduce unwanted noise in the forecast, the  4+ million data points were aggregated from seconds into intervals of ten minutes each. I examine only the lowest value of the timer, since the topic of interest is when the timer hits zero. (This strikes me as somewhat ad hoc, because the distributions of minimums are likely non-normal, they would be from an extreme value distribution.) Below is a plot of the ten minute minimums for the button. Each cycle is about a day, and there appears to be a weekly cycle that is very slight.

I exclude any period where the timer was not well recorded for technical reasons, which has helped return the forecast to normal after the “great button failure”. I am much more confident in this current forecast as a result. New to this forecast, I have also added a dummy indicator for the lowest badge observed. It began as purple, and then slowly slid to red. We are in a post-red period, but when the button began, we had only seen purple. The structure of the model ought to reflect that. This significant set of variables suggests that the button’s lowest observed value in a 10 minute period is sinking at an accelerated pace compared to the early stages of the button.

Then, I estimate the process using ARIMA(1,1,1) and weekly, daily, and hourly Fourier cyclical components. I include one pair of sin() and cos() waves to catch any cyclical trends in weeks, days, or hours. This is roughly the same technique I used to predict the next note in Beethoven’s Symphony, which worked with 95+% accuracy. They tend to fit very well, and in fact, I am often shocked by how effective they are when used correctly.

Below I show how well the past data fits with our model model.  This type of extreme fit is typical when ARIMA sequences are applied correctly, and only shows that I do seem to fit the past reasonably well. I check this plot to ensure that my forecast does not predict impossible amounts and it stays between 0-60 for our past data.

The fit appears to be very good, better than prior weeks, suggesting my model is better now that I have included lowest observed badge. There are few periods where the forecast is very off base.  (I am not sure why the last line spikes up so much, I would like to take a careful look at the code to see what’s going on, that spike is not part of ARIMA and therefore is a problem within my forecast, likely involving the very last period.)

Below, I show the errors of the forecast above. At this scale it is clear there are a few days where my model misjudges the fit. I am unsurprising by this, given I have about so many observations, but I am disappointed some intervals are incorrectly predicted by 20 seconds or more. This is the cost of estimation, perhaps.


On to more technical details.  My process looks at its own previous values to make a forecast. I need to make sure that my sequence is not missing critical associations.  Let us see how well the past values are associated with a current one. Big lines mean big association. We call plots of these correlations the ACF and PACF.  I plot them below. They suggest our fit is relatively well done. (They fall mostly within the blue lines for many/most steps, the first of the ACF is excluded, because the current value is 100% equal to itself.) For these steps that are outside of the blue in the PACF, I doubt the sequence has 25 lags or leads, and such things are not quickly calculable on a home computer anyway, so I am going to reject them as a possibility. Adding too many terms and over-fitting the data would be equally unwise.

5_16_acf 5_16_pacf

I avoid looking at the Augmented Dickey-Fuller Test because I am looking at minimums, and therefore have concerns about the normality of the errors, but have considered it.

Commentary on Other predictions Types and Styles:

Some are attempting to use the remaining number of greys. I am currently not encouraged that this approach is good. I note that the count of remaining greys appear to be largely insignificant in predicting the next lowest value of the button. (I have tried to include them in a variety of ways, including natural logs, and they did not influence the prediction.) I conclude from this that the number of greys largely is irrelevant. I suspect that a portion of the greys are pre-disposed to click, and this proportion of “click eventually” vs “never-click” matters more than the total number of greys, but I suspect this proportion fluctuates dramatically from minute to minute and I cannot isolate what the true proportion is without serious adjustment in my technique.

Some are attempting to predict the button failure by a clicks/minute approach, which I am intrigued by, but I have not investigated this closely as an approach.

I note that I have some reservations about the asymptotic validity of my estimators. I am investigating these currently.

Historical Forecasts

To see how my forecast changes, and in the interest of full disclosure, I will keep tabs of my past estimates and note how additional data improves or worsens these estimations.

Current Update (5/16) – May 23rd. Noting that previous updates have all shrunk distance to button failure: +14 days, then +13, +13, and +8. This current one is +7, within a week.

Updated Badge Technique (5/11) – May 19th New Technique Added: Used lowest observed badge color to help separate out the pattern of early periods (the purple, blue, orange periods) from the late patterns (the post-red period).

Revisiting the Forecast (5/3) -May 16th

Update: After Button Failure (4/27) – May 9th.

Great Button Failure Update (4/26) – May 28th, likely in error.

Initial Forecast (4/24) 
 – May 8th.


#Load Dependencies


library(“zoo”, lib.loc=”~/R/win-library/3.1″)
library(“xts”, lib.loc=”~/R/win-library/3.1″)
library(“lubridate”, lib.loc=”~/R/win-library/3.1″)
library(“forecast”, lib.loc=”~/R/win-library/3.1″)

#Source of data:
button button$time<-as.POSIXct(button$now_timestamp, origin=”1970-01-01″) #taken from
#Surprisingly, this feeds several periods of wrong time for just shy of 720 seconds. They are all zero.
#I must manually input a minimum for the button- prior to button time hitting zero, there had been false zeros, for lack of a better word.

#First there is the missing data. There is the periods between clicks where the timer clicks down by 1 second, and actually missing data. The ticking down periods are irrelevant because every click always happens at a local minimum.
#Get opening and closing time to sequence data.
all.dates<-seq(time.min, time.max, by=”sec”)
#merge data into single data frame with all data<-merge(all.dates.frame, button,all=FALSE)

#I trust that I did this correctly. Let us replace the button data frame now, officially.

#let us collapse this
#Need objects as xts:
button_xts<-button_xts[‘2015/’] #2015 to end of data set. Fixes odd error timings.
t15 min.
end<-endpoints(button_xts,on=”seconds”,t*60) # t minute periods #I admit end is a terrible name.
col1<-period.apply(button_xts$seconds_left,INDEX=end,FUN=function(x) {min(x,na.rm=TRUE)}) #generates some empty sets
col2<-period.apply(button_xts$participants,INDEX=end,FUN=function(x) {min(x,na.rm=TRUE)})

# we will add a lowest observed badge marker.
for(i in 1:length(button_xts$seconds_left)){
min_badge[i]<-floor(min(button_xts$seconds_left[1:(max(c(i-60/t,1)))])/10) #lowest badge seen yesterday is important.
#let’s get these factors as dummy variables.

#Seasons matter. I prefer Fourier Series:
fourier {
n X for(i in 1:terms)
X[,2*i-1] X[,2*i] }
colnames(X) return(X)
regressors<-data.frame(hours,days,weeks,badge_class[,2:dim(badge_class)[2]]) #badge_class[,2:dim(badge_class)[2]] #tried to use particpants. They are not significant.

#automatically chose from early ARIMA sequences, seasonal days, weeks, individual badge numbers are accounted for as a DRIFT term in the ARIMA sequence.
png(filename=”~/Button Data/5_20_acf.png”)
png(filename=”~/Button Data/5_20_pacf.png”)

#Let’s see how good this plot is of the hourly trend?
t.o.forecast<-paste(“Prediction starts at: “, date(),sep=””)
png(filename=”~/Button Data/5_20_historical.png”)
plot(fitted(reg), main=”Past Values of Button”, xlab=”Time (in 10 minute increments)”, ylab=”Lowest Button Time in 10 minute Interval)”, ylim=c(0,60))
png(filename=”~/Button Data/5_20_error.png”)
plot(res, main=”Error of Forecast”,,xlab=”Time (in 10 minute increments)”, ylab=”Error of Forecast Technique on Past Data”)
png(filename=”~/Button Data/5_20_overlay.png”)
plot(fitted(reg), main=”Past Values of Button overlayed with Forecast”,xlab=”Time (in 10 minute increments)”, ylab=”Lowest Button Time in 10 minute Interval”, ylim=c(0,60))

#forecast value of button:
#size of forecast
w n<-7*24*60/t
viable<-(dim(regressors)[1]-n):dim(regressors)[1] #gets the last week.

a=as.Date(seq(start, by=”15 min”,length.out=length(f_cast)))

png(filename=”~/Button Data/5_20_forecast.png”)
plot(forecast_values,ylim=c(0,60), main=”Lowest Button Time In Every 10 minute Period”, ylab=”10 minute Minimum of Button”, xlab=”Number of 10 minute Periods Since Button Creation”)
footnotes<-paste(“Timer Death in about 4 weeks. Prediction starts at “, date(),”. 75% CI in Grey.”,sep=””)

Programming, Statistics

An Example of Plotting Multiple Time Series (Stock Values) on a Graph in R

I am currently in the process of designing a portfolio to manage investments. While such programs are not best plastered over the internet, a few basic concepts about plotting can be displayed.  For example, I have created a rather appealing plot, which demonstrates how to plot series of multiple images in a single plot, shown below:
Code is below, including my process to detrend the data. The critical lines are in bold, highlighting the fact that you can use sample(colors()) to select from the body of colors at random. This is useful when you may have to generate many plots, potentially without greatly detailed manual supervision, and you are not demanding publication-quality color selection (which is plausible for personal investigative use).

#after obtaining closing prices, you should make sure you clean your inputs. Ensure you know why there are NA’s, or you will make a critical error of omission.



data<-diff(data, lag=1, differences=1)

#Check for any remaining trends in data over and above the natural cyclical or time-trending motion of the stocks!
#Detrend based off of the bond, a necessary part of even a basic CAPM portfolio
detrended<-data-xhat #also, norm.
for(n in 2:N){

lines(index(detrended),detrended[,n], col=sample(colors(),size=1))


Programming, Statistics

Music and Math

Many people claim there is a strong correlation between music and math.
Below, I demonstrate that the patterns in music are NOT well predicted by typical statistical approaches.

I have taken a MIDI file of Beethoven’s 5th, and analyzed the track using non-parametric estimation techniques. These techniques included panel data techniques, ARMA, and extensive non-parametric estimation techniques (polynomial and Fourier series to capture cyclical components). I then use the song’s notes and my estimation technique to create a forecast of following notes. I then play the “forecasted song”.  (I note that there has been a lot of recent development in this area and other techniques have been developed and popularized since I wrote this post.)

After listening, the “forecasted song” does does not well match the original. As a consequence, I can state that the mathematical techniques common to forecasting do not well predict a song.  Below are several attempts which I have highlighted:

The R-squared for these estimations are in fact VERY high, in the high 90’s. (Only few of the coefficients are significant, the data is clearly overfitted in some regressions.) This song in fact falls into the so-called uncanny valley, and is only slightly deviant from the actual Beethoven’s 5th. However, the ear is strongly cultured to perfection in the subject of music, and the errors are devastating to us.


Accidental Art 2

My earlier paper mentioned in “Accidental Art” on phased entry has been postponed.   This postponement is a fortunate side effect of my successful publication in Regional Science and Urban Economics.  My successful publication has promptly propelled me into a direction of crime economics, rather than one of phased entry. As a result, this phased entry paper has been put on hold. Currently I’m doing some basic modeling on counter-terrorism, which makes me feel like a criminal mastermind. Amusingly enough, there’s still some some beautiful accidental art being churned out by my model, this time in Matlab!

accidental art 2 accidental art 1


Using “cbind” and “as.vector”: Computationally intensive commands

As perhaps a mere interesting note, cbind when combined with as.vector can be a particularly RAM-intensive set of commands. I noted the following script excerpt caused my computer to quickly consume  11GB of RAM on a 300k entry dataset. :

for(j in c(1:200)){
# diff[,paste(i,”.”,j)]<-diff.j

The purpose of this script is to use bootstrapped coefficients generate an average partial effect between clip.1 and clip.mean. We will later use this to get a estimate of the standard errors of the APE. As it stands, it eats all my RAM quite promptly and causes the computer to crash.  The following script, nearly identical, does not have this problem:

for(j in c(1:200)){
# diff[,paste(i,”.”,j)]<-diff.j

And this works just fine! In fact, it barely  consumes 25% of my RAM.


Basic programming in Linux

I’ve just written my first program in Linux(Ubuntu) using bash commands. I’d love to give more details about what’s happening and how I’m doing it, but basically I’m following instructions and then playing with the little that I know about it.  I’ll be writing a program in two parts here, basically just as a demonstration of what they do and how they work.

Here’s the first program. Put it into a text file and save it as “testprogram”.

#!/usr/bin/env bash
# A simple command in a shell script
printf “Hello, world!\n”
printf “My name is \n”
printf “${name}! \n”
printf “What’s your name? \n”
read user_name
printf “I didn’t get that, is your name $name? \n”
read confirm
printf “What’s your name again? \n”
read user_name2
while [ $user_name = $user_name2 ]
printf “I heard your name is $user_name! No need to SHOUT!\n”

After doing that, navigate using “cd” and “ls” to the appropriate directory that you saved “testprogram” to.
Now, enter the command, this is the second part:

chmod a+rx testprogram

This will make the program “testprogram”, for All users (a), Readable and eXecutable (+rx).
You may now run the file by entering:


The program will produce the following output:

Hello World!
My name is COMPUTER!
What’s your name?
I didn’t get that, is your name <returns prompt>?
-prompt2, irrelevant what you say here, yes, your name again, etc.-
What’s your name again?
If your name in the first prompt is exactly the same as that of the third, it will scream:
I heard your name is <return prompt 1>! No need to SHOUT!
        It will repeat this until the program is terminated by the user.

I’m really pumped about this. This is part of my ongoing work at completing my dissertation (I need to make a standalone script to start the programs I want and pass arguments to it), as well as getting me to be a more useful part of the universe, so I’m always excited to get things done.