Statistics

# Multiple Linear Regression in R

This slideshow requires JavaScript.

In the previous exercise: Why do we need N-2?, I show a simple 1 dimensional regression by hand, which is followed by an examination of sample standard errors.  Below I make more extensive use of R (and an additional package) to plot what linear regression looks like in multiple dimensions. This generates the images above, (along with several others).  This illustrates that linear regression remains flat even in N dimensions, the surface of the regression is linear in coefficients.

As a class exercise, I ask that you consider different pairs dependent variables that are  functions of one another. What happens if the function is linear? What happens if the function is nonlinear, for example, $cos(x_1)=x_2$? Examine what happens to the surface of your regression as compared to the shape of the relationship you are investigating.  Is there a way you can contort the regression estimate into a curved surface to better match?  Why or why not?

install.packages(“plot3D”) # we need 3d plotting
library(“plot3D”, lib.loc=”~/R/win-library/3.1″) #Load it into R’s current library, may vary by computer.

set.seed(2343) #ensures replicatation. Sets seed of random number generators.
n<-25 #number of samples
x_1<-rnorm(n) #Our x’s come from a random sampling of X’s.
x_2<-rnorm(n)
b_0<-10
b_1<-3 #Those cursed jello puddings are associated with increased crime. Linear regression is supportive of association- not causation.
b_2<-(-3) # But student transit programs are associated with a decline in crime.
u<-rnorm(n)
y<-b_0+b_1*x_1+b_2*x_2+u #This is defining our true Y. The true relationship is linear.

#look at data in each dimension
plot(x_1,y)
plot(x_2,y)
#look at data overall
points3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=5) #look at data. phi/theta is tilt.

fit<-lm(y~x_1+x_2)  #fit it with a linear model, regressing y on x_1, x_2

#Make a surface
x_1.pred <- seq(min(x_1), max(x_1), length.out = n)
x_2.pred <- seq(min(x_2), max(x_2), length.out = n)
xy <- expand.grid(x_1=x_1.pred, x_2=x_2.pred)
y.pred <- matrix (nrow = n, ncol = n, data = predict(fit, newdata = data.frame(xy), interval = “prediction”))

summary(fit) #view output of variables.

fitpoints<-predict(fit)  #get predicted points, needed to make a surface.

scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=5 , surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #look at data. phi/theta is tilt.
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=45, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #From straight on it is a flat plane, residuals are highlighted
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=30, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #From other angles it is clear it is somewhat straight.
scatter3D(x_1,x_2,y,xlab=”x_1″,ylab=”x_2″,zlab=”y”,phi=60, surf=list(x = x_1.pred, y = x_2.pred, z = y.pred, facets = NA, fit = fitpoints)) #look at data. phi/theta is tilt.

# Why do we need n-2? An example in R

Below is a simple example showing why we may want the $(\Sigma u^2_i )/ (n-2)$ as our estimates of $\large \sigma^2$, when our naive intuition may suggest we only want the simple average of squared errors $(\Sigma u^2_i )/ (n)$.

To show this in no uncertain terms, I have coded a linear regression by hand in R.  Also embedded in the work below are several rules I follow about writing code. They are rules 0-6.  There are many other rules, since code writing is an art.

####Coding in R
#### Rule 1: Always comment on every few lines of code. It is not unheard of to comment every single line, particularly for new coders, or complex code.
#### You will need to reference your work at a later date, and after about 3 months, the purpose is lost. Also, I need to read it.

#### Rule 2: Define your variables first. Luckily these names are shared for us.
#### For your projects, use names which are clear for your research: (y=crime in Williamsburg, VA, X= Number of jello puddings consumed)

set.seed(1223) #ensures replication. Sets seed of random number generators.
n<-25 #number of samples
x<-2*rnorm(n) #Our x’s come from a random sampling of X’s.
b_0<-10
b_1<-3 #Those cursed jello puddings are associated with increased crime. Linear regression is supportive of association- not causation.
u<-rnorm(n) #We satisfy both independent mean and zero mean assumptions
y<-b_0+b_1*x+u #This is defining our true Y. The true relationship is linear.

plot(x,y) #Rule 0, really. Always check your data.

#### Rule 3: After definitions begin your second stage of work. Probably trimming existing data, etc. Do these in the order they were added.
hat_b_1<-sum( (x-mean(x)) * (y-mean(y)) ) / sum( (x-mean(x))^2 ) #Spaces between any parenthesized section of operations. We need to be able to see which parentheses are which.
hat_b_1 # Rule 4: Indent work which is conceptually subordinate. Indent more as needed. Four spaces=1 tab.
hat_b_0<-mean(y)-hat_b_1*mean(x)
hat_b_0 # Rule 5: Check your work as you go along. For our example, I got 9.89

abline(a=hat_b_0, b=hat_b_1, col=”red”) #let’s add a red line of best fit. And we must see how our plot looks. Repeat rule 0.

hat_y<-hat_b_0+hat_b_1*x
hat_u<-hat_y-y

plot(x,hat_u) # Let’s see our residuals
hist(hat_u) # Let’s see our histogram

#### Rule 6: Keep your final analysis as punchy and short as possible without sacrificing clarity.
#### The mean sum of the squared errors (usually unknown to us as researchers)
sigma_sq<-sum(u^2)/n #this is the value we’re trying to estimate
sigma_sq_naive<-sum(hat_u^2)/n #this is a naive estimation of it
sigma_sq_hat<-sum(hat_u^2)/(n-2) #this turns out to be more accurate, particularly in small samples. If n->infinity this goes away. Try it for yourself!

#R, is this assessment true? Is sig_sq_hat a better estimator of sig_sq than our naive estimator? Is it true we need the (-2)?
(sigma_sq-sigma_sq_naive) > (sigma_sq-sigma_sq_hat)

Here is one of several plots made by this code, showing a nice linear regression over the data:

Please don’t forget the derivation of why this is true!  This is simply some supportive evidence that it might be true.

# An Example of Plotting Multiple Time Series (Stock Values) on a Graph in R

I am currently in the process of designing a portfolio to manage investments. While such programs are not best plastered over the internet, a few basic concepts about plotting can be displayed.  For example, I have created a rather appealing plot, which demonstrates how to plot series of multiple images in a single plot, shown below:

Code is below, including my process to detrend the data. The critical lines are in bold, highlighting the fact that you can use sample(colors()) to select from the body of colors at random. This is useful when you may have to generate many plots, potentially without greatly detailed manual supervision, and you are not demanding publication-quality color selection (which is plausible for personal investigative use).

#after obtaining closing prices, you should make sure you clean your inputs. Ensure you know why there are NA’s, or you will make a critical error of omission.

closeprice<-log(closeprice)
data<-closeprice[is.finite(rowSums(closeprice)),]

#first-difference

data<-diff(data, lag=1, differences=1)
data<-na.omit(data)

#Check for any remaining trends in data over and above the natural cyclical or time-trending motion of the stocks!
#Detrend based off of the bond, a necessary part of even a basic CAPM portfolio
xhat<-lm(data$TYX.Close~1)$coefficients
detrended<-data-xhat #also, norm.
plot(index(detrended),detrended[,1],type=”l”)
for(n in 2:N){

lines(index(detrended),detrended[,n], col=sample(colors(),size=1))

}

# Music and Math

Many people claim there is a strong correlation between music and math.
Below, I demonstrate that the patterns in music are NOT well predicted by typical statistical approaches.

Methodology:
I have taken a MIDI file of Beethoven’s 5th, and analyzed the track using non-parametric estimation techniques. These techniques included panel data techniques, ARMA, and extensive non-parametric estimation techniques (polynomial and Fourier series to capture cyclical components). I then use the song’s notes and my estimation technique to create a forecast of following notes. I then play the “forecasted song”.  (I note that there has been a lot of recent development in this area and other techniques have been developed and popularized since I wrote this post.)

Result:
After listening, the “forecasted song” does does not well match the original. As a consequence, I can state that the mathematical techniques common to forecasting do not well predict a song.  Below are several attempts which I have highlighted:

Caveat:
The R-squared for these estimations are in fact VERY high, in the high 90’s. (Only few of the coefficients are significant, the data is clearly overfitted in some regressions.) This song in fact falls into the so-called uncanny valley, and is only slightly deviant from the actual Beethoven’s 5th. However, the ear is strongly cultured to perfection in the subject of music, and the errors are devastating to us.

Programming

# Accidental Art 2

My earlier paper mentioned in “Accidental Art” on phased entry has been postponed.   This postponement is a fortunate side effect of my successful publication in Regional Science and Urban Economics.  My successful publication has promptly propelled me into a direction of crime economics, rather than one of phased entry. As a result, this phased entry paper has been put on hold. Currently I’m doing some basic modeling on counter-terrorism, which makes me feel like a criminal mastermind. Amusingly enough, there’s still some some beautiful accidental art being churned out by my model, this time in Matlab!

Programming

# Accidental Art

I’m working on phased entry with spatial price discrimination. This got spat out, and I’m really enjoying the subtle patterns.

Programming

# Using “cbind” and “as.vector”: Computationally intensive commands

As perhaps a mere interesting note, cbind when combined with as.vector can be a particularly RAM-intensive set of commands. I noted the following script excerpt caused my computer to quickly consume  11GB of RAM on a 300k entry dataset. :

for(j in c(1:200)){
mod.out$coefficients$count[1:80]<-lim.my.df[1:80,j]
mod.out$coefficients$zero[1:80]<-lim.my.df[81:160,j]
a<-predict(mod.out,clip.1)
b<-predict(mod.out,clip.mean)
diff.j<-mean(a-b)
# diff[,paste(i,”.”,j)]<-diff.j
diff<-as.vector(cbind(diff,diff.j))
}

The purpose of this script is to use bootstrapped coefficients generate an average partial effect between clip.1 and clip.mean. We will later use this to get a estimate of the standard errors of the APE. As it stands, it eats all my RAM quite promptly and causes the computer to crash.  The following script, nearly identical, does not have this problem:

for(j in c(1:200)){
mod.out$coefficients$count[1:80]<-lim.my.df[1:80,j]
mod.out$coefficients$zero[1:80]<-lim.my.df[81:160,j]
a<-predict(mod.out,clip.1)
b<-predict(mod.out,clip.mean)
diff.j<-mean(a-b)
# diff[,paste(i,”.”,j)]<-diff.j
diff<-cbind(diff,diff.j)
}
diff<-as.vector(diff)

And this works just fine! In fact, it barely  consumes 25% of my RAM.

Programming

# Basic programming in Linux

I’ve just written my first program in Linux(Ubuntu) using bash commands. I’d love to give more details about what’s happening and how I’m doing it, but basically I’m following instructions and then playing with the little that I know about it.  I’ll be writing a program in two parts here, basically just as a demonstration of what they do and how they work.

Here’s the first program. Put it into a text file and save it as “testprogram”.

#!/usr/bin/env bash
# A simple command in a shell script
name=”COMPUTER”
printf “Hello, world!\n”
printf “My name is \n”
printf “${name}! \n” printf “What’s your name? \n” read user_name printf “I didn’t get that, is your name$name? \n”
printf “What’s your name again? \n”
while [ $user_name =$user_name2 ]
do
printf “I heard your name is \$user_name! No need to SHOUT!\n”
done

After doing that, navigate using “cd” and “ls” to the appropriate directory that you saved “testprogram” to.
Now, enter the command, this is the second part:

chmod a+rx testprogram

This will make the program “testprogram”, for All users (a), Readable and eXecutable (+rx).
You may now run the file by entering:

./testprogram

The program will produce the following output:

Hello World!
My name is COMPUTER!
-prompt1-
I didn’t get that, is your name <returns prompt>?
-prompt2, irrelevant what you say here, yes, your name again, etc.-
-prompt3-
If your name in the first prompt is exactly the same as that of the third, it will scream:
I heard your name is <return prompt 1>! No need to SHOUT!
It will repeat this until the program is terminated by the user.

I’m really pumped about this. This is part of my ongoing work at completing my dissertation (I need to make a standalone script to start the programs I want and pass arguments to it), as well as getting me to be a more useful part of the universe, so I’m always excited to get things done.

# Basic bootstrapping in R

I’ve been having some trouble determining how strata works in the boot() command.   My intuition says it should select from within each type of strata. But no guarantees!  I can always just type “boot” and read the whole command line for line… but this isn’t always easy to interpret without comments.
So here’s a quick test to make sure it does what I think it does.

x<-rep(c(0,100,0,1), each=20) #make a long list of 20 zeros,20 one hundreds, 20 zeros, 20 ones.pool<-matrix(x, ncol=2) #make that list into a 20 by 2 matrix, filling it downward.

pool #let’s look at it.
f<-function(pool,i){
mean(pool[i,1]) #mean of the first column of pool, using individuals i
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #resample and perform that operation 500 times.  Has variation in output.
boot(pool,f, R=500, strata=pool[,2]) #resample and perform that operation -reselecting from within the strata on the rhs. Since all observations within each strata are identical, the standard deviation should be zero.

Here’s an interesting mistake that I made while creating this command.

f<-function(pool,i){
mean(pool[,1])
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #Has no variation in bootstrapping.  Why? What’s up with that?

Answer: There’s no individuals. It just resamples the entire population of the first column of pooled 500 times, getting the final result as shown.