# Why do we need n-2? An example in R

Below is a simple example showing why we may want the $(\Sigma u^2_i )/ (n-2)$ as our estimates of $\large \sigma^2$, when our naive intuition may suggest we only want the simple average of squared errors $(\Sigma u^2_i )/ (n)$.

To show this in no uncertain terms, I have coded a linear regression by hand in R.  Also embedded in the work below are several rules I follow about writing code. They are rules 0-6.  There are many other rules, since code writing is an art.

####Coding in R
#### Rule 1: Always comment on every few lines of code. It is not unheard of to comment every single line, particularly for new coders, or complex code.
#### You will need to reference your work at a later date, and after about 3 months, the purpose is lost. Also, I need to read it.

#### Rule 2: Define your variables first. Luckily these names are shared for us.
#### For your projects, use names which are clear for your research: (y=crime in Williamsburg, VA, X= Number of jello puddings consumed)

set.seed(1223) #ensures replication. Sets seed of random number generators.
n<-25 #number of samples
x<-2*rnorm(n) #Our x’s come from a random sampling of X’s.
b_0<-10
b_1<-3 #Those cursed jello puddings are associated with increased crime. Linear regression is supportive of association- not causation.
u<-rnorm(n) #We satisfy both independent mean and zero mean assumptions
y<-b_0+b_1*x+u #This is defining our true Y. The true relationship is linear.

plot(x,y) #Rule 0, really. Always check your data.

#### Rule 3: After definitions begin your second stage of work. Probably trimming existing data, etc. Do these in the order they were added.
hat_b_1<-sum( (x-mean(x)) * (y-mean(y)) ) / sum( (x-mean(x))^2 ) #Spaces between any parenthesized section of operations. We need to be able to see which parentheses are which.
hat_b_1 # Rule 4: Indent work which is conceptually subordinate. Indent more as needed. Four spaces=1 tab.
hat_b_0<-mean(y)-hat_b_1*mean(x)
hat_b_0 # Rule 5: Check your work as you go along. For our example, I got 9.89

abline(a=hat_b_0, b=hat_b_1, col=”red”) #let’s add a red line of best fit. And we must see how our plot looks. Repeat rule 0.

hat_y<-hat_b_0+hat_b_1*x
hat_u<-hat_y-y

plot(x,hat_u) # Let’s see our residuals
hist(hat_u) # Let’s see our histogram

#### Rule 6: Keep your final analysis as punchy and short as possible without sacrificing clarity.
#### The mean sum of the squared errors (usually unknown to us as researchers)
sigma_sq<-sum(u^2)/n #this is the value we’re trying to estimate
sigma_sq_naive<-sum(hat_u^2)/n #this is a naive estimation of it
sigma_sq_hat<-sum(hat_u^2)/(n-2) #this turns out to be more accurate, particularly in small samples. If n->infinity this goes away. Try it for yourself!

#R, is this assessment true? Is sig_sq_hat a better estimator of sig_sq than our naive estimator? Is it true we need the (-2)?
(sigma_sq-sigma_sq_naive) > (sigma_sq-sigma_sq_hat)

Here is one of several plots made by this code, showing a nice linear regression over the data:

Please don’t forget the derivation of why this is true!  This is simply some supportive evidence that it might be true.

# An Example of Plotting Multiple Time Series (Stock Values) on a Graph in R

I am currently in the process of designing a portfolio to manage investments. While such programs are not best plastered over the internet, a few basic concepts about plotting can be displayed.  For example, I have created a rather appealing plot, which demonstrates how to plot series of multiple images in a single plot, shown below:

Code is below, including my process to detrend the data. The critical lines are in bold, highlighting the fact that you can use sample(colors()) to select from the body of colors at random. This is useful when you may have to generate many plots, potentially without greatly detailed manual supervision, and you are not demanding publication-quality color selection (which is plausible for personal investigative use).

#after obtaining closing prices, you should make sure you clean your inputs. Ensure you know why there are NA’s, or you will make a critical error of omission.

closeprice<-log(closeprice)
data<-closeprice[is.finite(rowSums(closeprice)),]

#first-difference

data<-diff(data, lag=1, differences=1)
data<-na.omit(data)

#Check for any remaining trends in data over and above the natural cyclical or time-trending motion of the stocks!
#Detrend based off of the bond, a necessary part of even a basic CAPM portfolio
xhat<-lm(data$TYX.Close~1)$coefficients
detrended<-data-xhat #also, norm.
plot(index(detrended),detrended[,1],type=”l”)
for(n in 2:N){

lines(index(detrended),detrended[,n], col=sample(colors(),size=1))

}

# Music and Math

Many people claim there is a strong correlation between music and math.
Below, I demonstrate that the patterns in music are NOT well predicted by typical statistical approaches.

Methodology:
I have taken a MIDI file of Beethoven’s 5th, and analyzed the track using non-parametric estimation techniques. These techniques included panel data techniques, ARMA, and extensive non-parametric estimation techniques (polynomial and Fourier series to capture cyclical components). I then use the song’s notes and my estimation technique to create a forecast of following notes. I then play the “forecasted song”.  (I note that there has been a lot of recent development in this area and other techniques have been developed and popularized since I wrote this post.)

Result:
After listening, the “forecasted song” does does not well match the original. As a consequence, I can state that the mathematical techniques common to forecasting do not well predict a song.  Below are several attempts which I have highlighted:

Caveat:
The R-squared for these estimations are in fact VERY high, in the high 90’s. (Only few of the coefficients are significant, the data is clearly overfitted in some regressions.) This song in fact falls into the so-called uncanny valley, and is only slightly deviant from the actual Beethoven’s 5th. However, the ear is strongly cultured to perfection in the subject of music, and the errors are devastating to us.

# Basic bootstrapping in R

I’ve been having some trouble determining how strata works in the boot() command.   My intuition says it should select from within each type of strata. But no guarantees!  I can always just type “boot” and read the whole command line for line… but this isn’t always easy to interpret without comments.
So here’s a quick test to make sure it does what I think it does.

x<-rep(c(0,100,0,1), each=20) #make a long list of 20 zeros,20 one hundreds, 20 zeros, 20 ones.pool<-matrix(x, ncol=2) #make that list into a 20 by 2 matrix, filling it downward.

pool #let’s look at it.
f<-function(pool,i){
mean(pool[i,1]) #mean of the first column of pool, using individuals i
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #resample and perform that operation 500 times.  Has variation in output.
boot(pool,f, R=500, strata=pool[,2]) #resample and perform that operation -reselecting from within the strata on the rhs. Since all observations within each strata are identical, the standard deviation should be zero.

Here’s an interesting mistake that I made while creating this command.

f<-function(pool,i){
mean(pool[,1])
} #create a function that uses (data, individuals)
boot(pool,f, R=500) #Has no variation in bootstrapping.  Why? What’s up with that?

Answer: There’s no individuals. It just resamples the entire population of the first column of pooled 500 times, getting the final result as shown.

# Ordering and structuring text in STATA

Like most researchers, I have difficulty finding data. On occasion, when I manage to find data, it comes in a useless or meaningless format. For example, a local police department offered to print out all of its police reports for me for the last 15 years at \$0.25 a page… which I would have to enter by hand.  I politely declined.  Here’s an example of sorting out city titles from a poorly formatted document- a short version of that document can be found here.

The first and most important thing is to have your data loaded.

clear all
use "location", clear

This loads the data and clears any extra junk STATA may have floating around.  The next thing to do is identify any sort of regular pattern.  The first identifiable patterns were that titles for cities began every 64 lines, and sometimes went for between 3 or 4 lines, before having another 64 lines of regular income reports (useless at this point) are posted. The second pattern is that they all are either villages, cities, or towns.  Below are some attempts at accomplishing this goal via making some sort of indicator every 64 lines (didn’t work), or marking every area that starts with CITY (inefficient). Don’t forget to put strings in quotation marks!

*Marks every line with a list that counts 0...63. I thought the 63rd would be useful, but because of the irregular length of the header, it's worthless.
*gen city_titles=mod(_n-1,64)
*Is the line saying city? If so, give it a mark, literally.
*gen mark=cond(var1=="CITY","mark",".",".")

Notice that using (*) comments out the line in STATA, and (.) is STATA’s version of NA.  Here is a more successful attempt:

*Identify if it's probably a title
gen mark=1 if var1=="CITY"
replace mark=1 if var1=="TOWN"
replace mark=1 if var1=="VILLAGE"

Don’t forget the double equals!  In STATA, all if statements I have observed are followed by (==) instead of (=), and I still manage to forget them often!  Next, I will create 3 more indicators, aptly labeled mark_1 to mark_3, all showing if it follows the indicator above by 1, 2, or 3 variables. In programming, it is better to give clear, boring, brief names rather than unclear mess. I would rather have a long object name than an unclear short one, since you never know who is going to have to read your code:

*Is it following a title?
gen mark_1=mark[_n-1]
gen mark_2=mark[_n-2]
gen mark_3=mark[_n-3]

In retrospect, I could have perhaps consolidated those more efficiently. Does anyone else see how I could have done that in a single variable (mark) instead of having multiple indicators (mark_1…mark_3)?  Either way, I have some extra chaff I have to eliminate. having (.)’s in my data set is agitating. Let’s set those all as 0’s, and sum the remaining columns together in something called “total”, so I know if they have an indicator (1 or more) or not (0).  Then we can drop everything we don’t want!

*Eliminate non-numerical entries as numeric.
replace mark=0 if mark==.
replace mark_1=0 if mark_1==.
replace mark_2=0 if mark_2==.
replace mark_3=0 if mark_3==.

*Is it marked?
gen total=mark+mark_1+mark_2+mark_3

drop if total!=1

Bam!  Much better. We dropped somewhere around 80% of our unwanted junk right there! We still could never compile this all by hand, though.  Here, we want it to concatenate all our variables 1…11 in a downward direction.  So we make a local marco list called to_combine, since that’s what we’ll be doing with them.

local to_combine var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11

Below is a STATA style do-loop, a common element in meaningful programming everywhere.  We want to repeat for each and every variable in the “to_combine” list a series of operations designed to deal with a few regular problems we observed. Firstly, we want to fix an issue with names. You may notice “FOND DU LAC” takes up 3 columns rather than the intended 1, being as that is the name of a city/town/village/county in Wisconsin.  This is the first thing we fix, by dropping the spaces with “DU” and “LAC” and replacing all “FOND” entries with “FOND DU LAC”.  Such a procedure is repeated for other problems we noted. There are 300+ problems we found, so this is not a perfect fix yet, (as noted by entry v9/2 in the “after” image below), but it’s much better. The second thing we fix is putting “COUNTY” horizontally under “TOTAL”, since in the original document (pdf), we know that after an entry says “TOTAL”, it always says the total for the county, not for any other grouping.  Lastly, we want it to concatenate horizontally but stop before it hits any part of the income sheet.  Stata uses & for “and” and | for “or”. You may use both in long chains for a single if statement. This last was done by a bit of trial and error to make sure I got it right. That’s fine, since computers are cheap and thinking hurts.  It stops at the line containing every end of line indicator I could think of, like “TAXES”, “POPULATION”, or “REVENUES”, which are clearly not city titles.

foreach var in to_combine'{

*We have a space problem (Part 1)
replace var'="" if var'=="DU"
...snipped...
replace var'="LA CROSSE" if var'=="LA"

*Puts county under total.  It's a mandatory identifier visible in the pdf. (Part 2)
replace var'="" if var'=="COUNTY"
replace var'="COUNTY" if var'[_n-2]=="TOTAL"

*Concatenates appropriately, stopping if it hits anything
replace var'=var'[_n-1]+"."+var' if var2[_n-1]!="POPULATION" & var1[_n-1]!="REVENUES" & var1[_n-1]!="TAXES" & mark_3[_n-1]!=1 & var2!="POPULATION" & var1!="REVENUES" & var1!="TAXES"

}

Now, let’s just drop the stuff we don’t need, and we’re looking at a much much cleaner worksheet, giving us the before/after image shown below:

*Drops excess rows or columns.
drop if mark==1
drop if mark_1==1
drop if mark_2==1
drop if var2=="POPULATION"
drop if var1=="REVENUES"
drop if var1=="TAXES"