6. Serious Data Analysis in R
Section author: Andrea Bruno <bruno.andrea.191@gmail.com>
Let’s take a moment to examine how we might make sense of a dataset when working on a research topic. To follow along with this tutorial, you will need to make an account on Posit Cloud, which offers a browser-based version of R studio. (However, I recommend actually installing R on your PC if you’re planning on using it for research!) https://posit.cloud/
Once you’re set up in posit, go to your workspace, press “New Project” and then select “New R Studio Project.” Once you are in your R Project, create a new file. You will use the upper left box to type commands into, and the bottom left rectangle will give you your output. The right side of the screen will show your environment.
Now, one thing of note which may be confusing to seasoned programmers: In R, if you want to run a line of code, your cursor must be on the line, or, if you want to run several lines, you must select them. If you do not do this, the code will not execute.
For our first example, we will be using a dataset which is native to R. To bring the data into your project, execute:
data(longley)
View(longley)
Note: R commands are case sensitive. For some reason, the “View()” command is capitalized, though this is rare and something you will need to remember case by case. No pun intended.
Running this code will give you a dataset with a few variables (labeled at the top of each column) and several observations from different years, (the rows,) relating to the Gross National Product and Labor Force Participation. Take some time to explore your variables and understand what each one means.
Now, once you are familiar, I want you to compare some of these values. We will start this by running a correlation test.
Pick two variables, such as GNP and Population. Now, type the following command:
cor.test(longley$GNP, longley$Population)
For many commands in R, you will tell it which dataset to look at with the first part of the term, in this case longley, and then put a $ before the variable it should retrieve, like so: dataset$variable.
Hit ctrl+enter, and the console should spit out something like this:
data: longley$GNP and longley$Population
t = 27.842,
df = 14, p-value = 1.168e-13
alternative hypothesis: true
correlation is not equal to 0
95 percent confidence interval: 0.9738032 0.9969870
sample estimates: cor 0.9910901
This gives us lots of great information. Let’s review what these terms mean!
- t value:
Describes how significant a relationship is within a small set of observations. Anything close to ‘0’ is totally insignificant, while anything above 2.5 or below -2.5 usually indicates a relationship.
- df (degrees of freedom):
The number of observations minus one. This tells you how robust your dataset is. “But why minus one?” I hear you cry. It’s to avoid double-counting the mean when you use the degrees of freedom to calculate things like variance and statistical significance.
- p-value:
the likelihood of your outcome occurring by chance. The lower your P value is, the more significant your result is! More on this later.
- alternative hypothesis (true/false):
When you are investigating a question with statistics, you start with a “null hypothesis,” which is usually the most common, unremarkable scenario. For example, the null hypothesis in a clinical trial is “The drug does nothing.” The alternative hypothesis could be: “the drug makes people immortal,” if you found that those who took the pill lived 100,000+ years longer than those who took the placebo.
- correlation:
The correlation value ranges between -1 and 1, with 1 representing a perfect positive correlation, -1 representing a perfect negative correlation, and 0 representing no correlation to speak of.
- 95% confidence interval:
When we use statistics, we can’t be sure the numbers in our data are the exact true value we would see if we had perfect information. The best we can do is estimate. That’s what the confidence interval is for! A 95% confidence interval tells us that we can be 95% sure the true value lies between the lower and upper bounds.
Now that we understand how to interpret our results, we see that we get close to a perfect correlation between GNP and Population with a relationship of .99.
However, as every person past the age of critical thinking should have tattooed on their brain, correlation does not equal causation. We can see that there is a strong relationship between these factors, but we cannot say anything about the direction of the relationship.
We also must consider confounding variables, or factors which relate to both the population and the GNP. Think of it this way: there is a strong relationship between the degree of umbrella usage in a given year and the amount of water in the reservoir. A crude observer might posit that people using their umbrellas magically contributes to the water supply! However, we know that there is another variable which is strongly associated with both umbrella usage and the amount of water in the reservoir: Rainfall! There isn’t a perfect way to detangle such relationships. However, one thing that can help is a regression model.
If a correlation maps the relationship between one variable, say GNP, and another variable, like the population, a regression model can be thought to stack multiple relationships together to weave a more complex understanding. However, you still need to be wary of close co-variances! In our umbrella example, if you try to plug in both rainfall and umbrella usage, you will essentially be double-counting the same phenomenon measured in different ways. This will mess up your significance level, and could even break your model. So, if you have two variables which basically measure the same thing, it’s best to either merge them into a single metric or omit one of them.
Now: to code a regression model in R, first, decide what you want to name your regression. It can be anything, but for mine, I will keep it simple and call it reg_model1.
Now, go ahead and write:
reg_model1 <- lm(GNP ~ Population + Employed + Armed_Forces, data = longley)
There are a few conventions of the R language to remember here. First, you name things with an arrow, represented by “<-“.
In this regression, you do not need to use the format dataset$variable in order to retrieve your variables, because the final argument tells R which data you are looking at. (I know, the inconsistency annoys me too.)
Run this regression model, and if you have done it correctly, you should see the name of your regression pop up in your global environment. Next, type: summary(reg_model1), which should give you something like:
Call: lm(formula = GNP ~ Employed + Armed.Forces, data = longley)
Residuals:
Min 1Q Median 3Q Max
-39.210 -12.123 0.953 14.588 23.508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.432e+03 9.647e+01 -14.850 1.56e-09 ***
Employed 2.789e+01 1.594e+00 17.498 2.04e-10 ***
Armed.Forces -6.047e-03 8.044e-02 -0.075 0.941 ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.28 on 13 degrees of freedom
Multiple R-squared: 0.9674, Adjusted R-squared: 0.9624
F-statistic: 192.8 on 2 and 13 DF, p-value: 2.172e-10
You should be able to understand most of this using the previous statistics overview, but let’s go more in depth about P values.
To better understand P-values and significance, think of a six-sided die: Say that as a researcher, you decide you want to run a series of trials to determine if a given die is weighted unfairly. In this case, your null hypothesis is that the die is fair, and the alternative hypothesis is that the die is weighted. You roll it once, and you get a six. Now, you can go out and say that the die rolled a 6 in 100% of the trials, but this isn’t very impressive. The P value in this case will still be 1, which is as high as a P value gets. Remember, the lower the P value is, the more evidence you have. If you run six trials, and on each roll, you get a different number between 1 and 6, your P-value will stay 1, because your experiment perfectly aligns with the null hypothesis. However, say you run 100 trials, and in 92% of them, the die gives a 6. In this case, the P-value will be incredibly low, which means you have significant evidence to support the claim that the die is weighted! An easy way to quickly understand your significance is to look at the asterisks, aka ‘stars’ next to your results.
A single asterisk represents a P value below .05, meaning there is less than a 5% chance the die just happened to roll in the way it did. Two asterisks means there is less than a 1% chance, and three asterisks means there is less than a .1% chance. Great! But what if you want to look at a more specific dataset you found online?
(By the way, you can find a huge list of publicly available datasets here: https://github.com/awesomedata/awesome-public-datasets)
For this example, I’m going to be using this dataset: https://efotw.org/economic-freedom/dataset?geozone=world&page=dataset&min-year=2&max-year=0&filter=0 which is meant to evaluate the economic freedom of various countries.
If you’re still using Posit Cloud, you will need to upload the data file to your environment.
in the bottom right section, under the tab labeled ‘files’, you will find a button labeled ‘upload.’ Click this and locate the data file in your computer. (It’s probably in downloads!)
assuming you saved your data as a csv file, execute the following command:
Ecofreedata <- read.csv("/cloud/project/*THEFILENAME!!.csv*")
Now,
View(Ecofreedata)``
Once you do this, you will notice that we have a few rows of blank space and random integers scattered about. This is a great opportunity to learn some basic data cleaning techniques!
If you want to remove the undesirable rows at the top, redo the command to call your file and add a skip term:
Ecofreedata <- read.csv("/cloud/project/economicdata2022-2022.csv", skip = 4, header = T)
You will also want the “header = T,” (T for true), term which tells R that the top row of data is your variable labels. Make sure you do this! If you don’t follow this step and your columns aren’t properly named, the instructions to come will not work!
Now, try View(Ecofreedata)
again. Better, right?
Spend some time playing around with
cor.test(Dataset$variable1, Dataset$variable2)
and
regmodel <- lm(dependentvariable ~ indepvar1 + indepvar2 + ... + indepvar*n*, data = yourdataset)
and see if you can find any interesting relationships.
Once you have done this, it’s time for another important lesson in data analysis: Your codebook is your trusted map through a foreign land. Without it, you will surely get lost. Most datasets have a codebook attached, which explains how to interpret the variables and describes how they were measured. You can find the codebook for this dataset at: https://www.fraserinstitute.org/sites/default/files/2024-10/economic-freedom-of-the-world-2024.pdf
For our first lesson, note that the direction of your data will not always be obvious!
For example, let’s take a look at the variable Inflation..Most.Recent.Year. When we plug the inflation variable into a regression model predicting, for example, Government.consumption and Inflation..Most.recent.year, we find that they have a negative relationship. What?? But I thought government consumption causes inflation? Have we just broken a hole in the basic laws of economics??
Well, when we consult our codebook and find the explanation for the inflation variable on page 76… oh.
“The lower the rate of inflation, the higher the rating. Countries that achieve perfect price stability earn a rating of 10. As the current annual inflation rate moves towards 25%, the rating for this component moves toward 0.”
So low inflation = high score, high inflation = low score. Right.
Sometimes the dataset designers will go out of their way to challenge your intuition.
So, I want you to put your hands together, close your eyes, and make the following promise to yourself and all that you love:
“I will thoroughly understand my dataset’s variables before I start any projects!”
Say it ten times, write it on your fridge, and paste it on your ceiling. This may be the most important research skill you learn today, and it will save you from many potential tragedies.
Starting a research paper before you befriend your dataset is like trying to build a mansion on a styrofoam foundation. Don’t do that.
Now, to wrap up, this tutorial has given you some solid building blocks for future research! Your suggested exercise is this:
Using the Ecofree dataset, (or another one, if you’re so inspired!) create a regression model with a handful of variables which thoroughly predict your dependent variable. Aim for an adjusted R-squared value above .3, which means your model successfully explains 30% of the variance! Only include variables with a p-value below .05, and watch out for co-variance!