03/12/2019 Slides available at https://www.ThiagoROliveira.com/IntroR
R
R
basics
R
as a calculatorR
R
Slides available at https://www.ThiagoROliveira.com/IntroR
Statistical packages for data
Stuff that you can do using:
But using the R programming language
Because:
On the other hand:
With R, you can…
R is getting popular…
To obtain R, visit https://cran.r-project.org/
Highly recommended: use RStudio
In the console, we can type in arithmetic operations of any kind:
2 + 2
## [1] 4
5 / 3
## [1] 1.666667
4 * 4
## [1] 16
5 * (10 - 3)
## [1] 35
12 / (5 * 2)
## [1] 1.2
sqrt(4)
## [1] 2
You can use the cursor or arrow keys on your keyboard to edit your code at the console:
Take a few minutes to play around at the console and try different things out. Don’t worry if you make a mistake, you can’t break anything easily!
Working directly with the console is fine, but…
Scripts are just text files – like notepad or TextEdit – embeded within RStudio
It is highly recommended that you always work from a script file
To run your code from a script, select the the lines you want to run and hit CTRL-ENTER (cmd-Enter) or use the Run button
R
can store information as an object with a name of our choice
<-
result <- 2 + 2 print(result)
## [1] 4
We can use objects to perform subsequent calculations
result * 3
## [1] 12
We can even use an object and assign the result to a new object:
new_result <- result / 8 new_result
## [1] 0.5
Take a look at the upper-right window. The Environment lists all R objects created in this section.
Note that if we assign a different value to the same object name, the value of the object will be changed
result <- 7 - 2 print(result)
## [1] 5
And remember that object names are case sensitive.
result
is not the same as Result
or RESULT
print(Result)
## Error in print(Result): object 'Result' not found
So far, we have only assigned numbers to an object. But R
can represent various types of values as objects.
my_name <- "thiago" my_name
## [1] "thiago"
my_name <- "thiago r. oliveira" my_name
## [1] "thiago r. oliveira"
Notice that we can treat numbers like characters if we want to
RESULT <- "4" RESULT
## [1] "4"
However, arithmetic operations cannot be used for character strings.
sqrt(RESULT)
## Error in sqrt(RESULT): non-numeric argument to mathematical function
Each object belongs to a different class. The Environment window shows the class of an object. We can also use the function class()
class(result)
## [1] "numeric"
class(RESULT)
## [1] "character"
class(sqrt)
## [1] "function"
A vector is a set of information contained together in a specific order.
c()
, which stands for concatenatenew_vector <- c(0, 3, 1, 4, 1, 5, 9, 2) new_vector
## [1] 0 3 1 4 1 5 9 2
[ ]
(we call it indexing). For instance, if we wish to access the 2nd element of the vector we just created, we can do the followingnew_vector[2]
## [1] 3
We can also use indexing to subset a vector. For example, remember the new_vector
we created was c(0, 3, 1, 4, 1, 5, 9, 2)
new_vector[c(1, 5, 6)]
## [1] 0 1 5
new_vector[c(6, 5, 1)]
## [1] 5 1 0
new_vector[-5]
## [1] 0 3 1 4 5 9 2
The c()
function can be used to combine multiple vectors
x1 <- c(1, 3, 5, 7) x1
## [1] 1 3 5 7
x2 <- c(4:7) x2
## [1] 4 5 6 7
x1x2 <- c(x1, x2) x1x2
## [1] 1 3 5 7 4 5 6 7
In R, mathematical operations on vectors occur elementwise:
fib <- c(1, 1, 2, 3, 5, 8, 13, 21) fib[1:7]
## [1] 1 1 2 3 5 8 13
fib[2:8]
## [1] 1 2 3 5 8 13 21
fib[1:7] + fib[2:8]
## [1] 2 3 5 8 13 21 34
Functions are the backbone or R
operations
sqrt()
, print()
, class()
, and c()
funcname(input)
Where
funcname
is the name of the functioninput
is the argument passed to the functionsqrt(49)
## [1] 7
A function always requires the use of parenthesis or round brackets ( )
. Inputs to the function are called arguments and go inside the brackets. Some basic functions useful for summarising data include:
length()
for the length of a vectormin()
for the minimum valuemax()
for the maximum valuerange()
for range of datamean()
for meansum()
for the sum of datalength(new_vector)
## [1] 8
min(new_vector)
## [1] 0
max(new_vector)
## [1] 9
range(new_vector)
## [1] 0 9
sum(new_vector)
## [1] 25
We can be creative. Instead of running
mean(new_vector)
## [1] 3.125
We can also estimate
sum(new_vector) / length(new_vector)
## [1] 3.125
We can also perform calculations on the output of a function:
mean(new_vector) * 3
## [1] 9.375
Which means that we can also have nested functions:
log(mean(new_vector))
## [1] 1.139434
We can also assign the output of any function to a new object for use later:
log_pie <- log(mean(new_vector))
world.pop <- c(2525779, 3026003, 3691173, 4449049, 5320817, 6127700, 6916183) year <- seq(to = 2010, by = 10, from = 1950) year
## [1] 1950 1960 1970 1980 1990 2000 2010
names(world.pop) <- year names(world.pop)
## [1] "1950" "1960" "1970" "1980" "1990" "2000" "2010"
world.pop
## 1950 1960 1970 1980 1990 2000 2010 ## 2525779 3026003 3691173 4449049 5320817 6127700 6916183
You can write your own functions!
my_new_mean <- function(x) { # function takes one input new.mean <- sum(x) / length(x) # object 'new.mean' is defined as this ratio return(new.mean) # return output } my_new_mean(world.pop) # Testing the new function
## [1] 4579529
mean(world.pop)
## [1] 4579529
my.summary <- function(x) { s.out <- sum(x) l.out <- length(x) m.out <- mean(x) out <- c(s.out, l.out, m.out) # define the output names(out) <- c("sum", "length", "mean") # add labels return(out) } my.summary(world.pop)
## sum length mean ## 32056704 7 4579529
Loading data into R
can be tricky sometimes.
getwd()
## [1] "/Users/rodri147/Dropbox/LSE/Teaching/Intro to R"
setwd()
functionsetwd("~/Dropbox/LSE/Teaching/Intro to R")
Now assuming the data files are in the working directory…
If the data file is saved as a CSV file, we just use the read.csv
function. Click here to download the data.
data_pop <- read.csv("UNpop.csv") class(data_pop)
## [1] "data.frame"
View()
command which displays the data frame like a spreadsheet.If the data file is saved as an RDta file, we just use the load()
function. Click here to download the data.
load("UNpop.RData")
Data files from other statistical software cannot be loaded into base R. Fortunately though, one of R
’s strengths is the existence of a large community of R
users who contribute writing R packages
.
R
#install.packages("foreign") # install package library(foreign) # load package
mydata_stata <- read.dta("UNpop.dta")
names(data_pop) # names of the variables
## [1] "year" "world.pop"
nrow(data_pop) # number of rows (observations)
## [1] 7
ncol(data_pop) # number of columns (variables)
## [1] 2
dim(data_pop) # dimensions
## [1] 7 2
The $
operator is very useful to access an individual variable.
summary(data_pop$world.pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2525779 3358588 4449049 4579529 5724258 6916183
summary(data_pop$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1950 1965 1980 1980 1995 2010
Another way of accessing individual variables is to use [ ]
- Data frame is a two-dimensional array, so we need two indexes: [rows, columns]
mean(data_pop[, "world.pop"])
## [1] 4579529
data_pop[1:3, ]
## year world.pop ## 1 1950 2525779 ## 2 1960 3026003 ## 3 1970 3691173
data_pop[c(1, 3, 5), "world.pop"]
## [1] 2525779 3691173 5320817
ggplot
The basic plotting syntax is very simple. plot(x_var, y_var)
will give you a scatter (click here to download data):
sim.df <- read.csv("file.csv") plot(sim.df$x, sim.df$y)
Hmm, let’s work on that.
The plot function takes a number of arguments (?plot
for a full list). The fewer you specify, the uglier your plot:
plot(x = sim.df$x, y = sim.df$y, xlab = "X variable", ylab = "Y variable", main = "Awesome plot title", pch = 19, # Solid points cex = 0.5, # Smaller points bty = "n", # Remove surrounding box col = sim.df$g # Colour by grouping variable )
The default behaviour of plot()
depends on the type of input variables for the x
and y
arguments. If x
is a factor variable, and y
is numeric, then R will produce a boxplot:
plot(x = sim.df$g, y = sim.df$x)
ggplot
A very popular alternative to base R plots is the ggplot2
library (the 2 in the name refers to the second iteration, which is the standard). This is a separate package (i.e. it is not a part of the base R environment) but is very widely used.
Wilkinson, L. (2005). The Grammar of Graphics 2nd Ed. Heidelberg: Springer. https://doi.org/10.1007/0-387-28695-0
Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3–28. https://doi.org/10.1198/jcgs.2009.07098
Graphs are broken into multiple layers
Layers can be recycled across multiple plots
ggplot
Let’s recreate the previous scatter plot using ggplot
:
library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + # Add scatterplot geom_point() + # Change axes labels and plot title labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + # Change default grey theme to black and white theme_bw()
ggplot
ggplot
One nice feature of ggplot
is that it is very easy to create facet plots:
library("ggplot2") ggplot(data = sim.df, aes(x = x, y = y, col = g)) + geom_point() + labs(x = "X variable", y = "Y variable", title = "Awesome plot title") + theme_bw() + # Separate plots by variable g facet_wrap(~ g)
ggplot
Linear regression models in R are implemented using the lm
function.
lm.fit <- lm(formula = y ~ x, data = sim.df)
The formula
argument is the specification of the model, and the data
argument is the data on which you would like the model to be estimated.
lm.fit
## ## Call: ## lm(formula = y ~ x, data = sim.df) ## ## Coefficients: ## (Intercept) x ## 0.4416 0.3402
lm
We can specify multivariate models:
lm.multi.fit <- lm(formula = y ~ x + z, data = sim.df)
Interaction models:
lm.inter.fit <- lm(formula = y ~ x * z, data = sim.df)
Note that direct effects of x
and z
are also included, when interaction term is specified.
Fixed-effect models:
lm.fe.fit <- lm(formula = y ~ x + g, data = sim.df)
And many more!
lm
The output of the lm
function is a long list of interesting output.
When we call the fitted object (e.g. lm.fit
), we are presented only with the estimated coefficients.
For some more information of the estimated model, use summary(fitted.model)
:
lm.fit.summary <- summary(lm.fit) lm.fit.summary
lm
## ## Call: ## lm(formula = y ~ x, data = sim.df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.1618 -0.6669 0.0217 0.6872 3.2098 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.44164 0.03215 13.74 <2e-16 *** ## x 0.34023 0.03192 10.66 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.016 on 998 degrees of freedom ## Multiple R-squared: 0.1022, Adjusted R-squared: 0.1013 ## F-statistic: 113.6 on 1 and 998 DF, p-value: < 2.2e-16
lm
As with any other function, summary(fitted.model)
returns an object. Here, it is a list. What is saved as the output of this function?
names(lm.fit.summary)
## [1] "call" "terms" "residuals" "coefficients" ## [5] "aliased" "sigma" "df" "r.squared" ## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
If we want to extract other information of interest from the fitted model object, we can use the $
operator to do so:
lm.fit.summary$r.squared
## [1] 0.1022217
lm
Accessing elements from saved models can be very helpful in making comparisons across models.
Suppose we want to extract and compare \(R^2\) across different models.
lm.r2 <- summary(lm.fit)$r.squared lm.multi.r2 <- summary(lm.multi.fit)$r.squared lm.inter.r2 <- summary(lm.inter.fit)$r.squared r2.compare <- data.frame( model = c("bivariate", "multivariate", "interaction"), r.squared = c(lm.r2, lm.multi.r2, lm.inter.r2) )
lm
We can print the data frame containing values of \(R^2\):
r2.compare
## model r.squared ## 1 bivariate 0.1022217 ## 2 multivariate 0.1101408 ## 3 interaction 0.1205508
Or we can plot them:
ggplot(r2.compare, aes(x = model, y = r.squared))+ geom_point(size = 4) + # Use `expression` to add 2 as a superscript to R ggtitle(expression(paste(R^{2}, " ", "Comparison"))) + theme_bw()
lm
lm
diagnosticsThere are a number of functions that are helpful in producing model diagnostics:
residuals(fitted.model)
extracts the residuals from a fitted modelcoefficients(fitted.model)
extracts coefficientsfitted(fitted.model)
extracts fitted valuesplot(fitted.model)
is a convenience function for producing a number of useful diagnostics plots