TL;DR
Calculating the sum or even the average of a variable isn’t too complicated in R if you are used to software such as Excel, you just need to know what code to use!
Introduction
When working in R and exploring data for research, calculating basic statistics such as the mean or standard deviation is commonplace. There are a few ways of achieving this in R which we will cover, in addition to exploring how to summarise whole data sets by a grouping variables.
Common summary statistics
But before we get started, we need to cover the code for some of the most basic summary statistics. These are pretty self explanatory, but we have annotated all of them for good measure:
sum() #total of a variable
mean() #the average of a variable
median() #the median of a variable
sd() #the standard deviation of a variable
sd() / mean() #coefficient of variation of a variable
min() #minimum value of a variable
max() #maximum value of a variable
Running the code
When you want to used these with your data, you can simply insert the name of the variable within the brackets of the above code to print the result into the console. To do this, we are going to use the HStack_data.csv data set from the reading data into R post.
<- Data <- read.csv("HStack_data.csv", header=TRUE)
Data #Remember you need to set your work directory or reference the file location above.
#We have said header=TRUE as there are variable names within the file
Now if you want to summarise a certain variable such as age, you can use the the below code and is prints it into the console:
#We have selected the Age variable but you could try others such as Height or Weight
mean(Data$Age)
## [1] 66.2
sd(Data$Age)
## [1] 7.438166
This is particularly useful if you are cleaning your data and want to quickly see any differences you have made but it also doesn’t save any of the results to be used later on. So using what you learned about creating objects in R, you can assign any of the results
<- sum(Data$Height)
sum_height <- mean(Data$Height)
mean_height <- sd(Data$Height)
sd_height #By assigning objects using the <- symbol you can call upon them later within R
However, this also isn’t very efficient. Luckily, you can use packages to make the process a bit easier for you. There are other packages that are probably much easier, but we think it is best for you use get to grips with dplyr. With this package, you can summarise a whole data set just using two lines of code:
library(dplyr)
<- Data %>%
Results summarise(across(everything(), mean, .names = "{.col}_mean"))
#This code asks R to take the data frame 'Data' and then assign the mean of all the variables to the object 'Results'
#across applies the function (mean) across all columns (everything()) in the data set
#The .names part assigns the suffix "_mean" in front of the already existing variable name
<- Data %>%
Results_2 summarise(across(everything(), list(mean = mean, sd = sd, min = min), .names = "{.col}_{.fn}"))
#This code will return the mean, sd and min of all of the variables in the data frame and assign it to the object 'Results_2'
#The "mean = mean" code passes the function names to .names which renames the columns by "variable_function"
#You can keep adding statistics to the summarise line by inserting them within the list () portion of the code
The code won’t give you a table ready for publication but allows you to quickly calculate a number of statistics on your data.
The extension to this is if you want to group the data by a factor variable such as sex or ethnicity. This can be completed by inserting another line of code into the above which also uses the dplyr package.
<- Data %>%
Results group_by(Sex) %>%
summarise(across(everything(), mean, .names = "{.col}_mean"))
#The group_by phrase can be used to group the analyses by a factor variable
#Like most statistical software, this works by taking a binary variable i.e. 0 and 1 to group by
#This can work with character variable types too
The results are then presented for reach group on a new row. It has only inserted the numbers and so you need to remember what any grouping variable represents.
Common errors
When you start to become more familiar with R and use a wider range of variables or data sets you may start to encounter errors with the above code. Things to bear in mind are:
- Non numeric data: The above code may not like character variables in your data or you may want to remove them from the output.
- Missing data: Similarly, if you have missing data, the code may also not run.
To get around these issues, you can slightly amend the code to include the following.
<- Data %>%
Results summarise(across(where(is.numeric), mean, na.rm = TRUE, .names = "{.col}_mean"))
#Use where(is.numeric) to exclude variables that are not numeric
#Use na.rm = TRUE to exclude missing values within the variable
Conclusion
Now you know how to use some the basic summarising functions within R, you can now start to explore your data in a bit more detail. Eventually, you may include some of the above in other functions or even use other packages to summarise your data, but the basic principles remain the same.
Complete code
sum() #total of a variable
mean() #the average of a variable
median() #the median of a variable
sd() #the standard deviation of a variable
sd() / mean() #coefficient of variation of a variable
min() #minimum value of a variable
max() #maximum value of a variable
<- Data <- read.csv("HStack_data.csv", header=TRUE)
Data #Remember you need to set your work directory or reference the file location above.
#We have said header=TRUE as there are variable names within the file
mean(Data$Age)
sd(Data$Age)
#We have selected the Age variable but you could try others such as Height or Weight
<- sum(Data$Height)
sum_height <- mean(Data$Height)
mean_height <- sd(Data$Height)
sd_height #By assigning objects using the <- symbol you can call upon them later within R
library(dplyr)
<- Data %>%
Results summarise(across(everything(), mean, .names = "{.col}_mean"))
#This code asks R to take the data frame 'Data' and then assign the mean of all the variables to the object 'Results'
#across applies the function (mean) across all columns (everything()) in the data set
#The .names part assigns the suffix "_mean" in front of the already existing variable name
<- Data %>%
Results_2 summarise(across(everything(), list(mean = mean, sd = sd, min = min), .names = "{.col}_{.fn}"))
#This code will return the mean, sd and min of all of the variables in the data frame and assign it to the object 'Results_2'
#The "mean = mean" code passes the function names to .names which renames the columns by "variable_function"
#You can keep adding statistics to the summarise line by inserting them within the list () portion of the code
<- Data %>%
Results group_by(Sex) %>%
summarise(across(everything(), mean, .names = "{.col}_mean"))
#The group_by phrase can be used to group the analyses by a factor variable
#Like most statistical software, this works by taking a binary variable i.e. 0 and 1 to group by
#This can work with character variable types too
<- Data %>%
Results summarise(across(where(is.numeric), mean, na.rm = TRUE, .names = "{.col}_mean"))
#Use where(is.numeric) to exclude variables that are not numeric
#Use na.rm = TRUE to exclude missing values within the variable