TL;DR

Calculating the sum or even the average of a variable isn’t too complicated in R if you are used to software such as Excel, you just need to know what code to use!

Introduction

When working in R and exploring data for research, calculating basic statistics such as the mean or standard deviation is commonplace. There are a few ways of achieving this in R which we will cover, in addition to exploring how to summarise whole data sets by a grouping variables.

Common summary statistics

But before we get started, we need to cover the code for some of the most basic summary statistics. These are pretty self explanatory, but we have annotated all of them for good measure:

sum() #total of a variable
mean() #the average of a variable
median() #the median of a variable
sd() #the standard deviation of a variable
sd() / mean() #coefficient of variation of a variable
min() #minimum value of a variable
max() #maximum value of a variable

Running the code

When you want to used these with your data, you can simply insert the name of the variable within the brackets of the above code to print the result into the console. To do this, we are going to use the HStack_data.csv data set from the reading data into R post.

Data <- Data <- read.csv("HStack_data.csv", header=TRUE)
#Remember you need to set your work directory or reference the file location above.
#We have said header=TRUE as there are variable names within the file

Now if you want to summarise a certain variable such as age, you can use the the below code and is prints it into the console:

#We have selected the Age variable but you could try others such as Height or Weight
mean(Data$Age)
## [1] 66.2
sd(Data$Age)
## [1] 7.438166

This is particularly useful if you are cleaning your data and want to quickly see any differences you have made but it also doesn’t save any of the results to be used later on. So using what you learned about creating objects in R, you can assign any of the results

sum_height <- sum(Data$Height)
mean_height <- mean(Data$Height)
sd_height <- sd(Data$Height)
#By assigning objects using the <- symbol you can call upon them later within R

However, this also isn’t very efficient. Luckily, you can use packages to make the process a bit easier for you. There are other packages that are probably much easier, but we think it is best for you use get to grips with dplyr. With this package, you can summarise a whole data set just using two lines of code:

library(dplyr)
Results <- Data %>%
  summarise(across(everything(), mean, .names = "{.col}_mean"))
#This code asks R to take the data frame 'Data' and then assign the mean of all the variables to the object 'Results'
#across applies the function (mean) across all columns (everything()) in the data set
#The .names part assigns the suffix "_mean" in front of the already existing variable name

Results_2 <- Data %>%
  summarise(across(everything(), list(mean = mean, sd = sd, min = min), .names = "{.col}_{.fn}"))
#This code will return the mean, sd and min of all of the variables in the data frame and assign it to the object 'Results_2'
#The "mean = mean" code passes the function names to .names which renames the columns by "variable_function"
#You can keep adding statistics to the summarise line by inserting them within the list () portion of the code

The code won’t give you a table ready for publication but allows you to quickly calculate a number of statistics on your data.

The extension to this is if you want to group the data by a factor variable such as sex or ethnicity. This can be completed by inserting another line of code into the above which also uses the dplyr package.

Results <- Data %>%
  group_by(Sex) %>%
  summarise(across(everything(), mean, .names = "{.col}_mean"))
#The group_by phrase can be used to group the analyses by a factor variable
#Like most statistical software, this works by taking a binary variable i.e. 0 and 1 to group by
#This can work with character variable types too

The results are then presented for reach group on a new row. It has only inserted the numbers and so you need to remember what any grouping variable represents.

Common errors

When you start to become more familiar with R and use a wider range of variables or data sets you may start to encounter errors with the above code. Things to bear in mind are:

  • Non numeric data: The above code may not like character variables in your data or you may want to remove them from the output.
  • Missing data: Similarly, if you have missing data, the code may also not run.

To get around these issues, you can slightly amend the code to include the following.

Results <- Data %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE, .names = "{.col}_mean"))
#Use where(is.numeric) to exclude variables that are not numeric
#Use na.rm = TRUE to exclude missing values within the variable 

Conclusion

Now you know how to use some the basic summarising functions within R, you can now start to explore your data in a bit more detail. Eventually, you may include some of the above in other functions or even use other packages to summarise your data, but the basic principles remain the same.

Complete code

sum() #total of a variable
mean() #the average of a variable
median() #the median of a variable
sd() #the standard deviation of a variable
sd() / mean() #coefficient of variation of a variable
min() #minimum value of a variable
max() #maximum value of a variable

Data <- Data <- read.csv("HStack_data.csv", header=TRUE)
#Remember you need to set your work directory or reference the file location above.
#We have said header=TRUE as there are variable names within the file

mean(Data$Age)
sd(Data$Age)
#We have selected the Age variable but you could try others such as Height or Weight

sum_height <- sum(Data$Height)
mean_height <- mean(Data$Height)
sd_height <- sd(Data$Height)
#By assigning objects using the <- symbol you can call upon them later within R

library(dplyr)
Results <- Data %>%
  summarise(across(everything(), mean, .names = "{.col}_mean"))
#This code asks R to take the data frame 'Data' and then assign the mean of all the variables to the object 'Results'
#across applies the function (mean) across all columns (everything()) in the data set
#The .names part assigns the suffix "_mean" in front of the already existing variable name

Results_2 <- Data %>%
  summarise(across(everything(), list(mean = mean, sd = sd, min = min), .names = "{.col}_{.fn}"))
#This code will return the mean, sd and min of all of the variables in the data frame and assign it to the object 'Results_2'
#The "mean = mean" code passes the function names to .names which renames the columns by "variable_function"
#You can keep adding statistics to the summarise line by inserting them within the list () portion of the code

Results <- Data %>%
  group_by(Sex) %>%
  summarise(across(everything(), mean, .names = "{.col}_mean"))
#The group_by phrase can be used to group the analyses by a factor variable
#Like most statistical software, this works by taking a binary variable i.e. 0 and 1 to group by
#This can work with character variable types too

Results <- Data %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE, .names = "{.col}_mean"))
#Use where(is.numeric) to exclude variables that are not numeric
#Use na.rm = TRUE to exclude missing values within the variable