TL;DR

One of the most commonly performed actions in any data analysis is the creation or mutation of our data. This post will walk through how to generate new columns as well mutate existing columns using both base R functionality as well as the dplyr package.

Introduction

One of the most commonly performed tasks in any data analysis is the creation and mutation (manipulation) of your data. This often involves performing actions on your existing columns to generate new, potentially more valuable data. You may hear this process referred to by a number of names (manipulation, wrangling, munging) however they all generally refer to the same process: changing our data from how it is, to how we want it to be. There are a number of functions within base R that allow us to perform this process as well as a number of packages that offer us a range of different methodologies to achieve the same outcome.

Creating Variables

Creating variables in a dataframe is a very common task in many data analysis work flows. Variables can be created using both base R as well as the dplyr package. Lets look at base R first. We will use the NHS R dataset ae_attendance to demonstrate our examples in this post. Lets take a look at this dataframe to begin with.

The NHSRdatasets

The NHSRdatasets package provides us a range of datasets that we can use to practice our R skills. I will use them in several of my posts as they provide a more realistic experience of using R than a small dataset generated by myself. To load these datasets, we first library in the NHSRdatasets package and then can call the data from the package as we could a normal function. There are many other packages that provide datasets that you can use when learning R including the datasets package.

library(NHSRdatasets)
ae_attendances <- ae_attendances 
#the ae_attendance dataset is from the NHSRdatasets package so we assign it to our own object so we can manipulate it
#we use the head function here to view the first few rows of our dataset
head(ae_attendances) 
##       period org_code  type attendances breaches admissions
## 1 2017-03-01      RF4     1       21289     2879       5060
## 2 2017-03-01      RF4     2         813       22          0
## 3 2017-03-01      RF4 other        2850        6          0
## 4 2017-03-01      R1H     1       30210     5902       6943
## 5 2017-03-01      R1H     2         807       11          0
## 6 2017-03-01      R1H other       11352      136          0

Now we have had a look at our variables, lets create a new column. To begin with, lets keep this really simple. We may want to create a variable to show all this data came from a first wave of data collection. To do this, we will generate a column called wave and give it a value of 1 for every row.

ae_attendances$wave <- 1
#here create a column wave by specifying the dataset, followed by the column name separated by $ 
#we give every cell in this column the value 1 - this code relies on R knowing to repeat a vector of length 1

Lets take through the syntax for this. We start by specifying the dataframe we want to generate the column in followed by the $ operator followed by the name of the column we want to create. We then use the assignment operator and the value we want to assign to the column. Helpfully, if only a single value is provided R will repeat this value across all rows in that column.

Now lets make a more realistic (and useful!) column. We have the columns attendance and breaches, lets create a column that represents the percent of attendances that turn into admissions.

ae_attendances$attendees_admitted <- (ae_attendances$admissions / ae_attendances$attendances) * 100
#create a column that represents the percent of attendances that turn into admissions

You can see that we start by dividing the admissions by the attendance column and then multiplying this number by 100. A key feature of R is when creating a new column there is no need to write a for loop. R as a language is vectorised meaning that operations are performed in parallel in certain R objects. This means the code we just wrote will create a value for each row in our dataframe without the need to iterate over every row individually. For more information on this, visit this chapter in Programming for R

It is worth noting that whilst we used division and multiplication operators above, we can also use addition and subtraction. There are also other arithmetic operators that can be used we can use in R including ^ to calculate exponents, %% for remainder division and %/% for integer division.

Conditionally mutating

When creating new variables sometimes we want to perform different operations based on the value of another column. For example, in our attendance data there is a factor variable type. Let’s say we wanted to multiply all the the rows categorised as type 1 by 2.

ae_attendances$breaches[ae_attendances$type == 1] <- ae_attendances$breaches[ae_attendances$type == 1] * 2
#multiply all the the rows categorised as type 1 by 2

This code looks a little complex but in reality it combines a number of simple operations. We first select the column we want to mutate (using the $ operator), then in the square brackets we specify the condition we want to mutate the column based on. We then use the assignment indicator, followed by the same code again to select the rows we want to mutate and then multiply these values by 2. This syntax is both repetitive and not very easy to read. Instead of the above, we can use an ifelse statement. In this statement, we specify a test condition, followed by an action to perform if the test is true followed by an action to perform if the statement is false.

ae_attendances$breaches <- ifelse(ae_attendances$type == 1, ae_attendances$breaches * 2, ae_attendances$breaches)
#specify a test condition, followed by an action to perform if the test is true followed by an action to perform if the statement is false

These statements can be made as simple or complex as necessary to achieve nearly all data manipulations. However, as with data visualisations, a number of packages have been developed to simplify the above process. We will now take a look at the dplyr package, part of the tidyverse, and look at how basic data manipulation can be performed.

dplyr

The dplyr package has the ability to both create new as well as mutate existing columns. This can all be done using the mutate function. Lets start by creating a new column where we divide the number of data breaches by the attendance.

library(dplyr)
library(magrittr)
#if you are unaware of %>% symbol - please visit our pipe operator post
#within a mutate function we use the = symbol to create the breaches_attendees column
ae_attendances %>%
  mutate(breaches_attendees = breaches/attendances)
## # A tibble: 12,765 x 9
##    period     org_code type  attendances breaches admissions  wave
##    <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl> <dbl>
##  1 2017-03-01 RF4      1           21289    11516       5060     1
##  2 2017-03-01 RF4      2             813       22          0     1
##  3 2017-03-01 RF4      other        2850        6          0     1
##  4 2017-03-01 R1H      1           30210    23608       6943     1
##  5 2017-03-01 R1H      2             807       11          0     1
##  6 2017-03-01 R1H      other       11352      136          0     1
##  7 2017-03-01 AD913    other        4381        2          0     1
##  8 2017-03-01 RYX      other       19562      258          0     1
##  9 2017-03-01 RQM      1           17414     8120       3597     1
## 10 2017-03-01 RQM      other        7817       86          0     1
## # ... with 12,755 more rows, and 2 more variables: attendees_admitted <dbl>,
## #   breaches_attendees <dbl>

We can see the syntax here differs from base R. We start by specifying the dataset we want to followed by the pipe operator (if you want to know more about the pipe operator, we discuss it in the following blog post). We then specify the mutate function. Inside this function we provide the name of the column we want to create, followed by specifying how we want to calculate the values in this column. A key difference that increases the readability of dplyr code is the fact we can specify the column names rather than the dataframe as well as the column name. This is due to a concept known as tidy evaluation (you do not need to understand this concept to use it effectively).

We can also perform conditional mutation in dplyr using a similar ifelse function as in base R. There are subtle differences between these two functions but we will not cover that at the moment.

#you may notice we use = to create the column and == to check for equality (are the values equal)
ae_attendances %>%
  mutate(breaches = if_else(type == 1, breaches * 2, breaches))
## # A tibble: 12,765 x 8
##    period     org_code type  attendances breaches admissions  wave
##    <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl> <dbl>
##  1 2017-03-01 RF4      1           21289    23032       5060     1
##  2 2017-03-01 RF4      2             813       22          0     1
##  3 2017-03-01 RF4      other        2850        6          0     1
##  4 2017-03-01 R1H      1           30210    47216       6943     1
##  5 2017-03-01 R1H      2             807       11          0     1
##  6 2017-03-01 R1H      other       11352      136          0     1
##  7 2017-03-01 AD913    other        4381        2          0     1
##  8 2017-03-01 RYX      other       19562      258          0     1
##  9 2017-03-01 RQM      1           17414    16240       3597     1
## 10 2017-03-01 RQM      other        7817       86          0     1
## # ... with 12,755 more rows, and 1 more variable: attendees_admitted <dbl>

dplyr also offers an alternative method to conditionally mutate columns: case_when. This can be a very useful function especially when the list of conditions is longer than 2. Let’s take a look at an example.

#the above conditionally mutates columns using case_when
#the tilde symbol (~) is used to separate the condition to the output
ae_attendances %>%
  mutate(
    large_breach= case_when(
      breaches < 500 ~ 1, 
      breaches > 500 & breaches < 1000 ~ 2,
      TRUE ~ 3
    )
  )
## # A tibble: 12,765 x 9
##    period     org_code type  attendances breaches admissions  wave
##    <date>     <fct>    <fct>       <dbl>    <dbl>      <dbl> <dbl>
##  1 2017-03-01 RF4      1           21289    11516       5060     1
##  2 2017-03-01 RF4      2             813       22          0     1
##  3 2017-03-01 RF4      other        2850        6          0     1
##  4 2017-03-01 R1H      1           30210    23608       6943     1
##  5 2017-03-01 R1H      2             807       11          0     1
##  6 2017-03-01 R1H      other       11352      136          0     1
##  7 2017-03-01 AD913    other        4381        2          0     1
##  8 2017-03-01 RYX      other       19562      258          0     1
##  9 2017-03-01 RQM      1           17414     8120       3597     1
## 10 2017-03-01 RQM      other        7817       86          0     1
## # ... with 12,755 more rows, and 2 more variables: attendees_admitted <dbl>,
## #   large_breach <dbl>

The code above assigns the new column, large_breach, a value of 1 if the number of breaches is less than 500, a value of 2 if the value of breaches is between 500 and 1000, and any value greater than 1000 a value of 3. Once again, this function makes it much more readable to conditionally mutate based on multiple conditions.

Conclusion

Creating new variables as well as mutating existing variables is a key step in nearly all data analysis. Both base R, as well as the dplyr package, offer a range wide range of possibilities when it comes to these tasks. In my opinion, the increased readability, as well as the potential for increased speed in large datasets makes dplyr the optimum choice for data manipulation.

Complete code

library(NHSRdatasets)
ae_attendances <- ae_attendances 
#the ae_attendance dataset is from the NHSRdatasets package so we assign it to our own object so we can manipulate it
head(ae_attendances) 
#we use the head function here to view the first few rows of our dataset

ae_attendances$wave <- 1
#here create a column wave by specifying the dataset, followed by the column name separated by $ 
#we give every cell in this column the value 1 - this code relies on R knowing to repeat a vector of length 1

ae_attendances$attendees_admitted <- (ae_attendances$admissions / ae_attendances$attendances) * 100
#create a column that represents the percent of attendances that turn into admissions

ae_attendances$breaches[ae_attendances$type == 1] <- ae_attendances$breaches[ae_attendances$type == 1] * 2
#multiply all the the rows categorised as type 1 by 2

ae_attendances$breaches <- ifelse(ae_attendances$type == 1, ae_attendances$breaches * 2, ae_attendances$breaches)
#specify a test condition, followed by an action to perform if the test is true followed by an action to perform if the statement is false

library(dplyr)
library(magrittr)
ae_attendances %>%
  mutate(breaches_attendees = breaches/attendances)
#if you are unaware of %>% symbol - please visit our pipe operator post
#within a mutate function we use the = symbol to create the breaches_attendees column

ae_attendances %>%
  mutate(breaches = if_else(type == 1, breaches * 2, breaches))
#you may notice we use = to create the column and == to check for equality (are the values equal)

ae_attendances %>%
  mutate(
    large_breach= case_when(
      breaches < 500 ~ 1, 
      breaches > 500 & breaches < 1000 ~ 2,
      TRUE ~ 3
    )
  )
#the above conditionally mutates columns using case_when
#the tilde symbol (~) is used to separate the condition to the output