TL;DR
For any language to be a true programming language you must be able to repeat a task multiples times and also execute commands based on a condition. Once you can accomplish these two tasks, you have everything you need to solve surprisingly complex problems. IF statements allow code to be executed only if a certain condition is met whilst FOR and WHILE loops allow us to repeat a certain block of code multiple times. This post will walk you through the basics of how to perform each of these tasks.
Introduction
When you first start performing data science in R, you may not necessarily come across many occasions where you need to control in what order your code is executed. This can be put down to most basic data science scripts as you (the user) can elect when certain chunks of code are run. However, as the data science tasks and scripts you write increase in complexity, and are run without your direct supervision, you will find the needs to execute certain code only when a condition is met and to repeat blocks of code many times. To perform these actions, R offers a number of programming constructs to accomplish this. In this post, we will look at IF statements as well as for and while loops.
Conditionals
The main conditional in programming is the IF statement. This statement evaluates a condition, to either TRUE or FALSE, and based on this performs a given action.Below is a basic example of an IF statement.
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
if (length(sequence) == 10) { #initiate the if statement and provide the condition "if the length of the sequence is equal to 10 then
print("This sequence has a length of 10") #if the above condition is met the execute this line of code
}
## [1] "This sequence has a length of 10"
We start by creating a sequence of numbers. Then we write our IF statement. To do this in R, we use the word if followed by an set of brackets. In these brackets we write our condition. In this example, we use the length function to get the length of the sequence of numbers and then use the == operator to check whether this value is the same as another value (in this case 10). We then open a set of curly brackets and within here, we write the code we want to execute if our condition is met (if it evaluates to TRUE). In our example, if the length of the sequence is equal to 10 we print the statement “This sequence has a length of 10”, whilst if the length of the sequence is is not equal to 10 nothing is printed. This is shown below.
<- c(1, 2, 3, 4, 5)
sequence #all code the same as above except a shorter sequence is used
if (length(sequence) == 10) { #therefore this condition is not met
print("This sequence has a length of 10") #as the condition is not met, this line of code is not executed
}
As you can see this statement generates no output as the condition was not met. However, having no output is not ideal. We can address this by using an ELSE statement. ELSE statements are executed when all the IF statements above it evaluates to FALSE.
if (length(sequence) == 10) {
print("This sequence has a length of 10")
else { #begin an else statement with the else key word
} print("This sequence does not have a length of 10") #if the IF statement is not met, this line of code is executed
}
## [1] "This sequence does not have a length of 10"
That’s better…but could still be improved. We could set up an IF statement to tell us whether the length of the sequence is greater, less than or equal to 10.
if (length(sequence) == 10) {
print("This sequence has a length of 10")
else if (length(sequence) < 10) { #an else if statement is used here, if the IF condition above is not met, this condition is evaluated
} print("This sequence has a length of less than 10") #if the else if condition is met, this line is executed
else {
} print("This sequence has a length of more than 10") #if neither condition is met, this line is executed
}
## [1] "This sequence has a length of less than 10"
We can see in this code that we use ELSE IF to link multiple IF statements together. This gives us the power to write as many conditions as we need to control the execution of our code.
Alternatives to base R IF statements
There are a wide range of alternatives to base R IF statements. These include if_else statements and case_when from the dplyr package. I won’t cover these in detail (as they are more useful in data analysis rather than programming) but the documentation for each function is linked above.
Loops
For loops
Loops are incredibly powerful in programming as a way to do the same task multiple times. This can be particularly useful to prevent the repetition of code (DRY-don’t repeat yourself). Let’s look at a basic for loop below.
<- list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
for (i in sequence) { #initiate a for loop the iterates through each number in the list sequence
print(i) #print each item in the list
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
So we start a for loop using the for key word. We then open the brackets and set an iterator. It is often seen as a convention to use i as an iterator however any letter, or word, can be used. We then specify something to iterate over which in our case is our list. We then open our curly brackets and write the code we want to execute at each iteration of the loop. In our case we print i which prints each number in the list.
While loops
Another type of loop that is used in programming is a while loop. A while loop checks a condition, and if the condition is met, executes a block of code. After this, the condition is re-checked and if it is still met, the code executes again. This process continues until the condition is no longer met. A basic example is shown below.
<- 1
n #set the value of n to 1
while (n < 5) { #initiate a while loop with the condition if n is less than 5
print(n) #if the above condition is met, print the value of n
<- n + 1 #then add 1 to the value of n
n }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
We start by creating a count variable (n) and assign it a value of 1. We then use the while key word to start our loop and give it a condition to check (is n less than 5). We then provide the code to run whilst this condition is met which in our case prints the value of n and then adds 1 to n. You can see the result of this code which is printing the numbers from 1 through until the value is no longer less than 5 (therefore 1 through 4 is printed).
In my experience, you will use for loops much more frequently than while loops. However, having both at your disposal gives you the greatest flexibility in solving problems that present during your coding.
Putting them together
We can also be very powerful by combining these two functions within our code. We can put a conditional (IF statement) within a for loop to iterate over an object, check if a condition is met, and then do something with our data.
<- list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
for(i in sequence) { #for each number in the list sequence
if(i %% 2 == 0) { #if the remainder of dividing the number by 2 is 0 (therefore the number is even)
print(i) #print the value of i (the number in the sequence)
} }
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
In the code above, we use a for loop as before to iterate over each item in our list. We then use an IF statement to check if we divide the number by 2 do we get a remainder. If we do not get a remainder, we know the number is even. So if we get a remainder of 0, we print the number out. This allows us to print a list of even numbers from our sequence within our loop. This highlights how by combining these two methods we can perform some more interesting operations relatively simply.
A more real world example
Whilst the above illustrate the basic use cases of however these are not particularly useful for real world data analysis. Below I will try to provide a more meaningful example of where a for loop can be highly effective and a problem I am commonly asked to solve - loading all csv files from a folder and combining them into one dataset.
library(dplyr)
library(kableExtra)
<- "./csv_files"
folder_path #this sets the path to a folder - please change this to point to your folder
<- list.files(folder_path, pattern = "*.csv")
files #this lists all .csv files in the above specified folder
<- list()
file_list #create an empty list to write data into
for (i in 1:length(files)) { #creates a for loop to iterate through numbers from 1 to the number of files in the folder
<- read.csv(paste0(folder_path, "/", files[i]), header = TRUE) #we then select the ith (1st, 2nd, 3rd etc) item in the empty list, we then read the csv data and save it into this position in the list, we paste together the folder_path and the ith file name, we set header to TRUE so the first row is used for column names
file_list[[i]]
}
#we bind together all the dataframes saved in the file_list list
#we use kable to make the output prettier
kable(file_list %>% bind_rows()) %>%
kable_styling(bootstrap_options = "basic")
ID | Height |
---|---|
1 | 160 |
2 | 174 |
3 | 147 |
4 | 188 |
5 | 159 |
6 | 160 |
7 | 174 |
8 | 147 |
9 | 188 |
10 | 159 |
11 | 160 |
12 | 174 |
13 | 147 |
14 | 188 |
15 | 159 |
Lets work through the code to see how a for loop can help us load the csv data into a dataframe. We first set a folder_path variable to specify the path to the folder where our csv files sit. We then use the list.files function to get the file name of all the files in the folder with the .csv file type. We then create a list to assign the content of each csv file into. We then get into our for loop. We use i as our iterator and make it iterate from number 1 through to the length of our files object (1 through to 3 in our case as we have 3 files). We then assign into file_list[[i]] which means assign to the first object of the list, then the second and so on as the iterator progresses. We then use the read.csv file to read the files into R. We use paste0 to create our file paths using the notation files[i] to select the first file in our files object, then the second and so on. This loop creates a list of three dataframes in file_list. Then we can use the bind_rows function from dplyr to bind all these dataframes into a single dataframe. I use the kable and kable_styling functions to make this output more attractive but you could just assign it back to a dataframe object.
Hopefully this shows how a relatively simple for loop can accomplish what could be a very time consuming job. There are many other ways of writing this code in R (lapply, map etc) and we may cover these in future posts.
More complex programming
As the name of this post highlights, we are only looking to introduce some of the basic ideas of programming in this post. There are many more complex ideas (such as nesting, multiple iterators etc) which we may cover in a later post.
Conclusions
Once you can use loops to repeatedly execute a block of code, and use conditional statements to control when blocks of code are executed, you have the ability to write incredibly complex programmes.
Complete code
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
if (length(sequence) == 10) { #initiate the if statement and provide the condition "if the length of the sequence is equal to 10 then
print("This sequence has a length of 10") #if the above condition is met the execute this line of code
}
<- c(1, 2, 3, 4, 5)
sequence #all code the same as above except a shorter sequence is used
if (length(sequence) == 10) { #therefore this condition is not met
print("This sequence has a length of 10") #as the condition is not met, this line of code is not executed
}
if (length(sequence) == 10) {
print("This sequence has a length of 10")
else { #begin an else statement with the else key word
} print("This sequence does not have a length of 10") #if the IF statement is not met, this line of code is executed
}
if (length(sequence) == 10) {
print("This sequence has a length of 10")
else if (length(sequence) < 10) { #an else if statement is used here, if the IF condition above is not met, this condition is evaluated
} print("This sequence has a length of less than 10") #if the else if condition is met, this line is executed
else {
} print("This sequence has a length of more than 10") #if neither condition is met, this line is executed
}
<- list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
for (i in sequence) { #initiate a for loop the iterates through each number in the list sequence
print(i) #print each item in the list
}
<- 1
n #set the value of n to 1
while (n < 5) { #initiate a while loop with the condition if n is less than 5
print(n) #if the above condition is met, print the value of n
<- n + 1 #then add 1 to the value of n
n
}
<- list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sequence #generates a sequence of numbers to iterate over
for(i in sequence) { #for each number in the list sequence
if(i %% 2 == 0) { #if the remainder of dividing the number by 2 is 0 (therefore the number is even)
print(i) #print the value of i (the number in the sequence)
}
}
library(dplyr)
library(kableExtra)
<- "./csv_files"
folder_path #this sets the path to a folder - please change this to point to youe folder
<- list.files(folder_path, pattern = "*.csv")
files #this lists all .csv files in the above specified folder
<- list()
file_list #create an empty list to write data into
for (i in 1:length(files)) { #creates a for loop to iterate through numbers from 1 to the number of files in the folder
<- read.csv(paste0(folder_path, "/", files[i]), header = TRUE) #we then select the ith (1st, 2n, 3rd etc) item in the empty list, we then read the csv data and save it into this position in the list, we paste together the folder_path and the ith file name, we set header to TRUE so the first row is used for column names
file_list[[i]]
}
kable(file_list %>% bind_rows()) %>%
kable_styling(bootstrap_options = "basic")
#we bind together all the dataframes saved in the file_list list
#we use kable to make the output prettier