TL;DR

Our earlier post on analysing glucose data gives you a great introduction on how to run analysis on datasets collected by FreeStyle Libre devices. This post extends this concept by illustrating how you can conduct an analysis across multiple files.

Introduction

Many of our posts cover how to conduct analysis on a dataset that has been loaded from a single source. Whether this is from an API, an excel file or a file generated from an accelerometer or continuous glucose monitor. However, often when we collect research data it does not come nicely packaged in a single file. Instead, we have 10s, 100s or 1000s of files that we need to combine together. This post will cover how to run analysis across many files.

Examining our files

Before we begin the steps to batch process our files, lets examine what our files look like so we understand the manipulations we need to perform. For this example, we will use continuous glucose data stored in .txt files. We will use the read.delim function to read this data into R. We will specify header = FALSE to indicate the first row of our data are not column names (as we don’t know whether this is true or not yet).

library(kableExtra)
data <- read.delim("./cgm_files/CGM02.txt", header = FALSE)
#using the read.delim function to read the .txt file into R
kable(head(data[, 1:5]))

We can see our data has the subject ID in the first cell of the first column. We can also see that we have our row names in the third column. Now we can perform a couple of basic manipulations to tidy our data a little more.

library(dplyr)
library(janitor)
#library necessary packages into R

data <- read.delim("./cgm_files/CGM02.txt", header = FALSE) %>% 
  row_to_names(3) %>% 
  mutate(id = data[1,1]) %>%
  relocate(id) 
#read the data in
#move the third row to become the column names
#set the entire id column to the value of cell 1
#relocate id to the front of the dataframe

That looks a lot better! Firstly, we use the row_to_names column from the janitor package to select the 3rd row of our data and make these the column names. Then we generated an id column so we know which participant the data refers to. We then relocate the id column to the front of the dataframe. Next, to prepare our data for the iglu analysis functions we need to rename and mutate a few columns.

#filter the record to only include those with type 0
#rename the following column names
#set the gl column to be numeric and x18
#set the time to a POSIXct object
#drop the original ID column using select
#view the dataset using head
data %>%
  filter(`Record Type` == 0) %>% 
  rename(time = Time, gl = `Historic Glucose (mmol/L)`) %>% 
  mutate(gl = as.numeric(gl)*18, 
      time = as.POSIXct(time, "%Y/%m/%d %H:%M", tz = "")) %>% 
  select(-ID) %>% 
  head() %>%
  kable()

In this code we filter our record type to 0 (our CGM analysis post covers why we make this choice), we then rename the Time and Historic Glucose column so they can be entered into the iglu functions. Second, we change the data type of the time column to a POSIXct object, the gl column to numeric and then times the value by 18 to work with iglu. Finally, I remove the ID column as we won’t use it going forward and it adds confusion between the id and ID columns.

Functionalising this code

So the code above is not hugely complex, however it does span around 10 lines of code. Therefore, I feel it is best to package this code up into a function. Below is that code.

folder <- "./cgm_files" 
#set the folder path

#get list of file names from that folder
file_names <- list.files(path = folder, pattern = "*.txt") 
#use the list.files function to create a list of all file names within the folder

read_glucose_txt <- function(folder_path, file_name) { #create the read_glucose_txt function
  data <- read.delim(paste0(folder_path, "/", file_name), header = FALSE) #this will need to be changed on windows 
  
  data <- data %>%
    row_to_names(3) %>% #move the data in the 3rd row to column names
    filter(`Record Type` == 0) %>%
    rename(time = Time, gl = `Historic Glucose (mmol/L)`) %>%
    mutate(
      id = data[1,1],
      gl = as.numeric(gl)*18,
      time = as.POSIXct(time, "%Y/%m/%d %H:%M", tz = "")
    ) %>%
    relocate(id) %>%
    select(-ID)
  
  return(data)
}

The code above is slightly reordered from what we have used before however all the constituent parts are there. The key difference is our function takes the folder_path and file name from the folder object and file_name object we create at the start of the code. If you have not yet fully grasped the concept of functions, consider reading our introduction to them here. By creating this function we are now able to read a text file containing our continuous glucose data into R and perform the necessary data manipulations on each file meaning they will be consistent. The importance of this will become clear in the next step.

Reading in multiple data files

Above we could read in a single data file, now lets look at how to read in more than one data file at once. To do this we will use our read_glucose_text function we created above and place it in a for loop. I will assume you have an understanding of for loops but if not, please revisit our previous post that covers them here.

library(magrittr)

files <- list() 
#define an empty list
  
for (i in 1:length(file_names)) {
  files[[i]] <- read_glucose_txt(folder, file_names[i]) #set the ith value of files to the output of the read_glucose_txt function with the folder parameter, an the ith file_name from the file_names list
}

#bind together all the dataframes stored within the files object
files %>% bind_rows() %>% 
  head() %>%
  kable()

We start in this code by defining a empty list where we can assign our read in files into. We then initialise a for loop where we iterate from 1 to the length of the list file_names (in our case 4). Within the loop, we use the read_glucose_text function with the folder path, and ith item in the file_names list (first file name 1, then 2 and so on). Once the function has completed we assign the value to ith value of the empty list files. We finally use the bind_rows function to bind each file we read in during our for loop into one large dataframe.

Functionalising this code

Again, this code is not massively complex but we will add it into a function for simplicity in the future.

read_multiple_glucose_txt <- function(folder_path, file_names){
  files <- list() #define an empty list
  
  for (i in 1:length(file_names)) {
    files[[i]] <- read_glucose_txt(folder, file_names[i]) 
  } #use the read_glucose function above to read in txt files and save in list
  
  return(files %>% bind_rows) #bind all dataframes from list together 
}

As before, the code used here is the same as the code above however slightly reorganised. We can run this function to create our data.

#run the read_multiple_glucose_txt function
#view the data using head
data <- read_multiple_glucose_txt(folder, file_names) 
kable(head(data))

Now we have all our files read in and ready for analysis.

The problem of speed

I wanted to add a quick comment here about the speed of our code. I have made no effort to make this code run quickly as, with the 4 files we needed to load it makes very little to no difference to us. However, if you were to be reading in 100s or 1000s of files using other functions may be beneficial. I would recommend exploring the map functions from purrr and vroom from the vroom package for faster iteration and faster data loading respectively. However, the biggest lesson I have learn when looking at the speed of code is first to make the code work, then workout where your bottleneck is (what is the slowest part of your code), then look to speed this up. This topic is massively complex and unlikely something we will cover anytime soon however the following post provides an excellent overview of this process.

In our current code, the bottleneck is the iglu analysis function. Therefore, worrying about the speed of our data load becomes irrelevant until we have dealt with speeding up our analysis code.

Running the files through iglu

Now we have all our data prepared, we can now read pass the data through iglu.

library(iglu)
#run the active_percent function from the iglu package
active_percent(data)

#run the all_metrics function from the iglu package
all_metrics(data)

We can see from the above code we get the summary data for each participant in our dataframe. If we wanted a daily summary, we can use a system similar to Andy used in his previous post and add this into our read_glucose_text function to achieve this.

Conclusion

Hopefully the post above has shown how we can perform analysis when our participants data is spread across multiple files. This methodology can be applied to any analysis function not just those in the iglu package.

Complete Code

library(kableExtra)
data <- read.delim("./cgm_files/CGM02.txt", header = FALSE)
#using the read.delim function to read the .txt file into R
kable(head(data[, 1:5])) 

library(dplyr)
library(janitor)
#library necessary packages into R

data <- read.delim("./cgm_files/CGM02.txt", header = FALSE) %>% 
  row_to_names(3) %>% 
  mutate(id = data[1,1]) %>%
  relocate(id) 
#read the data in
#move the third row to become the column names
#set the entire id column to the value of cell 1
#relocate id to the front of the dataframe

data %>%
  filter(`Record Type` == 0) %>% 
  rename(time = Time, gl = `Historic Glucose (mmol/L)`) %>% 
  mutate(gl = as.numeric(gl)*18, 
      time = as.POSIXct(time, "%Y/%m/%d %H:%M", tz = "")) %>% 
  select(-ID) %>% 
  head() %>%
  kable()
#filter the record to only include those with type 0
#rename the following column names
#set the gl column to be numeric and x18
#set the time to a POSIXct object
#drop the original ID column using select
#view the dataset using head

folder <- "./cgm_files" 
#set the folder path

#get list of file names from that folder
file_names <- list.files(path = folder, pattern = "*.txt") 
#use the list.files function to create a list of all file names within the folder

read_glucose_txt <- function(folder_path, file_name) { #create the read_glucose_txt function
  data <- read.delim(paste0(folder_path, "/", file_name), header = FALSE) #this will need to be changed on windows 
  
  data <- data %>%
    row_to_names(3) %>% #move the data in the 3rd row to column names
    filter(`Record Type` == 0) %>%
    rename(time = Time, gl = `Historic Glucose (mmol/L)`) %>%
    mutate(
      id = data[1,1],
      gl = as.numeric(gl)*18,
      time = as.POSIXct(time, "%Y/%m/%d %H:%M", tz = "")
    ) %>%
    relocate(id) %>%
    select(-ID)
  
  return(data)
}

library(magrittr)

files <- list() 
#define an empty list
  
for (i in 1:length(file_names)) {
  files[[i]] <- read_glucose_txt(folder, file_names[i]) #set the ith value of files to the output of the read_glucose_txt function with the folder parameter, an the ith file_name from the file_names list
}

files %>% bind_rows() %>% 
  head() %>%
  kable()
#bind together all the dataframes stored within the files object

read_multiple_glucose_txt <- function(folder_path, file_names){
  files <- list() #define an empty list
  
  for (i in 1:length(file_names)) {
    files[[i]] <- read_glucose_txt(folder, file_names[i]) 
  } #use the read_glucose function above to read in txt files and save in list
  
  return(files %>% bind_rows) #bind all dataframes from list together 
}

data <- read_multiple_glucose_txt(folder, file_names) 
kable(head(data))
#run the read_multiple_glucose_txt function
#view the data using head

library(iglu)
active_percent(data) 
#run the active_percent function from the iglu package

all_metrics(data) 
#run the all_metrics function from the iglu package

Batch analysing CGM data

TL;DR

Introduction

Examining our files

Functionalising this code

Reading in multiple data files

Functionalising this code

The problem of speed

Running the files through iglu

Conclusion

Complete Code

Did this post help you? Consider increasing the sites sustainability.