Chapter 7 Data Wrangling

In data analysis, having known and expected patterns in the data is important. Wrangling is the act of getting messy real world data organized for analysis.

\label{fig:7201}Visualizing messy data.

Figure 7.1: Visualizing messy data.


At the end of this chapter you should be able to

  • Understand concept of tidy data.

  • Wrangle some messy data into a tidier form.


7.1 Tidy Data

Tidy data is a concept in data science and statistics that refers to a specific way of organizing data into a tabular format that is easy to work with. The concept of tidy data was introduced by Hadley Wickham in his paper “Tidy Data” published in the Journal of Statistical Software in 2014.

In tidy data, each row represents a single observation or record, and each column represents a single variable or attribute. Additionally, there should be no duplicate columns or rows, and each cell should contain a single value.

More specifically, tidy data should have the following properties:

  1. Each variable has its own column: In tidy data, each variable or attribute should have its own column. This makes it easy to compare and analyze different variables.

  2. Each observation has its own row: In tidy data, each observation or record should have its own row. This makes it easy to perform calculations on individual observations or groups of observations.

  3. There should be no duplicate columns or rows: In tidy data, there should be no duplicate columns or rows. This ensures that the data is organized in a clear and concise manner.

  4. Each cell should contain a single value: In tidy data, each cell should contain a single value. This makes it easy to perform calculations and manipulate the data.

By organizing data in a tidy format, it becomes easier to analyze and visualize the data using tools like R and Python. Additionally, tidy data is more robust and less prone to errors than other types of data formats, making it easier to work with and share with others.


1.2.1 Online Cheat-sheets


Tidy data visualized:
\label{fig:7202}Visualizing tidy data concepts.

Figure 7.2: Visualizing tidy data concepts.

7.2 Example Walkthrough

We have a processed MS data set of metabolites measured for different bacteria at different time-points with different dosages of an antibiotic.

\label{fig:7203}Checking our messy data table.

Figure 7.3: Checking our messy data table.

Having a look at the data, we can see definitely that it is not tidy.

Our goal will be to: * Read excel data with readxl library. * In a single dplyr pipe create a tidy with culture, dose_mg, metabolite, time_min, abundance as our columns.

read_

Download the data …

url <- "https://raw.githubusercontent.com/jeffsocal/ASMS_R_Basics/main/data/bacterial-metabolites_dose-simicillin_wide.csv"
download.file(url, destfile = "./data/bacterial-metabolites_dose-simicillin_wide.csv")

Let’s start with the read-in.

require(tidyverse)

# read_csv
tbl_met <- "data/bacterial-metabolites_dose-simicillin_wide.csv" %>% read_csv()

tbl_met
## # A tibble: 9 × 21
##   Culture      glutamate_Time_0min glutamate_Time_10min glutamate_Time_20min glutamate_Time_50min glutamate_Time_2hr
##   <chr>                      <dbl>                <dbl>                <dbl>                <dbl>              <dbl>
## 1 e coli dose…              191666              2082823              1796910               846402            1782879
## 2 staph aureu…              125210               132156               114445               117150             137747
## 3 p aeruginos…              978126               397012               267599               424183            1630613
## 4 e coli dose…                1445              1017467             14456533               813321              29825
## 5 staph aureu…                 152               128683              1360154               102188              10920
## 6 p aeruginos…                  23              1612622             18282386               199374             230032
## 7 e coli dose…                 885                42356             11758951                22498               1715
## 8 staph aureu…                 136                 4276              1162855                 4841                130
## 9 p aeruginos…                3098                68146             19333030               106262                164
## # ℹ 15 more variables: lysine_Time_0min <dbl>, lysine_Time_10min <dbl>, lysine_Time_20min <dbl>,
## #   lysine_Time_50min <dbl>, lysine_Time_2hr <dbl>, `succinic acid_Time_0min` <dbl>,
## #   `succinic acid_Time_10min` <dbl>, `succinic acid_Time_20min` <dbl>, `succinic acid_Time_50min` <dbl>,
## #   `succinic acid_Time_2hr` <dbl>, phosphatidylcholine_Time_0min <dbl>, phosphatidylcholine_Time_10min <dbl>,
## #   phosphatidylcholine_Time_20min <dbl>, phosphatidylcholine_Time_50min <dbl>, phosphatidylcholine_Time_2hr <dbl>

separate

Let’s separate Culture into two new variables: Species and Dose.

\label{fig:7204}Separating columns in a tibble.

Figure 7.4: Separating columns in a tibble.

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:")
## # A tibble: 9 × 22
##   culture      dose_mg glutamate_Time_0min glutamate_Time_10min glutamate_Time_20min glutamate_Time_50min
##   <chr>        <chr>                 <dbl>                <dbl>                <dbl>                <dbl>
## 1 e coli       0mg                  191666              2082823              1796910               846402
## 2 staph aureus 0mg                  125210               132156               114445               117150
## 3 p aeruginosa 0mg                  978126               397012               267599               424183
## 4 e coli       10mg                   1445              1017467             14456533               813321
## 5 staph aureus 10mg                    152               128683              1360154               102188
## 6 p aeruginosa 10mg                     23              1612622             18282386               199374
## 7 e coli       20mg                    885                42356             11758951                22498
## 8 staph aureus 20mg                    136                 4276              1162855                 4841
## 9 p aeruginosa 20mg                   3098                68146             19333030               106262
## # ℹ 16 more variables: glutamate_Time_2hr <dbl>, lysine_Time_0min <dbl>, lysine_Time_10min <dbl>,
## #   lysine_Time_20min <dbl>, lysine_Time_50min <dbl>, lysine_Time_2hr <dbl>, `succinic acid_Time_0min` <dbl>,
## #   `succinic acid_Time_10min` <dbl>, `succinic acid_Time_20min` <dbl>, `succinic acid_Time_50min` <dbl>,
## #   `succinic acid_Time_2hr` <dbl>, phosphatidylcholine_Time_0min <dbl>, phosphatidylcholine_Time_10min <dbl>,
## #   phosphatidylcholine_Time_20min <dbl>, phosphatidylcholine_Time_50min <dbl>, phosphatidylcholine_Time_2hr <dbl>

mutate

Still some issues here. We managed a separation but out dose column isn’t numeric! We need to remove the “mg”.

\label{fig:7205}Mutate:Making new columns or changing existing ones.

Figure 7.5: Mutate:Making new columns or changing existing ones.

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(Dose = str_remove(dose_mg, "mg"))
## # A tibble: 9 × 23
##   culture      dose_mg glutamate_Time_0min glutamate_Time_10min glutamate_Time_20min glutamate_Time_50min
##   <chr>        <chr>                 <dbl>                <dbl>                <dbl>                <dbl>
## 1 e coli       0mg                  191666              2082823              1796910               846402
## 2 staph aureus 0mg                  125210               132156               114445               117150
## 3 p aeruginosa 0mg                  978126               397012               267599               424183
## 4 e coli       10mg                   1445              1017467             14456533               813321
## 5 staph aureus 10mg                    152               128683              1360154               102188
## 6 p aeruginosa 10mg                     23              1612622             18282386               199374
## 7 e coli       20mg                    885                42356             11758951                22498
## 8 staph aureus 20mg                    136                 4276              1162855                 4841
## 9 p aeruginosa 20mg                   3098                68146             19333030               106262
## # ℹ 17 more variables: glutamate_Time_2hr <dbl>, lysine_Time_0min <dbl>, lysine_Time_10min <dbl>,
## #   lysine_Time_20min <dbl>, lysine_Time_50min <dbl>, lysine_Time_2hr <dbl>, `succinic acid_Time_0min` <dbl>,
## #   `succinic acid_Time_10min` <dbl>, `succinic acid_Time_20min` <dbl>, `succinic acid_Time_50min` <dbl>,
## #   `succinic acid_Time_2hr` <dbl>, phosphatidylcholine_Time_0min <dbl>, phosphatidylcholine_Time_10min <dbl>,
## #   phosphatidylcholine_Time_20min <dbl>, phosphatidylcholine_Time_50min <dbl>, phosphatidylcholine_Time_2hr <dbl>,
## #   Dose <chr>

Ok, the “mg” is removed but if we check the tibble, Dose is still chr, so it’s still computed as text by R! Let’s make it numeric.

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg))
## # A tibble: 9 × 22
##   culture      dose_mg glutamate_Time_0min glutamate_Time_10min glutamate_Time_20min glutamate_Time_50min
##   <chr>          <dbl>               <dbl>                <dbl>                <dbl>                <dbl>
## 1 e coli             0              191666              2082823              1796910               846402
## 2 staph aureus       0              125210               132156               114445               117150
## 3 p aeruginosa       0              978126               397012               267599               424183
## 4 e coli            10                1445              1017467             14456533               813321
## 5 staph aureus      10                 152               128683              1360154               102188
## 6 p aeruginosa      10                  23              1612622             18282386               199374
## 7 e coli            20                 885                42356             11758951                22498
## 8 staph aureus      20                 136                 4276              1162855                 4841
## 9 p aeruginosa      20                3098                68146             19333030               106262
## # ℹ 16 more variables: glutamate_Time_2hr <dbl>, lysine_Time_0min <dbl>, lysine_Time_10min <dbl>,
## #   lysine_Time_20min <dbl>, lysine_Time_50min <dbl>, lysine_Time_2hr <dbl>, `succinic acid_Time_0min` <dbl>,
## #   `succinic acid_Time_10min` <dbl>, `succinic acid_Time_20min` <dbl>, `succinic acid_Time_50min` <dbl>,
## #   `succinic acid_Time_2hr` <dbl>, phosphatidylcholine_Time_0min <dbl>, phosphatidylcholine_Time_10min <dbl>,
## #   phosphatidylcholine_Time_20min <dbl>, phosphatidylcholine_Time_50min <dbl>, phosphatidylcholine_Time_2hr <dbl>

pivot_

Now, we have a bigger problem… multiple pieces of information encoded into each column! We need this tibble to be long rather than wide to by tidy. We can use pivot_longer from tidyr.

\label{fig:7206x`} Pivoting longer from wide data.

Figure 7.6: Pivoting longer from wide data.

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance")
## # A tibble: 180 × 4
##    culture dose_mg metabolite_time      abundance
##    <chr>     <dbl> <chr>                    <dbl>
##  1 e coli        0 glutamate_Time_0min     191666
##  2 e coli        0 glutamate_Time_10min   2082823
##  3 e coli        0 glutamate_Time_20min   1796910
##  4 e coli        0 glutamate_Time_50min    846402
##  5 e coli        0 glutamate_Time_2hr     1782879
##  6 e coli        0 lysine_Time_0min          1418
##  7 e coli        0 lysine_Time_10min         2004
##  8 e coli        0 lysine_Time_20min         1498
##  9 e coli        0 lysine_Time_50min         1356
## 10 e coli        0 lysine_Time_2hr            962
## # ℹ 170 more rows

Another happy table…but STILL has multiple data encoded in a single column!

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance") %>%
  separate(metabolite_time, c("metabolite", "time_min"), sep="_")
## # A tibble: 180 × 5
##    culture dose_mg metabolite time_min abundance
##    <chr>     <dbl> <chr>      <chr>        <dbl>
##  1 e coli        0 glutamate  Time        191666
##  2 e coli        0 glutamate  Time       2082823
##  3 e coli        0 glutamate  Time       1796910
##  4 e coli        0 glutamate  Time        846402
##  5 e coli        0 glutamate  Time       1782879
##  6 e coli        0 lysine     Time          1418
##  7 e coli        0 lysine     Time          2004
##  8 e coli        0 lysine     Time          1498
##  9 e coli        0 lysine     Time          1356
## 10 e coli        0 lysine     Time           962
## # ℹ 170 more rows
tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance") %>%
  separate(metabolite_time, c("metabolite", "time_min"), sep="_", convert = TRUE)
## # A tibble: 180 × 5
##    culture dose_mg metabolite time_min abundance
##    <chr>     <dbl> <chr>      <chr>        <dbl>
##  1 e coli        0 glutamate  Time        191666
##  2 e coli        0 glutamate  Time       2082823
##  3 e coli        0 glutamate  Time       1796910
##  4 e coli        0 glutamate  Time        846402
##  5 e coli        0 glutamate  Time       1782879
##  6 e coli        0 lysine     Time          1418
##  7 e coli        0 lysine     Time          2004
##  8 e coli        0 lysine     Time          1498
##  9 e coli        0 lysine     Time          1356
## 10 e coli        0 lysine     Time           962
## # ℹ 170 more rows

Almost there, but notice how metabolite_time has three items when we separate by _? glutamate_Time_0min -> glutamateTime0min We only provided two column names, though. Perhaps we can rethink our separator.

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance") %>%
  separate(metabolite_time, c("metabolite", "time_min"), sep="_Time_") %>%
  mutate(
    time_min = case_when(
      grepl("min", time_min, ignore.case = TRUE) ~ as.numeric(gsub("min", "", time_min)),
      grepl("hr", time_min, ignore.case = TRUE) ~ as.numeric(gsub("hr", "", time_min)) * 60
    )
  )
## # A tibble: 180 × 5
##    culture dose_mg metabolite time_min abundance
##    <chr>     <dbl> <chr>         <dbl>     <dbl>
##  1 e coli        0 glutamate         0    191666
##  2 e coli        0 glutamate        10   2082823
##  3 e coli        0 glutamate        20   1796910
##  4 e coli        0 glutamate        50    846402
##  5 e coli        0 glutamate       120   1782879
##  6 e coli        0 lysine            0      1418
##  7 e coli        0 lysine           10      2004
##  8 e coli        0 lysine           20      1498
##  9 e coli        0 lysine           50      1356
## 10 e coli        0 lysine          120       962
## # ℹ 170 more rows

Ok a new problem has appeared: mixed time units in our time_min column. We have both “min” and “hr”. We need to standardize to understand these as time units. We can mutate with case_when, which does specific things conditionally on particular observations in the tibble. This is a dplyr way of doing if statements. We will also advance it up by doing the “character” to “numeric” conversion in one go!

tbl_met_tidy <- tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance") %>%
  separate(metabolite_time, c("metabolite", "time_min"), sep="_Time_") %>%
  mutate(
    time_min = case_when(
      grepl("min", time_min, ignore.case = TRUE) ~ as.numeric(gsub("min", "", time_min)),
      grepl("hr", time_min, ignore.case = TRUE) ~ as.numeric(gsub("hr", "", time_min)) * 60
    )
  )

tbl_met_tidy
## # A tibble: 180 × 5
##    culture dose_mg metabolite time_min abundance
##    <chr>     <dbl> <chr>         <dbl>     <dbl>
##  1 e coli        0 glutamate         0    191666
##  2 e coli        0 glutamate        10   2082823
##  3 e coli        0 glutamate        20   1796910
##  4 e coli        0 glutamate        50    846402
##  5 e coli        0 glutamate       120   1782879
##  6 e coli        0 lysine            0      1418
##  7 e coli        0 lysine           10      2004
##  8 e coli        0 lysine           20      1498
##  9 e coli        0 lysine           50      1356
## 10 e coli        0 lysine          120       962
## # ℹ 170 more rows

Now we can examine our tibble and confirm that once and for all, it is tidy. Let’s save it for later use. We can also pipe in our saving.

write_

tbl_met %>%
  separate(Culture, c("culture", "dose_mg"), sep = " dose:") %>%
  mutate(dose_mg = str_remove(dose_mg, "mg")) %>%
  mutate(dose_mg = as.numeric(dose_mg)) %>%
  pivot_longer(cols = 3:22, names_to = "metabolite_time", values_to = "abundance") %>%
  separate(metabolite_time, c("metabolite", "time_min"), sep="_Time_") %>%
  mutate(
    time_min = case_when(
      grepl("min", time_min, ignore.case = TRUE) ~ as.numeric(gsub("min", "", time_min)),
      grepl("hr", time_min, ignore.case = TRUE) ~ as.numeric(gsub("hr", "", time_min)) * 60
    )
  ) %>%
  write_csv("data/tidyed-met-df.csv")

Non-pipe

One thing you can notice is how the dplyr verbs add up. We can continually pipe one data frame through a series of wrangling steps until we are satisfied. Operating in this way organizes processes in a way that is easy to manage. We could also do this:

# need an additional library for excel documents
library(readxl)

# read_excel can read portions of the excel
no_pipe_tbl_met <- read_excel("data/bacterial-metabolites_dose-simicillin_wide.xlsx")

no_pipe_tbl_met <- separate(no_pipe_tbl_met, Culture, c("culture", "dose_mg"), sep = " dose:")

no_pipe_tbl_met <- mutate(no_pipe_tbl_met, dose_mg = str_remove(dose_mg, "mg"))

no_pipe_tbl_met <- mutate(no_pipe_tbl_met, dose_mg = as.numeric(dose_mg))

no_pipe_tbl_met <- pivot_longer(no_pipe_tbl_met, cols = 3:22, names_to = "metabolite_time", values_to = "abundance")

no_pipe_tbl_met <- separate(no_pipe_tbl_met, metabolite_time, c("metabolite", "time_min"), sep="_Time_")

no_pipe_tbl_met_tidy <-  mutate(no_pipe_tbl_met,
    time_min = case_when(
      grepl("min", time_min, ignore.case = TRUE) ~ as.numeric(gsub("min", "", time_min)),
      grepl("hr", time_min, ignore.case = TRUE) ~ as.numeric(gsub("hr", "", time_min)) * 60
    )
  )


no_pipe_tbl_met_tidy
## # A tibble: 180 × 5
##    culture dose_mg metabolite time_min abundance
##    <chr>     <dbl> <chr>         <dbl>     <dbl>
##  1 e coli        0 glutamate         0    191666
##  2 e coli        0 glutamate        10   2082823
##  3 e coli        0 glutamate        20   1796910
##  4 e coli        0 glutamate        50    846402
##  5 e coli        0 glutamate       120   1782879
##  6 e coli        0 lysine            0      1418
##  7 e coli        0 lysine           10      2004
##  8 e coli        0 lysine           20      1498
##  9 e coli        0 lysine           50      1356
## 10 e coli        0 lysine          120       962
## # ℹ 170 more rows

summarize

Now we have a tidy tibble, we can summarize very rapidly.

tbl_met_tidy %>%
  group_by(culture, metabolite) %>%
  summarize(mean = mean(abundance))
## # A tibble: 12 × 3
## # Groups:   culture [3]
##    culture      metabolite               mean
##    <chr>        <chr>                   <dbl>
##  1 e coli       glutamate            2323045.
##  2 e coli       lysine               1519827.
##  3 e coli       phosphatidylcholine 67749754.
##  4 e coli       succinic acid        1906850.
##  5 p aeruginosa glutamate            2902178 
##  6 p aeruginosa lysine               2336119.
##  7 p aeruginosa phosphatidylcholine 77078132.
##  8 p aeruginosa succinic acid        3525432.
##  9 staph aureus glutamate             226736.
## 10 staph aureus lysine                193944.
## 11 staph aureus phosphatidylcholine  5836359.
## 12 staph aureus succinic acid         133337.

What other questions can we ask???

Exercises

  • Create a new R Studio Project and name it 005_data_wrangling.

  • Create a new R script, add your name and date at the top as comments.

  • Locate and/or download a Tidyverse cheat-sheet and refer to it as needed.

  1. Download the data.
url <- "https://raw.githubusercontent.com/jeffsocal/ASMS_R_Basics/main/data/bacterial-metabolites_dose-simicillin_messy.csv"
download.file(url, destfile = "./data/bacterial-metabolites_dose-simicillin_messy.csv")
  1. Read in the messy bacteria data and store it as a variable.
## # A tibble: 9 × 12
##   Culture           User  glutamate_runtime_0hr glutamate_runtime_24hr glutamate_runtime_2day glutamate_runtime_3day
##   <chr>             <chr>                 <dbl>                  <dbl>                  <dbl>                  <dbl>
## 1 e coli dose--0-m… Mass…                191666                2082823                1796910                 846402
## 2 staph aureus dos… Mass…                125210                 132156                 114445                 117150
## 3 p aeruginosa dos… Mass…                978126                 397012                 267599                 424183
## 4 e coli dose--10-… Mass…                  1445                1017467               14456533                 813321
## 5 staph aureus dos… Mass…                   152                 128683                1360154                 102188
## 6 p aeruginosa dos… Mass…                    23                1612622               18282386                 199374
## 7 e coli dose--20-… Mass…                   885                  42356               11758951                  22498
## 8 staph aureus dos… Mass…                   136                   4276                1162855                   4841
## 9 p aeruginosa dos… Mass…                  3098                  68146               19333030                 106262
## # ℹ 6 more variables: glutamate_runtime_5day <dbl>, lysine_runtime_0hr <dbl>, lysine_runtime_24hr <dbl>,
## #   lysine_runtime_2day <dbl>, lysine_runtime_3day <dbl>, lysine_runtime_5day <dbl>

In all proceeding exercises, pipe results from previous exercise into current exercise creating a single lone pipe for data processing

  1. Separate Culture column containing culture and dose into culture and dose_mg_ml columns.
## # A tibble: 9 × 13
##   culture      dose_mg_ml User                glutamate_runtime_0hr glutamate_runtime_24hr glutamate_runtime_2day
##   <chr>        <chr>      <chr>                               <dbl>                  <dbl>                  <dbl>
## 1 e coli       0-mg/ml    Mass Spectrometrist                191666                2082823                1796910
## 2 staph aureus 0-mg/ml    Mass Spectrometrist                125210                 132156                 114445
## 3 p aeruginosa 0-mg/ml    Mass Spectrometrist                978126                 397012                 267599
## 4 e coli       10-mg/ml   Mass Spectrometrist                  1445                1017467               14456533
## 5 staph aureus 10-mg/ml   Mass Spectrometrist                   152                 128683                1360154
## 6 p aeruginosa 10-mg/ml   Mass Spectrometrist                    23                1612622               18282386
## 7 e coli       20-mg/ml   Mass Spectrometrist                   885                  42356               11758951
## 8 staph aureus 20-mg/ml   Mass Spectrometrist                   136                   4276                1162855
## 9 p aeruginosa 20-mg/ml   Mass Spectrometrist                  3098                  68146               19333030
## # ℹ 7 more variables: glutamate_runtime_3day <dbl>, glutamate_runtime_5day <dbl>, lysine_runtime_0hr <dbl>,
## #   lysine_runtime_24hr <dbl>, lysine_runtime_2day <dbl>, lysine_runtime_3day <dbl>, lysine_runtime_5day <dbl>
  1. Make dose_mg_ml column numeric by removing the text and change the column data type from character to numeric.
## # A tibble: 9 × 13
##   culture      dose_mg_ml User                glutamate_runtime_0hr glutamate_runtime_24hr glutamate_runtime_2day
##   <chr>             <dbl> <chr>                               <dbl>                  <dbl>                  <dbl>
## 1 e coli                0 Mass Spectrometrist                191666                2082823                1796910
## 2 staph aureus          0 Mass Spectrometrist                125210                 132156                 114445
## 3 p aeruginosa          0 Mass Spectrometrist                978126                 397012                 267599
## 4 e coli               10 Mass Spectrometrist                  1445                1017467               14456533
## 5 staph aureus         10 Mass Spectrometrist                   152                 128683                1360154
## 6 p aeruginosa         10 Mass Spectrometrist                    23                1612622               18282386
## 7 e coli               20 Mass Spectrometrist                   885                  42356               11758951
## 8 staph aureus         20 Mass Spectrometrist                   136                   4276                1162855
## 9 p aeruginosa         20 Mass Spectrometrist                  3098                  68146               19333030
## # ℹ 7 more variables: glutamate_runtime_3day <dbl>, glutamate_runtime_5day <dbl>, lysine_runtime_0hr <dbl>,
## #   lysine_runtime_24hr <dbl>, lysine_runtime_2day <dbl>, lysine_runtime_3day <dbl>, lysine_runtime_5day <dbl>
  1. Pivot the table from wide to long creating metabolite, time_hr & abundance columns.
## # A tibble: 90 × 6
##    culture dose_mg_ml User                metabolite time_hr abundance
##    <chr>        <dbl> <chr>               <chr>      <chr>       <dbl>
##  1 e coli           0 Mass Spectrometrist glutamate  0hr        191666
##  2 e coli           0 Mass Spectrometrist glutamate  24hr      2082823
##  3 e coli           0 Mass Spectrometrist glutamate  2day      1796910
##  4 e coli           0 Mass Spectrometrist glutamate  3day       846402
##  5 e coli           0 Mass Spectrometrist glutamate  5day      1782879
##  6 e coli           0 Mass Spectrometrist lysine     0hr          1418
##  7 e coli           0 Mass Spectrometrist lysine     24hr         2004
##  8 e coli           0 Mass Spectrometrist lysine     2day         1498
##  9 e coli           0 Mass Spectrometrist lysine     3day         1356
## 10 e coli           0 Mass Spectrometrist lysine     5day          961
## # ℹ 80 more rows
  1. Make sure time_hr contains just hours and not a mixture of days and hours.
## # A tibble: 90 × 6
##    culture dose_mg_ml User                metabolite time_hr abundance
##    <chr>        <dbl> <chr>               <chr>        <dbl>     <dbl>
##  1 e coli           0 Mass Spectrometrist glutamate        0    191666
##  2 e coli           0 Mass Spectrometrist glutamate       24   2082823
##  3 e coli           0 Mass Spectrometrist glutamate       48   1796910
##  4 e coli           0 Mass Spectrometrist glutamate       72    846402
##  5 e coli           0 Mass Spectrometrist glutamate      120   1782879
##  6 e coli           0 Mass Spectrometrist lysine           0      1418
##  7 e coli           0 Mass Spectrometrist lysine          24      2004
##  8 e coli           0 Mass Spectrometrist lysine          48      1498
##  9 e coli           0 Mass Spectrometrist lysine          72      1356
## 10 e coli           0 Mass Spectrometrist lysine         120       961
## # ℹ 80 more rows
  1. Remove the User column. See cheatsheet here: data-wrangling-cheatsheet or consult internet.
## # A tibble: 90 × 5
##    culture dose_mg_ml metabolite time_hr abundance
##    <chr>        <dbl> <chr>        <dbl>     <dbl>
##  1 e coli           0 glutamate        0    191666
##  2 e coli           0 glutamate       24   2082823
##  3 e coli           0 glutamate       48   1796910
##  4 e coli           0 glutamate       72    846402
##  5 e coli           0 glutamate      120   1782879
##  6 e coli           0 lysine           0      1418
##  7 e coli           0 lysine          24      2004
##  8 e coli           0 lysine          48      1498
##  9 e coli           0 lysine          72      1356
## 10 e coli           0 lysine         120       961
## # ℹ 80 more rows