Chapter 11 Putting It All Together

Over the past few days, you have embarked on an enlightening journey into the world of R programming, a language renowned for its prowess in data analysis and visualization. This expedition has been as rewarding as it has been challenging, with each new concept building upon the last, culminating in a cohesive and comprehensive understanding of how to wield R effectively.

The initial steps involved mastering the basics of R. Understanding the syntax, the various data structures, and the fundamental functions were the foundation stones. These basics, though seemingly simple, are crucial as they form the bedrock upon which more complex operations are built. Every line of code, every function call, and every variable assignment reinforced the importance of a solid grasp of these foundational elements.

As you progressed, the focus shifted to data wrangling, a critical aspect of data analysis. Here, you learned how to manipulate data frames, clean datasets, and transform data into a usable format. Functions like dplyr and tidyr became invaluable allies, enabling you to filter, select, and arrange data with precision and efficiency. Data wrangling is akin to preparing a canvas before painting; it ensures that the data is in its optimal form, ready to reveal its insights.

With clean, well-structured data in hand, you moved on to the art of creating succinct graphs and plots. Visualization is a powerful tool in data analysis, transforming raw numbers into visual stories that are easy to comprehend. Using libraries like ggplot2, you explored various ways to represent data visually, from simple bar charts and histograms to more complex scatter plots and line graphs. Each plot was an exercise in clarity, aiming to convey information as effectively as possible.

All these skills coalesced in the organized notebook, a repository of knowledge and analysis. This notebook is not just a collection of code snippets and plots; it is a narrative that answers complex questions methodically. Each section flows logically into the next, ensuring that the analysis is easy to follow and reproduce. The notebook stands as a testament to the structured approach you have taken—beginning with data importation, followed by cleaning, analysis, visualization, and finally, interpretation of results.

This process of learning and application has underscored the importance of organization and clarity. By putting everything you have learned together, you can now approach complex questions with confidence, armed with the ability to analyze data thoroughly and present your findings in a manner that is both understandable and reproducible. The journey through R has not just been about acquiring a new skill; it has been about learning how to think critically and systematically about data, ensuring that every analysis is as insightful as it is rigorous.


Nothing to learn here, just practice

  • Download the data.

  • Approach the scientific question.

  • Document your analysis and results in a notebook.


11.1 Test Scenario 1

Proteomics Optimization

The LCMS Proteomics Lab has run a dilution series of a HeLa digest to determine the optimal LC column loading conditions with respect to yield and reproducibility. Determine which dilution provides the greatest number of unique peptides with a minimum amount of quantitative variability.

The Data

url <- "https://raw.githubusercontent.com/jeffsocal/ASMS_R_Basics/main/data/HelaDilution_Skyline-MSstats_peptides.csv.zip"
download.file(url, destfile = "./data/HelaDilution_Skyline-MSstats_peptides.csv.zip")

Other Observations

  • The Proteomics Bioinformatician suggests
    • some dilutions may yield different identifications.
    • not all identifications are real, ie, some are MBR (Match Between Run).

Plan Your Approach

  • Create a new notebook.
  • Examine the data file. Tidy if necessary.
  • Provide some basic summary accounting of the data.
  • Get into the analysis.

11.2 Test Scenario 2

Metabolomics Biomarker

Cala MP, Aldana J, Medina J, Sánchez J, Guio J, Wist J, Meesters RJW. Multiplatform plasma metabolic and lipid fingerprinting of breast cancer: A pilot control-case study in Colombian Hispanic women. PLoS One. 2018 Feb 13;13(2):e0190958. doi: 10.1371/journal.pone.0190958. PMID: 29438405; PMCID: PMC5810980.
Links: PubMed | Data Repository

An open access dataset has measured the quantitative abundance of several metabolites between healthy patients and patients with breast cancer. Determine if there are any potential biomarkers.

The Data

url <- "https://www.metabolomicsworkbench.org/studydownload/ST000918_AN001504_Results.txt"
download.file(url, destfile = "./data/ST000918_AN001504_Results.txt")

Other Observations

  • The data is in a matrix format and has been post processed to normalize and impute missing values.
  • You will likely need to adjust for multiple hypothesis testing.

Plan Your Approach

  • Create a new notebook.
  • Examine the data file. Tidy if necessary.
  • Provide some basic summary accounting of the data.
  • Get into the analysis.

11.3 Some Pro Coding Tricks

Tidy Columns

One way to avoid the backtick in referencing columns is to rename them with underscores for spaces.

tbl <- tbl |> 
  # rename all the columns with a functing using the 'dplyr' and 'stringr' packages
  rename_all(function(x){tolower(x) |> str_replace_all("\\s+", "_")})

Nested Summary Stats

One way to compute a summary statistic inside a table is with nesting.

tbl <- tbl |>
  # group by the column(s) that you want to summarize across
  group_by(column) |>
  # "nest" the remaining columns as a nested table, using the 'tidyr' package 
  nest() |>
  mutate(
    # compute the summary stats using map from the `purrr` package
    stats = data |> map(some_function),
    # extract the summary stat values using the 'broom' package
    stats = stats |> map(tidy)) |>
  # unnest and ungroup
  unnest(stats) |>
  ungroup()