Chapter 6 Tidyverse

The tidyverse is actually a collection of R packages designed for data analysis and visualization. It is an essential tool for data scientists and statisticians who work with large datasets.


At the end of this chapter you should be able to

  • Grasp the utility of the tidyverse.

  • Understand how to construct a data pipeline.

  • Composed a simple workflow.


The tidyverse packages are built around a common philosophy of data manipulation. The goal is to provide a consistent and intuitive syntax for data analysis that is easy to learn and use. The packages in the tidyverse include:

6.1.1 magrittr provides the pipe, %>% used throughout the tidyverse.
6.1.2 tibble creates the main data object.
6.2.1 readr reading and writing data in various formats.
6.3.1 dplyr data manipulation.
6.3.2 tidyr transforming messy data into a tidy format.
6.3.3 purrr functional programming with vectors and lists.
6.4.1 stringr working with strings.
6.4.2 lubridate working with dates and date strings.
6.5.2 ggplot2 graphical plotting and data visualization.

These packages work seamlessly together, allowing users to easily manipulate and visualize their data. The tidyverse also includes a set of conventions and best practices for data analysis, making it easy to follow a consistent workflow.


Cheat-sheets ??


Consider the following workflow to read in data, calculate a linear regression and visualize the data using nine (9) of the underlying packages in the tidyverse.

library(tidyverse)

# readr, tibble, magrittr: read in a table of comma separate values
tbl_csv <- "data/denver_climate.csv" %>% read_csv()

# define a function to fit a linear regression model
lm_func <- function(data) {
  lm(snowfall ~ min_temp, data = data)
}

# readr, tibble, magrittr: using the data imported from above
tbl_csv_lm <- tbl_csv %>%
  # dplyr
  group_by(year) %>%
  # tidyr
  nest() %>%
  # dplyr, purrr: apply the function to each nested data frame
  mutate(model = map(data, lm_func)) %>%
  # dplyr, broom, purrr: extract the coefficients from each model
  mutate(tidy = map(model, broom::tidy)) %>%
  # tidyr
  unnest(tidy) %>%
  # dplyr, stringr
  mutate(term = term %>% str_replace_all("\\(|\\)", "")) %>%
  # dplyr: retain only specific columns
  select(year, term, estimate) %>%
  # tidyr: convert from a long table to a wide table
  pivot_wider(names_from = 'term', values_from = 'estimate')


#ggplot2
tbl_csv %>%
  ggplot(aes(min_temp, snowfall)) + 
  geom_point() +
  # use the linear model data to plot regression lines
  geom_abline(data = tbl_csv_lm,
              aes(slope = min_temp, intercept = Intercept)) +
  # plot each year separately 
  facet_wrap(~year)

To get started with the tidyverse, you can install the package using the following command:

install.packages("tidyverse")

Once installed, users can load the package and begin using the individual packages within the tidyverse:

library(tidyverse)

Overall, the tidyverse is an essential tool for data analysis and visualization in R. Its user-friendly syntax and consistent conventions make it easy for data scientists and statisticians to work with large datasets.

6.1 Core Packages

Two important packages in the tidyverse are tibble and magrittr. These core packages enable other data manipulation operations to work seamlessly, improving efficiency and ease of use when working with data in R. Despite their importance, they are often taken for granted.

6.1.1 magrittr

The tidyverse package magrittr is a popular R package that provides a set of operators for chaining operations in a sequence. The package was developed by Stefan Milton Bache and Hadley Wickham. The main goal of magrittr is to make code more readable and easier to maintain.

The pipe operator, %>%, is the most famous operator provided by magrittr. It allows you to chain multiple operations without the need to use intermediate variables. The pipe operator takes the output of the previous function and passes it as the first argument to the next function. This chaining of operations allows for more concise and readable code.

Here is an example of how to use the pipe operator with magrittr:

# create a vector of numbers
numbers <- c(1, 2, 3, 4, 5)

# use the pipe operator to chain operations
numbers %>%
  sum() %>%
  sqrt()
## [1] 3.872983

In this example, we create a vector of numbers and use the pipe operator to chain the sum() and sqrt() functions. The output of the sum() function is passed as the first argument to the sqrt() function. This allows us to calculate the sum and square root of the vector in a single line of code.

Magrittr also provides other useful operators, such as the assignment pipe %<>%, which allows you to update a variable in place, and the tee operator %T>%, which allows you to inspect the output of an operation without interrupting the chain.

6.1.2 tibble

R tibble is a class of data frame in the R programming language. It is an improved alternative to the traditional data frame and is part of the tidyverse package. Tibbles are data frames with stricter requirements, and they provide a streamlined and more efficient way to work with data.

One of the main advantages of tibbles is that they provide a cleaner and more consistent way to display data. Tibbles only show the first 10 rows and all the columns that fit on the screen, making it easier to work with large datasets. Additionally, tibbles automatically convert character vectors to factors, preventing common errors that can occur when working with data frames.

Another important feature of tibbles is the way they handle column names. Tibbles will not allow spaces in column names, and they use backticks to reference columns with non-standard names. This makes it easier to work with datasets that have complex column names. Tibbles also provide a more consistent way to handle missing values. In data frames, missing values can be represented in different ways, such as NA, NaN, or NULL. Tibbles, on the other hand, only use NA to represent missing values, making it easier to work with missing data.

6.2 Importing

6.2.1 readr

R readr is a package in the R language that is used to read structured data files into R. The package is an efficient and user-friendly toolkit that allows for the reading of different types of flat files such as CSV, TSV, and fixed-width files. It is part of the tidyverse collection of packages, which is popular among data scientists and statisticians.

One of the key features of readr is its ability to quickly read data into R, making it an ideal package for data analysis and data cleaning. readr is designed to handle various types of data, including numeric, date, and character data. The package also has advanced features such as automatic guessing of column types, encoding detection, and parsing of dates and times.

# read comma separated values
tbl_csv <- "data/bacterial-metabolites_dose-simicillin_tidy.csv" %>% read_csv()

One of the best things about readr is its consistency in dealing with file formats, which allows for easy and fast data manipulation. The package provides a high level of control over the import process, allowing you to specify the location of the data file, the delimiter, and the encoding type. Additionally, readr can handle large datasets with ease, making it one of the most efficient packages for data handling.

6.3 Wrangling

Data wrangling is the process of cleaning, transforming, and formatting raw data into a usable format for analysis. The steps involved in data wrangling include removing duplicates, dealing with missing or erroneous values, converting data types, and formatting data into a consistent structure. It also involves merging data from different sources, reshaping data, and transforming data for analysis.

The objective of data wrangling is to create high-quality, structured data for further analysis and modeling. It requires technical skills, domain knowledge, and creativity. Without proper data wrangling, analysis and modeling may be compromised, leading to incorrect conclusions and decisions. This is where tidyverse functions become quite useful and we will go deeper into Data Wrangling in the subsiquent chapter.

Given an example of wide data, where Arabidopsis thaliana plants are measured for height for three weeks post germination.

library(tidyverse)

tbl_wide <- tibble(
  plant = LETTERS[1:3],
  condition = c('wet, cold', 'moist, cold', 'moist, hot'),
  week_1 = c(0.3,0.2,0.4),
  week_2 = c(1.3,1.5,1.7),
  week_3 = c(3.4,4.1,5.2)
)

tbl_wide
## # A tibble: 3 × 5
##   plant condition   week_1 week_2 week_3
##   <chr> <chr>        <dbl>  <dbl>  <dbl>
## 1 A     wet, cold      0.3    1.3    3.4
## 2 B     moist, cold    0.2    1.5    4.1
## 3 C     moist, hot     0.4    1.7    5.2

6.3.1 dplyr

R dplyr is perhaps pne of the most powerful libraries in the tidyverse, providing a set of tools for data manipulation and transformation. It is designed to work seamlessly with data stored in data frames.

The library comes with a set of functions that can be used to filter, arrange, group, mutate, and summarize data. These functions are optimized for speed and memory efficiency, allowing users to work with large datasets easily.

Some of the most commonly used functions in dplyr are:

  • filter: used to extract specific rows from a data frame based on certain conditions.
tbl_wide %>% filter(plant == 'A')
## # A tibble: 1 × 5
##   plant condition week_1 week_2 week_3
##   <chr> <chr>      <dbl>  <dbl>  <dbl>
## 1 A     wet, cold    0.3    1.3    3.4
  • arrange: used to sort the rows of a data frame based on one or more columns.
tbl_wide %>% arrange(week_3)
## # A tibble: 3 × 5
##   plant condition   week_1 week_2 week_3
##   <chr> <chr>        <dbl>  <dbl>  <dbl>
## 1 A     wet, cold      0.3    1.3    3.4
## 2 B     moist, cold    0.2    1.5    4.1
## 3 C     moist, hot     0.4    1.7    5.2
  • select: used to select specific columns from a data frame.
tbl_wide %>% select(plant, week_3)
## # A tibble: 3 × 2
##   plant week_3
##   <chr>  <dbl>
## 1 A        3.4
## 2 B        4.1
## 3 C        5.2
  • mutate: used to add new columns to a data frame.
tbl_wide %>% mutate(week_4 = c(3.8, 4.6, 5.7))
## # A tibble: 3 × 6
##   plant condition   week_1 week_2 week_3 week_4
##   <chr> <chr>        <dbl>  <dbl>  <dbl>  <dbl>
## 1 A     wet, cold      0.3    1.3    3.4    3.8
## 2 B     moist, cold    0.2    1.5    4.1    4.6
## 3 C     moist, hot     0.4    1.7    5.2    5.7

6.3.2 tidyr

R tidyr is a package in R that helps to reshape data frames. It is an essential tool for data cleaning and analysis. Tidyr is used to convert data from wide to long format and vice versa, and it also helps to separate and unite columns.

  • pivot_longer: used to reshape data from a column-based wide format to a row-based long format.
tbl_long <- tbl_wide %>% pivot_longer(cols = matches('week'), names_to = 'time', values_to = 'inches')

tbl_long
## # A tibble: 9 × 4
##   plant condition   time   inches
##   <chr> <chr>       <chr>   <dbl>
## 1 A     wet, cold   week_1    0.3
## 2 A     wet, cold   week_2    1.3
## 3 A     wet, cold   week_3    3.4
## 4 B     moist, cold week_1    0.2
## 5 B     moist, cold week_2    1.5
## 6 B     moist, cold week_3    4.1
## 7 C     moist, hot  week_1    0.4
## 8 C     moist, hot  week_2    1.7
## 9 C     moist, hot  week_3    5.2
  • pivot_wider: used to reshape data from a row-based long format to a column-based wide format.
tbl_long %>% pivot_wider(names_from = 'time', values_from = 'inches')
## # A tibble: 3 × 5
##   plant condition   week_1 week_2 week_3
##   <chr> <chr>        <dbl>  <dbl>  <dbl>
## 1 A     wet, cold      0.3    1.3    3.4
## 2 B     moist, cold    0.2    1.5    4.1
## 3 C     moist, hot     0.4    1.7    5.2

The package tidyr also has functions to separate and unite columns. The “separate” function is used when you have a column that contains multiple variables. For example, if you have a column that contains both the first and last name of a person, you can separate them into two columns. The “unite” function is the opposite of separate. It is used when you want to combine two or more columns into one column.

  • separate: used to separate a column with multiple values into two or more columns.
tbl_long %>% separate(condition, into = c('soil', 'temp'))
## # A tibble: 9 × 5
##   plant soil  temp  time   inches
##   <chr> <chr> <chr> <chr>   <dbl>
## 1 A     wet   cold  week_1    0.3
## 2 A     wet   cold  week_2    1.3
## 3 A     wet   cold  week_3    3.4
## 4 B     moist cold  week_1    0.2
## 5 B     moist cold  week_2    1.5
## 6 B     moist cold  week_3    4.1
## 7 C     moist hot   week_1    0.4
## 8 C     moist hot   week_2    1.7
## 9 C     moist hot   week_3    5.2
  • separate_rows: used to duplicate a row with multiple values from a given column.
tbl_wide %>% separate_rows(condition, sep = ', ')
## # A tibble: 6 × 5
##   plant condition week_1 week_2 week_3
##   <chr> <chr>      <dbl>  <dbl>  <dbl>
## 1 A     wet          0.3    1.3    3.4
## 2 A     cold         0.3    1.3    3.4
## 3 B     moist        0.2    1.5    4.1
## 4 B     cold         0.2    1.5    4.1
## 5 C     moist        0.4    1.7    5.2
## 6 C     hot          0.4    1.7    5.2

6.3.3 purrr

The purrr package is a functional programming toolkit for R that enables users to easily and rapidly apply a function to a set of inputs, returning a list or vector of outputs. It is designed to work seamlessly with the tidyverse ecosystem of packages, but can also be used with base R functions.

The most important feature in purrr is its ability to replace loops with functions that save time and effort. The package has a collection of functions that allow you to work with functions that take one or more arguments. Some of these functions include map, map2, pmap, and imap.

The map function is purrr’s flagship function and is used to apply a function to each element of a list or vector, returning a list of outputs. The map2 function applies a function to two lists or vectors in parallel, returning a list of outputs. The pmap function applies a function to an arbitrary number of lists or vectors in parallel, returning a list of outputs. The imap function is similar to map, but also provides the index of the current element in the input vector.

Purrr also includes features such as the possibility of mapping over nested lists, using map and variants to iterate over grouped data, and using map and variants to modify data in place.

numbers <- list(1, 2, 3, 4, 5)

# define a function to square a number
square <- function(x) { x ^ 2 }

# use map to apply the function to each element of the list
squared_numbers <- map(numbers, square)

# print the result
squared_numbers
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25

6.3.4 glue

R glue is a tidyverse package that provides a simple way to interpolate values into strings. It allows users to combine multiple strings or variables together into a single string with minimum efforts, simpler than using base R fuctions.

The glue function can handle various types of inputs, including vectors, lists, and expressions. It also supports user-defined formats and allows users to specify separators between the values.

One of the significant advantages of using glue is that it provides a more readable and concise way to create strings in R. It eliminates the need for multiple paste() or paste0() statements, which can be cumbersome and error-prone.

For example, instead of writing:

paste0("The value of x is: ", x, ", and the value of y is: ", y)

we can use R glue:

glue("The value of x is: {x}, and the value of y is: {y}")

This code will produce the same output, but it’s more readable and easier to modify.

6.4 Data Types

6.4.1 stringr

The tidyverse package stringr provides a cohesive set of functions designed to make working with strings more efficient. It is especially useful when dealing with messy or unstructured data that needs to be cleaned and transformed into a more structured format.

Several functions in stringr provides methods working with strings, for example:

  • str_replace: replaces a pattern with another pattern in a string.
str_replace("Hello World", "W.+", "Everyone")
## [1] "Hello Everyone"
  • str_extract: extracts the first occurrence of a pattern from a string.
str_extract("Hello World", "W.+")
## [1] "World"
  • str_split: splits a string into pieces based on a specified pattern.
str_split("Hello World", "\\s")
## [[1]]
## [1] "Hello" "World"

6.4.2 lubridate

The tidyverse package lubridate helps with the handling of dates and times. The package has several functions that make it easier to work with dates and times, especially when dealing with data that has different formats.

Some of the functions in lubridate package include:

  • ymd - this is used to convert dates in the format of year, month, and day to the date format in R. For example, ymd("20220101") will return the date in R format.
  • dmy - this is used to convert dates in the format of day, month, and year to the date format in R. For example, dmy("01-01-2022") will return the date in R format.
  • hms - this is used to convert time in the format of hours, minutes, and seconds to the time format in R. For example, hms("12:30:15") will return the time in R format.
  • ymd_hms - this is used to convert dates and times in the format of year, month, day, hours, minutes, and seconds to the date and time format in R. For example, ymd_hms("2022-01-01 12:30:15") will return the date and time in R format.

There are also functions for extracting information from dates and times such as year(), month(), day(), hour(), minute(), and second().

6.4.3 forcats

R forcats is a tidyverse package that provides a set of tools for working with categorical data. It is designed to make it easier to work with factors in R, which are used to represent categorical data.

The forcats package provides several functions that can be used to manipulate factors, including reordering levels, combining levels, and handling missing values. It also provides functions for working with ordered factors, which are used to represent data that has a natural ordering, such as age groups or ratings.

One of the key benefits of using forcats is that it allows you to easily visualize and analyze categorical data. The package provides functions for creating categorical plots, such as bar charts and pie charts, as well as for calculating summary statistics for categorical data.

In addition to its core functionality, forcats is also highly customizable. It provides a wide range of options for controlling the appearance of plots and for customizing the behavior of factor manipulation functions.

6.5 Summarizing

6.5.1 dplyr

In the R tidyverse package, summarizing data is a common task performed on data frames. The dplyr package provides a set of functions that makes it easy to summarize data based on one or more variables.

  • group_by: used to group rows of a data frame by one or more columns.
  • summarize: used to summarize the data based on one or more aggregate functions.

The summarise() function is used to perform simple summary statistics on data frames. It takes the name of the new variable as well as the summary function that should be used to calculate its value. For example, to calculate the mean and standard deviation of a variable named ‘x’ in a data frame named ‘df’, we can use the following code:

tbl_long %>%
  summarise(min = min(inches), 
            max = max(inches))
## # A tibble: 1 × 2
##     min   max
##   <dbl> <dbl>
## 1   0.2   5.2

The group_by() function is used to group data frames by one or more variables. This is useful when we want to summarize data by different categories. For example, to calculate the mean and standard deviation of ‘x’ by ‘group’, we can use the following code:

tbl_long %>%
  group_by(plant) %>%
  summarise(min = min(inches), 
            max = max(inches))
## # A tibble: 3 × 3
##   plant   min   max
##   <chr> <dbl> <dbl>
## 1 A       0.3   3.4
## 2 B       0.2   4.1
## 3 C       0.4   5.2

The summarize_at() and summarize_all() functions are used to perform summary statistics on multiple variables at once. The summarize_at() function takes a list of variables to summarize, while the summarize_all() function summarizes all variables in the data frame. For example, to calculate the mean and standard deviation of all numeric variables in a data frame named ‘df’, we can use the following code:

tbl_long %>%
  summarise_all(list(max = max, min = min))
## # A tibble: 1 × 8
##   plant_max condition_max time_max inches_max plant_min condition_min time_min inches_min
##   <chr>     <chr>         <chr>         <dbl> <chr>     <chr>         <chr>         <dbl>
## 1 C         wet, cold     week_3          5.2 A         moist, cold   week_1          0.2

Summarizing data is an essential task that can be performed using several functions. These functions make it easy to calculate summary statistics based on one or more variables, group data frames by different categories, and summarize multiple variables at once.

6.5.2 ggplot2

The tidyverse package ggplot2, demonstrated at the onset of this chapter, is a data visualization package in R programming language that provides a flexible and powerful framework for creating graphs and charts. It is built on the grammar of graphics, which is a systematic way of mapping data to visual elements like points, lines, and bars.

With ggplot2, you can create a wide range of graphs including scatterplots, bar charts, line charts, and more. The package offers a variety of customization options, such as color schemes, themes, and annotations, allowing you to create professional-looking visualizations with ease.

One of the key benefits of ggplot2 is that it allows you to quickly explore and analyze your data visually. You can easily create multiple graphs with different variables and subsets of your data, and compare them side by side to identify patterns and trends.

Exercises

  • Create a new R Studio Project and name it 004_tidyverse.

  • Create a new R script, add your name and date at the top as comments.

  1. Calculate the mean of following vector.
##  [1]  7.48 14.15  6.23 10.21 15.13  8.19  8.58  8.09  9.14 10.41
## [1] 9.761
  1. Pipeline (eg. %>%) a data operation that provides the mean of following vector.
## [1] 9.761
  1. Employing a pipeline (eg. %>%), construct a tibble with columns named radi and area which contains the AREA of circles with integer RADII 1 to 5. Remember PEMDAS.
## # A tibble: 5 × 2
##    radi  area
##   <int> <dbl>
## 1     1  3.14
## 2     2 12.6 
## 3     3 28.3 
## 4     4 50.3 
## 5     5 78.5
  1. Extract all AREAs greater than 50.
## # A tibble: 2 × 2
##    radi  area
##   <int> <dbl>
## 1     4  50.3
## 2     5  78.5
  1. Add a column named circ_type where you assign the string odd or even depending on the column radi. Attempt to use the purrr::map function, along with the oddeven() function from the previous chapter, then compute the mean, standard deviation, and coefficient of variation of the AREA for each circ_type.
## # A tibble: 2 × 4
##   circ_type area_mean area_sd area_cv
##   <chr>         <dbl>   <dbl>   <dbl>
## 1 even           31.4    26.7   0.849
## 2 odd            36.7    38.4   1.05

More Exercises

  • Create a new R Studio Project and name it 104_tidyverse. | |
  • Create a new R script, add your name and tbl_ddae at the top as comments.

Load the tidyverse library …

library(tidyverse)

Download the data.

url <- "https://raw.githubusercontent.com/jeffsocal/ASMS_R_Basics/main/data/Choi2017_DDA_Skyline_input.csv.zip"
download.file(url, destfile = "./data/Choi2017_DDA_Skyline_input.csv.zip")

Exercise #1 – Reading data

1.1 Read the example data from a proteomics experiment NOTE: file is a zipped .csv file – R knows how to read it!

Exercise #2 – Reviewing data Frames

2.1 Review some basic properties of the data frame

  • How many rows?
## [1] 1257732
  • How many columns?
## [1] 16
  • How many rows & columns (use one expression)
## [1] 1257732      16
  • What are the column names?
##  [1] "ProteinName"             "PeptideSequence"         "PeptideModifiedSequence" "PrecursorCharge"        
##  [5] "PrecursorMz"             "FragmentIon"             "ProductCharge"           "ProductMz"              
##  [9] "IsotopeLabelType"        "Condition"               "BioReplicate"            "FileName"               
## [13] "Area"                    "StandardType"            "Truncated"               "DetectionQValue"
  • What are the data types stored in each column?
## spc_tbl_ [1,257,732 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ProteinName            : chr [1:1257732] "DECOY_sp|P0CF18|YM085_YEAST" "DECOY_sp|P0CF18|YM085_YEAST" "DECOY_sp|P0CF18|YM085_YEAST" "DECOY_sp|P0CF18|YM085_YEAST" ...
##  $ PeptideSequence        : chr [1:1257732] "KDMYGNPFQK" "KDMYGNPFQK" "KDMYGNPFQK" "KDMYGNPFQK" ...
##  $ PeptideModifiedSequence: chr [1:1257732] "KDM[+16]YGNPFQK" "KDM[+16]YGNPFQK" "KDM[+16]YGNPFQK" "KDM[+16]YGNPFQK" ...
##  $ PrecursorCharge        : num [1:1257732] 3 3 3 3 3 3 3 3 3 3 ...
##  $ PrecursorMz            : num [1:1257732] 415 415 415 415 415 ...
##  $ FragmentIon            : chr [1:1257732] "precursor" "precursor" "precursor" "precursor" ...
##  $ ProductCharge          : num [1:1257732] 3 3 3 3 3 3 3 3 3 3 ...
##  $ ProductMz              : num [1:1257732] 415 415 415 415 415 ...
##  $ IsotopeLabelType       : chr [1:1257732] "light" "light" "light" "light" ...
##  $ Condition              : chr [1:1257732] "Condition1" "Condition1" "Condition1" "Condition2" ...
##  $ BioReplicate           : num [1:1257732] 1 2 3 4 5 6 7 8 9 10 ...
##  $ FileName               : chr [1:1257732] "JD_06232014_sample1-A.raw" "JD_06232014_sample1_B.raw" "JD_06232014_sample1_C.raw" "JD_06232014_sample2_A.raw" ...
##  $ Area                   : chr [1:1257732] "71765.046875" "147327.265625" "1373396.5" "66387.4453125" ...
##  $ StandardType           : logi [1:1257732] NA NA NA NA NA NA ...
##  $ Truncated              : logi [1:1257732] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ DetectionQValue        : chr [1:1257732] "#N/A" "#N/A" "#N/A" "#N/A" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ProteinName = col_character(),
##   ..   PeptideSequence = col_character(),
##   ..   PeptideModifiedSequence = col_character(),
##   ..   PrecursorCharge = col_double(),
##   ..   PrecursorMz = col_double(),
##   ..   FragmentIon = col_character(),
##   ..   ProductCharge = col_double(),
##   ..   ProductMz = col_double(),
##   ..   IsotopeLabelType = col_character(),
##   ..   Condition = col_character(),
##   ..   BioReplicate = col_double(),
##   ..   FileName = col_character(),
##   ..   Area = col_character(),
##   ..   StandardType = col_logical(),
##   ..   Truncated = col_logical(),
##   ..   DetectionQValue = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

What kind of data is present? What is the structure of the data?

  • Use the View function to review the data in RStudio.

  • It appears that some of the data is duplicated across many rows? Look at the data column by column and see if you can understand why.

Exercise #3 – Working with data Frames

3.1 Retrieve the data from the column call “FileName” How many values do you expect to get? Write an expression using the data you retrieved to see if your guess is correct.

## [1] 1257732
  • you’d expect to have the same number of values are there is rows
## [1] 1257732

3.2 How many unique values of the data from “FileName” are there? What are these values and what do they correspond to?

##  [1] "JD_06232014_sample1-A.raw" "JD_06232014_sample1_B.raw" "JD_06232014_sample1_C.raw" "JD_06232014_sample2_A.raw"
##  [5] "JD_06232014_sample2_B.raw" "JD_06232014_sample2_C.raw" "JD_06232014_sample3_A.raw" "JD_06232014_sample3_B.raw"
##  [9] "JD_06232014_sample3_C.raw" "JD_06232014_sample4-A.raw" "JD_06232014_sample4_B.raw" "JD_06232014_sample4_C.raw"

3.3 Using data frame indexing syntax, subset the data to rows for the protein “sp|P33399|LHP1_YEAST”

## # A tibble: 288 × 16
##    ProteinName          PeptideSequence PeptideModifiedSeque…¹ PrecursorCharge PrecursorMz FragmentIon ProductCharge
##    <chr>                <chr>           <chr>                            <dbl>       <dbl> <chr>               <dbl>
##  1 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  2 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  3 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  4 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  5 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  6 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  7 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  8 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
##  9 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
## 10 sp|P33399|LHP1_YEAST NDGWVPISTIATFNR NDGWVPISTIATFNR                      2        846. precursor               2
## # ℹ 278 more rows
## # ℹ abbreviated name: ¹​PeptideModifiedSequence
## # ℹ 9 more variables: ProductMz <dbl>, IsotopeLabelType <chr>, Condition <chr>, BioReplicate <dbl>, FileName <chr>,
## #   Area <chr>, StandardType <lgl>, Truncated <lgl>, DetectionQValue <chr>

3.4 How many unique peptides are present in the data for the above protein?

  • first store the subset in a variable

  • now calculate the number of unique peptides

## [1] 7
  • or you can do it with one expression
## [1] 7