An R package for the tidy-ing, post processing and analysis of quantitative proteomic data.
Proteomics analysis software, available either through a paid subscription or as an open-source tool, fail to output data in a well conceived tidy format. A majority of these tools generate output formats that have either mixed wide- and long-format data structures, columns headers with messy names and added symbols, and often confusing variable names. This leads researchers to create one-off scripts for cleaning and importing data from various formats, often creating an environment of unmaintained, bespoke code. This package attempts to solve that problem by creating a flexible import tool to unify multiple formats and create an new tidy R object for proteomics analysis.
This package supports at a high level:
- data importing
- data filtering
- data visualization
- quantitative normalization & imputation
- two-sample expression & term enrichment analysis
- protein inference, sequence coverage and visualization
Importing is currently implemented for a few platforms and assume peptide level FDR (at the user’s desired level) has already been accounted for. See
vignette("importing"). Importing is flexible enough to accept other data platforms in flat files (.csv, .tsv, and .xlsx) with a custom configuration.
|ProteomeDiscoverer||*.xlsx peptides export||*.xlsx proteins export||requires layout configuration|
|Skyline||*.csv MSstats peptide report||requires MSstats install|
|DIA-NN||*.tsv peptide report|
|mzTab||*.mzTab (v1.0.0)||*.mzTab (v1.0.0)||does not track MBR|
Ease of Use
This package supports the same syntactic sugar utilized in the tidy-verse functions like filter, and introduces the
%like% operator, see
vignette("subsetting") . These operations can extend to all aspects of the data set, including sample names, protein IDs, annotations and accountings like match_between_runs and num_peptides.
|!=||does not equal||
|<, >||less, greater than||
|! %like%||does not contain||
|—||— expression —||—|
Expression analysis also utilizes this type of syntax when referencing samples for analysis. For example
data %>% expression(knockdown/control) would know to run the differential expression of the sample ko with respect to the sample wt such that positive log2 difference would be up-expressed in ko and a negative log2 differences would be down-expressed in ko.
To install, open R and type:
You will also need the Bioconductor packages limma, qvalue, and fgsea, to install these type:
NOTE: There are several other packages required that will be prompted and automatically downloaded from CRAN when installing. Depending on your current system some packages require the installation of OS level libraries for advanced math and string manipulation.
Its simple to get started. Make a new project, drop your raw data in a folder labeled data. For more information see
library(tidyproteomics) data <- "./data/some_ProteomeDiscoverer_data.xlsx" %>% import("ProteomeDiscoverer", "proteins") data %>% summary("samples")