Skip to contents

Data Support

Importing is currently implemented for a few platforms and assume peptide level FDR (at the user’s desired level) has already been accounted for. See vignette("importing"). Importing is flexible enough to accept other data platforms in flat files (.csv, .tsv, and .xlsx) with a custom configuration.

Platform peptides proteins notes
ProteomeDiscoverer *.xlsx peptides export *.xlsx proteins export

requires layout configuration

website

MaxQuant evidence.txt proteinGroups.txt website
FragPipe combined_peptide.tsv combined_protein.tsv website
Skyline *.csv MSstats peptide report

requires MSstats install

website

DIA-NN *.tsv peptide report website
mzTab *.mzTab (v1.0.0) *.mzTab (v1.0.0)

does not track MBR

website

Table formatted data (eg. .csv, .tsv, .xlsx) from ProteomeDiscoverer, MaxQuant and FragPipe meet the requirements, and are defined in the package data tidyproteomics/inst/extdata/config/ accordingly. Note that the groups sample, identifier and quantitative are required, while the rest are optional, and only used if a match is found - currently this does not have good error handeling, so be aware. For ProteomeDiscoverer the peptide’s config can be modified to use Master Protein Accessions or Protein Accessions if either column is present. Also note, that for some configurations the sample group has no defined “supplied name”, as this is later derived by the extraction code defined in the columns labeled “pattern”.

The currently known and not directly supported quantitative proteomics platforms are Spectronaut and Proteograph Analysis Suite. Given a flat-file export, data from these platforms could also be importable. See User Defined Import.

ProteomeDiscoverer

The ProteomeDiscoverer software suite has the ability to export data post analysis for both peptides and proteins. This can be accomplished by opening the Study Results, then selecting File \> Export \> To Microsoft Excel. In the pop up keep only Level 1 checked, and select either “Proteins” or “Peptide Groups”. See the Exporting section below to make sure the required columns are present.

The data exported from ProteomeDiscoverer is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from ProteomeDiscoverer is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Note that the pivot columns will provide the basis for sample and sample_file by extracting the correct values as indicated in the pattern_extract column, and hence the absence of a value for column_import.

Initial Set Up

When setting up your experiment it is essential to create sample names in the Study Definition tab of the open Study, navigate to the Study Factors box and simply add Catigorical Factor and name it (eg. My_Sample) and name your study groups (eg. WT and KO). These destinations need to be applied to the import files under the Samples tab. Notice the left most column is the added Catigorical Factor named the same as what was supplied. In each line there is a pull-down menu for designating each file (or label for TMT) to one of the Catigorical Factors. This will now ensure that when importing the data there are properly labeled sample. No big deal if that hasn’t been done, it can be fixed with the modify() function.

Exporting

Abundances A quick note on reported abundances. ProteomeDiscoverer reports protein and peptide abundances in several ways that may not be immediately clear - tidyproteomics should have access to the raw abundance values, so that normalization and imputation are starting from scratch. We want to export for example Abundance: F3: Sample, p97-KD, not Abundances (Grouped) ..., Abundance Ratio: ..., or any-other derivation. This can be set by toggling the display columns in ProteomeDiscoverer to show Abundance, leaving the other columns should not impact data importing and will be disregarded.

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/ProteomeDiscoverer_proteins.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample “(?<=,).+” YES
sample sample_file (?<=)F(?=:) YES
identifier protein ^Accession$ \; YES
quantitative abundance_raw Abundance\: TRUE YES
impute match_between_runs Found in Sample\: Found TRUE YES
accounting num_peptides ^# Peptides$
accounting num_unique_peptides ^# Unique Peptides$
accounting num_psms # PSMs
annotation description Description
…etc
Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/ProteomeDiscoverer_ppeptides.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample “(?<=,).+” YES
sample sample_file (?<=)F(?=:) YES
identifier protein Master Protein Accessions \; YES
identifier peptide Annotated Sequence YES
identifier modification ^Modifications$ YES
quantitative abundance_raw Abundance\: TRUE YES
impute match_between_runs Found in Sample\: Found TRUE YES
accounting num_psms # PSMs
accounting description Description
…etc
# replace path_to_package_data("proteins") with the path to your local data.
# hela_proteins <- "./data/hela_export_table.xlsx" %>%
#    import("ProteomeDiscoverer", "proteins") 
data_proteins <- path_to_package_data("hela_proteins") %>%
   import("ProteomeDiscoverer", "proteins") 

MaxQuant

The MaxQuant software suite creates files in project sub directories following current_project \> combined \> txt with data both peptides (evidence.txt) and proteins (proteinGroups.txt).

The data exported from MaxQuant for the file proteinGroups.txt is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from MaxQuant is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Note also that we want to remove rows from the decoy search labeled REV_*, and indicated in the column pattern_remove.

Initial Set Up

While evidence.txt contains values for each imported file (important for comparative statistics), the proteinGroups.txt file will only contain an entry for each file if in the initial MaxQuant configuration the Experiment column in the raw data tab has a unique value for each file (eg 1, 2, 3, …), otherwise the values get merged on common experiment groups in the output for the protein level data. The sample groups can then be set with the modify() function.

Exporting

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/MaxQuant_proteins.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample (?<=\s)[0-9]+ YES
sample sample_file (?<=\s)[0-9]+ YES
identifier protein ^Proteins IDs$ (?<=\|).*?(?=\|) ^REV\_ \; YES
quantitative abundance_raw ^Intensity\s$ TRUE YES
accounting num_psms ^MS/MS count
accounting num_peptides ^Peptides\s TRUE
accounting num_unique_peptides ^Unique peptides\s TRUE
Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/MaxQuant_peptides.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample Raw file YES
sample sample_file Experiment YES
identifier protein ^Proteins$ (?<=\|).*?(?=\|) \; YES
identifier peptide ^Sequence$ YES
identifier modification ^Modified sequence$ YES
quantitative abundance_raw ^Intensity$ YES
impute match_between_runs Type MATCH YES
accounting num_psms MS/MS count
data_proteins <- "path_to_maxquant_project/combined/txt/proteinGroups.txt" %>%
   import("MaxQuant", "proteins") %>%
   reassign(field = 'sample', pattern = 'sample_1', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_2', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_3', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_4', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_5', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_6', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_7', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_8', replace = 'wt')

FragPipe

The FragPipe software suite creates files in project sub directory with data both peptides (combined_peptide.tsv) and proteins (combined_protein.tsv).

The data exported from FragPipe for the file combined_protein.tsv is not very “tidy”, as it has mixed wide-format columns (eg. Intensity for each sample on a single row) and long-format. The data table from FragPipe is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Initial Set Up

FragPipe already removes rows from the decoy search if indicated in the workflow setup, if however this is not being done you can indicated a pettern in the column pattern_remove.

Exporting

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/FragPipe_proteins.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample (?<=\s)[0-9]+ YES
sample sample_file (?<=\s)[0-9]+ YES
identifier protein ^Protein ID$ (?<=\|).*?(?=\|) \; YES
quantitative abundance_raw \sMaxLFQ\sIntensity$ TRUE YES
accounting num_psms ^\d\sSpectral\sCount$ TRUE
accounting num_psms_unique ^\sUnique\sSpectral\sCount$ TRUE
annotation description ^Description$
annotation gene_name ^Gene$
Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/FragPipe_peptides.tsv:

category column_defined column_import pattern_extract pattern_remove pattern_split pivot REQUIRED
sample sample .+(?=\_\d+\s) YES
sample sample_file .+(?=\sMax) YES
identifier protein ^Protein ID$ (?<=\|).*?(?=\|) \; YES
identifier peptide ^Peptide Sequence$ YES
quantitative abundance_raw \sMaxLFQ\sIntensity$ TRUE YES
accounting num_psms ^\sSpectral\sCount$ TRUE
annotation description ^Protein Description$
annotation gene_name ^Gene$

Skyline

The Skyline software suite can export quantitative peptide data for most analyses. The exported data file is in a fairly “tidy” long format CSV file, where each peptide for each sample is reported on an individual row.

Initial Set Up

A report need to be established, under File > Export > Report. Select Edit list..., Group: > External Tools then click Add.... Select the Columns that correspond to the required values shown below, name the report in Report Name: and click OK.

Exporting

Proteins

Not yet supported. Peptides can be combined into proteins with collapse().

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/SkyLine_peptides.tsv:

data_peptides <- "path_to_skyline_project/output_file_name.csv" %>%
   import("Skyline", "peptides")

DIA-NN

The DIA-NN software suite exports quantitative peptide data back into the project folder as report.tsv. The exported data file is in a fairly “tidy” long format file, where each peptide for each sample is reported on an individual row.

Exporting

Proteins

Not yet supported. Peptides can be combined into proteins with collapse().

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/DIA-NN_peptides.tsv:

data_peptides <- "path_to_diann_project/output_file_name.csv" %>%
   import("DIA-NN", "peptides")

mzTab

The mzTab data has limited support from major vendors - ProteomeDiscoverer for example only supports version 1.0.0. The data for proteins, peptides and psms are all contained within a single file. Tidyproteomics assembles the psm, peptide and protein data independently then sequentially combines them to generate the desired protein or peptide level output.

Exporting

Proteins
data_proteins <- "path_to_data/project.mzTab" %>% import("mzTab", "proteins")
Peptides
data_peptides <- "path_to_data/project.mzTab" %>% import("mzTab", "peptides")