Importing • tidyproteomics

Data Support

Importing is currently implemented for a few platforms and assume peptide level FDR (at the user’s desired level) has already been accounted for. See vignette("importing"). Importing is flexible enough to accept other data platforms in flat files (.csv, .tsv, and .xlsx) with a custom configuration.

Platform	peptides	proteins	notes
ProteomeDiscoverer	.xlsx peptides export*	.xlsx proteins export*	requires layout configuration website
MaxQuant	evidence.txt	proteinGroups.txt	website
FragPipe	combined_peptide.tsv	combined_protein.tsv	website
Skyline	.csv MSstats peptide report*		requires MSstats install website
DIA-NN	.tsv peptide report*		website
mzTab	*.mzTab (v1.0.0)	*.mzTab (v1.0.0)	does not track MBR website

Table formatted data (eg. .csv, .tsv, .xlsx) from ProteomeDiscoverer, MaxQuant and FragPipe meet the requirements, and are defined in the package data tidyproteomics/inst/extdata/config/ accordingly. Note that the groups sample, identifier and quantitative are required, while the rest are optional, and only used if a match is found - currently this does not have good error handeling, so be aware. For ProteomeDiscoverer the peptide’s config can be modified to use Master Protein Accessions or Protein Accessions if either column is present. Also note, that for some configurations the sample group has no defined “supplied name”, as this is later derived by the extraction code defined in the columns labeled “pattern”.

The currently known and not directly supported quantitative proteomics platforms are Spectronaut and Proteograph Analysis Suite. Given a flat-file export, data from these platforms could also be importable. See User Defined Import.

library("dplyr")
library("tidyproteomics")

ProteomeDiscoverer

The ProteomeDiscoverer software suite has the ability to export data post analysis for both peptides and proteins. This can be accomplished by opening the Study Results, then selecting File \> Export \> To Microsoft Excel. In the pop up keep only Level 1 checked, and select either “Proteins” or “Peptide Groups”. See the Exporting section below to make sure the required columns are present.

The data exported from ProteomeDiscoverer is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from ProteomeDiscoverer is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Note that the pivot columns will provide the basis for sample and sample_file by extracting the correct values as indicated in the pattern_extract column, and hence the absence of a value for column_import.

Initial Set Up

When setting up your experiment it is essential to create sample names in the Study Definition tab of the open Study, navigate to the Study Factors box and simply add Catigorical Factor and name it (eg. My_Sample) and name your study groups (eg. WT and KO). These destinations need to be applied to the import files under the Samples tab. Notice the left most column is the added Catigorical Factor named the same as what was supplied. In each line there is a pull-down menu for designating each file (or label for TMT) to one of the Catigorical Factors. This will now ensure that when importing the data there are properly labeled sample. No big deal if that hasn’t been done, it can be fixed with the modify() function.

Exporting

Abundances A quick note on reported abundances. ProteomeDiscoverer reports protein and peptide abundances in several ways that may not be immediately clear - tidyproteomics should have access to the raw abundance values, so that normalization and imputation are starting from scratch. We want to export for example Abundance: F3: Sample, p97-KD, not Abundances (Grouped) ..., Abundance Ratio: ..., or any-other derivation. This can be set by toggling the display columns in ProteomeDiscoverer to show Abundance, leaving the other columns should not impact data importing and will be disregarded.

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/ProteomeDiscoverer_proteins.tsv:

category	column_defined	column_import	pattern_extract	pattern_split	pivot	REQUIRED
sample	sample		“(?<=,).+”			YES
sample	sample_file		(?<=)F(?=:)			YES
identifier	protein	^Accession$		\;		YES
quantitative	abundance_raw	Abundance\:			TRUE	YES
impute	match_between_runs	Found in Sample\:	Found		TRUE	YES
accounting	num_peptides	^# Peptides$
accounting	num_unique_peptides	^# Unique Peptides$
accounting	num_psms	# PSMs
annotation	description	Description
…etc

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/ProteomeDiscoverer_ppeptides.tsv:

category	column_defined	column_import	pattern_extract	pattern_split	pivot	REQUIRED
sample	sample		“(?<=,).+”			YES
sample	sample_file		(?<=)F(?=:)			YES
identifier	protein	Master Protein Accessions		\;		YES
identifier	peptide	Annotated Sequence				YES
identifier	modification	^Modifications$				YES
quantitative	abundance_raw	Abundance\:			TRUE	YES
impute	match_between_runs	Found in Sample\:	Found		TRUE	YES
accounting	num_psms	# PSMs
accounting	description	Description
…etc

# replace path_to_package_data("proteins") with the path to your local data.
# hela_proteins <- "./data/hela_export_table.xlsx" %>%
#    import("ProteomeDiscoverer", "proteins") 
data_proteins <- path_to_package_data("hela_proteins") %>%
   import("ProteomeDiscoverer", "proteins")

MaxQuant

The MaxQuant software suite creates files in project sub directories following current_project \> combined \> txt with data both peptides (evidence.txt) and proteins (proteinGroups.txt).

The data exported from MaxQuant for the file proteinGroups.txt is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from MaxQuant is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Note also that we want to remove rows from the decoy search labeled REV_*, and indicated in the column pattern_remove.

Initial Set Up

While evidence.txt contains values for each imported file (important for comparative statistics), the proteinGroups.txt file will only contain an entry for each file if in the initial MaxQuant configuration the Experiment column in the raw data tab has a unique value for each file (eg 1, 2, 3, …), otherwise the values get merged on common experiment groups in the output for the protein level data. The sample groups can then be set with the modify() function.

Exporting

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/MaxQuant_proteins.tsv:

category	column_defined	column_import	pattern_extract	pattern_remove	pattern_split	pivot	REQUIRED
sample	sample		(?<=\s)[0-9]+				YES
sample	sample_file		(?<=\s)[0-9]+				YES
identifier	protein	^Proteins IDs$	(?<=\\|).*?(?=\\|)	^REV\_	\;		YES
quantitative	abundance_raw	^Intensity\s$				TRUE	YES
accounting	num_psms	^MS/MS count
accounting	num_peptides	^Peptides\s				TRUE
accounting	num_unique_peptides	^Unique peptides\s				TRUE

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/MaxQuant_peptides.tsv:

category	column_defined	column_import	pattern_extract	pattern_split	REQUIRED
sample	sample	Raw file			YES
sample	sample_file	Experiment			YES
identifier	protein	^Proteins$	(?<=\\|).*?(?=\\|)	\;	YES
identifier	peptide	^Sequence$			YES
identifier	modification	^Modified sequence$			YES
quantitative	abundance_raw	^Intensity$			YES
impute	match_between_runs	Type	MATCH		YES
accounting	num_psms	MS/MS count

data_proteins <- "path_to_maxquant_project/combined/txt/proteinGroups.txt" %>%
   import("MaxQuant", "proteins") %>%
   reassign(field = 'sample', pattern = 'sample_1', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_2', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_3', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_4', replace = 'ko') %>%
   reassign(field = 'sample', pattern = 'sample_5', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_6', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_7', replace = 'wt') %>%
   reassign(field = 'sample', pattern = 'sample_8', replace = 'wt')

FragPipe

The FragPipe software suite creates files in project sub directory with data both peptides (combined_peptide.tsv) and proteins (combined_protein.tsv).

The data exported from FragPipe for the file combined_protein.tsv is not very “tidy”, as it has mixed wide-format columns (eg. Intensity for each sample on a single row) and long-format. The data table from FragPipe is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.

Initial Set Up

FragPipe already removes rows from the decoy search if indicated in the workflow setup, if however this is not being done you can indicated a pettern in the column pattern_remove.

Exporting

Proteins

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/FragPipe_proteins.tsv:

category	column_defined	column_import	pattern_extract	pattern_split	pivot	REQUIRED
sample	sample		(?<=\s)[0-9]+			YES
sample	sample_file		(?<=\s)[0-9]+			YES
identifier	protein	^Protein ID$	(?<=\\|).*?(?=\\|)	\;		YES
quantitative	abundance_raw	\sMaxLFQ\sIntensity$			TRUE	YES
accounting	num_psms	^\d\sSpectral\sCount$			TRUE
accounting	num_psms_unique	^\sUnique\sSpectral\sCount$			TRUE
annotation	description	^Description$
annotation	gene_name	^Gene$

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/FragPipe_peptides.tsv:

category	column_defined	column_import	pattern_extract	pattern_split	pivot	REQUIRED
sample	sample		.+(?=\_\d+\s)			YES
sample	sample_file		.+(?=\sMax)			YES
identifier	protein	^Protein ID$	(?<=\\|).*?(?=\\|)	\;		YES
identifier	peptide	^Peptide Sequence$				YES
quantitative	abundance_raw	\sMaxLFQ\sIntensity$			TRUE	YES
accounting	num_psms	^\sSpectral\sCount$			TRUE
annotation	description	^Protein Description$
annotation	gene_name	^Gene$

Skyline

The Skyline software suite can export quantitative peptide data for most analyses. The exported data file is in a fairly “tidy” long format CSV file, where each peptide for each sample is reported on an individual row.

Initial Set Up

A report need to be established, under File > Export > Report. Select Edit list..., Group: > External Tools then click Add.... Select the Columns that correspond to the required values shown below, name the report in Report Name: and click OK.

Exporting

Proteins

Not yet supported. Peptides can be combined into proteins with collapse().

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/SkyLine_peptides.tsv:

data_peptides <- "path_to_skyline_project/output_file_name.csv" %>%
   import("Skyline", "peptides")

DIA-NN

The DIA-NN software suite exports quantitative peptide data back into the project folder as report.tsv. The exported data file is in a fairly “tidy” long format file, where each peptide for each sample is reported on an individual row.

Exporting

Proteins

Not yet supported. Peptides can be combined into proteins with collapse().

Peptides

The columns following columns should be considered. These can be modified in the file tidyproteomics/inst/extdata/config/DIA-NN_peptides.tsv:

data_peptides <- "path_to_diann_project/output_file_name.csv" %>%
   import("DIA-NN", "peptides")

mzTab

The mzTab data has limited support from major vendors - ProteomeDiscoverer for example only supports version 1.0.0. The data for proteins, peptides and psms are all contained within a single file. Tidyproteomics assembles the psm, peptide and protein data independently then sequentially combines them to generate the desired protein or peptide level output.

Exporting

Proteins

data_proteins <- "path_to_data/project.mzTab" %>% import("mzTab", "proteins")

Peptides

data_peptides <- "path_to_data/project.mzTab" %>% import("mzTab", "peptides")