Importing
importing.Rmd
Data Support
Importing is currently implemented for a few platforms and assume
peptide level FDR (at the user’s desired level) has already been
accounted for. See vignette("importing")
. Importing is
flexible enough to accept other data platforms in flat files (.csv,
.tsv, and .xlsx) with a custom configuration.
Platform | peptides | proteins | notes |
---|---|---|---|
ProteomeDiscoverer | *.xlsx peptides export | *.xlsx proteins export |
requires layout configuration |
MaxQuant | evidence.txt | proteinGroups.txt | website |
FragPipe | combined_peptide.tsv | combined_protein.tsv | website |
Skyline | *.csv MSstats peptide report |
requires MSstats install |
|
DIA-NN | *.tsv peptide report | website | |
mzTab | *.mzTab (v1.0.0) | *.mzTab (v1.0.0) |
does not track MBR |
Table formatted data (eg. .csv, .tsv, .xlsx) from ProteomeDiscoverer,
MaxQuant and FragPipe meet the requirements, and are defined in the
package data tidyproteomics/inst/extdata/config/
accordingly. Note that the groups sample, identifier
and quantitative are required, while the rest are optional, and
only used if a match is found - currently this does not have good error
handeling, so be aware. For ProteomeDiscoverer the peptide’s config can
be modified to use Master Protein Accessions
or
Protein Accessions
if either column is present. Also note,
that for some configurations the sample group has no defined
“supplied name”, as this is later derived by the extraction code defined
in the columns labeled “pattern”.
The currently known and not directly supported quantitative proteomics platforms are Spectronaut and Proteograph Analysis Suite. Given a flat-file export, data from these platforms could also be importable. See User Defined Import.
ProteomeDiscoverer
The ProteomeDiscoverer software suite has the ability to export data
post analysis for both peptides
and proteins
.
This can be accomplished by opening the Study Results, then selecting
File \> Export \> To Microsoft Excel
. In the pop up
keep only Level 1 checked, and select either “Proteins” or “Peptide
Groups”. See the Exporting section below to make sure the required
columns are present.
The data exported from ProteomeDiscoverer is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from ProteomeDiscoverer is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.
Note that the pivot columns will provide the basis for sample and sample_file by extracting the correct values as indicated in the pattern_extract column, and hence the absence of a value for column_import.
Initial Set Up
When setting up your experiment it is essential to create sample
names in the Study Definition
tab of the open Study,
navigate to the Study Factors
box and simply add
Catigorical Factor
and name it (eg. My_Sample) and name
your study groups (eg. WT and KO). These destinations need to be applied
to the import files under the Samples
tab. Notice the left
most column is the added Catigorical Factor
named the same
as what was supplied. In each line there is a pull-down menu for
designating each file (or label for TMT) to one of the
Catigorical Factors
. This will now ensure that when
importing the data there are properly labeled sample. No big deal if
that hasn’t been done, it can be fixed with the modify()
function.
Exporting
Abundances A quick note on reported abundances.
ProteomeDiscoverer reports protein and peptide abundances in several
ways that may not be immediately clear - tidyproteomics should have
access to the raw abundance values, so that normalization and imputation
are starting from scratch. We want to export for example
Abundance: F3: Sample, p97-KD
, not
Abundances (Grouped) ...
,
Abundance Ratio: ...
, or any-other derivation. This can be
set by toggling the display columns in ProteomeDiscoverer to show
Abundance
, leaving the other columns should not impact data
importing and will be disregarded.
Proteins
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/ProteomeDiscoverer_proteins.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | “(?<=,).+” | YES | ||||
sample | sample_file | (?<=)F(?=:) | YES | ||||
identifier | protein | ^Accession$ | \; | YES | |||
quantitative | abundance_raw | Abundance\: | TRUE | YES | |||
impute | match_between_runs | Found in Sample\: | Found | TRUE | YES | ||
accounting | num_peptides | ^# Peptides$ | |||||
accounting | num_unique_peptides | ^# Unique Peptides$ | |||||
accounting | num_psms | # PSMs | |||||
annotation | description | Description | |||||
…etc |
Peptides
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/ProteomeDiscoverer_ppeptides.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | “(?<=,).+” | YES | ||||
sample | sample_file | (?<=)F(?=:) | YES | ||||
identifier | protein | Master Protein Accessions | \; | YES | |||
identifier | peptide | Annotated Sequence | YES | ||||
identifier | modification | ^Modifications$ | YES | ||||
quantitative | abundance_raw | Abundance\: | TRUE | YES | |||
impute | match_between_runs | Found in Sample\: | Found | TRUE | YES | ||
accounting | num_psms | # PSMs | |||||
accounting | description | Description | |||||
…etc |
# replace path_to_package_data("proteins") with the path to your local data.
# hela_proteins <- "./data/hela_export_table.xlsx" %>%
# import("ProteomeDiscoverer", "proteins")
data_proteins <- path_to_package_data("hela_proteins") %>%
import("ProteomeDiscoverer", "proteins")
MaxQuant
The MaxQuant software suite creates files in project sub directories
following current_project \> combined \> txt
with
data both peptides (evidence.txt) and proteins
(proteinGroups.txt).
The data exported from MaxQuant for the file proteinGroups.txt is not very “tidy”, as it has mixed wide-format columns (eg. Abundance for each sample on a single row) and long-format. The data table from MaxQuant is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.
Note also that we want to remove rows from the decoy search labeled REV_*, and indicated in the column pattern_remove.
Initial Set Up
While evidence.txt contains values for each imported file
(important for comparative statistics), the proteinGroups.txt
file will only contain an entry for each file if in the initial MaxQuant
configuration the Experiment column in the raw data tab has a unique
value for each file (eg 1, 2, 3, …), otherwise the values get merged on
common experiment groups in the output for the protein level data. The
sample groups can then be set with the modify()
function.
Exporting
Proteins
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/MaxQuant_proteins.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | (?<=\s)[0-9]+ | YES | ||||
sample | sample_file | (?<=\s)[0-9]+ | YES | ||||
identifier | protein | ^Proteins IDs$ | (?<=\|).*?(?=\|) | ^REV\_ | \; | YES | |
quantitative | abundance_raw | ^Intensity\s$ | TRUE | YES | |||
accounting | num_psms | ^MS/MS count | |||||
accounting | num_peptides | ^Peptides\s | TRUE | ||||
accounting | num_unique_peptides | ^Unique peptides\s | TRUE |
Peptides
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/MaxQuant_peptides.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | Raw file | YES | ||||
sample | sample_file | Experiment | YES | ||||
identifier | protein | ^Proteins$ | (?<=\|).*?(?=\|) | \; | YES | ||
identifier | peptide | ^Sequence$ | YES | ||||
identifier | modification | ^Modified sequence$ | YES | ||||
quantitative | abundance_raw | ^Intensity$ | YES | ||||
impute | match_between_runs | Type | MATCH | YES | |||
accounting | num_psms | MS/MS count |
data_proteins <- "path_to_maxquant_project/combined/txt/proteinGroups.txt" %>%
import("MaxQuant", "proteins") %>%
reassign(field = 'sample', pattern = 'sample_1', replace = 'ko') %>%
reassign(field = 'sample', pattern = 'sample_2', replace = 'ko') %>%
reassign(field = 'sample', pattern = 'sample_3', replace = 'ko') %>%
reassign(field = 'sample', pattern = 'sample_4', replace = 'ko') %>%
reassign(field = 'sample', pattern = 'sample_5', replace = 'wt') %>%
reassign(field = 'sample', pattern = 'sample_6', replace = 'wt') %>%
reassign(field = 'sample', pattern = 'sample_7', replace = 'wt') %>%
reassign(field = 'sample', pattern = 'sample_8', replace = 'wt')
FragPipe
The FragPipe software suite creates files in project sub directory with data both peptides (combined_peptide.tsv) and proteins (combined_protein.tsv).
The data exported from FragPipe for the file combined_protein.tsv is not very “tidy”, as it has mixed wide-format columns (eg. Intensity for each sample on a single row) and long-format. The data table from FragPipe is protein-centric wherein each row is dedicated to a single protein, rather than a single measurement. To clean up this data we need to “rotate” the wide-format columns such that each row in our new table is a single observation. In other words, we need a single abundance value for a single protein from a single sample, per row. To accomplish this we need to pivot the wide-format columns defined in the column pivot then “extract” the sample name from the column header as defined in the column pattern_extract. These patterns conform to standard regular expressions.
Initial Set Up
FragPipe already removes rows from the decoy search if indicated in the workflow setup, if however this is not being done you can indicated a pettern in the column pattern_remove.
Exporting
Proteins
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/FragPipe_proteins.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | (?<=\s)[0-9]+ | YES | ||||
sample | sample_file | (?<=\s)[0-9]+ | YES | ||||
identifier | protein | ^Protein ID$ | (?<=\|).*?(?=\|) | \; | YES | ||
quantitative | abundance_raw | \sMaxLFQ\sIntensity$ | TRUE | YES | |||
accounting | num_psms | ^\d\sSpectral\sCount$ | TRUE | ||||
accounting | num_psms_unique | ^\sUnique\sSpectral\sCount$ | TRUE | ||||
annotation | description | ^Description$ | |||||
annotation | gene_name | ^Gene$ |
Peptides
The columns following columns should be considered. These can be
modified in the file
tidyproteomics/inst/extdata/config/FragPipe_peptides.tsv
:
category | column_defined | column_import | pattern_extract | pattern_remove | pattern_split | pivot | REQUIRED |
---|---|---|---|---|---|---|---|
sample | sample | .+(?=\_\d+\s) | YES | ||||
sample | sample_file | .+(?=\sMax) | YES | ||||
identifier | protein | ^Protein ID$ | (?<=\|).*?(?=\|) | \; | YES | ||
identifier | peptide | ^Peptide Sequence$ | YES | ||||
quantitative | abundance_raw | \sMaxLFQ\sIntensity$ | TRUE | YES | |||
accounting | num_psms | ^\sSpectral\sCount$ | TRUE | ||||
annotation | description | ^Protein Description$ | |||||
annotation | gene_name | ^Gene$ |
Skyline
The Skyline software suite can export quantitative peptide data for most analyses. The exported data file is in a fairly “tidy” long format CSV file, where each peptide for each sample is reported on an individual row.
Initial Set Up
A report need to be established, under
File > Export > Report
. Select
Edit list...
, Group: > External Tools
then
click Add...
. Select the Columns that correspond to the
required values shown below, name the report in
Report Name:
and click OK
.
Exporting
Proteins
Not yet supported. Peptides can be combined into proteins with
collapse()
.
DIA-NN
The DIA-NN software suite exports quantitative peptide data back into the project folder as report.tsv. The exported data file is in a fairly “tidy” long format file, where each peptide for each sample is reported on an individual row.
Exporting
Proteins
Not yet supported. Peptides can be combined into proteins with
collapse()
.
mzTab
The mzTab data has limited support from major vendors - ProteomeDiscoverer for example only supports version 1.0.0. The data for proteins, peptides and psms are all contained within a single file. Tidyproteomics assembles the psm, peptide and protein data independently then sequentially combines them to generate the desired protein or peptide level output.