Skip to contents

Protein accounting is an important aspect of quantiative proteomics. It consists of aggregating individual peptide measurements into a single reported protein. However, sequence homology make this a less than trivial process, as the LCMS process does not identify proteins specifically, rather their individual peptide segments. It is then up to the researcher to implement a method that infers the original protein content.

While the proteomics community at large has devised several schemes for protein accounting, not all methods are universally amenable. For example, when accounting for cellular differences between to cell states, it is often preferred to arrive at the minimal number of protein sequences that can explain the data. That means tossing aside identifiers to sequence variants - something researchers attempting to identify specific genes might like to retain. Furthermore, for researchers attempting to grasp if gene insertions are viable, gross summation of peptide abundances may miss nuances in sequence variations. These latter points are not well controlled for in platforms that cater for proteome wide discoveries. Therefor, we have attempted to provide several simple mechanisms that can apply to a variety of LCMS proteomics projects both larger proteome wide discoveries and smaller induced expression projects.

More sophisticated methods of protein accounting should utilize those current implementations, export the quantitative protein level data and then import into tidyproteomics.

NOTE: tidyproteomics allows you to group and summarize quantitative values for any annotation term, such as gene_name or even molecular_function giving you flexibility on how individual peptide values get summarized into larger observations. See Collapsing to Gene Name.

The inputs

assign_by

The collapse() function accommodates four methods for accounting with the variable:

Method Description
all-possible retain all protein assignments regardless of homology
non-homologous retain all protein assignments, but only use quantitative values from peptides that identify a single unique protein.
razor-local peptide will be assigned to the protein group with the largest number of total peptides identified within a sample group
razor-global peptide will be assigned to the protein group with the largest number of total peptides identified globally

top_n

this variable sets the number of peptide quantitative values to include in the final protein quantitative value.

split_abundance

a true/false variable, if true, will split abundances of razor-peptides by the proportional abundance between proteins. This method is experimental, has not been validated or peer-reviewed.

fasta_path can be provided to calculate iBAQ based values. However, currently this method only supports Tryptic based peptide accounting.

Examples

Here we import a data-object of ProteomeDiscoverer’s protein level accounting and investigate the difference between ProteomeDiscoverer’s protein accounting and that implemented by tidyproteomics.

# download the data
url_pro <- "https://data.caltech.edu/records/aevwq-2ps50/files/ProteomeDiscoverer_2.5_p97KD_HCT116_proteins.xlsx?download=1"
url_pep <- "https://data.caltech.edu/records/aevwq-2ps50/files/ProteomeDiscoverer_2.5_p97KD_HCT116_peptides.xlsx?download=1"

download.file(url_pro, destfile = "./data/pd_proteins.xlsx")
download.file(url_pep, destfile = "./data/pd_peptides.xlsx")

# import the data
pro_data <- "./data/pd_proteins.xlsx" %>% import('ProteomeDiscoverer', 'proteins')
pep_data <- "./data/pd_peptides.xlsx" %>% import('ProteomeDiscoverer', 'peptides')

Assign peptides by razor logic

pro_frompep_razor <- pep_data %>% 
  collapse(assign_by = 'razor-global', top_n = Inf) %>%
  normalize(.method = c('median', 'linear', 'loess'))

pro_frompep_razor %>% 
  summary("sample", destination = 'return') %>% 
  select(c('sample', 'proteins', 'peptides', 'CVs')) %>%
  knitr::kable()
sample proteins peptides CVs
ctl 7056 63622 0.14
p97_kd 7056 63622 0.11

Check to see how the two results agree …

tbl_merged_razor <- pro_frompep_razor$quantitative %>% select('sample','replicate','protein', 'abundance_raw') %>%
  inner_join(pro_data$quantitative %>% select('sample','replicate','protein', 'abundance_raw'), 
             by = c('sample','replicate','protein'),
             suffix = c("_tidyproteomics", "_proteomediscoverer"))

tbl_merged_razor %>%
  mutate(itentical_protein_abundances = abundance_raw_proteomediscoverer == abundance_raw_tidyproteomics) %>%
  filter(!is.na(itentical_protein_abundances)) %>%
  group_by(itentical_protein_abundances) %>%
  summarise(n = n(), .groups = 'drop') %>%
  mutate(r = paste(n / sum(n) * 100, "%")) %>%
  knitr::kable()
itentical_protein_abundances n r
FALSE 812 2.69257552143781 %
TRUE 29345 97.3074244785622 %

A quick plot for visualizing the

tbl_merged_razor %>%
  mutate(sample_rep = paste(sample, replicate)) %>%
  ggplot(aes(abundance_raw_proteomediscoverer, abundance_raw_tidyproteomics)) + 
  geom_abline(color='red') +
  geom_point(alpha = .67) +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~sample_rep)

Lets look at a few examples of outliers …

tbl_merged_razor %>%
  filter(abundance_raw_proteomediscoverer != abundance_raw_tidyproteomics) %>%
  head(3) %>%
  knitr::kable()
sample replicate protein abundance_raw_tidyproteomics abundance_raw_proteomediscoverer
ctl 2 P45983 869555.1 703721.1
ctl 2 Q8N138 3530659.8 2396096.2
ctl 2 Q16537 4131858.3 3433024.4

We will take the protein P45983 to examine …

pep_P45983 <- pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>% 
  filter(protein == 'P45983')

pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(peptide %in% pep_P45983$peptide) %>%
  group_by(peptide, modifications) %>%
  summarise(
    proteins = paste(protein, collapse = "; "),
    abundance_raw = unique(abundance_raw)
  ) %>%
  knitr::kable()
peptide modifications proteins abundance_raw
IIDFGLAR NA O15264; Q86YV6; P45984; P53778; Q16539; P45983 3426127.8
ISVDEALQHPYINVWYDPSEAEAPPPK NA P45983 417442.9
MSYLLYQMLCGIK 1xCarbamidomethyl [C10] P45984; P45983 289726.1
VIEQLGTPCPEFMK 1xCarbamidomethyl [C9] P45983 566715.8
YAGYSFEK NA P45983 425560.5

Here we see that P45983 yielded 5 peptides, 2 of which are shared. In the outlier table we see that ProteomeDiscoverer excluded all razor peptides from the protein quantitation. And if we dig further, we find that the peptide IIDFGLAR by razor definition belongs to the protein Q16539, while the peptide MSYLLYQMLCGIK can be assigned to P45983, thus giving us the larger value shown in the outlier table.

pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(protein %in% c('O15264','Q86YV6','P45984','P53778','Q16539','P45983')) %>%
  group_by(protein) %>%
  summarise(
    n_peptides = length(unique(peptide))
  ) %>%
  arrange(desc(n_peptides)) %>%
  knitr::kable()
protein n_peptides
Q16539 7
P45983 5
P45984 3
O15264 2
Q86YV6 2
P53778 1

Assign by excluding homologous peptides

That means the ProteomeDiscoverer method for protein accounting excluded the abundance measurement from all razor peptides. Let try to do replicate that as well.

pro_frompep_nonhom <- pep_data %>% 
  collapse(assign_by = 'non-homologous', top_n = Inf) %>%
  normalize(.method = c('median', 'linear', 'loess'))

pro_frompep_nonhom %>% 
  summary("sample", destination = 'return') %>% 
  select(c('sample', 'proteins', 'peptides', 'CVs')) %>%
  knitr::kable()
sample proteins peptides CVs
ctl 7056 63622 0.14
p97_kd 7056 63622 0.12

Check to see how the two results agree …

tbl_merged_nonhom <- pro_frompep_nonhom$quantitative %>% select('sample','replicate','protein', 'abundance_raw') %>%
  inner_join(pro_data$quantitative %>% select('sample','replicate','protein', 'abundance_raw'), 
             by = c('sample','replicate','protein'),
             suffix = c("_tidyproteomics", "_proteomediscoverer"))

tbl_merged_nonhom %>%
  mutate(itentical_protein_abundances = abundance_raw_proteomediscoverer == abundance_raw_tidyproteomics) %>%
  filter(!is.na(itentical_protein_abundances)) %>%
  group_by(itentical_protein_abundances) %>%
  summarise(n = n(), .groups = 'drop') %>%
  mutate(r = paste(n / sum(n) * 100, "%")) %>%
  knitr::kable()
itentical_protein_abundances n r
FALSE 3449 11.5247101279781 %
TRUE 26478 88.4752898720219 %

A quick plot for visualizing the

tbl_merged_nonhom %>%
  mutate(sample_rep = paste(sample, replicate)) %>%
  ggplot(aes(abundance_raw_proteomediscoverer, abundance_raw_tidyproteomics)) + 
  geom_abline(color='red') +
  geom_point(alpha = .67) +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~sample_rep)

Interesting, we now show that P45983 agrees with ProteomeDiscoverer, yet obviously several others do not

tbl_merged_nonhom %>% filter(protein == 'P45983') %>%
  knitr::kable()
sample replicate protein abundance_raw_tidyproteomics abundance_raw_proteomediscoverer
ctl 2 P45983 703721.1 703721.1
p97_kd 2 P45983 1635934.1 1635934.1
ctl 1 P45983 535040.5 535040.5
ctl 3 P45983 494210.7 494210.7
p97_kd 1 P45983 1409719.2 1409719.2
p97_kd 3 P45983 2219424.9 2219424.9

Lets look at a few examples of outliers …

tbl_merged_nonhom %>%
  filter(abundance_raw_proteomediscoverer != abundance_raw_tidyproteomics) %>%
  head(3) %>%
  knitr::kable()
sample replicate protein abundance_raw_tidyproteomics abundance_raw_proteomediscoverer
ctl 2 Q15366 67906.09 112335311
ctl 2 Q9Y3L5 81606.47 4567305
ctl 2 P53990 114462.68 16431834

We will take the protein Q96LR5 to examine …

pep_Q96LR5 <- pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>% 
  filter(protein == 'Q96LR5')

pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(peptide %in% pep_Q96LR5$peptide) %>%
  group_by(peptide, modifications) %>%
  summarise(
    proteins = paste(protein, collapse = "; "),
    abundance_raw = unique(abundance_raw)
  ) %>%
  knitr::kable()
#> `summarise()` has grouped output by 'peptide'. You can override using the
#> `.groups` argument.
peptide modifications proteins abundance_raw
DNWSPALTISK NA Q96LR5; P51965; Q969T4 7005611.0
ELAEITLDPPPNCSAGPK 1xCarbamidomethyl [C13] Q96LR5; Q969T4 2169172.8
IYHCNINSQGVICLDILK 2xCarbamidomethyl [C4; C13] Q96LR5; P51965; Q969T4 2543214.8
VDDSPSTSGGSSDGDQR NA Q96LR5 216778.9

A quick check on the peptide sum and it appears that ProteomeDiscoverer ignored razor-peptides all together for this protein

pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(peptide %in% pep_Q96LR5$peptide) %>%
  group_by(peptide, modifications) %>%
  summarise(
    proteins = paste(protein, collapse = "; "),
    abundance_raw = unique(abundance_raw),
    .groups = 'drop'
  ) %>%
  summarise(abundance_sum = sum(abundance_raw)) %>%
  knitr::kable()
abundance_sum
11934777

If we examine the other proteins in that group for all the peptides that may be shared …

get_peptides <- pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(protein %in% c('Q96LR5','P51965','Q969T4'))

pep_data$quantitative %>% 
  filter(!is.na(abundance_raw)) %>% 
  filter(sample == 'p97_kd', replicate == 1) %>%
  filter(peptide %in% get_peptides$peptide) %>%
  group_by(peptide, modifications) %>%
  summarise(
    proteins = paste(protein, collapse = "; "),
    abundance_raw = unique(abundance_raw)
  ) %>%
  knitr::kable()
#> `summarise()` has grouped output by 'peptide'. You can override using the
#> `.groups` argument.
peptide modifications proteins abundance_raw
DNWSPALTISK NA Q96LR5; P51965; Q969T4 7005611.0
DPAAPEPEEQEER NA Q969T4 371158.4
ELADITLDPPPNCSAGPK 1xCarbamidomethyl [C13] P51965 2536644.8
ELAEITLDPPPNCSAGPK 1xCarbamidomethyl [C13] Q96LR5; Q969T4 2169172.8
IYHCNINSQGVICLDILK 2xCarbamidomethyl [C4; C13] Q96LR5; P51965; Q969T4 2543214.8
VDDSPSTSGGSSDGDQR NA Q96LR5 216778.9

… we see that ProteomeDiscoverer actually assigned the abundances of razor-peptides. And if we check our results from the merged razor data we find that to be correct.

tbl_merged_razor %>%
  filter(sample == 'p97_kd', replicate == 1) %>% 
  filter(protein %in% c('Q96LR5','P51965','Q969T4')) %>%
  knitr::kable()
sample replicate protein abundance_raw_tidyproteomics abundance_raw_proteomediscoverer
p97_kd 1 Q969T4 371158.4 371158.4
p97_kd 1 P51965 2536644.8 2536644.8
p97_kd 1 Q96LR5 11934777.4 11934777.4

Collapsing to Gene Name

Alternatively, peptide level data can be collapsed to any Annotation term, in this case we will collapse to gene_name after adding the annotations to the peptide data object. See vignette("annotating").

library(tidyr)

data_fasta <- "uniprot_human-20398_20220920.fasta" %>% 
  tidyproteomics:::fasta_parse(as = "data.frame") %>%
  select(protein = accession, gene_name, description) %>%
  pivot_longer(
    cols = c('gene_name', 'description'),
    names_to = 'term',
    values_to = 'annotation'
  )

gene_data <- pep_data %>% 
  annotate(data_fasta, duplicates = 'merge') %>%
  collapse(collapse_to = 'gene_name')
gene_data %>% plot_counts()
#> Warning: Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)

gene_data %>% 
  normalize(.method = 'linear') %>%
  expression(p97_kd/ctl) %>% 
  plot_volcano(p97_kd/ctl, significance_column = 'p_value') + 
  labs(title = "Gene Level Expression")
#>  Normalizing quantitative data
#>  ... using linear regression
#>  ... using linear regression [222ms]
#> 
#>  Selecting best normalization method
#>  Selecting best normalization method ... done
#> 
#>   ... selected linear
#>  .. expression::t_test testing p97_kd / ctl
#>  .. expression::t_test testing p97_kd / ctl [3s]
#> 
#> Warning: Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#>   dplyr::filter(n > 1L)