Protein Accounting
collapsing.Rmd
Protein accounting is an important aspect of quantiative proteomics. It consists of aggregating individual peptide measurements into a single reported protein. However, sequence homology make this a less than trivial process, as the LCMS process does not identify proteins specifically, rather their individual peptide segments. It is then up to the researcher to implement a method that infers the original protein content.
While the proteomics community at large has devised several schemes for protein accounting, not all methods are universally amenable. For example, when accounting for cellular differences between to cell states, it is often preferred to arrive at the minimal number of protein sequences that can explain the data. That means tossing aside identifiers to sequence variants - something researchers attempting to identify specific genes might like to retain. Furthermore, for researchers attempting to grasp if gene insertions are viable, gross summation of peptide abundances may miss nuances in sequence variations. These latter points are not well controlled for in platforms that cater for proteome wide discoveries. Therefor, we have attempted to provide several simple mechanisms that can apply to a variety of LCMS proteomics projects both larger proteome wide discoveries and smaller induced expression projects.
More sophisticated methods of protein accounting should utilize those current implementations, export the quantitative protein level data and then import into tidyproteomics.
NOTE: tidyproteomics allows you to group and summarize quantitative values for any annotation term, such as gene_name or even molecular_function giving you flexibility on how individual peptide values get summarized into larger observations. See Collapsing to Gene Name.
The inputs
assign_by
The collapse()
function accommodates four methods for
accounting with the variable:
Method | Description |
---|---|
all-possible | retain all protein assignments regardless of homology |
non-homologous | retain all protein assignments, but only use quantitative values from peptides that identify a single unique protein. |
razor-local | peptide will be assigned to the protein group with the largest number of total peptides identified within a sample group |
razor-global | peptide will be assigned to the protein group with the largest number of total peptides identified globally |
top_n
this variable sets the number of peptide quantitative values to include in the final protein quantitative value.
split_abundance
a true/false variable, if true, will split abundances of razor-peptides by the proportional abundance between proteins. This method is experimental, has not been validated or peer-reviewed.
fasta_path can be provided to calculate iBAQ based values. However, currently this method only supports Tryptic based peptide accounting.
Examples
Here we import a data-object of ProteomeDiscoverer’s protein level accounting and investigate the difference between ProteomeDiscoverer’s protein accounting and that implemented by tidyproteomics.
# download the data
url_pro <- "https://data.caltech.edu/records/aevwq-2ps50/files/ProteomeDiscoverer_2.5_p97KD_HCT116_proteins.xlsx?download=1"
url_pep <- "https://data.caltech.edu/records/aevwq-2ps50/files/ProteomeDiscoverer_2.5_p97KD_HCT116_peptides.xlsx?download=1"
download.file(url_pro, destfile = "./data/pd_proteins.xlsx")
download.file(url_pep, destfile = "./data/pd_peptides.xlsx")
# import the data
pro_data <- "./data/pd_proteins.xlsx" %>% import('ProteomeDiscoverer', 'proteins')
pep_data <- "./data/pd_peptides.xlsx" %>% import('ProteomeDiscoverer', 'peptides')
Assign peptides by razor logic
pro_frompep_razor <- pep_data %>%
collapse(assign_by = 'razor-global', top_n = Inf) %>%
normalize(.method = c('median', 'linear', 'loess'))
pro_frompep_razor %>%
summary("sample", destination = 'return') %>%
select(c('sample', 'proteins', 'peptides', 'CVs')) %>%
knitr::kable()
sample | proteins | peptides | CVs |
---|---|---|---|
ctl | 7056 | 63622 | 0.14 |
p97_kd | 7056 | 63622 | 0.11 |
Check to see how the two results agree …
tbl_merged_razor <- pro_frompep_razor$quantitative %>% select('sample','replicate','protein', 'abundance_raw') %>%
inner_join(pro_data$quantitative %>% select('sample','replicate','protein', 'abundance_raw'),
by = c('sample','replicate','protein'),
suffix = c("_tidyproteomics", "_proteomediscoverer"))
tbl_merged_razor %>%
mutate(itentical_protein_abundances = abundance_raw_proteomediscoverer == abundance_raw_tidyproteomics) %>%
filter(!is.na(itentical_protein_abundances)) %>%
group_by(itentical_protein_abundances) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(r = paste(n / sum(n) * 100, "%")) %>%
knitr::kable()
itentical_protein_abundances | n | r |
---|---|---|
FALSE | 812 | 2.69257552143781 % |
TRUE | 29345 | 97.3074244785622 % |
A quick plot for visualizing the
tbl_merged_razor %>%
mutate(sample_rep = paste(sample, replicate)) %>%
ggplot(aes(abundance_raw_proteomediscoverer, abundance_raw_tidyproteomics)) +
geom_abline(color='red') +
geom_point(alpha = .67) +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~sample_rep)
Lets look at a few examples of outliers …
tbl_merged_razor %>%
filter(abundance_raw_proteomediscoverer != abundance_raw_tidyproteomics) %>%
head(3) %>%
knitr::kable()
sample | replicate | protein | abundance_raw_tidyproteomics | abundance_raw_proteomediscoverer |
---|---|---|---|---|
ctl | 2 | P45983 | 869555.1 | 703721.1 |
ctl | 2 | Q8N138 | 3530659.8 | 2396096.2 |
ctl | 2 | Q16537 | 4131858.3 | 3433024.4 |
We will take the protein P45983 to examine …
pep_P45983 <- pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(protein == 'P45983')
pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(peptide %in% pep_P45983$peptide) %>%
group_by(peptide, modifications) %>%
summarise(
proteins = paste(protein, collapse = "; "),
abundance_raw = unique(abundance_raw)
) %>%
knitr::kable()
peptide | modifications | proteins | abundance_raw |
---|---|---|---|
IIDFGLAR | NA | O15264; Q86YV6; P45984; P53778; Q16539; P45983 | 3426127.8 |
ISVDEALQHPYINVWYDPSEAEAPPPK | NA | P45983 | 417442.9 |
MSYLLYQMLCGIK | 1xCarbamidomethyl [C10] | P45984; P45983 | 289726.1 |
VIEQLGTPCPEFMK | 1xCarbamidomethyl [C9] | P45983 | 566715.8 |
YAGYSFEK | NA | P45983 | 425560.5 |
Here we see that P45983 yielded 5 peptides, 2 of which are shared. In the outlier table we see that ProteomeDiscoverer excluded all razor peptides from the protein quantitation. And if we dig further, we find that the peptide IIDFGLAR by razor definition belongs to the protein Q16539, while the peptide MSYLLYQMLCGIK can be assigned to P45983, thus giving us the larger value shown in the outlier table.
pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(protein %in% c('O15264','Q86YV6','P45984','P53778','Q16539','P45983')) %>%
group_by(protein) %>%
summarise(
n_peptides = length(unique(peptide))
) %>%
arrange(desc(n_peptides)) %>%
knitr::kable()
protein | n_peptides |
---|---|
Q16539 | 7 |
P45983 | 5 |
P45984 | 3 |
O15264 | 2 |
Q86YV6 | 2 |
P53778 | 1 |
Assign by excluding homologous peptides
That means the ProteomeDiscoverer method for protein accounting excluded the abundance measurement from all razor peptides. Let try to do replicate that as well.
pro_frompep_nonhom <- pep_data %>%
collapse(assign_by = 'non-homologous', top_n = Inf) %>%
normalize(.method = c('median', 'linear', 'loess'))
pro_frompep_nonhom %>%
summary("sample", destination = 'return') %>%
select(c('sample', 'proteins', 'peptides', 'CVs')) %>%
knitr::kable()
sample | proteins | peptides | CVs |
---|---|---|---|
ctl | 7056 | 63622 | 0.14 |
p97_kd | 7056 | 63622 | 0.12 |
Check to see how the two results agree …
tbl_merged_nonhom <- pro_frompep_nonhom$quantitative %>% select('sample','replicate','protein', 'abundance_raw') %>%
inner_join(pro_data$quantitative %>% select('sample','replicate','protein', 'abundance_raw'),
by = c('sample','replicate','protein'),
suffix = c("_tidyproteomics", "_proteomediscoverer"))
tbl_merged_nonhom %>%
mutate(itentical_protein_abundances = abundance_raw_proteomediscoverer == abundance_raw_tidyproteomics) %>%
filter(!is.na(itentical_protein_abundances)) %>%
group_by(itentical_protein_abundances) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(r = paste(n / sum(n) * 100, "%")) %>%
knitr::kable()
itentical_protein_abundances | n | r |
---|---|---|
FALSE | 3449 | 11.5247101279781 % |
TRUE | 26478 | 88.4752898720219 % |
A quick plot for visualizing the
tbl_merged_nonhom %>%
mutate(sample_rep = paste(sample, replicate)) %>%
ggplot(aes(abundance_raw_proteomediscoverer, abundance_raw_tidyproteomics)) +
geom_abline(color='red') +
geom_point(alpha = .67) +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~sample_rep)
Interesting, we now show that P45983 agrees with ProteomeDiscoverer, yet obviously several others do not
sample | replicate | protein | abundance_raw_tidyproteomics | abundance_raw_proteomediscoverer |
---|---|---|---|---|
ctl | 2 | P45983 | 703721.1 | 703721.1 |
p97_kd | 2 | P45983 | 1635934.1 | 1635934.1 |
ctl | 1 | P45983 | 535040.5 | 535040.5 |
ctl | 3 | P45983 | 494210.7 | 494210.7 |
p97_kd | 1 | P45983 | 1409719.2 | 1409719.2 |
p97_kd | 3 | P45983 | 2219424.9 | 2219424.9 |
Lets look at a few examples of outliers …
tbl_merged_nonhom %>%
filter(abundance_raw_proteomediscoverer != abundance_raw_tidyproteomics) %>%
head(3) %>%
knitr::kable()
sample | replicate | protein | abundance_raw_tidyproteomics | abundance_raw_proteomediscoverer |
---|---|---|---|---|
ctl | 2 | Q15366 | 67906.09 | 112335311 |
ctl | 2 | Q9Y3L5 | 81606.47 | 4567305 |
ctl | 2 | P53990 | 114462.68 | 16431834 |
We will take the protein Q96LR5 to examine …
pep_Q96LR5 <- pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(protein == 'Q96LR5')
pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(peptide %in% pep_Q96LR5$peptide) %>%
group_by(peptide, modifications) %>%
summarise(
proteins = paste(protein, collapse = "; "),
abundance_raw = unique(abundance_raw)
) %>%
knitr::kable()
#> `summarise()` has grouped output by 'peptide'. You can override using the
#> `.groups` argument.
peptide | modifications | proteins | abundance_raw |
---|---|---|---|
DNWSPALTISK | NA | Q96LR5; P51965; Q969T4 | 7005611.0 |
ELAEITLDPPPNCSAGPK | 1xCarbamidomethyl [C13] | Q96LR5; Q969T4 | 2169172.8 |
IYHCNINSQGVICLDILK | 2xCarbamidomethyl [C4; C13] | Q96LR5; P51965; Q969T4 | 2543214.8 |
VDDSPSTSGGSSDGDQR | NA | Q96LR5 | 216778.9 |
A quick check on the peptide sum and it appears that ProteomeDiscoverer ignored razor-peptides all together for this protein
pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(peptide %in% pep_Q96LR5$peptide) %>%
group_by(peptide, modifications) %>%
summarise(
proteins = paste(protein, collapse = "; "),
abundance_raw = unique(abundance_raw),
.groups = 'drop'
) %>%
summarise(abundance_sum = sum(abundance_raw)) %>%
knitr::kable()
abundance_sum |
---|
11934777 |
If we examine the other proteins in that group for all the peptides that may be shared …
get_peptides <- pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(protein %in% c('Q96LR5','P51965','Q969T4'))
pep_data$quantitative %>%
filter(!is.na(abundance_raw)) %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(peptide %in% get_peptides$peptide) %>%
group_by(peptide, modifications) %>%
summarise(
proteins = paste(protein, collapse = "; "),
abundance_raw = unique(abundance_raw)
) %>%
knitr::kable()
#> `summarise()` has grouped output by 'peptide'. You can override using the
#> `.groups` argument.
peptide | modifications | proteins | abundance_raw |
---|---|---|---|
DNWSPALTISK | NA | Q96LR5; P51965; Q969T4 | 7005611.0 |
DPAAPEPEEQEER | NA | Q969T4 | 371158.4 |
ELADITLDPPPNCSAGPK | 1xCarbamidomethyl [C13] | P51965 | 2536644.8 |
ELAEITLDPPPNCSAGPK | 1xCarbamidomethyl [C13] | Q96LR5; Q969T4 | 2169172.8 |
IYHCNINSQGVICLDILK | 2xCarbamidomethyl [C4; C13] | Q96LR5; P51965; Q969T4 | 2543214.8 |
VDDSPSTSGGSSDGDQR | NA | Q96LR5 | 216778.9 |
… we see that ProteomeDiscoverer actually assigned the abundances of razor-peptides. And if we check our results from the merged razor data we find that to be correct.
tbl_merged_razor %>%
filter(sample == 'p97_kd', replicate == 1) %>%
filter(protein %in% c('Q96LR5','P51965','Q969T4')) %>%
knitr::kable()
sample | replicate | protein | abundance_raw_tidyproteomics | abundance_raw_proteomediscoverer |
---|---|---|---|---|
p97_kd | 1 | Q969T4 | 371158.4 | 371158.4 |
p97_kd | 1 | P51965 | 2536644.8 | 2536644.8 |
p97_kd | 1 | Q96LR5 | 11934777.4 | 11934777.4 |
Collapsing to Gene Name
Alternatively, peptide level data can be collapsed to any Annotation
term, in this case we will collapse to gene_name after adding the
annotations to the peptide data object. See
vignette("annotating")
.
library(tidyr)
data_fasta <- "uniprot_human-20398_20220920.fasta" %>%
tidyproteomics:::fasta_parse(as = "data.frame") %>%
select(protein = accession, gene_name, description) %>%
pivot_longer(
cols = c('gene_name', 'description'),
names_to = 'term',
values_to = 'annotation'
)
gene_data <- pep_data %>%
annotate(data_fasta, duplicates = 'merge') %>%
collapse(collapse_to = 'gene_name')
gene_data %>% plot_counts()
#> Warning: Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
#> Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)
gene_data %>%
normalize(.method = 'linear') %>%
expression(p97_kd/ctl) %>%
plot_volcano(p97_kd/ctl, significance_column = 'p_value') +
labs(title = "Gene Level Expression")
#> ℹ Normalizing quantitative data
#> ℹ ... using linear regression
#> ✔ ... using linear regression [222ms]
#>
#> ℹ Selecting best normalization method
#> ✔ Selecting best normalization method ... done
#>
#> ℹ ... selected linear
#> ℹ .. expression::t_test testing p97_kd / ctl
#> ✔ .. expression::t_test testing p97_kd / ctl [3s]
#>
#> Warning: Values from `annotation` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(gene_name, term)) |>
#> dplyr::filter(n > 1L)