Expression • tidyproteomics

Analysis of Two Samples

One of the most fundamental analyses for quantitative proteomics is the estimation of univariate differences between two samples. Tidyproteomics accommodates two sample analysis using the expression() function, the results of which can be visualized using the plot_volcano() and plot_proportion() visualization functions as shown below. The expression analysis can be accomplished thought the R limma package, or using any two-sample statistical comparison, the default method employs the two-sided T-test. Note, if no values pass the cutoff it will automatically choose the top proteins to display.

It should be noted that this package does not incorporate some of the more recent methods described in literature. These methods could be incorporated in future revisions, or through a user provided function.

Available Methods

t.test

Johnston, L. W. “Student’s t-Test.” Journal of Quality Technology 2.4 (1970): 243-245.

R::stats::t.test

limma

(Empirical Bayes)

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47.

Bioconductor::limma

Alternative Methods

MSstats	Choi, Meena, Ching-Yun Chang, Timothy Clough, Daniel Broudy, Trevor Killeen, Brendan MacLean, and Olga Vitek. 2014. “MSstats: An R Package for Statistical Analysis of Quantitative Mass Spectrometry-Based Proteomic Experiments.” Bioinformatics 30 (17): 2524–26. msstats.org
DEqMS	Zhu, Yafeng, Lukas M. Orre, Yan Zhou Tran, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, and Janne Lehtiö. 2020. “DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis.” Molecular & Cellular Proteomics: MCP 19 (6): 1047–57. Bioconductor: DEqMS
MS -EmpiRe	Ammar, Constantin, Markus Gruber, Gergely Csaba, \| and Ralf Zimmer. 2019. “MS-EmpiRe Utilizes \| Peptide-Level Noise Distributions for \| Ultra-Sensitive Detection of Differentially \| Expressed Proteins.” Molecular & Cellular \| Proteomics: MCP 18 (9): 1880–92. \| \| GitHub::MS-EmpiRe
meqrob	Sticker, Adriaan, Ludger Goeminne, Lennart Martens, and Lieven Clement. 2020. “Robust Summarization and Inference in Proteome-Wide Label-Free Quantification.” Molecular & Cellular Proteomics: MCP 19 (7): 1209–19. Bioconductor: msqrob2

Expression Analysis

Input Parameters

parameter	description	inputs	default
log2fc_min	the minimum cutoff for displaying	0 to Inf	2
signifi cance_column	the column to use for significance	a dj_p_value, p_value*	a dj_p_value
sign ificance_max	the minimum cutoff for displaying	1 to 1/Inf	0.05
l abels_column	how to label the points	any value in annotations, eg gene_name, description	protein
show_pannels	show the colored areas highlighting significance	TRUE / FALSE	TRUE
show_lines	show the dotted cutoff lines	TRUE / FALSE	TRUE
s how_fc_scale	show the second scale on top converting l og2 _foldchange to foldchange	TRUE / FALSE	TRUE
point_size	size of each dot	1/10 to 5	NULL
co lor_positive	color for positive expression	[any R accepts] (htt%20ps:// r-charts%20. com/colors/)	dodgerblue
co lor_negative	color for negative expression	[any R accepts] (htt%20ps:// r-charts%20. com/colors/)	firebrick1

library("dplyr")
library("tidyproteomics")

rdata <- hela_proteins %>% 
  normalize(.method = 'loess') %>%
  expression(knockdown/control)

Exporting Results

The results of this analysis are stored in the tidyproteomics data-object and can easily be exported to save as a flat data table or used in an external down stream analysis.

rdata %>% 
  export_analysis(knockdown/control, .analysis = 'expression')
#> # A tibble: 5,187 × 27
#>    protein imputed     n imputed_control imputed_knockdown average_expression
#>    <chr>     <dbl> <int>           <dbl>             <dbl>              <dbl>
#>  1 Q15011    0         6           0                     0           4183224.
#>  2 Q5SVJ8    0.333     6           0.667                 0            828978.
#>  3 F6SA91    0.167     6           0.333                 0            994174.
#>  4 Q9NXV2    0         6           0                     0           2768778.
#>  5 Q13751    0         6           0                     0            990356.
#>  6 Q8TBM8    0.333     6           0.667                 0           3005450.
#>  7 O95833    0.167     6           0.333                 0           3047900.
#>  8 Q9BY50    0         6           0                     0           1776266.
#>  9 O00308    0         6           0                     0            661377.
#> 10 Q14534    0.333     6           0.667                 0           4043660.
#> # ℹ 5,177 more rows
#> # ℹ 21 more variables: proportional_expression <dbl>, foldchange <dbl>,
#> #   log2_foldchange <dbl>, p_value <dbl>, adj_p_value <dbl>,
#> #   normalization <chr>, abundance_control_1 <dbl>, abundance_control_2 <dbl>,
#> #   abundance_control_3 <dbl>, abundance_knockdown_1 <dbl>,
#> #   abundance_knockdown_2 <dbl>, abundance_knockdown_3 <dbl>,
#> #   description <chr>, biological_process <chr>, cellular_component <chr>, …

Volcano Plot

The default is to plot accoring to adjusted p-value with a Log2 fold-change cutoff of 2 or greater.

rdata %>% 
  plot_volcano(knockdown/control)

The parameters can be adjusted to suit your needs …

rdata %>% 
  plot_volcano(knockdown/control,
               significance_column = 'p_value',
               significance_max = 0.01,
               log2fc_min = 2)

… or desires.

library(ggplot2)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:tidyproteomics':
#> 
#>     annotate

rdata %>% 
  plot_volcano(knockdown/control,
               significance_column = 'p_value',
               significance_max = 0.01,
               log2fc_min = 2,
               color_positive = 'orange',
               color_negative = 'purple',
               show_lines = FALSE,
               show_fc_scale = FALSE,
               show_pannels = FALSE,
               labels_column = 'gene_name') +
  labs(title = "A nice volcano plot", subtitle = "with great colors")

Proportional Plot

The proportional plot is complementary to the volcano plot and utilizes all the same underlying data. The default is to show both the top 1% (proportion_min = 0.01) along with the values passing the fold-change and significance cutoffs.

library(ggplot2)

rdata %>% 
  plot_proportion(knockdown/control)
#> Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

This plot can likewise also be manipulated like the volcano plot.

library(ggplot2)

rdata %>% 
  plot_proportion(knockdown/control,
                  significance_column = 'p_value',
                  proportion_min = 0.1,
                  log2fc_min = 2,
                  color_positive = 'orange',
                  color_negative = 'purple',
                  show_lines = TRUE,
                  labels_column = 'gene_name')
#> ℹ proportional_expression appears to sum to 1 adjusting values to 100(%)