Visualizing
visualizing.Rmd
Pre Normalization
Critical to any processing pipeline is the ability to summarize and
visualize data, both pre and post processing. Tidyproteomics covers this
well with both a summary()
function and several
plot_()
functions. The summary function (described in more
detail vignette("summarizing")
) utilizes the same syntax
inherent to subset()
to generate summary statistics on any
variable set, including all annotated and accounting terms.
Post Normalization
Visualizing data post processing is an important aspect of data
analysis and great care is taken to explore the data post normalization
with a variety of plot functions. Each of these are intended to display
graphs that should lend insights such as the quantitative dynamic ranges
pre and post normalizations plot_normalization()
, the
sample specific CVs, dynamic grange plot_variation_cv()
and
principal component variation plot_variation_pca()
for each
normalization.
library("dplyr")
library("tidyproteomics")
rdata <- hela_proteins %>%
normalize(.method = c("scaled", "median", "linear", "loess", "randomforest"))
rdata %>% plot_normalization()
#> Warning: Removed 73038 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
Variation
Coefficient of Variation and Dynamic Range
The statistical assessment often referred to as CVs (Coefficient of Variation) or RSD (Relative Standard Deviation) attempts to measure the dispersion in a measurement. CVs in proteomics is plural because we often measure hundreds or thousands of proteins simultaneously. Understanding that variability and the effects of normalization will help improve the accuracy of your experiments.
rdata %>% plot_variation_cv()
#> TableGrob (2 x 2) "arrange": 3 grobs
#> z cells name grob
#> 1 1 (2-2,1-1) arrange gtable[layout]
#> 2 2 (2-2,2-2) arrange gtable[layout]
#> 3 3 (1-1,1-2) arrange text[GRID.text.741]
Principal Component Analysis
This is a plot of the accumulative variation explained by each of the principal components. Ideally, normalization show improve the first few principal components, removing the measurement and instrument variability, exposing the underlying biological variability. This plot show help visuallize that.
rdata %>% plot_variation_pca()
Dynamic Range
Perhaps more intriguing is the plot in
plot_dynamic_range()
which shows a density heat map of
sample specific CVs in relation to quantitative abundance. This plot
highlights how CVs increase at the lower quantitative range and, more
importantly, how each normalization method can address these large
variances. Again, note how random forest normalization is best able to
minimize the CVs at the lower quantitative range.
rdata %>% plot_dynamic_range()
#> Warning in ggplot2::geom_point(ggplot2::aes(x = range_x, y = range_y), color = "lightblue"): All aesthetics have length 1, but the data has 60912 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#> a single row.
Clustering
Once normalization and imputation methods have been implemented and
selected it is often desired to visualize the unbiased clustering of
samples. This can be accomplished with the plot_heatmap()
and plot_pca()
functions to generate plots.
Heatmap
rdata %>% plot_heatmap()