Subsetting
subsetting.Rmd
Subsetting data in tidyproteomics with subset()
is
straight forward, and similar to the tidyverse function
tidyverse::filter()
. The name subset()
is used
here to avoid conflicting with dplyr::filter()
and to also
separate it as a tidyproteomics specific function due to the underlying
nature of the tidyproteomics data structure. The function
subset()
allows data to be easily filtered with simple
semantic expressions, similar to how the filter function in the
tidyverse package dplyr operates. This package also introduces two new
operators that work as a regular expression filter, similar to SQL
syntax, (%like%) which can be used in the
semantic expression to subset data base on pattern matching in variable
groups. For example, the expression
!description %like% 'ribosome'
would keep all proteins with a description that does not include the
word ‘ribosome’. Additionally, together with the
merge()
and reassign()
functions, data can be
combined from multiple sources, assigned to specific sample groups and
analyzed in a single collective. Alternatively, for example, data can be
separated, normalized and imputed independently then recombined back
into a single collective for analysis and visualization.
library("dplyr")
library("tidyproteomics")
hela_proteins %>% summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 7055 66329 58706 0.908 0.16
#> knockdown 7055 66329 58706 0.909 0.21
Examples
Subsetting can be a powerful way to slice-and-dice data, print quick stats or provide a quick visualization. For example, the Hela data set can be subset to just the proteins with “Ribosome” in the description:
hela_proteins %>%
subset(description %like% "Ribosome") %>%
summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 18 224 224 0.976 0.094
#> knockdown 18 224 224 0.996 0.170
Aside from filtering directly on protein_accession
values, subsetting can use any of the columns in the
experiments
table:
colnames(hela_proteins$experiments)
#> [1] "sample_id" "import_file" "sample_file" "sample" "replicate"
any of the terms in the accounting
table:
colnames(hela_proteins$accounting)
#> [1] "sample_id" "protein" "num_peptides"
#> [4] "num_psms" "num_unique_peptides" "protein_group"
#> [7] "imputed"
and any of the terms in the annotations
table:
Using Annotations
This allows for the specific importing of specialized terms with
annotate()
from which subsetting can be performed.
hela_proteins %>%
subset(cellular_component %like% "nucleus") %>% summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 4227 44075 39105 0.921 0.16
#> knockdown 4227 44075 39105 0.921 0.20
Additionally, provided the quantitative platform produces an imputed value commonly referred to as “match between runs”, the data can be filtered to exclude these values. This can be valuable in cases were true presence/absence is desired, larger portions of the proteome differ.
Using Accountings
hela_proteins %>%
subset(match_between_runs == FALSE) %>% summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 7055 66329 58706 0.908 0.16
#> knockdown 7055 66329 58706 0.909 0.21
Also, data can be filtered to proteins containing a desired number of underlying peptides.
hela_proteins %>%
subset(mum_peptides <= 1) %>% summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 7055 66329 58706 0.908 0.16
#> knockdown 7055 66329 58706 0.909 0.21
Split then Merge
Here is an example where data is split into two groups, independently manipulated, the merged back together. Not advisable for an experiment like this, just for demonstration purposes.
data_kd <- hela_proteins %>%
subset(sample %like% "knockdown") %>%
normalize(.method = c('median')) %>%
impute()
data_ct <- hela_proteins %>%
subset(sample %like% "control") %>%
normalize(.method = c('median')) %>%
impute()
data_new <- merge(list(data_kd, data_ct), quantitative_source = 'all')
data_new
#> Origin Merged
#> proteins (10.84 MB)
#> Composition 6 files
#> 2 samples (knockdown, control)
#> Quantitation 7055 proteins
#> 4 log10 dynamic range
#> 27% missing values
#> *imputed by 'row' samples via 'base::quote .Primitive("min")' group_by_sample 'FALSE'. by 'row' samples via 'base::quote .Primitive("min")' group_by_sample 'FALSE'.
#> Accounting (4) num_peptides num_psms num_unique_peptides imputed
#> Annotations (9) description biological_process cellular_component molecular_function
#> gene_id_entrez gene_name wiki_pathway reactome_pathway
#> gene_id_ensemble
#>
data_new %>% summary('sample')
#> sample proteins peptides peptides_unique quantifiable CVs
#> control 7055 66329 58706 0.927 0.15
#> knockdown 7055 66329 58706 0.909 0.21
data_new %>% operations()
#> ℹ Data Transformations
#> • ProteomeDiscoverer [1]: Data files (p97KD_HCT116_proteins.xlsx) were
#> imported as proteins from ProteomeDiscoverer
#>
#> • ProteomeDiscoverer [1]: Data subset `sample` %like% `knockdown`
#>
#> • ProteomeDiscoverer [1]: Data normalized via median.
#>
#> • ProteomeDiscoverer [1]: Normalization automatically selected as median.
#>
#> • ProteomeDiscoverer [1]: Missing values imputed by 'row' samples via
#> 'base::quote .Primitive("min")' group_by_sample 'FALSE'.
#>
#> • ProteomeDiscoverer [1]: ... 689 values imputed
#>
#> • ProteomeDiscoverer [2]: Data files (p97KD_HCT116_proteins.xlsx) were
#> imported as proteins from ProteomeDiscoverer
#>
#> • ProteomeDiscoverer [2]: Data subset `sample` %like% `control`
#>
#> • ProteomeDiscoverer [2]: Data normalized via median.
#>
#> • ProteomeDiscoverer [2]: Normalization automatically selected as raw.
#>
#> • ProteomeDiscoverer [2]: Missing values imputed by 'row' samples via
#> 'base::quote .Primitive("min")' group_by_sample 'FALSE'.
#>
#> • ProteomeDiscoverer [2]: ... 753 values imputed
#>
#> • Merged 2 data sets