Skip to contents

Subsetting data in tidyproteomics with subset() is straight forward, and similar to the tidyverse function tidyverse::filter(). The name subset() is used here to avoid conflicting with dplyr::filter() and to also separate it as a tidyproteomics specific function due to the underlying nature of the tidyproteomics data structure. The function subset() allows data to be easily filtered with simple semantic expressions, similar to how the filter function in the tidyverse package dplyr operates. This package also introduces two new operators that work as a regular expression filter, similar to SQL syntax, (%like%) which can be used in the semantic expression to subset data base on pattern matching in variable groups. For example, the expression !description %like% 'ribosome' would keep all proteins with a description that does not include the word ‘ribosome’. Additionally, together with the merge() and reassign() functions, data can be combined from multiple sources, assigned to specific sample groups and analyzed in a single collective. Alternatively, for example, data can be separated, normalized and imputed independently then recombined back into a single collective for analysis and visualization.

library("dplyr")
library("tidyproteomics")

hela_proteins %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     7055    66329           58706        0.908 0.16
#>  knockdown     7055    66329           58706        0.909 0.21

Examples

Subsetting can be a powerful way to slice-and-dice data, print quick stats or provide a quick visualization. For example, the Hela data set can be subset to just the proteins with “Ribosome” in the description:

hela_proteins %>% subset(description %like% "Ribosome") %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable   CVs
#>    control       18      224             224        0.976 0.094
#>  knockdown       18      224             224        0.996 0.170

Aside from filtering directly on protein_accession values, subsetting can use any of the columns in the experiments table:

colnames(hela_proteins$experiments)
#> [1] "sample_id"   "import_file" "sample_file" "sample"      "replicate"

any of the terms in the accounting table:

colnames(hela_proteins$accounting)
#> [1] "sample_id"           "protein"             "num_peptides"       
#> [4] "num_psms"            "num_unique_peptides" "protein_group"      
#> [7] "imputed"

and any of the terms in the annotations table:

hela_proteins$annotations$term %>% unique()
#> [1] "description"        "biological_process" "cellular_component"
#> [4] "molecular_function" "gene_id_entrez"     "gene_name"         
#> [7] "wiki_pathway"       "reactome_pathway"   "gene_id_ensemble"

Using Annotations

This allows for the specific importing of specialized terms with annotate() from which subsetting can be performed.

hela_proteins %>% subset(cellular_component %like% "nucleus") %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     4227    44075           39105        0.921 0.16
#>  knockdown     4227    44075           39105        0.921 0.20

Additionally, provided the quantitative platform produces an imputed value commonly referred to as “match between runs”, the data can be filtered to exclude these values. This can be valuable in cases were true presence/absence is desired, larger portions of the proteome differ.

Using Accountings

hela_proteins %>% subset(match_between_runs == FALSE) %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     7055    66329           58706        0.908 0.16
#>  knockdown     7055    66329           58706        0.909 0.21

Also, data can be filtered to proteins containing a desired number of underlying peptides.

hela_proteins %>% subset(mum_peptides <= 1) %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     7055    66329           58706        0.908 0.16
#>  knockdown     7055    66329           58706        0.909 0.21
hela_proteins %>% subset(num_unique_peptides <= 1) %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     1459     4202            1459        0.505 0.19
#>  knockdown     1459     4202            1459        0.489 0.21
hela_proteins %>% 
  subset(cellular_component %like% "cytosol") %>% 
  summary() 
#>  proteins peptides peptides_unique quantifiable  CVs
#>      2855    33053           29506        0.929 0.24

Split then Merge

Here is an example where data is split into two groups, independently manipulated, the merged back together. Not advisable for an experiment like this, just for demonstration purposes.

data_kd <- hela_proteins %>% 
  subset(sample %like% "knockdown") %>% 
  normalize(.method = c('median')) %>%
  impute()

data_ct <- hela_proteins %>% 
  subset(sample %like% "control") %>% 
  normalize(.method = c('median')) %>%
  impute()

data_new <- merge(list(data_kd, data_ct), quantitative_source = 'all')
data_new
#> Origin          Merged 
#>                 proteins (10.84 MB) 
#> Composition     6 files 
#>                 2 samples (knockdown, control) 
#> Quantitation    7055 proteins 
#>                 4 log10 dynamic range 
#>                 27% missing values 
#>  *imputed       by 'row' samples via 'base::quote .Primitive("min")' group_by_sample 'FALSE'. by 'row' samples via 'base::quote .Primitive("min")' group_by_sample 'FALSE'. 
#> Accounting      (4) num_peptides num_psms num_unique_peptides imputed 
#> Annotations     (9) description biological_process cellular_component molecular_function
#>                 gene_id_entrez gene_name wiki_pathway reactome_pathway
#>                 gene_id_ensemble 
#> 
data_new %>% summary('sample')
#>     sample proteins peptides peptides_unique quantifiable  CVs
#>    control     7055    66329           58706        0.927 0.15
#>  knockdown     7055    66329           58706        0.909 0.21
data_new %>% operations()
#>  Data Transformations
#>   • ProteomeDiscoverer [1]: Data files (p97KD_HCT116_proteins.xlsx) were
#>   imported as proteins from ProteomeDiscoverer
#> 
#>   • ProteomeDiscoverer [1]: Data subset `sample` %like% `knockdown`
#> 
#>   • ProteomeDiscoverer [1]: Data normalized via median.
#> 
#>   • ProteomeDiscoverer [1]: Normalization automatically selected as median.
#> 
#>   • ProteomeDiscoverer [1]: Missing values imputed by 'row' samples via
#>   'base::quote .Primitive("min")' group_by_sample 'FALSE'.
#> 
#>   • ProteomeDiscoverer [1]: ... 689 values imputed
#> 
#>   • ProteomeDiscoverer [2]: Data files (p97KD_HCT116_proteins.xlsx) were
#>   imported as proteins from ProteomeDiscoverer
#> 
#>   • ProteomeDiscoverer [2]: Data subset `sample` %like% `control`
#> 
#>   • ProteomeDiscoverer [2]: Data normalized via median.
#> 
#>   • ProteomeDiscoverer [2]: Normalization automatically selected as raw.
#> 
#>   • ProteomeDiscoverer [2]: Missing values imputed by 'row' samples via
#>   'base::quote .Primitive("min")' group_by_sample 'FALSE'.
#> 
#>   • ProteomeDiscoverer [2]: ... 753 values imputed
#> 
#>   • Merged 2 data sets