PhosR is a package for the all-rounded analysis of phosphoproteomic data from processing to downstream analysis. This vignette will provide a step-by-step workflow of how PhosR can be used to process and analyse a panel of phosphoproteomic datasets. As one of the first steps of data processing, we will begin by performing filtering and imputation of the data with
First, we will load the PhosR package. If you already haven’t done so, please install PhosR as instructed in the main page.
We assume that you will have the raw data processed using platforms frequently used for mass-spectrometry based proteomics such as MaxQuant. For demonstration purposes, we will take a parts of phosphoproteomic data generated by Humphrey et al. with accession number PXD001792. The dataset contains the phosphoproteomic quantifications of two mouse liver cell lines (Hepa1.6 and FL38B) that were treated with either PBS (mock) or insulin.
Let us load the PhosphoExperiment (ppe) object.
A quick glance of the object.
ppe #> class: PhosphoExperiment #> dim: 5000 24 #> metadata(0): #> assays(1): Quantification #> rownames(5000): Q7TPV4;MYBBP1A;S1321;PQSALPKKRARLSLVSRSPSLLQSGVKKRRV #> Q3UR85;MYRF;S304;PARAPSPPWPPQGPLSPGTGSLPLSIARAQT ... #> P28659-4;NA;S18;AFKLDFLPEMMVDHCSLNSSPVSKKMNGTLD #> E9Q8I9;FRY;S1380;HNIELVDSRLLLPGSSPSSPEDEVKDREGEV #> rowData names(0): #> colnames(24): Intensity.FL83B_Control_1 Intensity.FL83B_Control_2 ... #> Intensity.Hepa1.6_Ins_5 Intensity.Hepa1.6_Ins_6 #> colData names(0):
We will take the grouping information from
colnames of our matrix.
For each cell line, there are two conditions (Control vs Insulin-stimulated) and 6 replicates for each condition.
# FL38B gsub("Intensity.", "", grps)[1:12] #>  "FL83B_Control" "FL83B_Control" "FL83B_Control" "FL83B_Control" #>  "FL83B_Control" "FL83B_Control" "FL83B_Ins" "FL83B_Ins" #>  "FL83B_Ins" "FL83B_Ins" "FL83B_Ins" "FL83B_Ins" # Hepa1 gsub("Intensity.", "", grps)[13:24] #>  "Hepa1.6_Control" "Hepa1.6_Control" "Hepa1.6_Control" "Hepa1.6_Control" #>  "Hepa1.6_Control" "Hepa1.6_Control" "Hepa1.6_Ins" "Hepa1.6_Ins" #>  "Hepa1.6_Ins" "Hepa1.6_Ins" "Hepa1.6_Ins" "Hepa1.6_Ins"
Note that there are in total 24 samples and 5,000 phosphosites profiled.
dim(ppe) #>  5000 24
Next, we will perform some filtering of phosphosites so that only phosphosites with quantification for at least 50% of the replicates in at least one of the conditions are retained. For this filtering step, we use the
selectGrps function. The filtering leaves us with 1,772 phosphosites.
selectGrps gives you the option to relax the threshold for filtering. The filtering threshold can therefore be optimized for each dataset.
We can proceed to imputation now that we have filtered for suboptimal phosphosites. To take advantage of data structure and experimental design, PhosR provides users with a lot of flexibility for imputation. There are three functions for imputation:
ptImpute. Here, we will demonstrate the use of
scImpute function is used for site- and condition-specific imputation. A pre-defined thereshold is used to select phosphosites to impute. Phosphosites with missing values equal to or greater than a predefined value will be imputed by sampling from the empirical normal distribution constructed from the quantification values of phosphosites from the same condition.
In the above example, only phosphosites that are quantified in more than 50% of samples from the same condition will be imputed.
We then perform paired tail-based imputation on the dataset imputed with
scImpute. Paired tail-based imputation performs imputation of phosphosites that have missing values in all replicates in one condition (e.g. in basal) but not in another condition (e.g., in stimulation). This method of imputation ensures that we do not accidentally filter phosphosites that seemingly have low detection rate.
scImpute, we can set a predefined threshold to in another condition (e.g. stimulation), the tail-based imputation is applied to impute for the missing values in the first condition.
set.seed(123) ppe_imputed <- ppe_imputed_tmp ppe_imputed[,seq(6)] <- ptImpute(ppe_imputed[,seq(7,12)], ppe_imputed[,seq(6)], percent1 = 0.6, percent2 = 0, paired = FALSE) ppe_imputed[,seq(13,18)] <- ptImpute(ppe_imputed[,seq(19,24)], ppe_imputed[,seq(13,18)], percent1 = 0.6, percent2 = 0, paired = FALSE)
Lastly, we perform normalisation of the filtered and imputed phosphoproteomic data.
ppe_imputed_scaled <- medianScaling(ppe_imputed, scale = FALSE, assay = "imputed")
A useful function in
PhosR is to visualize the percentage of quantified sites before and after filtering and imputation. The main inputs of
plotQC are the quantification matrix, sample labels (equating the column names of the matrix), an integer indicating the panel to plot, and lastly, a color vector. To visualize the percentage of quantified sites, use the
plotQC function and set panel = quantify to visualise bar plots of samples.
By setting panel = dendrogram, we can visualise the results of unsupervised hierarchical clustering of samples as a dendrogram. The dendrogram demonstrates that imputation has improved the clustering of the samples so that replicates from the same conditions cluster together.
We can now move onto the next step in the
PhosR workflow: integration of datasets and batch correction.
sessionInfo() #> R version 4.1.2 (2021-11-01) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Catalina 10.15.7 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib #> #> locale: #>  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #>  stats graphics grDevices utils datasets methods base #> #> other attached packages: #>  PhosR_1.5.1 BiocStyle_2.20.2 #> #> loaded via a namespace (and not attached): #>  ggtext_0.1.1 bitops_1.0-7 #>  matrixStats_0.61.0 fs_1.5.0 #>  RColorBrewer_1.1-2 rprojroot_2.0.2 #>  GenomeInfoDb_1.28.4 tools_4.1.2 #>  backports_1.3.0 utf8_1.2.2 #>  R6_2.5.1 BiocGenerics_0.38.0 #>  colorspace_2.0-2 tidyselect_1.1.1 #>  gridExtra_2.3 GGally_2.1.2 #>  preprocessCore_1.54.0 compiler_4.1.2 #>  textshaping_0.3.6 Biobase_2.52.0 #>  xml2_1.3.2 network_1.17.1 #>  desc_1.4.0 DelayedArray_0.18.0 #>  ggdendro_0.1.22 labeling_0.4.2 #>  bookdown_0.24 scales_1.1.1 #>  proxy_0.4-26 pkgdown_1.6.1 #>  systemfonts_1.0.2 stringr_1.4.0 #>  digest_0.6.28 rmarkdown_2.11 #>  XVector_0.32.0 pkgconfig_2.0.3 #>  htmltools_0.5.2 MatrixGenerics_1.4.3 #>  highr_0.9 ruv_0.9.7.1 #>  fastmap_1.1.0 limma_3.48.3 #>  rlang_0.4.12 GlobalOptions_0.1.2 #>  farver_2.1.0 shape_1.4.6 #>  jquerylib_0.1.4 generics_0.1.1 #>  statnet.common_4.5.0 dendextend_1.15.2 #>  dplyr_1.0.7 car_3.0-12 #>  RCurl_1.98-1.5 magrittr_2.0.1 #>  GenomeInfoDbData_1.2.6 Matrix_1.3-4 #>  Rcpp_1.0.7 munsell_0.5.0 #>  S4Vectors_0.30.2 fansi_0.5.0 #>  abind_1.4-5 viridis_0.6.2 #>  lifecycle_1.0.1 stringi_1.7.5 #>  yaml_2.2.1 carData_3.0-4 #>  MASS_7.3-54 SummarizedExperiment_1.22.0 #>  zlibbioc_1.38.0 plyr_1.8.6 #>  grid_4.1.2 parallel_4.1.2 #>  crayon_1.4.2 lattice_0.20-45 #>  cowplot_1.1.1 gridtext_0.1.4 #>  circlize_0.4.13 knitr_1.36 #>  pillar_1.6.4 igraph_1.2.8 #>  ggpubr_0.4.0 GenomicRanges_1.44.0 #>  markdown_1.1 ggsignif_0.6.3 #>  reshape2_1.4.4 stats4_4.1.2 #>  glue_1.5.0 evaluate_0.14 #>  pcaMethods_1.84.0 BiocManager_1.30.16 #>  vctrs_0.3.8 gtable_0.3.0 #>  purrr_0.3.4 tidyr_1.1.4 #>  reshape_0.8.8 cachem_1.0.6 #>  ggplot2_3.3.5 xfun_0.28 #>  broom_0.7.10 e1071_1.7-9 #>  coda_0.19-4 rstatix_0.7.0 #>  ragg_1.1.3 class_7.3-19 #>  viridisLite_0.4.0 tibble_3.1.6 #>  pheatmap_1.0.12 memoise_2.0.0 #>  IRanges_2.26.0 ellipsis_0.3.2