Batch correction of phosphoproteomics dataset with PhosR

Introduction

A common but largely unaddressed challenge in phosphoproteomic data analysis is to correct for batch effect. Without correcting for batch effect, it is often not possible to analyze datasets in an integrative manner. To perform data integration and batch effect correction, we identified a set of stably phosphorylated sites (SPSs) across a panel of phosphoproteomic datasets and, using these SPSs, implemented a wrapper function of RUV-III from the ruv package called RUVphospho.

Note that when the input data contains missing values, imputation should be performed before batch correction since RUV-III requires a complete data matrix. The imputed values are removed by default after normalisation but can be retained for downstream analysis if the users wish to use the imputed matrix. This vignette will provide an example of how PhosR can be used for batch correction.

Loading packages and data

If you haven’t already done so, load the PhosR package.

suppressPackageStartupMessages({
  library(PhosR)
  library(stringr)
})

In this example, we will use L6 myotube phosphoproteome dataset (with accession number PXD019127) and the SPSs we identified from a panel of phosphoproteomic datasets (please refer to our preprint for the full list of the datasets used). The SPSs will be used as our negative control in RUV normalisation.

data("phospho_L6_ratio_pe")
data("SPSs")

ppe <- phospho.L6.ratio.pe
ppe
#> class: PhosphoExperiment 
#> dim: 6654 12 
#> metadata(0):
#> assays(1): Quantification
#> rownames(6654): D3ZNS8;AAAS;S495;THIPLYFVNAQFPRFSPVLGRAQEPPAGGGG
#>   Q9R0Z7;AAGAB;S210;RSVGSAESCQCEQEPSPTAERTESLPGHRSG ...
#>   D3ZG78;ZZEF1;S1516;SGPSAAEVSTAEEPSSPSTPTRRPPFTRGRL
#>   D3ZG78;ZZEF1;S1535;PTRRPPFTRGRLRLLSFRSMEETRPVPTVKE
#> rowData names(0):
#> colnames(12): AICAR_exp1 AICAR_exp2 ... AICARIns_exp3 AICARIns_exp4
#> colData names(0):

Setting up the data

The L6 myotube data contains phosphoproteomic samples from three treatment conditions each with quadruplicates. Myotube cells were treated with either AICAR or Insulin (Ins), which are both important modulators of the insulin signalling pathway, or both (AICARIns) before phosphoproteomic analysis.

colnames(ppe)[grepl("AICAR_", colnames(ppe))]
#> [1] "AICAR_exp1" "AICAR_exp2" "AICAR_exp3" "AICAR_exp4"
colnames(ppe)[grepl("^Ins_", colnames(ppe))]
#> [1] "Ins_exp1" "Ins_exp2" "Ins_exp3" "Ins_exp4"
colnames(ppe)[grepl("AICARIns_", colnames(ppe))]
#> [1] "AICARIns_exp1" "AICARIns_exp2" "AICARIns_exp3" "AICARIns_exp4"

Note that we have in total 6654 quantified phosphosites and 12 samples in total.

dim(ppe)
#> [1] 6654   12

We have already performed the relevant processing steps to generate a dense matrix. Please refer to the imputation page to perform filtering and imputation of phosphosites in order to generate a matrix without any missing values.

sum(is.na(ppe@assays@data$Quantification))
#> [1] 0

We will extract phosphosite labels.

sites = paste(sapply(ppe@GeneSymbol, function(x)x),";",
                 sapply(ppe@Residue, function(x)x),
                 sapply(ppe@Site, function(x)x),
                 ";", sep = "")

Lastly, we will take the grouping information from colnames of our matrix.

# take the grouping information
grps = gsub("_.+", "", colnames(ppe))
grps
#>  [1] "AICAR"    "AICAR"    "AICAR"    "AICAR"    "Ins"      "Ins"     
#>  [7] "Ins"      "Ins"      "AICARIns" "AICARIns" "AICARIns" "AICARIns"

Diagnosing batch effect

There are a number of ways to diagnose batch effect. In PhosR, we make use of two visualisation methods to detect batch effect: dendrogram of hierarchical clustering and a principal component analysis (PCA) plot. We use the plotQC function we introduced in the imputation section of the vignette.

By setting panel = dendrogram, we can plot the dendrogram illustrating the results of unsupervised hierarchical clustering of our 12 samples. Clustering results of the samples demonstrate that there is a strong batch effect (where batch denoted as expX, where X refers to the batch number). This is particularly evident for samples from Ins and AICARIns treated conditions.

plotQC(ppe@assays@data$Quantification, panel = "dendrogram", grps=grps, labels = colnames(ppe)) +
  ggplot2::ggtitle("Before batch correction")

We can also visualise the samples in PCA space by setting panel = pca. The PCA plot demonstrates aggregation of samples by batch rather than treatment groups (each point represents a sample coloured by treatment condition). It has become clearer that even within the AICAR treated samples, there is some degree of batch effect as data points are separated between samples from batches 1 and 2 and those from batches 3 and 4.

plotQC(ppe@assays@data$Quantification, grps=grps, labels = colnames(ppe), panel = "pca") +
  ggplot2::ggtitle("Before batch correction")

Correcting batch effect

We have now diagnosed that our dataset exhibits batch effect that is driven by experiment runs for samples treated with three different conditions. To address this batch effect, we correct for this unwanted variation in the data by utilising our pre-defined SPSs as a negative control for RUVphospho.

First, we construct a design matrix by condition.

design = model.matrix(~ grps - 1)
design
#>    grpsAICAR grpsAICARIns grpsIns
#> 1          1            0       0
#> 2          1            0       0
#> 3          1            0       0
#> 4          1            0       0
#> 5          0            0       1
#> 6          0            0       1
#> 7          0            0       1
#> 8          0            0       1
#> 9          0            1       0
#> 10         0            1       0
#> 11         0            1       0
#> 12         0            1       0
#> attr(,"assign")
#> [1] 1 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$grps
#> [1] "contr.treatment"

We will then use the RUVphospho function to normalise the data. Besides the quantification matrix and the design matrix, there are two other important inputs to RUVphospho: 1) the ctl argument is an integer vector denoting the position of SPSs within the quantification matrix 2) k parameter is an integer denoting the expected number of experimental (e.g., treatment) groups within the data

# phosphoproteomics data normalisation and batch correction using RUV
ctl = which(sites %in% SPSs)
ppe = RUVphospho(ppe, M = design, k = 3, ctl = ctl)

Quality control

As quality control, we will demonstrate and evaluate our normalisation method with hierarchical clustering and PCA plot using again plotQC. Both the hierarchical clustering and PCA results demonstrate the normalisation procedure in PhosR facilitates effective batch correction.

# plot after batch correction
p1 = plotQC(ppe@assays@data$Quantification, grps=grps, labels = colnames(ppe), panel = "dendrogram" )
p2 = plotQC(ppe@assays@data$normalised, grps=grps, labels = colnames(ppe), panel="dendrogram")
ggpubr::ggarrange(p1, p2, nrow = 1)

p1 = plotQC(ppe@assays@data$Quantification, panel = "pca", grps=grps, labels = colnames(ppe)) +
  ggplot2::ggtitle("Before Batch correction")
p2 = plotQC(ppe@assays@data$normalised, grps=grps, labels = colnames(ppe), panel="pca") +
  ggplot2::ggtitle("After Batch correction")
ggpubr::ggarrange(p1, p2, nrow = 2)

Generating SPSs

We note that the current SPS are derived from phosphoproteomic datasets derived from mouse cells. To enable users working on phosphoproteomic datasets derived from other species, we have developed a function getSPS that takes in multiple phosphoproteomics datasets stored as phosphoExperiment objects to generate set of SPS. Users can use their in-house or public datasets derived from the same speciies to generate species-specific SPS, which they can use to normalise their data.Below demonstrate the usage of the getSPS function.

First, load datasets to use for SPS generation.

data("phospho_L6_ratio_pe")
data("phospho.liver.Ins.TC.ratio.RUV.pe")
data("phospho.cells.Ins.pe")

ppe1 <- phospho.L6.ratio.pe
ppe2 <- phospho.liver.Ins.TC.ratio.RUV.pe
ppe3 <- phospho.cells.Ins.pe

Filter, impute, and transform ppe3 (other inputs have already been processed).

grp3 = gsub('_[0-9]{1}', '', colnames(ppe3))
ppe3 <- selectGrps(ppe3, grps = grp3, 0.5, n=1)
ppe3 <- tImpute(ppe3)
FL83B.ratio <- ppe3@assays@data$imputed[, 1:12] - rowMeans(ppe3@assays@data$imputed[,grep("FL83B_Control", colnames(ppe3))])
Hepa.ratio <- ppe3@assays@data$imputed[, 13:24] - rowMeans(ppe3@assays@data$imputed[,grep("Hepa1.6_Control", colnames(ppe3))])
ppe3@assays@data$Quantification <- cbind(FL83B.ratio, Hepa.ratio)

Generate inputs of getSPS.

ppe.list <- list(ppe1, ppe2, ppe3)
cond.list <- list(grp1 = gsub("_.+", "", colnames(ppe1)),
                  grp2 = str_sub(colnames(ppe2), end=-5),
                  grp3 = str_sub(colnames(ppe3), end=-3))

Finally run getSPS to generate list of SPSs.

inhouse_SPSs <- getSPS(ppe.list, conds = cond.list)

head(inhouse_SPSs)
#> [1] "TMPO;S67;"    "RBM5;S621;"   "SRRM1;S427;"  "THOC2;S1417;" "SSB;S92;"    
#> [6] "PPIG;S252;"

Run RUVphospho using the newly generateed SPSs. (NOTE that we do not expect the results to be reproducible as the exempler SPSs have been generated from dummy datasets that have been heavily filtered.)

sites = paste(sapply(ppe@GeneSymbol, function(x)x),";",
              sapply(ppe@Residue, function(x)x),
              sapply(ppe@Site, function(x)x), ";", sep = "")

ctl = which(sites %in% inhouse_SPSs)
ppe = RUVphospho(ppe, M = design, k = 3, ctl = ctl)

SessionInfo

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1 PhosR_1.13.1 
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-9                gridExtra_2.3              
#>   [3] rlang_1.1.4                 magrittr_2.0.3             
#>   [5] matrixStats_1.4.1           e1071_1.7-16               
#>   [7] compiler_4.4.1              systemfonts_1.1.0          
#>   [9] vctrs_0.6.5                 reshape2_1.4.4             
#>  [11] pkgconfig_2.0.3             shape_1.4.6.1              
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] backports_1.5.0             XVector_0.42.0             
#>  [17] labeling_0.4.3              utf8_1.2.4                 
#>  [19] rmarkdown_2.28              markdown_1.13              
#>  [21] preprocessCore_1.64.0       ragg_1.3.3                 
#>  [23] network_1.18.2              purrr_1.0.2                
#>  [25] xfun_0.48                   zlibbioc_1.48.2            
#>  [27] cachem_1.1.0                GenomeInfoDb_1.38.8        
#>  [29] jsonlite_1.8.9              highr_0.11                 
#>  [31] DelayedArray_0.28.0         broom_1.0.7                
#>  [33] R6_2.5.1                    stringi_1.8.4              
#>  [35] bslib_0.8.0                 RColorBrewer_1.1-3         
#>  [37] limma_3.58.1                GGally_2.2.1               
#>  [39] car_3.1-3                   GenomicRanges_1.54.1       
#>  [41] jquerylib_0.1.4             Rcpp_1.0.13                
#>  [43] SummarizedExperiment_1.32.0 knitr_1.48                 
#>  [45] IRanges_2.36.0              Matrix_1.7-0               
#>  [47] igraph_2.1.1                tidyselect_1.2.1           
#>  [49] abind_1.4-8                 yaml_2.3.10                
#>  [51] viridis_0.6.5               ggtext_0.1.2               
#>  [53] lattice_0.22-6              tibble_3.2.1               
#>  [55] plyr_1.8.9                  withr_3.0.2                
#>  [57] Biobase_2.62.0              coda_0.19-4.1              
#>  [59] evaluate_1.0.1              desc_1.4.3                 
#>  [61] ggstats_0.7.0               proxy_0.4-27               
#>  [63] xml2_1.3.6                  circlize_0.4.16            
#>  [65] pillar_1.9.0                BiocManager_1.30.25        
#>  [67] ggpubr_0.6.0                MatrixGenerics_1.14.0      
#>  [69] carData_3.0-5               stats4_4.4.1               
#>  [71] generics_0.1.3              RCurl_1.98-1.16            
#>  [73] S4Vectors_0.40.2            ggplot2_3.5.1              
#>  [75] commonmark_1.9.2            munsell_0.5.1              
#>  [77] scales_1.3.0                BiocStyle_2.30.0           
#>  [79] class_7.3-22                glue_1.8.0                 
#>  [81] pheatmap_1.0.12             tools_4.4.1                
#>  [83] dendextend_1.18.1           ggsignif_0.6.4             
#>  [85] fs_1.6.4                    cowplot_1.1.3              
#>  [87] grid_4.4.1                  tidyr_1.3.1                
#>  [89] colorspace_2.1-1            GenomeInfoDbData_1.2.11    
#>  [91] Formula_1.2-5               cli_3.6.3                  
#>  [93] ruv_0.9.7.1                 textshaping_0.4.0          
#>  [95] fansi_1.0.6                 S4Arrays_1.2.1             
#>  [97] viridisLite_0.4.2           ggdendro_0.2.0             
#>  [99] dplyr_1.1.4                 pcaMethods_1.94.0          
#> [101] gtable_0.3.6                rstatix_0.7.2              
#> [103] sass_0.4.9                  digest_0.6.37              
#> [105] BiocGenerics_0.48.1         SparseArray_1.2.4          
#> [107] farver_2.1.2                htmlwidgets_1.6.4          
#> [109] htmltools_0.5.8.1           pkgdown_2.1.1              
#> [111] lifecycle_1.0.4             GlobalOptions_0.1.2        
#> [113] statnet.common_4.10.0       statmod_1.5.0              
#> [115] gridtext_0.1.5              MASS_7.3-60.2

29 October 2024