A central repository for projects of the Computational Systems Biology (CSB) group
https://github.com/PYangLabTools and resource related to single-cell omics data
Cepo
A new method to detect cell identity genes from single-cell RNA-sequencing data using differential stability, a biologically motivated metric. Cepo computes cell-type specific gene statistics pertaining to differential stable gene expression.[Github Repo], [BioC R package], [User tutorial]
Kim, H., Wang, K., Chen, C., Lin, Y., Tam, PPL., Lin, D., Yang, J. & Yang, P.† (2021) Uncovering cell identity through differential stability with Cepo. Nature Computational Science, 1, 784-790. [Full Text] [Nature Content Sharing link]
Kim, H., Tam, P. & Yang, P.† (2021) Defining cell identity beyond the premise of differential gene expression. Cell Regeneration, 10, 20. [Full Text]
CiteFuse
A suite of methods and tools for CITE-seq data analysis from pre-processing to integrative analytics, including doublet detection, network-based modality integration, cell type clustering, differential RNA and protein expression analysis, ADT evaluation, ligand-receptor interaction analysis, and interactive web-based visualisation of the analyses.[Github Repo], [BioC R package], [User tutorial]
Kim, H.✢, Lin, Y.✢, Geddes, T., Yang, J. & Yang, P.† (2020) CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics, 36(14), 4137-4143. [Full Text]
scReClassify
scReClassify is a method for correcting potentially mislabelled cells in single-cell RNA-sequencing data. It uses a semi-supervised algorithm, adaSampling, to learn and predict mis-annotation of cell types in the dataset.[Github Repo & tutorial], [BioC R package]
Kim, T., Lo, K., Geddes, T., Kim, H., Yang, J. & Yang, P.† (2019) scReClassify: post hoc cell type classification of single-cell RNA-seq data. BMC Genomics, 20, 913. [Full Text]
Stably expressed genes
Stably expressed genes (SEGs) are genes that express at similar levels across individual cells in a large number of cell types. They are identified from large-scale mouse and human sinlge-cell RNA-seq datasets covering a large number of cell types. SEGs can be used for batch correction and integration of multiple datasets. Biologically, SEGs share similar characteristics of housekeeping genes.[List of mouse SEGs], [List of human SEGs], [Interactive tool for SEG download]
Lin, Y., Ghazanfar, S., Strbenac, D., Wang, A., Patrick, E., Lin, D., Speed, T., Yang, J.† & Yang, P.† (2019) Evaluating stably expressed genes in single cells. GigaScience, 8(9), giz106. [Full Text] [Giga DB]
Tools and resource related to proteomics and phosphoproteomics data
PhosR
An R package for comprehensive analysis of phosphoproteomic data. PhosR consists of various processing tools for phosphoproteomic data including filtering, imputation, normalisaton and batch correction using stably phosphorylated sites (SPS), which enables integration of multiple phosphoproteomic datasets. Downstream analytical tools consists of site- and protein-centric pathway analysis to evaluate activities of kinases and signalling pathways, large-scale kinase-substrate annotation from dynamic phosphoproteomic profiling, and visualisation and construction of signalomes present in the phosphoproteomic data of interest.[Github Repo], [BioC R package], [User tutorial], [STAR Protocol]
Kim, H.✢, Kim, T.✢, Hoffman, N., Xiao, D., James, D., Humphrey, S. & Yang, P.† (2021) PhosR enables processing and functional analysis of phosphoproteomic data. Cell Reports, 34(8), 108771. [Full Text]
Kim, H., Kim, T., Xiao, D. & Yang, P.† (2021) Protocol for the processing and downstream analysis of phosphoproteomic data with PhosR. STAR Protocols, 2(2), 100585. [Full Text]
Stably phosphorylated sites
This list of stably phosphorylated sites (SPS) is identified from across 53 human phosphoproteomics datasets, covering 40 cell/tissue types and 194 conditions/treatments. SPS can be used for batch correction, normalisation, and integration of multiple phosphoproteomics datasets as in PhosR. The SPS themselves were evolutionarily conserved, functionally important, and enriched in a range of core signalling and gene pathways, including RNA splicing pathway, an essential cellular process in mammalian cells, and frequently disrupted by cancer mutations.[List of human SPS], [Github Repo]
Xiao, D., Kim, H., Pang, I. & Yang, P.† (2022) Functional analysis of the stable phosphoproteome reveals cancer vulnerabilities. Bioinformatics, 38(7), 1956-1963. [Full Text]
ClueR
Cluster evaluation R (ClueR) package is designed for identifying optimal fuzzy c-means clustering of a given mass spectrometry (MS)-based phosphoproteomics data based on prior knowledge of kinase substrate relationships such as those from PhosphoSitePlus database.[Github Repo & tutorial], [CRAN R package]
Yang, P.†, Zheng, X., Jayaswal, V., Hu, G., Yang, J. & Jothi, R. (2015). Knowledge-based analysis for detecting key signaling events from time-series phosphoproteomics data. PLoS Computational Biology, 11(8), e1004403. [Full Text], [PDF]
KSP-PUEL
Positive-unlabeled ensemble learning (PUEL) for kinase substrate prediction (KSP-PUEL) is an application developed for predicting novel substrates of kinases of interest using MS-based phosphoproteomics data by learning from kinase recognition motifs and substrate phosphorylation profiles.[Github Repo]
Yang, P.†, Humphrey, S., James, D., Yang, J. & Jothi, R.† (2016). Positive-unlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data. Bioinformatics, 32(2), 252-259. [Full Text], [PDF]
directPA & KinasePA
An R package designed to identify combinatorial effects of multiple treatments and/or perturbations on pathways and kinases profiled by microarray, RNA-seq, proteomics, or phosphoproteomics data.[Github Repo & tutorial], [CRAN R package]
Yang, P.✢, Patrick, E.✢, Tan, S., Fazakerley, D., Burchfield, J., Gribben, C., Prior, M., James, D. & Yang, J. (2014). Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway. Bioinformatics, 30(6), 808-814. [Full Text], [PDF]
Yang, P., Patrick, E., Humphrey, S., Ghazanfar, S., James, D., Jothi, R. & Yang, J. (2016). KinasePA: Phosphoproteomics data annotation using hypothesis driven kinase perturbation analysis. Proteomics, 16(13), 1868-1871. [Full Text]
General purpose tools
AdaSampling
AdaSampling is a semi-supervised model for learning from data with noisy class labels or only positive class. It can be applied to scenarios where there are mislabelled training examples in the data or only positive training examples are known whereas negative examples are unknown.[CRAN package], [Github Repo]
Yang, P.†, Ormerod, J., Liu, W., Ma, C., Zomaya, A. & Yang, J. (2019) AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications. IEEE Transactions on Cybernetics, 49(5), 1932-1943. [PDF]
Yang, P., Liu, W. & Yang, J. (2017) Positive unlabeled learning via wrapper-based adaptive sampling. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Pre-print [PDF]
Sample Subset Optimization (SSO)
Sample subset optimization (SSO) is a sampling technique that utilize an evolutionary algorithm to optimize sample subsets for learning from imbalanced dataset. The key idea is to select a subset of most representative training examples from the major class and combine them with the minor class to generate a balance data for training classification models.[Software]
Yang, P.†, Yoo, P., Fernando, J., Zhou, B., Zhang, Z. & Zomaya, A. (2014). Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Transactions on Cybernetics, 44(3), 445-455. [Full Text] [PDF]
Yang, P., Zhang, Z., Zhou, B. & Zomaya, A. (2011). Sample subset optimization for classifying imbalanced biological data. In Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Lecture Notes in Artificial Intelligence 6635, Springer Berlin Heidelberg, 333-344. [Text]
Legacy projects hosted on Google Code
- Legacy version of sample subset optimization package
https://code.google.com/p/sample-subset-optimization
- A machine learning algorithm in Java for protein inference
http://code.google.com/p/re-fraction
- A boosted learning algorithm in Java for peptide filtering
http://code.google.com/p/self-boosted-percolator
- A parallel genetic algorithm in Java for SNP interaction detection
http://code.google.com/p/genetic-ensemble-snpx
- An ensemble algorithm in Perl for SNP interaction filtering
http://code.google.com/p/ensemble-of-filters
- An open source mass spectrometry analysis pipeline in R
http://code.google.com/p/ocap
- A dynamic wavelet package in C/C++ for mass spectrum modeling
http://code.google.com/p/dywave/DyWave
- A particle swarm optimisation algorithm in Java for imbalanced data sampling
http://code.google.com/p/imbalanced-data-sampling