Parsimonious Gene Correlation Network Analysis (PGCNA) is a gene correlation network analysis approach that is computationally simple yet yields stable and biologically meaningful modules and allows visualisation of very large networks, showing substructure and relationships that are normally hard to see. The parsimonious approach, retaining the 3 most correlated edges per gene, results in a vast reduction in network complexity meaning that large networks can be clustered quickly and reduced to a small output file that can be used by downstream software.
Motivation for method
Our overall motivation was based on the following requirements for the method:
Visualize/contextualize the biology contained in large gene expression data-sets.
Generate modules (also referred to as clusters) that have distinct and meaningful biology using an unsupervised approach (no prior assumptions).
Be able to map the discovered modules onto gene expression data-sets to generate module 'fingerprints' per sample/patient.
Use these derived ‘fingerprints’ to study relationships between module biology and external factors (not expression based; e.g. mutations).x Use module gene/signature membership to study recurrent features between cancers.
This is an overview of the analysis steps when PGCNA is applied to multiple expression data-sets in the paper npj Syst. Biol. Appl. 5, 13. https://doi.org/10.1038/s41540-019-0090-7
(1) : Multiple available gene expression data-sets are gathered, in this case publically available data for breast cancer (BRCA; n=26) and colorectal cancer (CRC; n=11). Each data-set is normalised, probes merged per gene and the genes re-annotated to the latest gene symbols.
(2) : Per data-set rank genes by variance across patients and for the top 80% most variant calculate Spearman's rank correlations between every gene pair.
(3) : Per cancer, merge data-sets by taking the median gene pair correlations for all genes that are present in > 1/3 of the data-set. All correlations that have a p-value > 0.05 are set to 0.
(4) : Per gene only retain the top 3 edges (correlations). However, due to some genes being common partners of many genes (in their top 3 list) this radical edge reduction approach results in a scale-free network with hub nodes (high degree/connectivity).
(5) : Using the FastUnfold/Louvain method (https://sites.google.com/site/findcommunities/) the merged correlation matrices are clustered 10,000 times. The top 100 (selected using the FastUnfold modularity score) are ranked by their gene signature enrichment to select the optimal clustering of data. The stability of modules is assessed by comparing this optimal solution against the remaining 99 clusterings of the data.
(6) : Using Gephi, visualise the neworks, and create user explorable versions, allowing contextualisation of biology in the large networks.
(7) : Explore characteristics of the data-sets by mapping them onto the networks, allowing understanding of regions of the networks that differ in expression level, or correlation with prognosis (meta hazard ratios).
(8) : Explore the commonality of the derived modules between cancers/data-sets by looking at their shared modules (gene-level) or biology (signature-level)
(9) : Per module calculate a Module Expression Value (MEV) based on the modules 25 most representative genes. This allows a 'fingerprint' to be generated per patient/sample showing the relative expression of the discovered modules.
(10) : Use MEVs to cluster and explore the original and new (unseen; i.e. not used in network construction) expression data-sets. This helps visualise the heterogeneity within established cancer subtypes, and highlight their relative characteristics.
(11) : Explore additional meta data (e.g. mutations, CNV etc) by analysing their correlation with the MEVs. Again this demonstrates the heterogeneity of mutations in established cancer subtypes, but also reveals potentially unexplored relationships.
Example PGCNA analyses
A python implementation of PGCNA can be downloaded here: https://github.com/medmaca/PGCNA