Microbial abundance profiles are applied widely to understand diseases from the aspect of microbial communities. By investigating the abundance associations of species or genes, we can construct molecular ecological networks (MENs). The MENs are often constructed by calculating the Pearson correlation coefficient (PCC) between genes. In this work, we also applied multimodal mutual information (MMI) to construct MENs. The members which drive the concerned MENs are referred to as key drivers.
Key drivers, which are major components that drive the disease concerned MENs, provide hints to understand the mechanisms of disease and are intensively studied with RNA data. There are various of methods to identify the key drivers in a co-expression network. One method is to incorporate the annotation of genes and pathways of diseases in order to locate the key drivers by considering enrichment of statistic of genes neighborhood [12, 13]. Another category of method distinguishes important MENs by calculating associations between gene modules with meta information like phenotype and GWAS analysis [14, 15], and then detects the key drivers by measuring the genes topology effect. For example, MEGENA  did multiscale hub analysis and Zhang et al. examined the number of N-hob downstream nodes . Those methods on detecting key drivers in RNA data analysis can be adopted to detect key drivers in MENs. Even though Portune et al. locates important microbial species and genes with the assistance of gene annotation to study the MENs , the annotation for microbial genes and species yet demands intensive efforts and the pathways of diseases are incomplete.
The distinction between keystone species and key drivers is that the keystone species are only topologically important, while key drivers motivate disease associated networks. MENs of diseases can be different compared to those from healthy individuals. By analyzing the factors driving the differences, we can uncover the development of the disease.
Inspired by key drivers analysis with RNA data and keystone species studies in MENs, we proposed a method to perform key drivers analysis without the availability of annotation information. Given a microbial abundance profile, we first construct the MEN, in which the nodes represent the microbial species or phylogenetic gene markers and the edges capture the associations between their respective nodes. Then we divide the MEN into multiple subnetworks and extract the subnetworks that are most relevant to the disease by calculating the associations between subnetworks and phenotype variables. A single phenotype variable could be insufficient to capture the changes in disease networks from healthy networks and it can be biased. To address this issue, we applied principal component analysis to extract delegated phenotype, which is more robust. Last, our method detects the key driver based on PageRank, which utilizes node topological properties within each extracted subnetwork. It captures the global link structure of subnetworks thus outperforms statistical algorithms that only use local information.
Our main contribution is that we refined the framework of key driver detection, and proposed delegated phenotype to capture the changes in disease networks from healthy networks. To validate our method, we performed experiment based on simulated data. Then, we tested KDiamend with two real microbiome datasets. We conducted key drivers analysis on Type 2 diabetes (T2D) and Rheumatoid Arthritis (RA), whose data are from gut microbiome and oral microbiome respectively. For each disease, we also compared experiment using PCC and MMI as two different inference methods, and acquired both consensus and divergence. Experiments of the two inference methods identified multiple identical phylogenetic gene markers and identified consensus pattern of disease-associated networks, indicating the robustness of our framework. On the other hand, the two different inference methods also led to specific findings, providing us with various aspects to study the mechanisms of diseases. We detected six T2D-relevant subnetworks and identified key drivers for each of them correspondingly. The identified key drivers include IPR006047, IPR018485 and IPR003385 related to the carbohydrate metabolic process, while the carbohydrate metabolic process is an important issue during the development of T2D . In addition, we also detected key drivers for RA. Both PCC and MMI experiments located multiple InterPro matches (IPRs) which are related membrane and infection. Six subnetworks were extracted by PCC, containing IPRs concerned with immunoglobulin, Sporulation. Three subnetworks were detected by MMI, with IPRs about biofilm, Flaviviruses, bacteriophage, etc. The result is inspiring since the development of biofilms is regarded as one of the drivers of persistent infections  and some biofilms-growing bacterias contribute to RA .
Our method is to detect the key drivers which drive the diseases related networks in the microbial community. The key drivers can be microbial species or phylogenetic gene markers. For simplicity, we present our method with nodes as genes in the subsequent descriptions.
The detection of key drivers consists of following steps (see Fig. 1). First, we construct a MEN to represent the relationship between genes based on microbial abundance profiles and infer the weight of each edge. Second, we cluster the genes and partition the MEN into multiple subnetworks. Third, we analyze the phenotype variables and extract the delegated phenotype. By computing the associations between subnetworks and delegated phenotype, we obtain subnetworks that are most related to the disease. Last, based on PageRank, we identify actors with top influence over others in each subnetwork as key drivers.
Flowchart. First, we build a MEN and cluster genes into multiple subnetworks. After that, we summarize the phenotype variables and connect it to subnetworks. Then, we locate key drivers through PageRank
We tested our method with real microbiome datasets and compared PCC with MMI in this framework. First, In order to detect key drivers for T2D, we downloaded processed InterPro matches (IPR) abundance data from EBI (SRP008047), which is gut metagenome (microbiome) data from Chinese samples. InterPro  provides a functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. The phenotype information of the dataset is provided in related paper . We used the 145 samples from stage one.
In addition, for PCC experiment, key drivers in subnetwork 89 and subnetwork 208 are related to the carbohydrate metabolic process, including IPR006047, IPR018485, and IPR003385. The key driver in subnetwork 166 is IPR001789, which plays a role in phosphorelay signal transduction system. MMI also detected IPR018211, IPR005538, IPR003501, and IPR001790 which are related to phosphorelay. PCC and MMI both identified IPRs related to the carbohydrate metabolic process and phosphorelay.
We applied our method to oral microbiome to detect the key drivers in microbial community related to dysbiosis in Rheumatoid Arthritis (RA). The abundance data was downloaded from EBI (ERP006678). Information of phenotype variables for different individuals was acquired from published paper . We mapped the samples downloaded from EBI with the individual ID and got 49 oral microbial samples in total. Among them, 27 samples were collected from patients with RA in different disease states, 22 samples, used as the control, were collected from people without RA. 21 of them are saliva samples and 28 are dental samples.
Similar to the analysis for T2D, we processed the phenotype matrix and detected subnetworks most related to RA. First, we removed phenotype variables with more than 1/3 missing values. Then, for remaining phenotype variables, we conducted imputation using R package MICE . By computing correlation between delegated phenotype and eigengenes of subnetworks, we extracted six subnetworks most related to RA using PCC and three subnetworks using MMI. Finally, we identified key drivers for detected disease associated subnetworks correspondingly.
We applied key drivers analysis for RA using PCC and MMI as two different inference methods respectively. Both experiments show IPRs, in extracted associated subnetworks, have higher abundance in disease state than in normal state (see Fig. 4). For PCC experiment, annotation shows that most IPRs in subnetwork 335 and 63 are about cell membrane while most IPRs in subnetwork 676, 128, 679 and 680 are about replication and cell growth. Functions for IPRs were inferred according to keywords and Gene Ontology (GO) mentioned in InterPro . Moreover, subnetwork 335 also contains IPR014879(Sporulation initiation factor Spo0A, C-terminal) and IPR013783 (Immunoglobulin-like fold). IPR013783 is about immunoglobulin molecules and T-cell receptor antigen [40, 41], while RA is a disease caused by compromised immune systems .
For MMI experiment, subnetwork 1642, which has the largest correlation with delegated phenotype, contains multiple IPRs about biofilm: IPR024487, IPR019669, and IPR010344. There are totally 24 IPRs in this subnetwork and top 5 of them are IPR003496, IPR024205, IPR008542, IPR010344, and IPR019669. Specifically, IPR010344 plays a role in biofilm formation and IPR019669 participates in single-species biofilm formation on the inanimate substrate. The development of biofilms is one of the drivers of persistent infections . Some bacteria, when growing in the biofilm, e.g., Porphyromonas gingivalis in dental plaque, can become destructive and may contribute to RA . Besides, subnetwork 1642 also contains IPR013756, associated with Flaviviruses, and IPR009774, related to hypothetical Streptococcus thermophilus bacteriophage, which hints the infection process in this subnetwork.
We generated the simulated data with various parameters. For each parameter group, 100 samples were produced. We compared the performance of PageRank with the degree algorithm that locates the key driver with the highest degree. As shown in Fig. 7, the noise level has little effect on prediction precision. The result of the degree algorithm also follows this pattern. To compare these two algorithms, we collected the cases where only one algorithm correctly found the key driver and the result is shown in C. Since when the number of sub-genes is large, both algorithms have high prediction precision, we focus more on cases where the sub-gene number is relatively small. In this situation, the PageRank has better performance. 2b1af7f3a8