Atanas Kamburov, Konstantin Pentchev, Hanna Galicka, Christoph Wierling, Hans Lehrach, Ralf Herwig (2011)
ConsensusPathDB: toward a more complete picture of cell biology.
Nucleic Acids Research 39(Database issue):D712-717.
Atanas Kamburov, Christoph Wierling, Hans Lehrach, Ralf Herwig (2009)
ConsensusPathDB--a database for integrating human functional interaction networks.
Nucleic Acids Research 37(Database issue):D623-D628.
For questions and comments regarding ConsensusPathDB, please contact Dr. Atanas Kamburov: firstname.lastname@example.org.
ConsensusPathDB is a database that integrates different types of functional interactions between physical entities in the cell like genes, RNA, proteins, protein complexes and metabolites in order to assemble a more complete and a less biased picture of cellular biology. Currently, ConsensusPathDB
contains metabolic and signaling reactions, physical protein interactions, genetic interactions, gene
regulatory interactions and drug-target interactions in human, mouse and yeast. The interaction information is collected from 30 public
resources and is integrated into a seamless network.
Physical entities from the source databases
are mapped to each other on the basis of common
identifiers like UniProt and Entrez accession numbers.
Interactions with matching primary participants
(i.e., substrates and products in the case of biochemical
reactions, interactors in the case of protein interactions and
regulated gene in the case of gene regulatory interactions) are
also mapped and grouped together according to similarity.
The database content is updated every three months with the newest available versions of the source databases; new source databases are integrated at the rate of 1-2 databases per release.
The functionalities of the ConsensusPathDB web interface are described below.
Clicking on the link "content information" on the left panel of the ConsensusPathDB start page will take you to a summary of the current interaction content in the database. Most notably, the entity and interaction overlap tables summarize the number of unique physical entities and interactions present in each integrated resource, as well as the number of overlapping interactions and physical entities between resources. Figure 1 below shows the overlap tables for Release 1 of ConsensusPathDB as an example.
One way to access the integrated interaction information in ConsensusPathDB is to search for specific interactions. Currently, the web interface to ConsensusPathDB offers two possibilities to do this.Search interactions of specific molecules or pathways
The ConsensusPathDB user can search for interactions of specific physical entities or pathways.
First, search terms are provided in the form of trivial names or accession numbers (e.g., UniProt or KEGG
accession numbers). It is possible to search for multiple physical entities or pathways simultaneously by
entering query keywords in separate rows. Search results summarize the names of matching objects and,
if the objects have been searched by accession numbers, the accession numbers matching the query.
Additionally, external web links are provided for each hit that show the origin of the physical entity
or pathway, and navigate the user to the web site containing the original information about the according item. Normally, pathways have only one external link (because pathways from
different resources are not compared with each other), and physical entities may have multiple links (because
matching physical entities from different resources are merged in
After selecting relevant pathways or physical entities, their functional interactions are listed. If the selected pathways contain subpathways or if the physical entities constitute groups, e.g. protein families, then the interactions of the subpathways or group members, respectively, are also listed. At this stage, each interaction that is displayed has a single source. The user is able to specify mapping criteria according to which similar interactions are to be merged. Interactions with matching primary participants are considered similar. Which interactions should be considered identical, depends on the user's settings: the user is able to specify whether the primary participants of similar interactions must have matching modification patterns, subcellular localization and/or matching stoichiometry (in the case of biochemical reactions) in order the similar interactions to be considered identical. According to these settings, interactions are merged together and displayed in the visualization environment of ConsensusPathDB which is described below.
Apart from the standard search for interactions of physical entities or pathways, the web interface of ConsensusPathDB features searching for the shortest path(s) of functional interactions that link a couple of physical entities with each other. Here, the user specifies a path "start" and a path "end". One possible shortest path between both is calculated and displayed (Figure 2.). The user is able to exclude particular physical entities from the shortest path, which is especially useful when non-specific hubs like ubiquitin and ATP are present in the path. Interaction paths of interest can be visualized in the visualization environment of ConsensusPathDB, which is described below.
Interaction networks can be viewed in an interactive visualization environment. The web interface user has two choices for a visualization environment: an image-based visualization and a visualization based on Cytoscape.js. Both frameworks display interaction networks in the same style so switching between them involves barely any user acclimatization. The image-based visualization is no longer supported and will be removed soon. The newer, Cytoscape.js visualization environment allows much more flexibility, e.g. it allows the user to pan, zoom, and re-arrange the networks. Notably, gene / protein expression data can be overlaid on the nodes of a currently viewed network to enable the interaction network based interpretation of such data. Here, we describe the common features of both visualization environments.Elements of ConsensusPathDB interaction networks
Functional interaction graphs displayed in ConsensusPathDB (example in Figure 3.)
contain two types of nodes (bipartite network) and several types of
edges. Both nodes and edges differ in their shape and
color. Rectangular nodes represent physical entities (genes,
proteins, compounds, etc.) and circular nodes represent interactions
between nodes. The color
of physical entity nodes shows the type of the physical
entity, and the same is true for interaction nodes. For example,
biochemical reactions are shown as green circles, and physical
interactions are shown as orange circles. The primary name of each
physical entity in the graph is displayed inside the representing
node by default. Post-translational modifications and subcellular locations
of physical entities are shown in square brackets behind the name, if
the user has chosen those as differentiation criteria for interaction
mapping when visualizing selected interactions of entities or pathways. Edges linking an
interaction participant with the according interaction node also differ
in their style, arrow shape, and color. The edge style and arrow
shape indicate the role of the according physical entity in the
interaction. The origin of interactions is indicated by the edge
color: multiple edges with the same style and arrow shape but
with different colors may exist between a physical entity and an
interaction, indicating that the interaction is present in multiple
databases. Encoding origin information as an edge attribute
(i.e. edge color) and not as an interaction attribute
allows more detailed attribution of origin information: think for
example of an interaction which is present in two different databases
but for which different enzymes are provided in each database.
A detailed graph legend is available in the visualization environment and is shown by clicking on the graph legend link from the menu panel of the visualization environment (Figure 4.).
Interaction network-graphs displayed in the visualization
environment are dynamical. Not only can the visibility of certain
interactions' participants be toggled, but interactions can be
added or removed from the graph.
The names of physical entities that are normally shown inside the rectangles representing the according physical entities can be hidden, which is especially useful when the user wishes to hide the names of certain, less important entities in the interaction graph thus stressing on the rest of the present objects. Moreover, physical entities with a secondary role (e.g. enzymes, modifiers) can be hidden, for example when the user wishes to visualize only the mass flow in a biochemical pathway. The visibility of names and of complete physical entity nodes can be toggled for single physical entities by clicking on the according node with the left mouse button and selecting the appropriate option from the tooltip menu, or for all physical entities in the network-graph by choosing the appropriate options from the visibility settings menu at the menu panel in the top region of the visualization environment page.
Apart from manipulating existing physical entity nodes, the user can hide single interactions, lists of interactions or connected components of the graph (again by choosing according options on the tooltip menus of nodes or the options from the menu panel). New interactions can be added to the interaction graph by expanding physical entity nodes with their further interactions (left-click on an entity node -> expand node in the tooltip menu) or by searching for physical entities or pathways that are possibly not present in the graph, and selecting their interactions of interest (menu panel -> misc functions -> add interactions).
The user can always undo his last graph modification by clicking on undo last graph changes button on the menu panel.
For interaction networks displayed in the visualization
interface, the user can view a summary of the number of physical
entity and interaction nodes of specific types, as well as an interaction
mapping summary. To access this feature, click on misc
functions in the menu panel -> network statistics.
Visualization graphs can be exported from the visualization environment by clicking on misc functions (in the menu panel) -> export network. Several possibilities are provided: the interaction network-graph can be exported as a graphics file in several formats, including png, jpg, postscript and others, as a BioPAX file or as a ConsensusPathDB model dump. BioPAX is an XML-based file format that carries information on interaction systems and is currently supported by many software packages for e.g. biochemical system modeling. The ConsensusPathDB model dump can be imported into the visualization environment (see Section File upload and mapping) and worked with at a later time point. Model dumps can be imported only if the ConsensusPathDB version has not changed in the meantime.
ConsensusPathDB offers two statistical approaches to analyze user-specified lists of genes or metabolites obtained e.g. by microarray experiments or mass spectrometry, respectively. The first approach is over-representation analysis, where predefined lists of functionally associated genes (pathways, Gene Ontology (GO) categories and neighborhood-based entity sets, explained below) are tested for over-representation in the user-specified list based on the hypergeometric test. The second approach is based on the Wilcoxon signed-rank test and takes as input genes with exactly two measurement values, typically expression values in two distinct phenotypes. While the over-representation functionality typically takes as input a relatively short, non-weighted list of "special" (e.g., differentially expressed) genes, for the Wilcoxon enrichment analysis approach, genome-wide expression data containing measurements of possibly thousands of genes are the preferred input.
The molecular concept-based analysis aproaches are detailed below (although they are explained on the basis of input lists of genes, the same applies for metabolite lists).
The user can provide a list of identifiers of interesting genes or proteins, e.g. of genes that are significantly over-
or underexpressed in a certain phenotype compared to a control phenotype. The gene identifiers are mapped to physical entities in ConsensusPathDB.
Over-represented sets are searched among currently three categories of predefined gene sets: network neighborhood-based sets,
pathway-based sets and Gene Ontology-based sets. For each of the predefined sets, a p-value is calculated according to the hypergeometric test based on
the number of physical entities present in both the predefined set and user-specified list of physical entities. If no
background is uploaded by the user (corresponding to the list of IDs of all measured entities in the experiment), the
background parameter value for the hypergeometric test will depend on the type of the accession numbers used for the
input list: more precisely, the background size is the number of ConsensusPathDB entities that are anotated with an ID of
the type the user has provided, and participate in at least one pathway / GO category / neighborhood-based entity set (depending on
which of these predefined classes are considered by the user). The size of the tested predefined sets is also
corrected to the number of set members that are annotated with an ID of the user-specified ID type (since the rest of the
set members can never occur in the candidates list, using the specific namespace). This "effective" set size is provided in the analysis results in brackets after the absolute size. The p-values are corrected for
multiple testing using the false discovery rate method and are available as q-values on the results page. Sets whose hypergeometric p-value
passes the threshold defined by the user are listed, and for each set, the set size (the absolute size, as well as the corrected size), the over-representation p-value, the q-value, the set size and all
interaction resources contributing to the construction of the set are provided. The set definition in terms of set
centers in the case of neighborhood-based sets (see below), pathway name in the case of pathway-based sets or GO category name in the case of GO category over-representation are also
displayed. For sets with a reasonable size, more details like the list of all set members and the list of interactions
connecting the physical entities can be viewed. In the case of neighborhood-based sets and pathways, bar charts are displayed that summarize the GO cellular
component, molecular function and biological process annotations (flattened to GO level 2) of the members of the according set. Each of the three charts shows the five most common annotations from the according GO category. For each entity set
of reasonable size, the underlying interaction network-graph can be viewed in the ConsensusPathDB visualization
interface, where physical entities from the uploaded list are marked with a red border.
Neighborhood-based sets are sets of physical entities that are connected to each other with one or more functional interactions. Each neighborhood-based set has a central physical entity (or simply a center) and a radius denoting the maximal distance in terms of number of interactions separating the center from the rest of the entities in the set. The distance between physical entities involved in some functional interaction is 1. The same is true also for enzymes catalyzing neighboring biochemical reactions (i.e. reactions that have common primary participants, e.g. when a product of the first reaction is a substrate in the second). To avoid redundancy in sets, we allow sets to have more than one center, and merge sets with different centers that have the same members.
We have predefined a large number of neighborhood-based sets, according to different centers and radii. The user can choose to search for enriched neighborhood-based sets with a radius of 1 or 2 interactions and can additionally impose constraints on the nests like minimal size (i.e. minimal number of different proteins and genes contained in the set), minimal connectivity index, minimal overlap with the input list, and p-value. The connectivity index, which quantifies the level of interconnectedness within neighborhood-based sets, is defined as the fraction of all real interaction edges of all possible edges connecting the entities in the set, which is k*(k-1)/2 for a set of size k.
Apart from neighborhood-based entity sets, ConsensusPathDB contains pre-defined pathway-based sets from several pathway databases. A pathway-based set contains all the proteins and genes that are involved in a curated biochemical pathway. Gene Ontology (GO) based sets contain genes that are together annotated with a specific GO term. In ConsensusPathDB, GO categories from levels 2 and 3 of the GO hierarchy are available for pathway analysis. Complex-based sets are sets of genes whose protein products are members of the same annotated protein complex.Wilcoxon enrichment analysis
The Wilcoxon enrichment analysis method carries out a paired Wilcoxon signed-rank test for each NEST / GO category / pathway based on the user-specified measurement values of its members. The measurement values for every gene / protein, uploaded by the user, typically reflect genome-wide gene expression or proteome-wide protein abundance in two different phenotypes. For every uploaded gene or protein, exactly two values must be supplied in the input form (or uploaded file). The Wilcoxon test assignes a P-value to each functional set based on how probable it is that the combined measurement differences of genes in the functional set between the phenotypes have appeared by chance. Q-values are calculated using the same method as in the over-representation analysis approach.Visualization of molecular concepts
The typical output of most tools for gene/metabolite set over-representation/enrichment analysis is a table where the different molecular concepts (e.g. pathways) are listed, ranked according to some statistical measure of association with the user-specified gene/metabolite list (most often a P-value). However, the molecular concepts often overlap with each other to some extent - for example, they may stand in a hierarchical relationship to each other (like Reactome pathways and Gene Ontology categories) or may share key elements. In the latest version of ConsensusPathDB, we have thus introduced the possibility to visualize the different molecular concepts (pathways, neighborhood-based sets, Gene Ontology categories and protein complexes), resulting from a particular over-representation or enrichment analysis, as concept overlap graphs (Figure 7). In these graphs, each node represents a separate concept whose member list size (i.e., number of genes/metabolites contained) and P-value are encoded as node size and node color, respectively. Two nodes are connected by an edge if they share members. The edge width reflects the relative overlap (corresponding to the Fowlkes-Mallows index) between the nodes, while the edge color encodes the number of shared members that are also found in the user's input (denoted "shared candidates"). This visual representation helps the user to quickly identify related biological processes that together show a changed activity, e.g. because they have the same key regulators. Moreover, it gives a quick overview over the relationships between the different types of concepts (e.g., particular Gene Ontology biological process categories may be very similar to particular pathways contained in pathway databases). Last but not least, the color coding of edges can provide clues about potentially dysregulated crosstalks between different biological processes. The overlap graph visualization environment features a filter that can be applied to edges in order to highlight only the closest relationships between concepts.
In addition to the over-representation/enrichment analysis of predefined molecular concepts, the web interface of ConsensusPathDB provides another approach for the interaction- and pathway-centric analysis of lists of genes, called induced network modules analysis. The approach was first proposed in Berger et al., 2007. Given a list of so-called seed genes (e.g. resulting from microarray experiments, which are unable to directly disclose the functional relationships between genes), it aims to interconnect those genes through different types of interactions (physical, biochemical, regulatory, etc.; selectable by the user). This information on the pairwise functional/physical relationships between the genes can shed light e.g. on the biological reasons why they are identified together in the experiment. For example, if a group of genes found to be over-expressed in a microarray experiment are highly interconnected through physical interactions, this suggests that those genes may encode proteins which together form a protein complex that has a high concentration in the phenotype under study and thus may be relevant for this phenotype.
Notably, the induced network modules may optionally include genes that are not in the user-supplied seeds list, but associate two or more seed genes with each other and overall have significantly many connections within the induced network module (purple nodes in Figure 8 below). These so-called intermediate genes are likely to be associated with the phenotype under study, although they may not be regulated on the transcriptional level and thus do not appear in the input gene list. For example, if a group of seed genes are all connected through gene regulatory interactions to an intermediate node that represents a transcription factor, this suggests that the transcription factor may be dysfunctional (e.g. due to a mutation, which does not necessarily impact the transcription factor's expression). Intermediate genes are ranked according to the significance of association with the seeds list given their overall connectivity in the background network. This is quantified by a z-score calculated for each intermediate node with the binomial proportions test. The z-score threshold can be controlled dynamically by the user in order to create sub-networks involving many intermediate and seed genes with a less stringent threshold or more compact sub-networks with a more stringent threshold.
If numerical values (e.g. expression values) are available for the genes, they can be uploaded and overlaid on the nodes of the resulting induced modules to allow an easier interpretation.
Through the web interface of ConsensusPathDB, users can upload interaction networks in BioPAX, PSI-MI or SBML format. Upon upload, physical entities and functional interactions are compared against the content of ConsensusPathDB. .This is done based on identifiers (UniProt, KEGG, ChEBI, ...) in the case of physical entities and on participant composition in the case of interactions. The uploaded networks are visualized, and for interactions present in ConsensusPathDB, the source databases are displayed. Networks can be expanded in the context of the database content by adding further interactions, and edited.
SBML files should have physical entity annotations as in the BioModels model repository
The use of CPDB is free of charge for academic users. Commercial users should contact Dr. Ralf Herwig ( herwig [at] molgen.mpg.de ). Data stored in ConsensusPathDB is available under the license terms of each contributing database.Disclaimer
Although best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.Acknowledgements
ConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area "Life Sciences, Genomics and Biotechnology for Health" (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).
For more information, please contact kamburov[at]molgen.mpg.de .