Atanas Kamburov, Christoph Wierling, Hans Lehrach, Ralf Herwig (2009) ConsensusPathDB--a database for integrating human functional interaction networks. Nucleic Acids Research 37(Database issue):D623-D628
Contact:For questions and comments regarding ConsensusPathDB, please contact Atanas Kamburov: kamburov@molgen.mpg.de.
ConsensusPathDB is a database that integrates different types of functional interactions between physical entities in the cell like genes, RNA, proteins, protein complexes and metabolites in order to assemble a more complete and a less biased picture of cellular biology. Currently, ConsensusPathDB
contains metabolic and signaling reactions, physical protein interactions and gene
regulatory interactions in human, mouse and yeast.
Physical entities from the external
resources are mapped to each other on the basis of common
identifiers like UniProt and Entrez.
Interactions with matching primary participants
(i.e., substrates and products in the case of biochemical
reactions, interactors in the case of protein interactions and
regulated gene in the case of gene regulatory interactions) are
also mapped and grouped together according to similarity.
The database content is updated every three months
with the newest available versions of the source databases.
The functionalities of the ConsensusPathDB web interface are described below.
Clicking on the link "content information" on the left panel of the ConsensusPathDB start page will take you to a summary of the current interaction content in the database. Most notably, the entity and interaction overlap tables summarize the number of unique physical entities and interactions present in each integrated resource, as well as the number of overlapping interactions and physical entities between resources. Figure 1. shows the overlap tables for Release 1 of ConsensusPathDB as an example.

One way to access the integrated interaction information in ConsensusPathDB is to search for specific interactions. Currently, the web interface to ConsensusPathDB offers two possibilities to do this.
Interactions of specific molecules or pathways
The ConsensusPathDB user can search for interactions of specific physical entities or pathways.
First, search terms are provided in the form of trivial names or accession numbers (e.g., UniProt or KEGG
accession numbers). It is possible to search for multiple physical entities or pathways simultaneously by
entering query keywords in separate rows. Search results summarize the names of matching objects and,
if the objects have been searched by accession numbers, the accession numbers matching the query.
Additionally, external web links are provided for each hit that show the origin of the physical entity
or pathway, and navigate the user to the web site containing the original information about the according item. Normally, pathways have only one external link (because pathways from
different resources are not compared with each other), and physical entities may have multiple links (because
matching physical entities from different resources are merged in
ConsensusPathDB).
After selecting relevant pathways or physical entities, their
functional interactions are listed. If the selected
pathways contain subpathways or if the physical entities
constitute groups, e.g. protein families, then the interactions
of the subpathways or group members, respectively, are also listed.
At this stage, each
interaction that is displayed has a single source. The user is
able to specify mapping criteria according to which similar
interactions are to be merged. Interactions
with matching primary participants are considered
similar. Which interactions should be considered identical,
depends on the user's settings: the user is able to
specify whether the primary participants of similar interactions must have
matching modification patterns, subcellular localization and/or
matching stoichiometry (in the case of biochemical reactions) in
order the similar interactions to be considered identical. According to these settings,
interactions are merged together and displayed in the visualization environment
of ConsensusPathDB which is described below.
Apart from the standard search for interactions of physical entities and pathways, the web interface of ConsensusPathDB features searching for the shortest path(s) of functional interactions that link a couple of physical entities with each other. Here, the user specifies a path "start" and a path "end". One possible shortest path between both is calculated and displayed (Figure 2.). The user is able to exclude particular physical entities from the shortest path, which is especially useful when non-specific hubs like ubiquitin and ATP are present in the path. Interaction paths of interest can be visualized in the visualization environment of ConsensusPathDB, which is described below.

Interaction networks can be viewed in an interactive visualization environment. The web interface user has two choices for a visualization environment: an image/JavaScipt-based visualization framework and a Java applet-based framework. Both frameworks display interaction networks in the same style so switching between them involves no user acclimatization. While the latter framework requires a Java Runtime Environment to be installed on the client computer and thus has higher processor and workspace requirements than a simple computer image, it has several advantages especially when it comes to visualizing larger networks: Network nodes are movable and can be re-arranged using different layout methods. Network viewing is further facilitated through the zoom function controlled by the computer mouse wheel. In this Java-based visualization environment, gene / protein expression data can be overlaid on the nodes of a currently viewed network to enable the interaction network based interpretation of such data. Here, we describe the common properties of visualized networks.
Elements of ConsensusPathDB interaction networks
Functional interaction graphs displayed in ConsensusPathDB (example in Figure 3.)
contain two types of nodes (bipartite network) and several types of
edges. Both nodes and edges differ in their shape and
color. Rectangular nodes represent physical entities (genes,
proteins, compounds, etc.) and circular nodes represent interactions
between nodes. The color
of physical entity nodes shows the type of the physical
entity, and the same is true for interaction nodes. For example,
biochemical reactions are shown as green circles, and physical
interactions are shown as orange circles. The primary name of each
physical entity in the graph is displayed inside the representing
node by default. Post-translational modifications and subcellular locations
of physical entities are shown in square brackets behind the name, if
the user has chosen those as differentiation criteria for interaction
mapping when visualizing selected interactions of entities or pathways. Edges linking an
interaction participant with the according interaction node also differ
in their style, arrow shape, and color. The edge style and arrow
shape indicate the role of the according physical entity in the
interaction. The origin of interactions is indicated by the edge
color: multiple edges with the same style and arrow shape but
with different colors may exist between a physical entity and an
interaction, indicating that the interaction is present in multiple
databases. Encoding origin information as an edge attribute
(i.e. edge color) and not as an interaction attribute
allows more detailed attribution of origin information: think for
example of an interaction which is present in two different databases
but for which different enzymes are provided in each database.
A detailed graph legend is available in the visualization environment and is shown by
clicking on the graph legend link from the menu panel of the
visualization environment (Figure 4.).

The visualization environments of ConsensusPathDB provide several functionalities to give entity and interaction information details and facilitate graph viewing. Synonymic names of interactions and physical entities are shown in tooltips when pointed at with the mouse cursor. For physical entities, external database identifiers are also displayed in the tooltip. Protein complex compositions are displayed instead of trivial complex names by clicking on the node representing the complex and selecting "switch to component view". For the static-image based visualization framework, we have implemented a JavaScript tool that mimics the hand-tool of other well-known viewing software like e.g. Ghostscript viewers and Adobe Reader. To scroll the graph using the hand-tool, just click anywhere on a white space in the graph and move the mouse while holding down the left mouse button. Another facility which ensures convenient work with larger graphs is the locate tool: you can locate any molecule present in the graph by entering a name or an accession number in the input field at the top right corner of the visualization environment page.
Interactive modification of interaction networks
Interaction network-graphs displayed in the visualization
environment are dynamical. Not only can the visibility of certain
interactions' participants be toggled, but interactions can be
added or removed from the graph.
The names of physical entities that are normally shown inside the
rectangles representing the according physical entities can be
hidden, which is especially useful when the user wishes to hide
the names of certain, less important entities in the interaction
graph thus stressing on the rest of the present objects. Moreover,
physical entities with a secondary role (e.g. enzymes, modifiers)
can be hidden, for example when the user wishes to visualize only
the mass flow in a biochemical pathway. The visibility of
names and of complete physical entity nodes can be toggled for
single physical entities by clicking on the according node with
the left mouse button and selecting the appropriate option from
the tooltip menu, or for all physical entities in the
network-graph by choosing the appropriate options from the
visibility settings menu at the menu panel in the top
region of the visualization environment page.
Apart from manipulating existing physical entity nodes, the user
can hide single interactions, lists of interactions or connected
components of the graph (again by choosing according options on
the tooltip menus of nodes or the options from the menu
panel). New interactions can be added to the interaction graph by
expanding physical entity nodes with their further interactions
(left-click on an entity node -> expand node in the tooltip
menu) or by searching for physical entities or pathways that are
possibly not present in the graph, and selecting their
interactions of interest (menu panel -> misc functions ->
add interactions).
The user can always undo his last graph modification by
clicking on undo last graph changes button on the menu
panel.
For interaction networks displayed in the visualization
interface, the user can view a summary of the number of physical
entity and interaction nodes of specific types, as well as an interaction
mapping summary. To access this feature, click on misc
functions in the menu panel -> network statistics.
Visualization graphs can be exported from the visualization
environment by clicking on misc functions (in the menu
panel) -> export network. Several possibilities are
provided: the interaction network-graph can be exported as a
graphics file in several formats, including png, jpg,
postscript and others, as a BioPAX file or as a ConsensusPathDB model
dump. BioPAX is an XML-based file format that carries information
on interaction systems and is currently supported by many
software packages for e.g. biochemical system modeling. The
ConsensusPathDB model dump can be imported into the visualization
environment (see Section File upload and mapping) and worked
with at a later time point. Currently, model
dumps can be imported only if the ConsensusPathDB version has not
changed in the meantime.
ConsensusPathDB offers two statistical approaches to analyze user-specified lists of genes obtained, e.g., by microarray experiments (commonly, a list of differentially expressed genes between two phenotypes). The first approach is over-representation analysis, where predefined lists of functionally associated genes (pathways, Gene Ontology (GO) categories and neighborhood-based entity sets, explained below) are tested for over-representation in the user-specified list based on the hypergeometric test. The second approach is based on the Wilcoxon signed-rank test and takes as input genes with exactly two measurement values, typically expression values in two distinct phenotypes. While the over-representation functionality typically takes as input a relatively short, non-weighted list of "special" (e.g., differentially expressed) genes, for the Wilcoxon enrichment analysis approach, genome-wide expression data containing measurements of possibly thousands of genes are the preferred input.
Over-representation analysis
The user can provide a list of identifiers of interesting genes or proteins, e.g. of genes that are significantly over-
or underexpressed in a certain phenotype compared to a control phenotype. The gene identifiers are mapped to physical entities in ConsensusPathDB.
Over-represented sets are searched among currently three categories of predefined gene sets: network neighborhood-based sets,
pathway-based sets and Gene Ontology-based sets. For each of the predefined sets, a p-value is calculated according to the hypergeometric test based on
the number of physical entities present in both the predefined set and user-specified list of physical entities. If no
background is uploaded by the user (corresponding to the list of IDs of all measured entities in the experiment), the
background parameter value for the hypergeometric test will depend on the type of the accession numbers used for the
input list: more precisely, the background size is the number of ConsensusPathDB entities that are anotated with an ID of
the type the user has provided, and participate in at least one pathway / GO category / neighborhood-based entity set (depending on
which of these predefined classes are considered by the user). The size of the tested predefined sets is also
corrected to the number of set members that are annotated with an ID of the user-specified ID type (since the rest of the
set members can never occur in the candidates list, using the specific namespace). This "effective" set size is provided in the analysis results in brackets after the absolute size. The p-values are corrected for
multiple testing using the false discovery rate method and are available as q-values on the results page. Sets whose hypergeometric p-value
passes the threshold defined by the user are listed, and for each set, the set size (the absolute size, as well as the corrected size), the over-representation p-value, the q-value, the set size and all
interaction resources contributing to the construction of the set are provided. The set definition in terms of set
centers in the case of neighborhood-based sets (see below), pathway name in the case of pathway-based sets or GO category name in the case of GO category over-representation are also
displayed. For sets with a reasonable size, more details like the list of all set members and the list of interactions
connecting the physical entities can be viewed. In the case of neighborhood-based sets and pathways, bar charts are displayed that summarize the GO cellular
component, molecular function and biological process annotations (flattened to GO level 2) of the members of the according set. Each of the three charts shows the five most common annotations from the according GO category. For each entity set
of reasonable size, the underlying interaction network-graph can be viewed in the ConsensusPathDB visualization
interface, where physical entities from the uploaded list are marked with a red border.
Neighborhood-based sets are sets of physical entities that are connected to each other with one or more functional interactions. Each neighborhood-based set has a central physical entity (or simply a center) and a radius denoting the maximal distance in terms of number of interactions separating the center from the rest of the entities in the set. The distance between physical entities involved in some functional interaction is 1. The same is true also for enzymes catalyzing neighboring biochemical reactions (i.e. reactions that have common primary participants, e.g. when a product of the first reaction is a substrate in the second). To avoid redundancy in sets, we allow sets to have more than one center, and merge sets with different centers that have the same members.
We have predefined a large number of neighborhood-based sets, according to different centers and radii. The user can choose to search for enriched neighborhood-based sets with a radius of 1 or 2 interactions and can additionally impose constraints on the nests like minimal size (i.e. minimal number of different proteins and genes contained in the set), minimal connectivity index, minimal overlap with the input list, and p-value. The connectivity index, which quantifies the level of interconnectedness within neighborhood-based sets, is defined as the fraction of all real interaction edges of all possible edges connecting the entities in the set, which is k*(k-1)/2 for a set of size k.

Apart from neighborhood-based entity sets, ConsensusPathDB contains pre-defined pathway-based sets from 9 pathway databases. A pathway-based set contains all the proteins and genes that are involved in a curated biochemical pathway. Gene Ontology (GO) based sets contain genes that are together annotated with a specific GO term. In ConsensusPathDB, GO categories from levels 2 and 3 of the GO hierarchy are available for pathway analysis.
Wilcoxon enrichment analysisThe Wilcoxon enrichment analysis method carries out a paired Wilcoxon signed-rank test for each NEST / GO category / pathway based on the measurement values of its members. The measurement values for every gene / protein, uploaded by the user, typically reflect genome-wide gene expression or proteome-wide protein abundance in two different phenotypes. For every uploaded gene or protein, exactly two values must be supplied in the input form (or uploaded file). The Wilcoxon test assignes a P-value to each functional set based on how probable it is that the combined measurement differences of genes in the functional set between the phenotypes have appeared by chance. Q-values are calculated using the same method as in the over-representation analysis approach.
Through the web interface of ConsensusPathDB, users can upload interaction networks in BioPAX, PSI-MI or SBML format. Upon upload, physical entities and functional interactions are compared against the content of ConsensusPathDB. .This is done based on identifiers (UniProt, KEGG, ChEBI, ...) in the case of physical entities and on participant composition in the case of interactions. The uploaded networks are visualized, and for interactions present in ConsensusPathDB, the source databases are displayed. Networks can be expanded in the context of the database content by adding further interactions, and edited.
SBML files should have physical entity annotations as in the BioModels model repository
The use of CPDB is free of charge for academic users. Commercial users should contact Dr. Ralf Herwig ( herwig [at] molgen.mpg.de ). Data stored in ConsensusPathDB is available under the license terms of each contributing database.
DisclaimerAlthough best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.
AcknowledgementsConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area "Life Sciences, Genomics and Biotechnology for Health" (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).
For more information, please contact kamburov[at]molgen.mpg.de .