ConsensusPathDB tutorial

Content

   General information

   Introduction

   Database content information

   Search in ConsensusPathDB
     Interactions of specific molecules or pathways
     Shortest path of functional interactions between molecules

   Interaction network visualization environment
     Elements of ConsensusPathDB interaction networks
     Visualization facilities
     Interactive modification of interaction networks
     Miscellaneous functions

   Pathway analysis
     Over-representation analysis
     Wilcoxon enrichment analysis

   Network upload

   More information
     Licensing information
     Disclaimer
     Acknowledgements
  General information

Citation:

Atanas Kamburov, Christoph Wierling, Hans Lehrach, Ralf Herwig (2009) ConsensusPathDB--a database for integrating human functional interaction networks. Nucleic Acids Research 37(Database issue):D623-D628

Contact:

For questions and comments regarding ConsensusPathDB, please contact Atanas Kamburov: kamburov@molgen.mpg.de.


  Introduction

ConsensusPathDB is a database that integrates different types of functional interactions between physical entities in the cell like genes, RNA, proteins, protein complexes and metabolites in order to assemble a more complete and a less biased picture of cellular biology. Currently, ConsensusPathDB contains metabolic and signaling reactions, physical protein interactions and gene regulatory interactions in human, mouse and yeast. Physical entities from the external resources are mapped to each other on the basis of common identifiers like UniProt and Entrez. Interactions with matching primary participants (i.e., substrates and products in the case of biochemical reactions, interactors in the case of protein interactions and regulated gene in the case of gene regulatory interactions) are also mapped and grouped together according to similarity.
The database content is updated every three months with the newest available versions of the source databases.
The functionalities of the ConsensusPathDB web interface are described below.


  Database content information

Clicking on the link "content information" on the left panel of the ConsensusPathDB start page will take you to a summary of the current interaction content in the database. Most notably, the entity and interaction overlap tables summarize the number of unique physical entities and interactions present in each integrated resource, as well as the number of overlapping interactions and physical entities between resources. Figure 1. shows the overlap tables for Release 1 of ConsensusPathDB as an example.



Figure 1

Overlap tables showing the number of overlapping interactions and physical entities across functional interactions resources (ConsensusPathDB release 1, 12.12.2007). The numbers in the diagonals denote the numbers of unique interactions and physical entities present in source databases, respectively. Example: the HumanCyc database contains 1169 unique interactions involving overall 3613 unique physical entities. 113 of the HumanCyc interactions and 775 of the physical entities are also present in the Reactome database.


  Search in ConsensusPathDB

One way to access the integrated interaction information in ConsensusPathDB is to search for specific interactions. Currently, the web interface to ConsensusPathDB offers two possibilities to do this.

Interactions of specific molecules or pathways

The ConsensusPathDB user can search for interactions of specific physical entities or pathways. First, search terms are provided in the form of trivial names or accession numbers (e.g., UniProt or KEGG accession numbers). It is possible to search for multiple physical entities or pathways simultaneously by entering query keywords in separate rows. Search results summarize the names of matching objects and, if the objects have been searched by accession numbers, the accession numbers matching the query. Additionally, external web links are provided for each hit that show the origin of the physical entity or pathway, and navigate the user to the web site containing the original information about the according item. Normally, pathways have only one external link (because pathways from different resources are not compared with each other), and physical entities may have multiple links (because matching physical entities from different resources are merged in ConsensusPathDB).
After selecting relevant pathways or physical entities, their functional interactions are listed. If the selected pathways contain subpathways or if the physical entities constitute groups, e.g. protein families, then the interactions of the subpathways or group members, respectively, are also listed. At this stage, each interaction that is displayed has a single source. The user is able to specify mapping criteria according to which similar interactions are to be merged. Interactions with matching primary participants are considered similar. Which interactions should be considered identical, depends on the user's settings: the user is able to specify whether the primary participants of similar interactions must have matching modification patterns, subcellular localization and/or matching stoichiometry (in the case of biochemical reactions) in order the similar interactions to be considered identical. According to these settings, interactions are merged together and displayed in the visualization environment of ConsensusPathDB which is described below.

Shortest path of functional interactions between molecules

Apart from the standard search for interactions of physical entities and pathways, the web interface of ConsensusPathDB features searching for the shortest path(s) of functional interactions that link a couple of physical entities with each other. Here, the user specifies a path "start" and a path "end". One possible shortest path between both is calculated and displayed (Figure 2.). The user is able to exclude particular physical entities from the shortest path, which is especially useful when non-specific hubs like ubiquitin and ATP are present in the path. Interaction paths of interest can be visualized in the visualization environment of ConsensusPathDB, which is described below.



Figure 2

One shortest path of functional interactions linking PUMA with PFKL, excluding ATP. The path involves 4 proteins: PUMA, P53, HSPA5 and PFKL. A dashed line between one physical entity and the next indicates that there is at least one functional interaction containing the two proteins. If P53 and/or HSPA5 are excluded, another shortest path will be calculated that does not involve these proteins. A detailed interaction network-graph showing the type and the origin of relevant interactions linking the physical entities is shown by clicking on "Visualize path".


  Interaction network visualization environment

Interaction networks can be viewed in an interactive visualization environment. The web interface user has two choices for a visualization environment: an image/JavaScipt-based visualization framework and a Java applet-based framework. Both frameworks display interaction networks in the same style so switching between them involves no user acclimatization. While the latter framework requires a Java Runtime Environment to be installed on the client computer and thus has higher processor and workspace requirements than a simple computer image, it has several advantages especially when it comes to visualizing larger networks: Network nodes are movable and can be re-arranged using different layout methods. Network viewing is further facilitated through the zoom function controlled by the computer mouse wheel. In this Java-based visualization environment, gene / protein expression data can be overlaid on the nodes of a currently viewed network to enable the interaction network based interpretation of such data. Here, we describe the common properties of visualized networks.

Elements of ConsensusPathDB interaction networks



Figure 3

An example functional interaction network automatically generated by the visualization environment of ConsensusPathDB. The network contains three biochemical reactions (green circles), two protein-protein interactions (orange circles) and one gene regulatory interaction (purple circle) involving overall seven proteins (light blue rectangles), five protein complexes (darker blue rectangles), a metabolite (light brown rectangle) and a gene (purple rectangle). Multiple edges connecting the functional interaction nodes (circles) with the physical entity nodes (rectangles) show that the interactions have multiple sources coded by the edge colors. The role of each interaction participant is denoted by the edge style and arrow shape (see Figure 4. for details)

Functional interaction graphs displayed in ConsensusPathDB (example in Figure 3.) contain two types of nodes (bipartite network) and several types of edges. Both nodes and edges differ in their shape and color. Rectangular nodes represent physical entities (genes, proteins, compounds, etc.) and circular nodes represent interactions between nodes. The color of physical entity nodes shows the type of the physical entity, and the same is true for interaction nodes. For example, biochemical reactions are shown as green circles, and physical interactions are shown as orange circles. The primary name of each physical entity in the graph is displayed inside the representing node by default. Post-translational modifications and subcellular locations of physical entities are shown in square brackets behind the name, if the user has chosen those as differentiation criteria for interaction mapping when visualizing selected interactions of entities or pathways. Edges linking an interaction participant with the according interaction node also differ in their style, arrow shape, and color. The edge style and arrow shape indicate the role of the according physical entity in the interaction. The origin of interactions is indicated by the edge color: multiple edges with the same style and arrow shape but with different colors may exist between a physical entity and an interaction, indicating that the interaction is present in multiple databases. Encoding origin information as an edge attribute (i.e. edge color) and not as an interaction attribute allows more detailed attribution of origin information: think for example of an interaction which is present in two different databases but for which different enzymes are provided in each database.
A detailed graph legend is available in the visualization environment and is shown by clicking on the graph legend link from the menu panel of the visualization environment (Figure 4.).




Figure 4

Legend of the interaction network-graphs generated by the visualization environment of ConsensusPathDB.

Visualization facilities

The visualization environments of ConsensusPathDB provide several functionalities to give entity and interaction information details and facilitate graph viewing. Synonymic names of interactions and physical entities are shown in tooltips when pointed at with the mouse cursor. For physical entities, external database identifiers are also displayed in the tooltip. Protein complex compositions are displayed instead of trivial complex names by clicking on the node representing the complex and selecting "switch to component view". For the static-image based visualization framework, we have implemented a JavaScript tool that mimics the hand-tool of other well-known viewing software like e.g. Ghostscript viewers and Adobe Reader. To scroll the graph using the hand-tool, just click anywhere on a white space in the graph and move the mouse while holding down the left mouse button. Another facility which ensures convenient work with larger graphs is the locate tool: you can locate any molecule present in the graph by entering a name or an accession number in the input field at the top right corner of the visualization environment page.

Interactive modification of interaction networks

Interaction network-graphs displayed in the visualization environment are dynamical. Not only can the visibility of certain interactions' participants be toggled, but interactions can be added or removed from the graph.
The names of physical entities that are normally shown inside the rectangles representing the according physical entities can be hidden, which is especially useful when the user wishes to hide the names of certain, less important entities in the interaction graph thus stressing on the rest of the present objects. Moreover, physical entities with a secondary role (e.g. enzymes, modifiers) can be hidden, for example when the user wishes to visualize only the mass flow in a biochemical pathway. The visibility of names and of complete physical entity nodes can be toggled for single physical entities by clicking on the according node with the left mouse button and selecting the appropriate option from the tooltip menu, or for all physical entities in the network-graph by choosing the appropriate options from the visibility settings menu at the menu panel in the top region of the visualization environment page.
Apart from manipulating existing physical entity nodes, the user can hide single interactions, lists of interactions or connected components of the graph (again by choosing according options on the tooltip menus of nodes or the options from the menu panel). New interactions can be added to the interaction graph by expanding physical entity nodes with their further interactions (left-click on an entity node -> expand node in the tooltip menu) or by searching for physical entities or pathways that are possibly not present in the graph, and selecting their interactions of interest (menu panel -> misc functions -> add interactions).
The user can always undo his last graph modification by clicking on undo last graph changes button on the menu panel.

Miscellaneous functions

For interaction networks displayed in the visualization interface, the user can view a summary of the number of physical entity and interaction nodes of specific types, as well as an interaction mapping summary. To access this feature, click on misc functions in the menu panel -> network statistics.
Visualization graphs can be exported from the visualization environment by clicking on misc functions (in the menu panel) -> export network. Several possibilities are provided: the interaction network-graph can be exported as a graphics file in several formats, including png, jpg, postscript and others, as a BioPAX file or as a ConsensusPathDB model dump. BioPAX is an XML-based file format that carries information on interaction systems and is currently supported by many software packages for e.g. biochemical system modeling. The ConsensusPathDB model dump can be imported into the visualization environment (see Section File upload and mapping) and worked with at a later time point. Currently, model dumps can be imported only if the ConsensusPathDB version has not changed in the meantime.


  Pathway analysis

ConsensusPathDB offers two statistical approaches to analyze user-specified lists of genes obtained, e.g., by microarray experiments (commonly, a list of differentially expressed genes between two phenotypes). The first approach is over-representation analysis, where predefined lists of functionally associated genes (pathways, Gene Ontology (GO) categories and neighborhood-based entity sets, explained below) are tested for over-representation in the user-specified list based on the hypergeometric test. The second approach is based on the Wilcoxon signed-rank test and takes as input genes with exactly two measurement values, typically expression values in two distinct phenotypes. While the over-representation functionality typically takes as input a relatively short, non-weighted list of "special" (e.g., differentially expressed) genes, for the Wilcoxon enrichment analysis approach, genome-wide expression data containing measurements of possibly thousands of genes are the preferred input.

Over-representation analysis

The user can provide a list of identifiers of interesting genes or proteins, e.g. of genes that are significantly over- or underexpressed in a certain phenotype compared to a control phenotype. The gene identifiers are mapped to physical entities in ConsensusPathDB. Over-represented sets are searched among currently three categories of predefined gene sets: network neighborhood-based sets, pathway-based sets and Gene Ontology-based sets. For each of the predefined sets, a p-value is calculated according to the hypergeometric test based on the number of physical entities present in both the predefined set and user-specified list of physical entities. If no background is uploaded by the user (corresponding to the list of IDs of all measured entities in the experiment), the background parameter value for the hypergeometric test will depend on the type of the accession numbers used for the input list: more precisely, the background size is the number of ConsensusPathDB entities that are anotated with an ID of the type the user has provided, and participate in at least one pathway / GO category / neighborhood-based entity set (depending on which of these predefined classes are considered by the user). The size of the tested predefined sets is also corrected to the number of set members that are annotated with an ID of the user-specified ID type (since the rest of the set members can never occur in the candidates list, using the specific namespace). This "effective" set size is provided in the analysis results in brackets after the absolute size. The p-values are corrected for multiple testing using the false discovery rate method and are available as q-values on the results page. Sets whose hypergeometric p-value passes the threshold defined by the user are listed, and for each set, the set size (the absolute size, as well as the corrected size), the over-representation p-value, the q-value, the set size and all interaction resources contributing to the construction of the set are provided. The set definition in terms of set centers in the case of neighborhood-based sets (see below), pathway name in the case of pathway-based sets or GO category name in the case of GO category over-representation are also displayed. For sets with a reasonable size, more details like the list of all set members and the list of interactions connecting the physical entities can be viewed. In the case of neighborhood-based sets and pathways, bar charts are displayed that summarize the GO cellular component, molecular function and biological process annotations (flattened to GO level 2) of the members of the according set. Each of the three charts shows the five most common annotations from the according GO category. For each entity set of reasonable size, the underlying interaction network-graph can be viewed in the ConsensusPathDB visualization interface, where physical entities from the uploaded list are marked with a red border.
Neighborhood-based sets are sets of physical entities that are connected to each other with one or more functional interactions. Each neighborhood-based set has a central physical entity (or simply a center) and a radius denoting the maximal distance in terms of number of interactions separating the center from the rest of the entities in the set. The distance between physical entities involved in some functional interaction is 1. The same is true also for enzymes catalyzing neighboring biochemical reactions (i.e. reactions that have common primary participants, e.g. when a product of the first reaction is a substrate in the second). To avoid redundancy in sets, we allow sets to have more than one center, and merge sets with different centers that have the same members.
We have predefined a large number of neighborhood-based sets, according to different centers and radii. The user can choose to search for enriched neighborhood-based sets with a radius of 1 or 2 interactions and can additionally impose constraints on the nests like minimal size (i.e. minimal number of different proteins and genes contained in the set), minimal connectivity index, minimal overlap with the input list, and p-value. The connectivity index, which quantifies the level of interconnectedness within neighborhood-based sets, is defined as the fraction of all real interaction edges of all possible edges connecting the entities in the set, which is k*(k-1)/2 for a set of size k.



Figure 6

An example for neighborhood-based sets. Assume that four interactions exist, the first between molecules A and B, the second between B and C, the third between B and D, and the fourth between C and E. These define several neighborhood-based sets according to the set center and the radius: for example, the set with center B and radius 1 contains A, B, C and D (i.e. the set center B and its next neighbors A, C and D), the set with the same center but with radius 2 contains A, B, C, D and E (the set center B, its next neighbors A, C and D, and B's second-next neighbor E).

Apart from neighborhood-based entity sets, ConsensusPathDB contains pre-defined pathway-based sets from 9 pathway databases. A pathway-based set contains all the proteins and genes that are involved in a curated biochemical pathway. Gene Ontology (GO) based sets contain genes that are together annotated with a specific GO term. In ConsensusPathDB, GO categories from levels 2 and 3 of the GO hierarchy are available for pathway analysis.

Wilcoxon enrichment analysis

The Wilcoxon enrichment analysis method carries out a paired Wilcoxon signed-rank test for each NEST / GO category / pathway based on the measurement values of its members. The measurement values for every gene / protein, uploaded by the user, typically reflect genome-wide gene expression or proteome-wide protein abundance in two different phenotypes. For every uploaded gene or protein, exactly two values must be supplied in the input form (or uploaded file). The Wilcoxon test assignes a P-value to each functional set based on how probable it is that the combined measurement differences of genes in the functional set between the phenotypes have appeared by chance. Q-values are calculated using the same method as in the over-representation analysis approach.


  Network upload

Through the web interface of ConsensusPathDB, users can upload interaction networks in BioPAX, PSI-MI or SBML format. Upon upload, physical entities and functional interactions are compared against the content of ConsensusPathDB. .This is done based on identifiers (UniProt, KEGG, ChEBI, ...) in the case of physical entities and on participant composition in the case of interactions. The uploaded networks are visualized, and for interactions present in ConsensusPathDB, the source databases are displayed. Networks can be expanded in the context of the database content by adding further interactions, and edited.
SBML files should have physical entity annotations as in the BioModels model repository


  More information

Licensing information

The use of CPDB is free of charge for academic users. Commercial users should contact Dr. Ralf Herwig ( herwig [at] molgen.mpg.de ). Data stored in ConsensusPathDB is available under the license terms of each contributing database.

Disclaimer

Although best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.

Acknowledgements

ConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area "Life Sciences, Genomics and Biotechnology for Health" (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).

For more information, please contact kamburov[at]molgen.mpg.de .