Annotation of the function of genes and proteins is the principal goal of genome analysis
The basic computational steps are, first, to perform an automatic annotation process to generate the functional annotations for the complete genome, and second, to integrate the generated annotations in a platform that provide molecular biologists with a suitable environment to analyze the results.
We have recently developed a new version of the functional annotation algorithm FunCut (Abascal and Valencia, 2003; Del Pozo et al., 2008), which in its first version was successfully applied to other functional annotation projects: Functional annotation of the genome of Buchnera aphidicola (Roeland et al., 2003), annotation of the proteins of the human chromosome 21 (Biosapiens), Streptomyces coelicolor genome annotation, etc.
FunCUT is application based on the study of the annotations of homologous sequences and includes new features related to the specific identification of protein subfamilies (orthologous groups).
The workflow of the method proceeds as follows:
A sequence similitary search is carried out to find proteins related to the query sequence. A clustering algorithm is applied in order to identity closely related sequence groups in the set of similar proteins. More related sequences are more likely to share a common function. In some cases, recursive sequence similitary searches lead to better representation of the related subfamilies, which facilitates the clustering.
The local alignments with the closely related proteins clusteres together with the query protein are classified in different categories depending on the extent to which the alignments cover the length of the query and target sequences (alignment categories).
Key functional annotations of the corresponding proteins are analyzed, including functional descriptions, enzymatic activity codes, and Swiss-Prot style keywords. Annotations from new FunCut version are also based on GOA annotations, UniProt keywords and their GO terms.
The transference of information is carried out starting from the alignment categories with a better coverage. A confidence level is assigned to each one of the annotations. This level is derived from the alignment categories.
New FunCut version introduces module SIAM (Statistically Inferred Annotation Method). It has been developed to be integrated with FunCut with a double goal:
- providing accurate Gene Ontology (GO) terms mapped over one the original generated annotations of the tool (UniProt keywords);
- improving the predictions with the addition of new annotations based on a statistical and heuristic algorithm.
FunCut system annotates query sequences based on sequence homology information and it originally generates annotations composed by a synthetic description, a set of EC codes and a set of keywords. These annotations are inferred from a pool of rules which decide which annotations should be transferred to the query sequence, based on the topology of the space of homologous sequences. Homologous sequences are found by intermediate sequence searches (ISS; Bino and Sali, 2004) over the input database, and sequence similarity results are clustered using the NCut algorithm (Abascal and Valencia, 2002). Original rules are parametrized to take into account the size of the cluster in which the query is placed ('winner cluster') and the distance to other homologous clusters as well as their sizes.
SIAM uses this rule-based heuristics as first source of annotation. Instead of dealing with UniProt keywords, SIAM uses their mappings to GO terms provided by the Gene Ontology Consortium. Then, the module takes the sequences of the 'winner cluster' and extracts all their non-IEA GO annotations from the GOA database, building a consensus list from the set of annotated terms.
Each terms is included in the consensus list only if the support value of term in the annotated set satisfies a predefined minimum support threshold. In addition, the p-values of the assigned terms are calculated in order to measure the statistical significance of each annotation in the set. The consensus list could be validated with the Distance Model over Gene Ontology, that is based on the co-occurrence of the terms within a same Interpro domain. A final term prediction list is built from the voting of the terms derived from the rule-based heuristics and the statistical algorithm. If a term is provided by both methods and the number of homologous sequences in the winner cluster is greater than two, the predicted term is scored by a confidence value of 3, otherwise, the score is 2. Finally, given the list of predicted terms, we choose those terms that are closest to the leaves of the DAG, under the constraint that their corresponding nodes form a 'nad' subset. That is, a subset of nodes that are 'not ancestors and not descendants' of each other in GO.
Web Services Technologies for FunCUT-SIAM
FunCUT-SIAM has been designed to be portable, modular and flexible. It is possible to integrate it to other bioinformatics systems and it can be accessed in distributed systems in the form of web services.
The modules are:
ISS (Intermediate Sequence Search): Module that carries out the homolog search.
NCut (Normalized Cut): Clustering Module that groups the homolog sequences -subfamilies- and weights its closeness to the query sequence.
OFunCut: Module that analyzes the key functional annotations of the neighbor sequences and makes the transference of the annotations.
SIAM: is a Statistical Inferred Annotation Model. It infers Gene Ontology terms for FunCUT pipeline.
Those modules are implemented as diferent types of web services:Taverna workbench users (Oinn et al., 2004).
The web services pipeline runs on a battery of computers located at the Spanish National Institute for Bioinformatics (Instituto Nacional de Bioinformática-INB, http://www.inab.org).