Analysis of the Thermotoga maritima genome combining a variety of sequence similarity and genome context tools.

Kyrpides NC, Ouzounis CA, Iliopoulos I, Vonstein V, Overbeek R.

The proliferation of genome sequence data has led to the development of a number of tools and strategies that facilitate computational analysis. These methods include the identification of motif patterns, membership of the query sequences in family databases, metabolic pathway involvement and gene proximity. We re-examined the completely sequenced genome of Thermotoga maritima by employing the combined use of the above methods. By analyzing all 1877 proteins encoded in this genome, we identified 193 cases of conflicting annotations (10%), of which 164 are new function predictions and 29 are amendments of previously proposed assignments. These results suggest that the combined use of existing computational tools can resolve inconclusive sequence similarities and significantly improve the prediction of protein function from genome sequence.

Nucleic Acids Res. 2000 Nov 15;28(22):4573-6.

WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction.

Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E.

The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORF-clustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms.

Nucleic Acids Res. 2000 Jan 1;28(1):123-5.

Genomics: what is realistically achievable?

Overbeek R

We now have a large and growing number of sequenced genomes. It is widely understood that this presents research opportunities and promises to change the way biology advances, but the magnitude and nature of the opportunities is, for the most part, poorly understood. In this short piece, I wish to examine the following two questions: First, how quickly will sequence data be produced? Second, what impact will this have on our understanding of the sequenced organisms?

Since I am a computer scientist by training, I tend to think of the current situation in which the field of genomics is being driven forward by rapid technological advances as quite analogous to the sequence of events in computing that were triggered by advances in microcomputer and network technologies. I distinctly remember the early period in which it seemed clear to most computer scientists (including myself) that technical advances were very desirable and interesting, but could have little impact on either the fundamental research issues or the overall advance of the field. Most of us completely underestimated the impact of exponential price improvements in key-enabling technologies. Certainly no one that I know of foresaw in any detail the current world of computing (although a few had rare insights into the potential). As we face the world generated by the web, we should remember that as late as the early 1990s common wisdom indicated that 'movies on demand' would be the application that drove increased network bandwidth.

Genome Biol. 2000; 1(2): comment2002.1–comment2002.3.
Published online 2000 Jul 28. doi:  10.1186/gb-2000-1-2-comment2002

Protein interaction maps for complete genomes based on gene fusion events.

Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA.

A large-scale effort to measure, detect and analyse protein-protein interactions using experimental methods is under way. These include biochemistry such as co-immunoprecipitation or crosslinking, molecular biology such as the two-hybrid system or phage display, and genetics such as unlinked noncomplementing mutant detection. Using the two-hybrid system, an international effort to analyse the complete yeast genome is in progress. Evidently, all these approaches are tedious, labour intensive and inaccurate. From a computational perspective, the question is how can we predict that two proteins interact from structure or sequence alone. Here we present a method that identifies gene-fusion events in complete genomes, solely based on sequence comparison. Because there must be selective pressure for certain genes to be fused over the course of evolution, we are able to predict functional associations of proteins. We show that 215 genes or proteins in the complete genomes of Escherichia coli, Haemophilus influenzae and Methanococcus jannaschii are involved in 64 unique fusion events. The approach is general, and can be applied even to genes of unknown function.

Nature 402, 86-90 (4 November 1999) | doi:10.1038/47056

Universal protein families and the functional content of the last universal common ancestor.

Kyrpides N, Overbeek R, Ouzounis C.

The phylogenetic distribution of Methanococcus jannaschii proteins can provide, for the first time, an estimate of the genome content of the last common ancestor of the three domains of life. Relying on annotation and comparison with reference to the species distribution of sequence similarities results in 324 proteins forming the universal family set. This set is very well characterized and relatively small and nonredundant, containing 301 biochemical functions, of which 246 are unique. This universal function set contains mostly genes coding for energy metabolism or information processing. It appears that the Last Universal Common Ancestor was an organism with metabolic networks and genetic machinery similar to those of extant unicellular organisms.

J Mol Evol. 1999 Oct;49(4):413-23.

The use of gene clusters to infer functional coupling.

Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N.

Previously, we presented evidence that it is possible to predict functional coupling between genes based on conservation of gene clusters between genomes. With the rapid increase in the availability of prokaryotic sequence data, it has become possible to verify and apply the technique. In this paper, we extend our characterization of the parameters that determine the utility of the approach, and we generalize the approach in a way that supports detection of common classes of functionally coupled genes (e.g., transport and signal transduction clusters). Now that the analysis includes over 30 complete or nearly complete genomes, it has become clear that this approach will play a significant role in supporting efforts to assign functionality to the remaining uncharacterized genes in sequenced genomes.

Proc Natl Acad Sci U S A. 1999 Mar 16;96(6):2896-901.