Exact mapping of prokaryotic gene starts.

Baytaluk MV, Gelfand MS, Mironov AA.

It is known that while the programs used to find genes in prokaryotic genomes reliably map protein-coding regions, they often fail in the exact determination of gene starts. This problem is further aggravated by sequencing errors, most notably insertions and deletions leading to frame-shifts. Therefore, the exact mapping of gene starts and identification of frame-shifts are important problems of the computer-assisted functional analysis of newly sequenced genomes. Here we review methods of gene recognition and describe a new algorithm for correction of gene starts and identification of frame-shifts in prokaryotic genomes. The algorithm is based on the comparison of nucleotide and protein sequences of homologous genes from related organisms, using the assumption that the rate of evolutionary changes in protein-coding regions is lower than that in non-coding regions. A dynamic programming algorithm is used to align protein sequences obtained by formal translation of genomic nucleotide sequences. The possibility of frame-shifts is taken into account. The algorithm was tested on several groups of related organisms: gamma-proteobacteria, the Bacillus/Clostridium group, and three Pyrococcus genomes. The testing demonstrated that, dependent or a genome, 1-10 per cent of genes have incorrect starts or contain frame-shifts. The algorithm is implemented in the program package Orthologator-GeneCorrector.

Brief Bioinform. 2002 Jun;3(2):181-94.

Complete reconstitution of the human coenzyme A biosynthetic pathway via comparative genomics.

Daugherty M, Polanuyer B, Farrell M, Scholle M, Lykidis A, de Crécy-Lagard V, Osterman A.


The biosynthesis of CoA from pantothenic acid (vitamin B5) is an essential universal pathway in prokaryotes and eukaryotes. The CoA biosynthetic genes in bacteria have all recently been identified, but their counterparts in humans and other eukaryotes remained mostly unknown. Using comparative genomics, we have identified human genes encoding the last four enzymatic steps in CoA biosynthesis: phosphopantothenoylcysteine synthetase (EC ), phosphopantothenoylcysteine decarboxylase (EC ), phosphopantetheine adenylyltransferase (EC ), and dephospho-CoA kinase (EC ). Biological functions of these human genes were verified using a complementation system in Escherichia coli based on transposon mutagenesis. The individual human enzymes were overexpressed in E. coli and purified, and the corresponding activities were experimentally verified. In addition, the entire pathway from phosphopantothenate to CoA was successfully reconstituted in vitro using a mixture of purified recombinant enzymes. Human recombinant bifunctional phosphopantetheine adenylyltransferase/dephospho-CoA kinase was kinetically characterized. This enzyme was previously suggested as a point of CoA biosynthesis regulation, and we have observed significant differences in mRNA levels of the corresponding human gene in normal and tumor cells by Northern blot analysis.

J Biol Chem. 2002 Jun 14;277(24):21431-9. Epub 2002 Mar 28.

Genome sequence and analysis of the oral bacterium Fusobacterium nucleatum strain ATCC 25586.

Kapatral V, Anderson I, Ivanova N, Reznik G, Los T, Lykidis A, Bhattacharyya A, Bartman A, Gardner W, Grechkin G, Zhu L, Vasieva O, Chu L, Kogan Y, Chaga O, Goltsman E, Bernal A, Larsen N, D'Souza M, Walunas T, Pusch G, Haselkorn R, Fonstein M, Kyrpides N, Overbeek R.

We present a complete DNA sequence and metabolic analysis of the dominant oral bacterium Fusobacterium nucleatum. Although not considered a major dental pathogen on its own, this anaerobe facilitates the aggregation and establishment of several other species including the dental pathogens Porphyromonas gingivalis and Bacteroides forsythus. The F. nucleatum strain ATCC 25586 genome was assembled from shotgun sequences and analyzed using the ERGO bioinformatics suite (http://www.integratedgenomics.com). The genome contains 2.17 Mb encoding 2,067 open reading frames, organized on a single circular chromosome with 27% GC content. Despite its taxonomic position among the gram-negative bacteria, several features of its core metabolism are similar to that of gram-positive Clostridium spp., Enterococcus spp., and Lactococcus spp. The genome analysis has revealed several key aspects of the pathways of organic acid, amino acid, carbohydrate, and lipid metabolism. Nine very-high-molecular-weight outer membrane proteins are predicted from the sequence, none of which has been reported in the literature. More than 137 transporters for the uptake of a variety of substrates such as peptides, sugars, metal ions, and cofactors have been identified. Biosynthetic pathways exist for only three amino acids: glutamate, aspartate, and asparagine. The remaining amino acids are imported as such or as di- or oligopeptides that are subsequently degraded in the cytoplasm. A principal source of energy appears to be the fermentation of glutamate to butyrate. Additionally, desulfuration of cysteine and methionine yields ammonia, H(2)S, methyl mercaptan, and butyrate, which are capable of arresting fibroblast growth, thus preventing wound healing and aiding penetration of the gingival epithelium. The metabolic capabilities of F. nucleatum revealed by its genome are therefore consistent with its specialized niche in the mouth.

J Bacteriol. 2002 Apr;184(7):2005-18.

The genome sequence of the facultative intracellular pathogen Brucella melitensis.

DelVecchio VG, Kapatral V, Redkar RJ, Patra G, Mujer C, Los T, Ivanova N,
Anderson I, Bhattacharyya A, Lykidis A, Reznik G, Jablonski L, Larsen N, D'Souza
M, Bernal A, Mazur M, Goltsman E, Selkov E, Elzer PH, Hagius S, O'Callaghan D,
Letesson JJ, Haselkorn R, Kyrpides N, Overbeek R.

Brucella melitensis is a facultative intracellular bacterial pathogen that causes
abortion in goats and sheep and Malta fever in humans. The genome of B.
melitensis strain 16M was sequenced and found to contain 3,294,935 bp distributed
over two circular chromosomes of 2,117,144 bp and 1,177,787 bp encoding 3,197
ORFs. By using the bioinformatics suite ERGO, 2,487 (78%) ORFs were assigned
functions. The origins of replication of the two chromosomes are similar to those
of other alpha-proteobacteria. Housekeeping genes, including those involved in
DNA replication, transcription, translation, core metabolism, and cell wall
biosynthesis, are distributed on both chromosomes. Type I, II, and III secretion
systems are absent, but genes encoding sec-dependent, sec-independent, and
flagella-specific type III, type IV, and type V secretion systems as well as
adhesins, invasins, and hemolysins were identified. Several features of the B.
melitensis genome are similar to those of the symbiotic Sinorhizobium meliloti.

Proc Natl Acad Sci U S A. 2002 Jan 8;99(1):443-8. Epub 2001 Dec 26.

Genomes OnLine Database (GOLD): a monitor of genome projects world-wide.

Bernal A, Ear U, Kyrpides N.

GOLD is a comprehensive resource for accessing information related to completed and ongoing genome projects world-wide. The database currently provides information on 350 genome projects, of which 48 have been completely sequenced and their analysis published. GOLD was created in 1997 and since April 2000 it has been licensed to Integrated Genomics. The database is freely available through the URL: http://igweb.integratedgenomics.com/GOLD/.

Nucleic Acids Res. 2001 Jan 1;29(1):126-7.

Archaeal shikimate kinase, a new member of the GHMP-kinase family.

Daugherty M, Vonstein V, Overbeek R, Osterman A.

Shikimate kinase (EC 2.7.1.71) is a committed enzyme in the seven-step biosynthesis of chorismate, a major precursor of aromatic amino acids and many other aromatic compounds. Genes for all enzymes of the chorismate pathway except shikimate kinase are found in archaeal genomes by sequence homology to their bacterial counterparts. In this study, a conserved archaeal gene (gi1500322 in Methanococcus jannaschii) was identified as the best candidate for the missing shikimate kinase gene by the analysis of chromosomal clustering of chorismate biosynthetic genes. The encoded hypothetical protein, with no sequence similarity to bacterial and eukaryotic shikimate kinases, is distantly related to homoserine kinases (EC 2.7.1.39) of the GHMP-kinase superfamily. The latter functionality in M. jannaschii is assigned to another gene (gi591748), in agreement with sequence similarity and chromosomal clustering analysis. Both archaeal proteins, overexpressed in Escherichia coli and purified to homogeneity, displayed activity of the predicted type, with steady-state kinetic parameters similar to those of the corresponding bacterial kinases: K(m,shikimate) = 414 +/- 33 microM, K(m,ATP) = 48 +/- 4 microM, and k(cat) = 57 +/- 2 s(-1) for the predicted shikimate kinase and K(m,homoserine) = 188 +/- 37 microM, K(m,ATP) = 101 +/- 7 microM, and k(cat) = 28 +/- 1 s(-1) for the homoserine kinase. No overlapping activity could be detected between shikimate kinase and homoserine kinase, both revealing a >1,000-fold preference for their own specific substrates. The case of archaeal shikimate kinase illustrates the efficacy of techniques based on reconstruction of metabolism from genomic data and analysis of gene clustering on chromosomes in finding missing genes.

J Bacteriol. 2001 Jan; 183(1): 292–300.
doi:  10.1128/JB.183.1.292-300.2001

Analysis of the Thermotoga maritima genome combining a variety of sequence similarity and genome context tools.

Kyrpides NC, Ouzounis CA, Iliopoulos I, Vonstein V, Overbeek R.

The proliferation of genome sequence data has led to the development of a number of tools and strategies that facilitate computational analysis. These methods include the identification of motif patterns, membership of the query sequences in family databases, metabolic pathway involvement and gene proximity. We re-examined the completely sequenced genome of Thermotoga maritima by employing the combined use of the above methods. By analyzing all 1877 proteins encoded in this genome, we identified 193 cases of conflicting annotations (10%), of which 164 are new function predictions and 29 are amendments of previously proposed assignments. These results suggest that the combined use of existing computational tools can resolve inconclusive sequence similarities and significantly improve the prediction of protein function from genome sequence.

Nucleic Acids Res. 2000 Nov 15;28(22):4573-6.

WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction.

Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E.

The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORF-clustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms.

Nucleic Acids Res. 2000 Jan 1;28(1):123-5.

Genomics: what is realistically achievable?

Overbeek R

We now have a large and growing number of sequenced genomes. It is widely understood that this presents research opportunities and promises to change the way biology advances, but the magnitude and nature of the opportunities is, for the most part, poorly understood. In this short piece, I wish to examine the following two questions: First, how quickly will sequence data be produced? Second, what impact will this have on our understanding of the sequenced organisms?

Since I am a computer scientist by training, I tend to think of the current situation in which the field of genomics is being driven forward by rapid technological advances as quite analogous to the sequence of events in computing that were triggered by advances in microcomputer and network technologies. I distinctly remember the early period in which it seemed clear to most computer scientists (including myself) that technical advances were very desirable and interesting, but could have little impact on either the fundamental research issues or the overall advance of the field. Most of us completely underestimated the impact of exponential price improvements in key-enabling technologies. Certainly no one that I know of foresaw in any detail the current world of computing (although a few had rare insights into the potential). As we face the world generated by the web, we should remember that as late as the early 1990s common wisdom indicated that 'movies on demand' would be the application that drove increased network bandwidth.

Genome Biol. 2000; 1(2): comment2002.1–comment2002.3.
Published online 2000 Jul 28. doi:  10.1186/gb-2000-1-2-comment2002

Protein interaction maps for complete genomes based on gene fusion events.

Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA.

A large-scale effort to measure, detect and analyse protein-protein interactions using experimental methods is under way. These include biochemistry such as co-immunoprecipitation or crosslinking, molecular biology such as the two-hybrid system or phage display, and genetics such as unlinked noncomplementing mutant detection. Using the two-hybrid system, an international effort to analyse the complete yeast genome is in progress. Evidently, all these approaches are tedious, labour intensive and inaccurate. From a computational perspective, the question is how can we predict that two proteins interact from structure or sequence alone. Here we present a method that identifies gene-fusion events in complete genomes, solely based on sequence comparison. Because there must be selective pressure for certain genes to be fused over the course of evolution, we are able to predict functional associations of proteins. We show that 215 genes or proteins in the complete genomes of Escherichia coli, Haemophilus influenzae and Methanococcus jannaschii are involved in 64 unique fusion events. The approach is general, and can be applied even to genes of unknown function.

Nature 402, 86-90 (4 November 1999) | doi:10.1038/47056

Universal protein families and the functional content of the last universal common ancestor.

Kyrpides N, Overbeek R, Ouzounis C.

The phylogenetic distribution of Methanococcus jannaschii proteins can provide, for the first time, an estimate of the genome content of the last common ancestor of the three domains of life. Relying on annotation and comparison with reference to the species distribution of sequence similarities results in 324 proteins forming the universal family set. This set is very well characterized and relatively small and nonredundant, containing 301 biochemical functions, of which 246 are unique. This universal function set contains mostly genes coding for energy metabolism or information processing. It appears that the Last Universal Common Ancestor was an organism with metabolic networks and genetic machinery similar to those of extant unicellular organisms.

J Mol Evol. 1999 Oct;49(4):413-23.

The use of gene clusters to infer functional coupling.

Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N.

Previously, we presented evidence that it is possible to predict functional coupling between genes based on conservation of gene clusters between genomes. With the rapid increase in the availability of prokaryotic sequence data, it has become possible to verify and apply the technique. In this paper, we extend our characterization of the parameters that determine the utility of the approach, and we generalize the approach in a way that supports detection of common classes of functionally coupled genes (e.g., transport and signal transduction clusters). Now that the analysis includes over 30 complete or nearly complete genomes, it has become clear that this approach will play a significant role in supporting efforts to assign functionality to the remaining uncharacterized genes in sequenced genomes.

Proc Natl Acad Sci U S A. 1999 Mar 16;96(6):2896-901.