Treponema pallidum subsp. pallidum, the causative agent of syphilis, a sexually transmitted disease, is a microaerophilic obligate parasite of humans. This bacterium is one of the few prominent infectious agents that has not been continuously cultured in vitro, and consequently relatively little is known about its mechanisms of virulence at the molecular level. Therefore, T. pallidum represented an attractive candidate for genome sequencing.
The complete genome sequence of T. pallidum has now been completed and comprises 1,138,006 base pairs containing 1041 predicted protein-coding sequences. An important goal of this project is to identify potential virulence factors. Genome analysis indicates several potential virulence factors, including a family of 12 proteins related to the Treponema denticola Msp protein, several putative hemolysins, as well as several other protein classes of interest. The results of this analysis are reviewed in this article and indicate the value of whole-genome sequences in rapidly advancing the understanding of infectious agents.
Keywords: Molecular pathogenesis, Virulence factor, Treponema pallidum, Syphilis, Genome analysis
The DNA sequence of the T. pallidum genome
- General characteristics of the sequence
The genomic DNA sequence of T. pallidum subsp. pallidum (Nichols), as determined by the random whole genome sequencing method, comprises a circular chromosome of 1,138,006 bp with a G+C base composition of 52.8%. There are a total of 1041 predicted ORFs, with an average size of 1023 bp. The average size of these predicted proteins is 37,771 Da, with a range of 3,235 to 172,869 Da. The mean isoelectric point of the predicted proteins is 8.1, with a range of 3.9 to 12.3. These parameters are similar to those observed in other bacteria. These proteins are encoded by 92.9% of genomic DNA.
Biological functions have been suggested for 577 ORFs (55%) according to the Riley classification scheme, while 177 ORFs (17%) match hypothetical proteins from other species and 287 ORFs (28%) do not match the base of data and may be novel genes. When compared to another spirochete, Borrelia burgdorferi, whose genome has also been sequenced, 90 T. pallidum ORFs of unknown function match chromosomally-encoded proteins in B. burgdorferi, but no T. pallidum ORFs match B. burgdorferi plasmid-encoded. proteins, suggesting that the plasmid proteins are unique to Borrelia.
All 61 codon triplets are used in T. pallidum genes, with a bias for G or C at the third codon position. This is in contrast to the A or T skew at this position in B. burgdorferi. This observation is related to the higher composition of G+C bases in the genome of T. pallidum, being almost double that of B. burgdorferi. The disparate G+C composition between spirochete genomes is also related to a bias in overall codon usage and a concomitant difference in amino acid composition in the predicted coding sequences.
Analysis of the predicted protein sequences indicates that 129 of the ORFs (12%) can be assigned to 42 families of paralogous genes. Among these, 15 families contain 44 genes that do not have an assigned biological role. The largest family, with 14 members, consists of ATP-binding cassette proteins in ABC transport systems, while 30 families have only 2 members. Among the 13 gene families, there are 16 adjacent gene clusters that may represent duplications in the T. pallidum genome.
- Analysis methods
After completing the DNA sequence, the coding regions were identified using GLIMMER and searched against a non-redundant database using the methods developed at TIGR. Furthermore, families of paralogs were analyzed by Pfam, transmembrane domains were predicted by TopPred, and signal peptides were predicted by Signal-P. Although this procedure predicted the vast majority of ORFs, there may be a small number of genes that are not yet represented in the T. pallidum database, either because they are too small to have been considered or because they have unusual features, e.g. example, different patterns of codon usage.
Following this analysis, a different search algorithm, PSI-BLAST, was used to search the database with each predicted ORF. In addition, the BLOCKS and ProDom databases of protein domains as well as the COG database of protein orthologous groups were searched. The results of these analyzes of each putative ORF were used to make the predictions described in this review.
T. pallidum has been a major pathogen of the civilized world for more than 500 years. It has been one of the most resistant organisms to study and, in fact, it was only identified at the beginning of this century. However, as a result of the completion of the genomic sequence, there are now a wealth of clues to understanding, diagnosing, and treating syphilis. In this review, we have described a collection of 67 proteins that are of interest for future studies on the virulence of T. pallidum. Less than a third of these had been previously observed, and among the previously uncharacterized genes is the tpr gene family that likely plays an important role in treponemal infections. Our future understanding of T. pallidum, as well as many other microorganisms, pathogenic and not, is profoundly altered by the availability of complete genome sequences.