"A Global Analysis of Caenorhabditis elegans Operons",
Thomas Blumenthal 1, Donald Evans 1, Christopher D. Link 2, Alessandro Guffanti 3, Daniel Lawson 3, Jean Thierry-Mieg 4, Danielle Thierry-Mieg 4, Wei Lu Chiu 5, Kyle Duke 6, Moni Kiraly 6 & Stuart K. Kim 6
1 Department of Biochemistry and Molecular Genetics, University
of Colorado School of Medicine, Box B121, 4200 E. 9th Avenue, Denver, Colorado
80262, USA
2 Institute of Behavioral Genetics, Box 447, University
of Colorado, Boulder, Colorado 80309, USA
3 The Sanger Centre, Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SA, UK
4 Gene Network Laboratory, National Institute of Genetics,
Mishima 411, Japan, and National Center for
Biotechnology Information, Bethesda, Maryland, USA
5 Department of Molecular Sciences and Technologies,
Pfizer Global Research & Development—Ann Arbor, 2800 Plymouth Road,
Ann Arbor, Michigan 48105, USA
6 Departments of Developmental Biology and Genetics,
Stanford University Medical Center, 279 Campus Drive, Stanford, California
94305, USA
Correspondence and requests for materials should be addressed to
T.B.
(e-mail: tom.blumenthal@uchsc.edu)
The nematode worm Caenorhabditis elegans and its relatives are unique among animals in having operons [1]. Operons are regulated multigene transcription units, in which polycistronic pre-messenger RNA (pre-mRNA coding for multiple peptides) is processed to monocistronic mRNAs. This occurs by 3' end formation and trans-splicing using the specialized SL2 small nuclear ribonucleoprotein particle [2] for downstream mRNAs [1]. Previously, the correlation between downstream location in an operon and SL2 trans-splicing has been strong, but anecdotal [3]. Although only 28 operons have been reported, the complete sequence of the C. elegans genome reveals numerous gene clusters [4]. To determine how many of these clusters represent operons, we probed full-genome microarrays for SL2-containing mRNAs. We found significant enrichment for about 1,200 genes, including most of a group of several hundred genes represented by complementary DNAs that contain SL2 sequence. Analysis of their genomic arrangements indicates that >90% are downstream genes, falling in 790 distinct operons. Our evidence indicates that the genome contains at least 1,000 operons, 2–8 genes long, that contain about 15% of all C. elegans genes. Numerous examples of co-transcription of genes encoding functionally related proteins are evident. Inspection of the operon list should reveal previously unknown functional relationships.
In order to search the genome for mRNAs that
contain SL2, we hybridized microarrays containing 17,817 predicted genes
(94% of known and predicted genes) with probe enriched for SL2-containing
mRNAs (see Methods). The results are presented in
Fig.
1a.
Figure 1: SL2/poly(A)+ ratios
of 17,817 C. elegans genes. Genes were divided into bins according
to
ratios, and plotted as log2(ratio) (line).
a, Distribution of confirmed SL2-accepting genes. Percentage of
319 genes shown to be SL2 trans-spliced on the basis of sequenced
cDNAs (bars).
b, Distribution of first genes in operons. First genes in the operons
identified by the 100 highest SL2/poly(A)+ ratios were distributed
into bins.
c, Genes in the leftmost peak and four control groups of 100 genes
were evaluated for location in operons. Genes whose trans-splice
sites were within 1 kb of the stop codon or 500 bp from the poly(A) site
of another gene were scored as downstream in operons. Percentage of genes
in each bin scored as downstream in operons is shown.
Having performed a global search for genes that produce SL2 mRNAs, we determined whether their genomic structure indicated that they are located within operons. Each gene was evaluated as to whether it was likely to be downstream in an operon by the criteria described in Fig. 1 legend, using either the WormBase [5] or the Intronerator website [6]. In the set of 1,200 SL2-enriched genes contained in the leftmost peak, 86% were scored as downstream in operons, and only 4.5% were scored as first genes in operons (Fig. 1c). From the set of genes that do not show significant SL2/poly(A)+ ratios, only 15–20% were scored as possibly downstream in operons. This analysis provides strong evidence that the microarray experiment effectively identified C. elegans genes that are in operons. These data show a robust correlation across the genome between SL2 trans-splicing and downstream location in an operon, confirming and extending previous data based on individual genes.
We used three methods to estimate the number of operons in the genome. First, we collected all of the genes in operons, both from microarray data and in the list of SL2-containing cDNAs. The combined list contains 2,291 genes in 881 operons ( Supplementary Information Table 2). Second, we estimated the number of operons that were missed by the microarray data. The list of SL2 spliced genes identified in the microarray experiments contained 74% of the genes identified from cDNA clones, and thus presumably of all SL2 spliced genes. Therefore we estimate that the genome contains at least 1,068 operons (790/0.74). Third, genes can be predicted to be in operons on the basis of their gene structure. We formed a list of possible operons on the basis of gene orientation and a spacing of less than 1 kilobase (kb) between stop and start codons. There are >3,000 possible operons on this list, and 790 of these were found to be SL2-enriched in our microarray experiments. On average, the remaining genes express transcripts that are at comparable levels to the SL2-containing transcripts, making it unlikely that we missed many genes because they are expressed at too low a level to have been detected on the microarrays or by cDNA clones. Instead, the remaining genes may not be in operons, but instead may be genes that are fortuitously close together.
The average operon contains 2.6 genes, and the longest contains 8
genes (Table 1).
332 operons have more than two genes, and in 58% of these every downstream
gene was scored as SL2 trans-spliced. These data indicate that a large
percentage of SL2-accepting genes were identified, and provide strong support
for the conclusion that downstream genes in operons are usually or always
trans-spliced by SL2. If there are about 1,000 operons with 2.6 genes
per operon, there are 2,600 genes in operons. Thus the C. elegans
genome, which contains between 17,300 (estimated from expressed open reading
frames [7]) and 19,000 (all known and predicted open
reading frames [5]) genes, expresses at least 13–15%
of its genes as operons. These operons are not evenly distributed on the
C. elegans chromosomes (Fig. 2).
Figure 2: Chromosomal distribution of operons. Each chromosome
was divided into equal-sized bins of
665,230 bp. The x axis is in Mb from the left end of each chromosome.
The number of predicted genes in
each bin (right-hand y axis) is shown by the data points. Operons
(left-hand y axis) are shown as bars.
Figure 3: Operon intercistronic distances. Distances from
the 3' end formation site of upstream genes and
trans-splice sites of downstream genes are plotted for the
285 operons for which reliable data are
available (listed in Supplementary
Information Table 6).
The correlation between SL2 trans-splicing and downstream position in an operon is quite strong. Nonetheless some genes that appear to be downstream in operons do not have high SL2/poly(A)+ scores, perhaps because their mRNAs were not well represented in the probe RNA population. Some operons that are expressed at low levels may have been missed. Also, some downstream genes in operons may get trans-spliced to SL1 rather than SL2 [8]. Operons with long spacing might be missed because they have a tendency to be SL1 spliced [3]. Furthermore, some genes that do get SL2 trans-spliced appear not to be downstream in operons. Perhaps there is a rare mode of SL2 trans-splicing that does not require a gene to be downstream in an operon.
Operons are a common form of gene organization in bacteria and archaea,
but they are in general absent in eukaryotes (with the possible exception
of trypanosomes). Based on genome sequences of yeast, Arabidopsis,
Drosophila and humans, operons are very unlikely to be found in
this wide array of species. Trans-splicing appears to be an enabling
characteristic. Presumably operons exist only when trans-splicing
can provide a
cap to protect the downstream RNA following 3' end cleavage and
prevent the accompanying transcription termination. Operons have been reported
only in rhabditid nematodes [9], although recent work
suggests they are found elsewhere among the nematodes (D. G. Giliano and
M. Blaxter, personal communication). Nevertheless, the fact that operon
organization in C. elegans is so common implies that the genome
may be
quite plastic, perhaps owing to chromosomal rearrangements producing
new gene juxtapositions [10]. Given the relatively compact
C. elegans genome, operon evolution may have been driven in part
by constraints on chromosomal structure or organization.
Caenorhabditis elegans operons appear to be a means to co-regulate functionally related proteins, like bacterial operons. Related genes do occur in operons [11-15]. Indeed, numerous additional examples are found in the list of operons reported here. For example, D1054.2, encoding a proteasome subunit, is in an operon with a ubiquitin ligase complex subunit. ZK856.9, which encodes a TFIIIC transcription factor, is in an operon with an RNA polymerase III subunit. C15H11.9, encoding a regulator of ribosome synthesis, is in an operon with an RNA polymerase I subunit. C15C7.1, encoding a vesicle docking and trafficking protein, is in an operon with a GRIP domain protein that also functions in the trans-Golgi. These and numerous other examples show that related genes are often found together in operons. Furthermore, such relationships occur far more frequently than would be expected by chance. For example, all seven genes with an RNA-binding domain of the 'RNA recognition motif' (RRM) type that are in operons with other genes with identified functions are in operons with other nucleic-acid-interacting proteins. In contrast, of seven proteins likely to be involved with the Golgi, only one operon contains a nucleic-acid-binding protein, whereas four contain proteins related to transport. Our results show that genes for mitochondrial proteins have a strong tendency to be in operons with genes for other mitochondrial proteins, and that this relationship is highly significant (P = 3.6 10-4; see Supplementary Information Tables 3 and 4). The same is true for genes encoding splicing proteins. However, whether operons usually contain genes of related function is not yet known.
Nonetheless, the presence of a gene in an operon with another gene
has recently been used to successfully predict a previously unknown functional
relationship [16], suggesting that the operons can be used to uncover related
genes. We note that many examples of genes in operons are apparent orthologues
of genes that cause disease in humans [17].
(Table 2).
It may be possible to identify novel genes that are functionally related to the disease genes by investigating the other genes in these operons.
SL2-enriched cDNA was prepared by reverse transcribing 5 µg
of mixed stage poly(A)+ RNA primed with oligo(dT) [18].
The cDNA was denatured at 70 °C for 2 min, and annealed to a T7/SL2
primer (1 µM;
5'-TGAATTGTAATACGACTCACTATAGGGAGAGGTTTTAACCCAGTTACTCA-3') at 42
°C for 5 min, followed by extension with Escherichia coli DNA
polymerase I Klenow fragment in 100 µl at 37 °C for 30 min.
RNase H was destroyed by incubating with 0.5% SDS and 20 µg
proteinase K for 1 h at 55 °C. The cDNA was extracted with phenol,
phenol/chloroform, chloroform/isoamyl alcohol and ethanol precipitated.
SL2-enriched cRNA was prepared using T7 RNA polymerase using the manufacturer's
Megascript protocol (Ambion). DNA microarrays are described in ref. [19].
RNA preparation, cDNA synthesis, labelled cDNA preparation, microarray
hybridization and microarray scanning were performed as previously described
[18]. Cy3-dUTP was used to label SL2-enriched cDNA and
Cy5-dUTP was used to label cDNA from poly(A)+ RNA made from
a mixed stage population of wild-type worms. The SL2-enriched probe and
the probe from the starting poly(A)+ mRNA were simultaneously
hybridized to DNA microarrays. To ensure reproducibility,
this procedure was repeated five times. Ratios of Cy3/Cy5 (SL2/poly(A)+)
signals were calculated for each gene and converted to log2(ratio).
We then calculated the average log2(ratio) from the five repeats.
The full data set is available as Supplementary
Information Table 5. The results are presented by dividing the resulting
log2(ratios) into bins (Fig. 1a).
Supplementary information accompanies this paper.
We thank J. Spieth, J. Kent, A. Zahler and L. Stein for help with
navigation of the C. elegans databases, Y. Kohara for cDNA data,
M. Huang for discussions, I. Shah for statistical advice, D. Guiliano and
M. Blaxter for communication of unpublished results, and P. MacMorris for
advice on the manuscript. This work was
supported by the NIH (T.B., C.D.L. and S.K.K.).
1. Spieth, J., Brooke, G., Kuersten, S., Lea, K. & Blumenthal, T. Operons in C. elegans: Polycistronic mRNA precursors are processed by trans-splicing of SL2 to downstream coding regions. Cell 73, 521-532 (1993)
2. Huang, X.-Y. & Hirsh, D. A second trans-spliced RNA leader sequence in the nematode Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 86, 8640-8644 (1989) |
3. Blumenthal, T. & Steward, K. in C. Elegans II (eds D. L. Riddle et al.) 117-145 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 1997)
4. Zorio, D. A. R., Cheng, N., Blumenthal, T. & Spieth, J. Operons represent a common form of chromosomal organization in C. elegans. Nature 372, 270-272 (1994)
5. Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J. & Spieth, J. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 29, 82-86 (2001)
6. Kent, W. J. & Zahler, A. M. The intronerator: exploring introns and alternative splicing in Caenorhabditis elegans. Nucleic Acids Res. 28, 91-93 (2000)
7. Raboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332-336 (2000)
8. Williams, C., Xu, L. & Blumenthal, T. SL1 trans-splicing and 3' end formation in a unique class of Caenorhabditis elegans operon. Mol. Cell. Biol. 19, 376-383 (1999)
9. Evans, D. et al. Operons and SL2 trans-splicing exist in nematodes outside the genus Caenorhabditis. Proc. Natl. Acad. Sci. USA 94, 9751-9756 (1997)
10. Huynen, M. A., Snel, B. & Bork, P. Inversions and the dynamics of eukaryotic gene order. Trends Genet. 17, 304-306 (2001)
11. Page, A. P. Cyclophilin and protein disulphide isomerase genes are co-transcribed in a functionally related manner in Caenorhabditis elegans. DNA Cell Biol. 16, 1335-1343 (1997)
12. Huang, L. S., Tzou, P. & Sternberg, P. W. The lin-15 locus encodes two negative regulators of Caenorhabditis elegans vulval development. Mol. Biol. Cell 5, 395-412 (1994)
13. Clark, S. G., Lu, X. & Horvitz, H. R. The Caenorhabditis elegans locus lin-15, a negative regulator of a tyrosine kinase signalling pathway, encodes two different proteins. Genetics 137, 987-997 (1994)
14. Treinin, M., Gillo, B., Liebman, L. & Chalfie, M. Two functionally dependent acetylcholine subunits are encoded in a single Caenorhabditis elegans operon. Proc. Natl Acad. Sci. USA 95, 15492-15495 (1998)
15. Mazroui, R., Puoti, A. & Kramer, A. Splicing factor SF1 from Drosophila and Caenorhabditis: presence of an N-terminal RS domain and requirement for viability. RNA 5, 1615-1631 (1999)
16. Furst, J. et al. ICln ion channel splice variants in Caenorhabditis elegans. Voltage dependence and interaction with an operon partner protein. J. Biol. Chem. 277, 4435-4445 (2002)
17. Culetto, E. & Sattelle, D. B. A role for Caenorhabditis elegans in understanding the function and interactions of human disease genes. Hum. Mol. Genet. 9, 869-877 (2000)
18. Reinke, V. et al. A global profile of germline gene expression in C. elegans. Mol. Cell 6, 605-616 (2000)
19. Jiang, M. et al. Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. Proc. Natl Acad. Sci. USA 98, 218-223 (2001)
1. Herstein PR, and Frenster JH, "Mated Models of Gene Regulation in Eukaryotes".
2. Frenster JH, "Ultrastructural Probes of Active DNA Sites, and the RNA Activators of DNA".