# Sizes of genomes: The C‑value paradox

The C-value is the amount of DNA in the haploid genome of an organism. It varies over a very wide range, with a general increase in C-value with complexity of organism from prokaryotes to invertebrates, vertebrates, plants. The C-value paradox is basically this: how can we account for the amount of DNA in terms of known function? Very similar organisms can show a large difference in C-values (e.g. amphibians). The amount of genomic DNA in complex eukaryotes is much greater than the amount needed to encode proteins.  For example: Mammals have 30,000 to 50,000 genes, but their genome size (or C-value) is 3 x 109 bp.

(3 x 109 bp)/3000 bp (average gene size) = 1 x 106 (“gene capacity”).

Drosophila melanogaster has about 5000 mutable loci (~genes). If the average size of an insect gene is 2000 bp, then >1 x 108 bp/2 x 103 bp = > 50,000 “gene capacity”.

Figure 4.1.

Our current understanding of complex genomes reveals several factors that help explain the classic C-value paradox:

• Introns in genes
• Regulatory elements of genes
• Pseudogenes
• Multiple copies of genes
• Intergenic sequences
• Repetitive DNA

The facts that some of the genomic DNA from complex organisms is highly repetitive, and that some proteins are encoded by families of genes whereas others are encoded by single genes, mean that the genome can be considered to have several distinctive components. Analysis of the kinetics of DNA reassociation, largely in the 1970's, showed that such genomes have components that can be distinguished by their repetition frequency. The experimental basis for this will be reviewed in the first several sections of this chapter, along with application of hybridization kinetics to measurement of complexity and abundance of mRNAs. Advances in genomic sequencing have provided more detailed views of genome structure, and some of this information will be reviewed in the latter sections of this chapter.

### Table 4.1. Distinct components in complex genomes

Highly repeated DNA

R(repetition frequency) >100,000

Almost no information, low complexity

Moderately repeated DNA

10<R<10,000

Little information, moderate complexity

“Single copy” DNA

R=1 or 2

Much information, high complexity

R= repetition frequency

### Reassociation kinetics measure sequence complexity

#### Low complexity DNA sequences reanneal faster than do high complexity sequences

The components of complex genomes differ not only in repetition frequency (highly repetitive, moderately repetitive, single copy) but also in sequence complexity. Complexity (denoted by N) is the number of base pairs of unique or nonrepeating DNA in a given segment of DNA, or component of the genome. This is different from the length (L) of the sequence if some of the DNA is repeated, as illustrated in this example.

E.g. consider 1000 bp DNA.

500 bp is sequence a, present in a single copy.

500 bp is sequence b (100 bp) repeated 5 times:

a            b  b   b   b   b

|___________|__|__|__|__|__|           L = length = 1000 bp = a + 5b

N = complexity = 600 bp = a + b

Some viral and bacteriophage genomes have almost no repeated DNA, and L is approximately equal to N. But for many genomes, repeated DNA occupies 0.1 to 0.5 of the genome, as in this simple example. The key result for genome analysis is that less complex DNA sequences renature faster than do more complex sequences. Thus determining the rate of renaturation of genomic DNA allows one to determine how many kinetic components (sequences of different complexity) are in the genome, what fraction of the genome each occupies, and the repetition frequency of each component.

Before investigating this in detail, let's look at an example to illustrate this basic principle, i.e. the inverse relationship between reassociation kinetics and sequence complexity.

### Inverse Relationship between Reassociation Kinetics and Sequence Complexity

Let a, b, ... z represent a string of base pairs in DNA that can hybridize (see Fig. 4.2.). For simplicity in arithmetic, we will use 10 bp per letter.

• DNA 1 = ab (This is very low sequence complexity, 2 letters or 20 bp)
• DNA 2 = cdefghijklmnopqrstuv.  (This is 10 times more complex (20 letters or 200 bp)).
• DNA 3 =izyajczkblqfreighttrainrunninsofastelizabethcottonqwftzxvbifyoudontbelieveimleavingyoujustcountthe

(This is 100 times more complex (200 letters or 2000 bp).

A solution of 1 mg DNA/ml is 0.0015 M (in terms of moles of bp per L) or 0.003 M (in terms of nucleotides per L).  We'll use 0.003 M = 3 mM, i.e. 3 mmoles nts per L.  (nts = nucleotides).

Consider a 1 mg/ml solution of each of the three DNAs.  For DNA 1, this means that the sequence ab (20 nts) is present at 0.15 mM or 150 mM (calculated from 3 mM / 20 nt in the sequence). Likewise, DNA 2 (200 nts) is present at 15 mM, and DNA 3 is present at 1.5 mM. Melt the DNA (i.e. dissociate into separate strands) and then allow the solution to reanneal, i.e. let the complementary strand reassociate.

Since the rate of reassociation is determined by the rate of the initial encounter between complementary strands, the higher the concentration of those complementary strands, the faster the DNA will reassociate. So for a given overall DNA concentration, the simple sequence (ab) in low complexity DNA 1 will reassociate 100 times faster than the more complex sequence (izyajcsk ....trad) in the higher complexity DNA 3.  Fast reassociating DNA is low complexity.

Fig. 4.2.

#### Kinetics of renaturation

In this section, we will develop the relationships among rates of renaturation, complexity, and repetition frequency more formally.

Figure 4.3.

The time required for half renaturation is inversely proportional to the rate constant. Let C = concentration of single-stranded DNA at time $$t$$ (expressed as moles of nucleotides per liter). The rate of loss of single-stranded (ss) DNA during renaturation is given by the following expression for a second-order rate process:

$\dfrac{-dC}{dt}= kC^2$ or $\dfrac{dC}{C^2}=-kdt$

Integration and some algebraic substitution shows that

$\dfrac{C}{C_o}=\dfrac{1}{1+kC_ot} \;\;\; (1)$

Thus, at half renaturation, when

$\dfrac{C}{C_o}=0.5 \; \text{at} \; t=t_{1/2}$

one obtains:

$C_ot_{1/2}=\dfrac{1}{k} \; \;\; (2)$

where $$k$$  is the rate constant in in liters (mole nt)-1 sec-1

The rate constant for renaturation is inversely proportional to sequence complexity. The rate constant, k, shows the following proportionality:

$k \propto \dfrac{\sqrt{L}}{N} \;\;\; (3)$

where

• L = length and
• N = complexity.

Empirically, the rate constant k has been measured as

$k = 3x10^5 \dfrac{\sqrt{L}}{N}$

in 1.0 M Na+ at $$T = T_m - 25^oC$$

The time required for half renaturation (and thus Cot1/2)  is directly proportional to sequence complexity.

From equations (2) and (3),

$C _0 t_{1/2} \propto \dfrac{N}{\sqrt{L}} \;\;\;\; (4)$

For a renaturation measurement, one usually shears DNA to a constant fragment length L (e.g. 400 bp). Then L is no longer a variable, and

$C_o t_{1/2}\propto N \;\;\;\; (5).$

The data for renaturation of genomic DNA are plotted as $$C_0 t$$ curves:

Figure 4.4.

Renaturation of a single component is complete (0.1 to 0.9) over 2 logs of $$C_0t$$ (e.g., 1 to 100 for E. coli DNA), as predicted by equation (1).

#### Sequence complexity is usually measured by a proportionality to a known standard

If you have a standard of known genome size, you can calculate $$N$$ from $$C_0t_{1/2}$$:

$\dfrac{N^{unknown}}{N^{known}} = \dfrac{C_0t_{1/2}^{unknown}}{C_0t_{1/2}^{known}} \;\;\;\; (6)$

A known standard could be

• E. coli with N = 4.639 x 106 bp
• pBR322 with N = 4362 bp

More complex DNA sequences renature more slowly than do less complex sequences. By measuring the rate of renaturation for each component of a genome, along with the rate for a known standard, one can measure the complexity of each component.

#### Analysis of $$C_ot$$ curves with multiple components

In this section, the analysis in section B is applied quantitatively in an example of renaturation of genomic DNA. If an unknown DNA has a single kinetic component, meaning that the fraction renatured increases from 0.1 to 0.9 as the value of C0t increases 100-fold, then one can calculate its complexity easily. Using equation (6), all one needs to know is its C0t1/2, plus the $$C_0t_{1/2}$$ and complexity of a standard renatured under identical conditions (initial concentration of DNA, salt concentration, temperature, etc.).

The same logic applies to the analysis of a genome with multiple kinetic components. Some genomes reanneal over a range of C0tvalues covering many orders of magnitude, e.g. from 10-3 to 104. Some of the DNA renatures very fast; it has low complexity, and as we shall see, high repetition frequency. Other components in the DNA renature slowly; these have higher complexity and lower repetition frequency. The only new wrinkle to the analysis, however, is to treat each kinetic component independently. This is a reasonable approach, since the DNA is sheared to short fragments, e.g. 400 bp, and it is unlikely that a fast-renaturing DNA will be part of the same fragment as a slow-renaturing DNA.

Some terms and abbreviations need to be defined here.

• f = fraction of genome occupied by a component
• $$C_0t_{1/2}$$ for pure component = (f) ($$C_0t_{1/2}$$ measured in the mixture of components)
• R = repetition frequency
• G = genome size. G can be measured chemically (e.g. amount of DNA per nucleus of a cell) or kinetically (see below).

One can read and interpret the $$C_0t$$ curve as follows. One has to estimate the number of components in the mixture that makes up the genome. In the hypothetical example in Fig. 4.5, three components can be seen, and another is inferred because 10% of the genome has renatured as quickly as the first assay can be done. The three observable components are the three segments of the curve, each with an inflection point at the center of a part of the curve that covers a 100-fold increase in $$C_0t$$ (sometimes called 2 logs of $$C_0t). The fraction of the genome occupied by a component, f, is measured as the fraction of the genome annealing in that component. The measured\(C_0t_{1/2}$$ is the value of $$C_0t$$ at which half the component has renatured. In Fig. 4.5, component 2 renatures between $$C_0t$$ values of 10-3 and 10-1, and the fraction of the genome renatured increased from 0.1 to 0.3 over this range. Thus f is 0.3-0.1=0.2. The C0t  value at half-renaturation for this component is the value seen when the fraction renatured reached 0.2 (i.e. half-way between 0.1 and 0.3; this C0t  value is 10-2, and it is referred to as the C0t1/2for component 2 (measured in the mixture of components). Values for the other components are tabulated in Fig. 4.5.

Figure 4.5.

All the components of the genome are present in the genomic DNA initially denatured. Thus the value for C0 is for all the genomic DNA, not for the individual components. But once one knows the fraction of the genome occupied by a component, one can calculate the C0  for each individual component, simply as C0 ´ f. Thus the $$C_0t_{1/2}$$ for the individual component is the $$C_0t_{1/2}$$ (measured in the mixture of components) ´ f. For example the $$C_0t_{1/2}$$ for individual (pure) component 2 is 10-2 ´ 0.2 = 2 ´ 10-3 .

Knowing the measured $$C_0t_{1/2}$$for a DNA standard, one can calculate the complexity of each component.

$Nn= C_0t_{1/2}_{pure}, n$  ´

• where n refers to the particular component, i.e. (1, 2, 3, or 4)

The repetition frequency of a given component is the total number of base pairs in that componentdivided by the complexity of the component. The total number of base pairs in that component is given by fn  ´  G

Rn =

For the data in Fig. 4.5, one can calculate the following values:

Component

f

$$C_0t_{1/2}$$, mix

$$C_0t_{1/2}$$, pure

N (bp)

##### RR

1  foldback

0.1

< 10-4

< 10-4

2  fast

0.2

10-2

2 x 10-3

600

105

3  intermediate

0.1

1

0.1

3 x 104

103

4  slow (single copy)

0.6

103

600

1.8 x 108

1

std bacterial DNA

10

3 x 106

1

The genome size, G, can be calculated from the ratio of the complexity and the repetition frequency.

G=

E.g. If G = 3 x 108 bp, and component 2 occupies 0.2 of it, then component 2 contains 6 x 107 bp. But the complexity of component 2 is only 600 bp. Therefore it would take 105 copies of that 600 bp sequence to comprise 6 x 107 bp, and we surmise that R = 105.

Question 4.1.

If one substitutes the equation for Nn and for G  into the equation for Rn, a simple relationship for R can be derived in terms of $$C_0t_{1/2}$$ values measured for the mixture of components . What is it?

### Types of DNA in each kinetic component for complex genomes

Eukaryotic genomes usually have multiple components, which generates complex C0t  curves.  Fig. 4.6 shows a schematic C0t  curve that illustrates the different kinetic  components of human DNA, and the following table gives some examples of members of the different components.

Figure 4.6.

Table 4.2.  Four principle kinetic components of complex genomes

 Renaturation kinetics C0t  descriptor Repetition frequency Examples too rapid to measure "foldback" not applicable inverted repeats fast renaturing low C0t highly repeated, > 105 copies per cell interspersed short repeats (e.g. human Alu  repeats); tandem repeats of short sequences (centromeres) intermediate renaturing mid C0t moderately repeated, 10-104 copies per cell families of interspersed repeats (e.g. human L1 long repeats); rRNA, 5S RNA, histone genes slow renaturing high C0t low, 1-2 copies per cell, "single copy" most structural genes (with their introns); much of the intergenic DNA

N, Rfor repeated DNAs are averages for many families of repeats. Individual members of families of repeats are similar but not identical to each other.

The emerging picture of the human genome reveals approximately 30,000 genes encoding proteins and structural or functional RNAs. These are spread out over 22 autosomes and 2 sex chromosomes. Almost all have introns, some with a few short introns and others with very many long introns. Almost always a substantial amount of intergenic DNA separates the genes.

Several different families of repetitive DNA are interspersed throughout the the intergenic and intronic sequences. Almost all of these are repeats are vestiges of transposition events, and in some cases the source genes for these transposons have been found. Some of the most abundant families of repeats transposed via an RNA intermediate, and can be called retrotransposons. The most abundant repetitive family in humans are Alu repeats, named for a common restriction endonuclease site within them. They are about 300 bp long, and about 1 million copies are in the genome. They are probably derived from a modified gene for a small RNA called 7SL RNA. (This RNA is involved in translation of secreted and membrane bound proteins). Genomes of species from other mammalian orders (and indeed all vertebrates examined) have roughly comparable numbers of short interspersed repeats independently derived from genes encoding other short RNAs, such as transfer RNAs.

Another prominent class of repetitive retrotransposons are the longL1 repeats. Full-length copies of L1 repeats are about 7000 bp long, although many copies are truncated from the 5' end. About 50,000 copies are in the human genome. Full-length copies of recently transposed L1s and their sources genes have two open reading frames (i.e. can encode two proteins). One is a multifunctional protein similar to the pol gene of retroviruses. It encodes a functional reverse transcriptase. This enzyme may play a key role in the transposition of all retrotransposons. Repeats similar to L1s are found in all mammals and in other species, although the L1s within each mammalian order have features distinctive to that order. Thus both short interspersed repeats (or SINEs) and the L1 long interspersed repeats (or LINEs) have expanded and propogated independently in different mammalian orders.

Both types of retrotransposons are currently active, generating de novomutations in humans. A small subset of SINEs have been implicated as functional elements of the genome, providing post-transcriptional processing signals as well as protein-coding exons for a small number of genes.

Other classes of repeats, such as L2s (long repeats) and MIRS (short repeats named mammalian interspersed repeats), appear to predate the mammalian radiation, i.e. they appear to have been in the ancestral eutherian mammal. Other classes of repeats are transposable elements that move by a DNA intermediate.

Other common interspersed repeated sequences in humans

#### LTR-containing retrotransposons

MaLR: mammalian, LTR retrotransposons

Endogenous retroviruses

MER4 (MEdium Reiterated repeat, family 4)

MER1 and MER2

#### Mariner repeats

Some of the repeats are clustered into tandem arrays and make up distinctive features of chromosomes (Fig. 4.7). In addtion to the interspersed repeats discussed above, another contributor to the moderately repetitive DNA fraction are the thousands of copies of rRNA genes.  These are in extensive tandem arrays on a few chromosomes, and are condensed into heterochromatin. Other chromosomal structures with extensive arrays of tandem repeats are centromeres and telomeres.

Figure 4.7. Clustered repeated sequences in the human genome.

The common way of finding repeats now is by sequence comparison to a database of repetitive DNA sequences, RepBase (from J. Jurka). One of the best tools for finding matches to these repaats is RepeatMasker (from Arian Smit and P. Green, U. Wash.). A web server for RepeatMasker can be accessed at: http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

Question 4.2. Try RepeatMasker on INSgene sequence. You can get the INSsequence either from NCBI (GenBank accession gi|307071|gb|L15440.1 or one can use LocusLink, query on ) or from the course website.

#### Very little of the nonrepetive DNA component is expressed as mRNA

Hybridization kinetic studies of RNA revealed several important insights. First, saturation experiments, in which an excess of unlabeled RNA was used to drive labeled, nonrepetitive DNA (tracer) into hybrid, showed that only a small fraction of the nonrepetitive DNA was present in mRNA. Classic experiments from Eric Davidson’s lab showed that only 2.70% of total nonrepetitive DNA correspondss to mRNAisolated from sea urchin gastrula (this is corrected for the fact that only one strand of DNA is copied into RNA; the actual amount driven into hybrid is half this, or 1.35%; Fig. 4.8). The complexity of this nonrepetitive fraction is (Nsc ) is 6.1 x 108 bp, so only 1.64 x 107 bp of this DNA is present as mRNA in the cell. If an "average" mRNA is 2000 bases long, there are ~8200 mRNAs present in gastrula.

In contrast, if the nonrepetitive DNA is hybridized to nuclear RNA from the same tissue, 28% of the nonrepetitive fraction corresponds to RNA (Fig. 4.8). The nuclear RNA is heterogeneous in size, and is sometimes referred to as heterogeneous nuclear RNA, or hnRNA. Some of it is quite large, much more so than most of the mRNA associated with ribosomes in the cytoplasm. The latter is called polysomal mRNA.

Figure 4.8.

Figure 4.8.

These data show that a substantial fraction of the genome (over one-fourth of the nonrepetitive fraction) is transcribed in nuclei at the gastrula stage, but much of this RNA never gets out of nucleus (or more formally, many more sequences from the DNA are represented in nuclear RNA than in cytoplasmic RNA). Thus much of the complexity in nuclear RNA stays in the nucleus; it is not processed into mRNA and is never translated into proteins.

#### Factors contributing to an explanation include:

1. Genes may be transcribed but the RNA is not stable. (Even the cytoplasmic mRNA from different genes can show different stabilities; this is one level of regulation of expression. But there could also be genes whose transcripts are so unstable in some tissues that they are never processed into cytoplasmic mRNA, and thus never translated. In this latter case, the gene is transcribed but not expressed into protein.)
2. Intronic RNA is transcribed and turns over rapidly after splicing.
3. Genes are transcribed well past the poly A addition site. These transcripts through the 3' flanking, intergenic regions are usually very unstable.
4. Not all of this "extra" RNA in the nucleus is unstable. For instance, some RNAs are used in the nucleus, e.g.:
5. U2-Un RNAs in splicing (small nuclear RNAs, or snRNAs).

#### RNA may be a structural component of nuclear scaffold (S. Penman).

Thus, although 10 times as much RNA complexity is present in the nucleus compared to the cytoplasm, this does not mean that 10 times as many genes are being transcribed as are being translated. Some fraction (unknown presently) of this "excess" nuclear RNA may represent genes that are being transcribed but not expressed, but many other factors also contribute to this phenomenon.

#### mRNA populations in different tissues show considerable overlap:

• Housekeeping genes encode metabolic functions found in almost all cells.
• Specialized genes, or tissue-specific genes, are expressed in only 1 (or a small number of) tissues. These tissue-specific genes are sometimes expressed in large amounts.

Estimating numbers of genes expressed and mRNA abundance from the kinetics of RNA-driven reactions

Using principles similar to those for analysis of repetition classes in genomic DNA, one can determine from the kinetics of hybridization between a preparation of RNA and single copy DNA both the average number of genes represented in the RNA, as well as the abundance of the mRNAs. The details of the kinetic analysis will not be presented, but they are similar to those already discussed. Highly abundant RNAs (like high copy number DNA) will hybridize to genomic DNA faster than will low abundance RNA (like low copy number DNA). Only a few mRNAs are highly abundant, and they constitute a low complexity fraction. The bulk of the genes are represented by lower abundance mRNA, and these many mRNAs constitute a high complexity, slowly hybridizing fraction.

An example is summarized in Table 4.3. an excess of mRNA from chick oviduct washybridized to a tracer of labeled cDNA (prepared from oviduct mRNA). Three principle components were found, ranging from the highly abundant ovalbumin mRNA to much rarer mRNAs from many genes.

Table 4.3.

 Component Kinetics of hybridization N (nt) # mRNAs Abundance Example 1 fast 2,000 1 120,000 Ovalbumin 2 medium 15,000 7-8 4,800 Ovomucoid, others 3 slow 2.6 x 107 13,000 6-7 Everything else

#### Preparation of normalized cDNA libraries for ESTs

Just like the mRNA populations used as the templates for reverse transcriptase, the cDNAs from a particular tissue or cell type will be composed of many copies of a very few, abundant mRNAs, a fairly large number of copies of the moderately abundant mRNAs, and a small number of copies of the rare mRNAs. Since most genes produce low abundance mRNA, a corresponding small number of cDNAs will be made from most genes. In an effort to obtain cDNAs from most genes, investigators have normalized the cDNA libraries to remove the most abundant mRNAs.

The cDNAs are hybridized to the template mRNA to a sufficiently high Rot (concentration of RNA ´ time) so that the moderately abundant mRNAs and cDNAs are in duplex, whereas the rare cDNAs are still single-stranded. The duplex mRNA-cDNA will stick to a hydroxyapatite column, and the desired single-stranded, low abundance cDNA will elute. This procedure can be repeated a few times to improve the separation. The low abundance, high complexity cDNA is then ligated into a cloning vector to construct the cDNA library.

This normalization is key to the success of a random sequencing approach. Random cDNA clones, hundreds of thousands of them, have been picked and sequenced. A single-pass sequence from one of these cDNA clones is called an expressed sequence tag, or EST (Fig. 4.9). It is called a “tag” because it is a sequence of only part of the cDNA, and since it is in cDNA, which is derived from mRNA, it is from an expressed gene. If the cDNA libraries reflected the normal abundance of the mRNAs, then this approach would result in re-sequencing the abundant cDNAs over and over, and most of the rare cDNAs would never be sequenced. However, the normalization has been successful, and many genes, even with rare mRNAs, are represented in the EST database.

As of May, 2001, over 2,700,000 ESTs individual sequences of human cDNA clones have been deposited in dbEST. They are grouped into nonredundant sets (called Unigene clusters). Over 95,000 Unigene clusters have been assembled, and almost 20,000 of them contain known human genes. The estimated number of human genes is less than the number of Unigene clusters, presumably because some large genes are still represented in more than one Unigene cluster. It is likely that most human genes are represented in the EST databases. Exceptions include genes expressed only in tissues which have not been sampled in the cDNA libraries. For more information, see http://www.ncbi.nlm.nih.gov/UniGene/index.html

Figure 4.9. cDNA clones from normalized libraries are sequenced to generate ESTs.

### Databases for genomic analysis

• Nucleic acid sequences
• genomic and mRNA, including ESTs
• Protein sequences
• Protein structures
• Genetic and physical maps

Organism-specific databases

• MedLine (PubMed)
• Online Mendelian Inheritance in Man (OMIM)

Figure 4.15. Example of mapping information at NCBI. Genetic map around MYOD1, 11p15.4

Sequences and annotation of the human genome

Ensemble (European Bioinformatics Institute (EMBL) and Sanger Centre)

http://www.ensembl.org/

A.

B.

Figure 4.16. Sample views from servers displaying the human genome. (A) View from the Human Genome Browser. The region shown is part of chromosome 22 with the genes PNUTL1, TBX1and others. Extensive annotation for exons, repeats, single nucleotide polymorphisms, homologous regions in mouse and other information is available for all the sequenced genome. (B) Comparable information in a different format is available at the ENSEMBL server.

#### Programs for sequence analysis

• BLAST to search rapidly through sequence databases
• PipMaker (to align 2 genomic DNA sequences)
• Gene finding by ab initio methods (GenScan, GRAIL, etc.)

Figure 4.18. Results of BLAST search, INSvs. nr

00:40, 21 Nov 2013