Skip to main content
Biology LibreTexts

4.4: Genome Analysis by Large Scale Sequencing

  • Page ID
    313
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Whole genomes can be sequenced both by random shot-gun sequencing and by a directed approach using mapped clones.

    Figure 4.11. Directed sequencing of BAC contigs.

    The results of the Celera and public collaboration on the fly sequence was published in early 2000, and descriptions of the human genome sequence were published separately by Celera and IHGSC in 2001. Neither genome is completely sequenced (as of 2001), but both are highly sequenced and are stimulating a major revolution in the life sciences.

    The wisdom of which approach to take is still a matter of debate, and depends to some extent on how thoroughly one needs to sequence a complex genome. For instance, a publicly accessible sequence of the mouse genome at 3X coverage was recently generated by the shotgun approach. Other genomes will likely be “lightly sequenced” at a similar coverage. But a full, high quality sequence of mouse will likely use aspects of the more directed approach. Also, the Celera assembly (primarily shotgun sequence) used the public data on the human genome sequence as well. Thus current efforts use both the rapid sequencing by shotgun methods and as well as sequencing mapped clones.

    Survey of sequenced genomes

    The genome sequences are available for many species now, covering an impressive phylogenetic range. This includes more than 28 eubacteria, at least 6 archaea, a fungus (the yeast Saccharomyces cerevisiae), a protozoan (Plasmodium falciparum), a worm (the nematode Caenorhabditis elegans), an insect (the fruitfly Drosophila melanogaster), two plants (Arabadopsisand rice (soon)), and two mammals (human Homo sapiensand mouse Mus domesticus). Some information about these is listed in Table 4.4.

    Table 4.4.Sequenced genomes. This table is derived from the listing of “Complete Genomes Mapped on the KEGG Pathways (Kyoto Encyclopedia of Genes and Genomes)” at

    www.genome.ad.jp/kegg/java/org_list.html

    Additional genomes have been added, but only samples of the bacterial sequences are listed.

    Genes encoding

    Species

    Genome Size

    (bp)

    Protein

    RNA

    Total

    Enzymes

    Category

    Eubacteria

    Escherichia coli

    4,639,221

    4,289

    108

    1,254

    gram negative

    Haemophilus influenzae

    1,830,135

    1,717

    74

    571

    gram negative

    Helicobacter pylori

    1,667,867

    1,566

    43

    394

    gram negative

    Bacillus subtilis

    4,214,814

    4,100

    121

    819

    gram positive

    Mycoplasma genitalium

    580,073

    467

    36

    202

    gram positive

    Mycoplasma pneumoniae

    816,394

    677

    33

    226

    gram positive

    Mycobacterium tuberculosis

    4,411,529

    3,918

    48

    -

    gram positive

    Aquifex aeolicus

    1,551,335

    1,522

    50

    -

    hyperthermophilic bacterium

    Borrelia burgdorferi

    1,230,663

    1,256

    23

    176

    lyme disease Spirochete

    Synechocystis sp.

    3,573,470

    3,166

    49

    702

    cyanobacterium

    Archaebacteria

    Archaeoglobus fulgidus

    2,178,400

    2,407

    49

    439

    S-metabolizing archaea

    Methanococcus jannaschii

    1,739,934

    1,735

    43

    441

    archaea

    Methanobacterium thermoautotrophicum

    1,751,377

    1,871

    47

    558

    archaea

    Eukaryotes

    Saccharomyces cerevisiae

    12,069,313

    6,064

    262

    861

    fungi

    Caenorhabditis elegans

    97,000,000

    18,424

    -

    nematode

    Drosophila melanogaster

    180,000,000

    13,601

    insect, fly, 120 Mb sequenced

    Arabidopsis thaliana

    115,500,000

    25,706

    plant, complete

    Homo sapiens

    3,200,000,000

    30,000-40,000

    human, draft + finished

    Mus domesticus

    3,000,000,000

    mouse, draft

    Genome size

    Bacterial genomes range in size from 0.58 to almost 5 million bp (Mb). E. coli and B. subtilis, two of the most intensively studied bacteria, have the largest genomes and largest numbers of genes. The genome of the yeast Saccharomyces cerevisiae is only 2.6 times as large as that of E. coli. The genome of humans is almost 700 times larger than that of E. coli. However, genome size is not a direct measure of genetic content over long phylogenetic distances. One needs to examine the fraction of the genome that codes for protein or contains other important information. Let’s look at sizes and numbers of genes in different genomes.

    Gene size and number

    The average gene size is similar among bacteria, averaging around 1100 bp. Very little DNA separates most bacterial genes; in E. colithere is an average of only 118 bp between genes. Since the gene size varies little, then the number of genes varies over as wide a range as the genome size, from 467 genes in M. genitaliumto 4289 in E. coli. Thus within bacteria, which have little noncoding DNA, the number of genes is proportional to the genome size.

    Saccharomyces cerevisiaehas one gene every 1900 bp on average, which could reflect both an increase in size of gene as well as somewhat greater distance between genes. Both bacteria and yeast show a much denser packing of genes than is seen in more complex genomes.

    Data on a large sample of human genes shows that they are much larger than bacterial genes, with the median being about 14 times larger than the 1 kb bacterial genes. This is not because most human proteins are substantially larger; both bacterial proteins average about 350 amino acids in length, which is similar to the median size of human proteins. The major difference is the large amount of intronic sequence in human genes.

    Table 4.5.Average size of human genes and parts of genes. This is based on information in the IHGSC paper in Nature, and derived from analysis of 1804 human genes.

    Median

    Mean

    Internal exon

    122 bp

    145 bp

    Number of exons

    7

    8.8

    Length of each intron

    1023 bp

    3365 bp

    3’ UTR

    400 bp

    770 bp

    5’ UTR

    240 bp

    300 bp

    Coding sequence

    1100 bp

    1340 bp

    Length of protein encoded

    367 amino acids

    447 amino acids

    Genomic extent

    14,000 bp

    27,000 bp

    Figure 4.14. Genome size and number of genes in species ranging from bacteria to humans.

    Alternative splicing is common in human genes

    A previous lower estimate is that alternative splicing occurs in 35% of human genes. However, recent data show this fraction is larger.

    For Chromosome 22:

    • 642 transcripts cover 245 genes, 2.6 txpts/gene
    • 2 or more transcripts for 145 (59%) of genes

    For Chromosome 19:

    • 1859 transcripts cover 544 genes, 3.2 txpts/gene

    This contrasts with the situation in worm, in which alternative splicing occurs in 22% of genes. The increased genetic diversity from alternative splicing may contribute considerably to the greater complexity of humans, not just the increase in the number of genes.

    Estimates of number of human genes

    The estimated number of human genes has varied greatly over recent years. Some of these numbers have been widely quoted, and it may be useful to list some of the sources of these estimates.

    • mRNA complexity (association kinetics): 40,000 genes
    • Avg size of gene 30,000 bp: 100,000 genes
    • Number of CpG islands: 70,000 to 80,000
    • Unigene clusters of ESTs: 35,000 to 125,000
    • More rigorous EST clustering: 35,000 genes
    • Comparison to pufferfish: 30,000 genes
    • Extrapolate from gene counts on chromosomes 21 and 22 (which are finished): 30,000 to 35,500 genes

    Using the draft human sequence from Juy 2000, the IHGSC constructed an Initial Gene Index for human. They use the Ensembl system at the Sanger Centre. They started with ab initio predictions by Genscan, then confirmed by similarity to proteins, mRNAs, ESTs, and protein motifs (Pfam database) from any organism. This led to an initial set of 35,500 genes and 44,860 transcripts in the Ensemble database. After reducing fragmentation, merging with known genes, and removing contaminating bacterial sequences, they were left with 31,778 genes. After taking into account residual fragmentation, and the rate at which true genes are found by a similar analysis, the estimate remains about 32,000 genes. However, it is an estimate and is subject to change as more annotation is completed..

    Starting with this estimate that the human genome contains about 32,000 genes, one can calculate how much of the genome is coding and how much is transcribed. If the average coding length is 1400 bp, then 1.5%of human genome consists of coding sequence. If the average genomic extent per gene is 30 kb, then 33% of human genome is “transcribed”.

    Summary of number of genes in eukaryotic species:

    • Human: 32,000 “still uncertain”
    • Fly: 13, 338
    • Worm: 18,266
    • Yeast: 6,144
    • Mustard weed: 25,706
    • Human: 2x number of genes in fly and worm
    • Human: more alternative splicing, perhaps 5x number of proteins as in fly or worm

    Assignment of functions to genes

    Genes encoding proteins and RNAs can be detected with considerable accuracy using compuational tools. Note that even for an extensively studies organism like E. coli, the number of genes found by sequence analysis (4289 encoding proteins) is far greater than the number that can be assigned as encoding a particular enzyme (1254). The discrepancy between genes found in the sequence versus those with known function (i.e. assigned as encoding an enzyme) is greater for some poorly characterized organisms such as the lyme-disease causing Spirochete Borrelia burgdorferi.

    The many genes with unassigned function present an exciting challenge both in bioinformatics and in biochemistry/cell biology/genetics. Large collaborations have been initiated for a comprehensive genetic and expression analysis of some organisms. For instance, projects are underway to make mutations in all detected genes in Saccharomyces cerevisiae and to quantify the level of stable RNA from each gene in a variety of growth conditions, through the cell cycle and in other conditions. Databases are already established that record the changes in RNA levels for all yeast genes when the organism is shifted from glucose to galactose as a carbon source. These large scale expression analysis use high density microchip arrays that contain characteristic sequences for all 6064 yeast genes. These gene arrays are then hybridized with fluorescently labeled RNA or cDNA from cells grown under the two different conditions. The hybridization signals are quantitated and compared automatically, analyzed. The plan is to store the results in public databases. Useful websites include:

    • SGD
    • MIPS: a database for genomes and protein sequences

    This page titled 4.4: Genome Analysis by Large Scale Sequencing is shared under a All Rights Reserved (used with permission) license and was authored, remixed, and/or curated by Ross Hardison.

    • Was this article helpful?