Ensembl
Encyclopedia
Ensembl is a joint scientific project between the European Bioinformatics Institute
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...

 and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...

. After 10 years in existence, Ensembl's aim remains to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....

s of our own species and other vertebrate
Vertebrate
Vertebrates are animals that are members of the subphylum Vertebrata . Vertebrates are the largest group of chordates, with currently about 58,000 species described. Vertebrates include the jawless fishes, bony fishes, sharks and rays, amphibians, reptiles, mammals, and birds...

s and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Similar databases and browsers are found at NCBI
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...

 and the University of California, Santa Cruz (UCSC)
UCSC Genome Browser
The University of California, Santa Cruz is an up-to-date source for genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations...

.

Background

The human genome consists of three billion base pair
Base pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...

s, which code for approximately 20,000–25,000 gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

s. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual annotation
Annotation
An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...

, whereby a team of scientists try to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...

 to DNA.

In the Ensembl project, sequence data is fed into a software "pipeline" (written in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.

Over time the project has expanded to include additional species (including key model organisms such as mouse
House mouse
The house mouse is a small rodent, a mouse, one of the most numerous species of the genus Mus.As a wild animal the house mouse mainly lives associated with humans, causing damage to crops and stored food....

, fruitfly
Drosophila melanogaster
Drosophila melanogaster is a species of Diptera, or the order of flies, in the family Drosophilidae. The species is known generally as the common fruit fly or vinegar fly. Starting from Charles W...

 and zebrafish) as well as a wider range of genomic data, including genetic variations and regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of Ensembl into invertebrate metazoa, plants, fungi, bacteria and protists, whilst the original project continues to focus on vertebrates.

Displaying genomic data

Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.

Other displays show data at varying levels of resolution, from whole karyotype
Karyotype
A karyotype is the number and appearance of chromosomes in the nucleus of an eukaryotic cell. The term is also used for the complete set of chromosomes in a species, or an individual organism.p28...

s down to text-based representations of DNA and amino acid sequences, or present other types of display such as trees of similar genes (homologues
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA
FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...

.

Externally produced data can also be added to the display, either via a DAS (Distributed Annotation System
Distributed Annotation System
The Distributed Annotation System is used in bioinformatics to share and collate genomic annotation information. It is an open source project.From the web site:...

) server on the internet, or by uploading a suitable file in one of the supported formats, such as BED or PSL.

Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.

Alternative access methods

In addition its website, Ensembl provides a Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

 API (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), etc.
The Ensembl website provides extensive information on how to install and use the API.

This software can be used to access the public MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

 database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.

Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.

Last, there is an FTP server which can be used to download an entire MySQL databases as well some selected data sets in other formats.

Current species

The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. this includes:
  • Chordata
    • Mammalia
      • Primates: Bushbaby
        Northern Greater Galago
        The northern greater galago , also known as Garnett's greater galago, is a prosimian primate endemic to Africa.A low-coverage genomic sequence of the northern greater galago, was completed in 2006...

        , Chimp, Human, Macaque
        Rhesus Macaque
        The Rhesus macaque , also called the Rhesus monkey, is one of the best-known species of Old World monkeys. It is listed as Least Concern in the IUCN Red List of Threatened Species in view of its wide distribution, presumed large population, and its tolerance of a broad range of habitats...

        , Mouse Lemur
        Gray Mouse Lemur
        The gray mouse lemur , or lesser mouse lemur, is a small lemur, a type of strepsirrhine primate, found only on the island of Madagascar. Weighing , it is the largest of the mouse lemurs , a group which include the smallest primates in the world...

        , Orangutan
        Sumatran Orangutan
        The Sumatran orangutan is one of the two species of orangutans. Found only on the island of Sumatra, in Indonesia, it is rarer and smaller than the Bornean orangutan. The Sumatran orangutan grows to about tall and in males...

        , Tarsier
        Philippine Tarsier
        The Philippine Tarsier , known locally as the Kupal in Cebuano/Visayan and Mamag in Luzon, is an endangered species of tarsier endemic to the Philippines. It is found in the southeastern part of the archipelago, particularly in the islands of Bohol, Samar, Leyte and Mindanao...

         ;
      • Scandentia: Tree shrew ;
      • Glires (= Rodents + Lagomorphs): Guineapig, Kangaroo rat, Mouse
        House mouse
        The house mouse is a small rodent, a mouse, one of the most numerous species of the genus Mus.As a wild animal the house mouse mainly lives associated with humans, causing damage to crops and stored food....

        , Rat
        Brown Rat
        The brown rat, common rat, sewer rat, Hanover rat, Norway rat, Brown Norway rat, Norwegian rat, or wharf rat is one of the best known and most common rats....

        , Ground Squirrel
        Thirteen-lined ground squirrel
        The thirteen-lined ground squirrel , also known as the striped gopher, leopard ground squirrel, squinney, and as the leopard-spermophile in Audubon’s day, is a ground squirrel....

        , Pika
        American Pika
        The American pika , a diurnal species of pika, is found in the mountains of western North America, usually in boulder fields at or above the tree line. They are herbivorous, smaller relatives of rabbits and hares.-Description:...

        , Rabbit
        European Rabbit
        The European Rabbit or Common Rabbit is a species of rabbit native to south west Europe and north west Africa . It has been widely introduced elsewhere often with devastating effects on local biodiversity...

         ;
      • Laurasiatheria: Cow, Dolphin
        Bottlenose Dolphin
        Bottlenose dolphins, the genus Tursiops, are the most common and well-known members of the family Delphinidae, the family of oceanic dolphins. Recent molecular studies show the genus contains two species, the common bottlenose dolphin and the Indo-Pacific bottlenose dolphin , instead of one...

        , Alpaca
        Alpaca
        An alpaca is a domesticated species of South American camelid. It resembles a small llama in appearance.Alpacas are kept in herds that graze on the level heights of the Andes of southern Peru, northern Bolivia, Ecuador, and northern Chile at an altitude of to above sea level, throughout the year...

        , Pig
        Pig
        A pig is any of the animals in the genus Sus, within the Suidae family of even-toed ungulates. Pigs include the domestic pig, its ancestor the wild boar, and several other wild relatives...

        , Cat
        Cat
        The cat , also known as the domestic cat or housecat to distinguish it from other felids and felines, is a small, usually furry, domesticated, carnivorous mammal that is valued by humans for its companionship and for its ability to hunt vermin and household pests...

        , Dog
        Dog
        The domestic dog is a domesticated form of the gray wolf, a member of the Canidae family of the order Carnivora. The term is used for both feral and pet varieties. The dog may have been the first animal to be domesticated, and has been the most widely kept working, hunting, and companion animal in...

        , Horse
        Horse
        The horse is one of two extant subspecies of Equus ferus, or the wild horse. It is a single-hooved mammal belonging to the taxonomic family Equidae. The horse has evolved over the past 45 to 55 million years from a small multi-toed creature into the large, single-toed animal of today...

        , Megabat, Microbat
        Little brown bat
        The little brown bat is a species of the genus Myotis , one of the most common bats of North America...

        , Hedgehog, Shrew
        Common Shrew
        The Common Shrew or Eurasian Shrew, Sorex araneus, is the most common shrew, and one of the most common mammals, throughout Northern Europe, including Great Britain, but excluding Ireland. It is long and weighs , and has velvety dark brown fur with a pale underside. Juvenile shrews have lighter...

         ;
      • Afrotheria: Elephant
        African Bush Elephant
        The African Bush Elephant or African Savanna Elephant is the larger of the two species of African elephant. Both it and the African Forest Elephant have usually been classified as a single species, known simply as the African Elephant...

        , Hyrax
        Cape Hyrax
        The Rock Hyrax , or Cape Hyrax, is one of the four living species of the order Hyracoidea, and the only living species in the genus Procavia. Like all hyraxes, it is a medium-sized terrestrial mammal, superficially resembling a guinea pig with short ears and tail...

        , Tenrec
      • Xenarthra: Armadillo
        Nine-banded Armadillo
        The nine-banded armadillo , or the nine-banded, long-nosed armadillo, is a species of armadillo found in North, Central, and South America, making it the most widespread of the armadillos...

        , sloth
        Sloth
        Sloths are the six species of medium-sized mammals belonging to the families Megalonychidae and Bradypodidae , part of the order Pilosa and therefore related to armadillos and anteaters, which sport a similar set of specialized claws.They are arboreal residents of the jungles of Central and South...

         ;
      • Marsupialia: Opossum
        Gray Short-tailed Opossum
        The gray short-tailed opossum is a small member of the Didelphidae family of opossums. It was the first marsupial to have its genome sequenced. It is naturally found in arboreal habitats in Bolivia, Brazil and Paraguay. The opossum is used as a research model in science, and is also frequently...

        , Wallaby
        Tammar Wallaby
        The Tammar Wallaby , also known as the Dama Wallaby or Darma Wallaby, is a small member of the kangaroo family and is the model species for research on kangaroos and marsupials. It is found on offshore islands on the South Australian and Western Australian coast...

         ;
      • Monotremes: Platypus
        Platypus
        The platypus is a semi-aquatic mammal endemic to eastern Australia, including Tasmania. Together with the four species of echidna, it is one of the five extant species of monotremes, the only mammals that lay eggs instead of giving birth to live young...

         ;
    • Birds: Chicken
      Chicken
      The chicken is a domesticated fowl, a subspecies of the Red Junglefowl. As one of the most common and widespread domestic animals, and with a population of more than 24 billion in 2003, there are more chickens in the world than any other species of bird...

      , Zebra Finch
      Zebra Finch
      The Zebra Finch, Taeniopygia guttata, is the most common and familiar estrildid finch of Central Australia and ranges over most of the continent, avoiding only the cool moist south and the tropical far north. It also can be found natively in Indonesia and East Timor...

       ;
    • Lepidosauria
      Lepidosauria
      The Lepidosauria are reptiles with overlapping scales. This subclass includes Squamata and Sphenodontidae. It is a monophyletic group and therefore contains all descendents of a common ancestor. The squamata includes snakes, lizards, tuataras, and amphisbaenia. Lepidosauria is the sister taxon...

      : Anole Lizard (pre) ;
    • Lissamphibia
      Lissamphibia
      The subclass Lissamphibia includes all recent amphibians and means smooth amphibia.Extant amphibians fall into one of three orders — the Anura , the Caudata or Urodela , and the Gymnophiona or Apoda .Although the ancestry of each group is still unclear, all share certain common characteristics,...

      : Xenopus tropicalis ;
    • Teleost fishes: Takifugu rubripes (Fugu
      Fugu
      is the Japanese word for pufferfish and the dish prepared from it, normally species of genus Takifugu, Lagocephalus, or Sphoeroides, or porcupinefish of the genus Diodon. Fugu can be lethally poisonous due to its tetrodotoxin; therefore, it must be carefully prepared to remove toxic parts and to...

      ), Tetraodon nigroviridis
      Tetraodon nigroviridis
      Tetraodon nigroviridis is one of the pufferfish known as the green spotted puffer. It is found across South and Southeast Asia in coastal freshwater and brackish water habitats. Tetraodon nigroviridis reaches a maximum length of about 15 cm...

       (Green spotted pufferfish), Danio rerio (Zebrafish), Oryzias latipes (Medaka), Gasterosteus aculeatus (Stickleback
      Stickleback
      The Gasterosteidae are a family of fish including the sticklebacks. FishBase currently recognises sixteen species in the family, grouped in five genera. However several of the species have a number of recognised subspecies, and the taxonomy of the family is thought to be in need of revision...

      ) ;
    • Cyclostomata
      Cyclostomata
      Cyclostomata is a group of chordates that comprises the living jawless fishes: the lampreys and hagfishes. Both groups have round mouths that lack jaws but have retractable horny teeth...

      : Petromyzon marinus (Sea lamprey
      Sea lamprey
      The sea lamprey is a parasitic lamprey found on the Atlantic coasts of Europe and North America, in the western Mediterranean Sea, and in the Great Lakes. It is brown, gray, or black on its back and white or gray on the underside and can grow up to 90 cm long. Sea lampreys prey on a wide...

      ) (pre) ;
    • Tunicates: Ciona intestinalis, Ciona savignyi ;
  • Non-vertebrates
    • Insects: Drosophila melanogaster
      Drosophila melanogaster
      Drosophila melanogaster is a species of Diptera, or the order of flies, in the family Drosophilidae. The species is known generally as the common fruit fly or vinegar fly. Starting from Charles W...

       (Fruitfly), Anopheles gambiae (Mosquito), Aedes aegypti (Mosquito)
    • Worm: Caenorhabditis elegans
      Caenorhabditis elegans
      Caenorhabditis elegans is a free-living, transparent nematode , about 1 mm in length, which lives in temperate soil environments. Research into the molecular and developmental biology of C. elegans was begun in 1974 by Sydney Brenner and it has since been used extensively as a model...

  • Yeast: Saccharomyces cerevisiae
    Saccharomyces cerevisiae
    Saccharomyces cerevisiae is a species of yeast. It is perhaps the most useful yeast, having been instrumental to baking and brewing since ancient times. It is believed that it was originally isolated from the skin of grapes...

     (Baker's yeast)

See also

  • List of sequenced eukaryotic genomes
  • Sequence analysis
    Sequence analysis
    In bioinformatics, the term sequence analysis refers to the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological...

  • Sequence profiling tool
    Sequence profiling tool
    A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to...

  • Sequence motif
    Sequence motif
    In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...

  • UCSC Genome Browser
    UCSC Genome Browser
    The University of California, Santa Cruz is an up-to-date source for genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK