Ensembl
Encyclopedia
Ensembl is a joint scientific project between the European Bioinformatics Institute
and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project
. After 10 years in existence, Ensembl's aim remains to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genome
s of our own species and other vertebrate
s and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
Similar databases and browsers are found at NCBI
and the University of California, Santa Cruz (UCSC)
.
s, which code for approximately 20,000–25,000 gene
s. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual annotation
, whereby a team of scientists try to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein
to DNA.
In the Ensembl project, sequence data is fed into a software "pipeline" (written in Perl
) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.
Over time the project has expanded to include additional species (including key model organisms such as mouse
, fruitfly
and zebrafish) as well as a wider range of genomic data, including genetic variations and regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of Ensembl into invertebrate metazoa, plants, fungi, bacteria and protists, whilst the original project continues to focus on vertebrates.
Other displays show data at varying levels of resolution, from whole karyotype
s down to text-based representations of DNA and amino acid sequences, or present other types of display such as trees of similar genes (homologues
) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA
.
Externally produced data can also be added to the display, either via a DAS (Distributed Annotation System
) server on the internet, or by uploading a suitable file in one of the supported formats, such as BED or PSL.
Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.
API (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), etc.
The Ensembl website provides extensive information on how to install and use the API.
This software can be used to access the public MySQL
database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.
Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.
Last, there is an FTP server which can be used to download an entire MySQL databases as well some selected data sets in other formats.
European Bioinformatics Institute
The European Bioinformatics Institute is a centre for research and services in bioinformatics, and is part of European Molecular Biology Laboratory...
and the Wellcome Trust Sanger Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...
. After 10 years in existence, Ensembl's aim remains to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....
s of our own species and other vertebrate
Vertebrate
Vertebrates are animals that are members of the subphylum Vertebrata . Vertebrates are the largest group of chordates, with currently about 58,000 species described. Vertebrates include the jawless fishes, bony fishes, sharks and rays, amphibians, reptiles, mammals, and birds...
s and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
Similar databases and browsers are found at NCBI
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...
and the University of California, Santa Cruz (UCSC)
UCSC Genome Browser
The University of California, Santa Cruz is an up-to-date source for genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations...
.
Background
The human genome consists of three billion base pairBase pair
In molecular biology and genetics, the linking between two nitrogenous bases on opposite complementary DNA or certain types of RNA strands that are connected via hydrogen bonds is called a base pair...
s, which code for approximately 20,000–25,000 gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...
s. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual annotation
Annotation
An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...
, whereby a team of scientists try to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
to DNA.
In the Ensembl project, sequence data is fed into a software "pipeline" (written in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.
Over time the project has expanded to include additional species (including key model organisms such as mouse
House mouse
The house mouse is a small rodent, a mouse, one of the most numerous species of the genus Mus.As a wild animal the house mouse mainly lives associated with humans, causing damage to crops and stored food....
, fruitfly
Drosophila melanogaster
Drosophila melanogaster is a species of Diptera, or the order of flies, in the family Drosophilidae. The species is known generally as the common fruit fly or vinegar fly. Starting from Charles W...
and zebrafish) as well as a wider range of genomic data, including genetic variations and regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of Ensembl into invertebrate metazoa, plants, fungi, bacteria and protists, whilst the original project continues to focus on vertebrates.
Displaying genomic data
Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.Other displays show data at varying levels of resolution, from whole karyotype
Karyotype
A karyotype is the number and appearance of chromosomes in the nucleus of an eukaryotic cell. The term is also used for the complete set of chromosomes in a species, or an individual organism.p28...
s down to text-based representations of DNA and amino acid sequences, or present other types of display such as trees of similar genes (homologues
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...
) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA
FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...
.
Externally produced data can also be added to the display, either via a DAS (Distributed Annotation System
Distributed Annotation System
The Distributed Annotation System is used in bioinformatics to share and collate genomic annotation information. It is an open source project.From the web site:...
) server on the internet, or by uploading a suitable file in one of the supported formats, such as BED or PSL.
Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.
Alternative access methods
In addition its website, Ensembl provides a PerlPerl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
API (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), etc.
The Ensembl website provides extensive information on how to install and use the API.
This software can be used to access the public MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.
Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.
Last, there is an FTP server which can be used to download an entire MySQL databases as well some selected data sets in other formats.
Current species
The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. this includes:- Chordata
- Mammalia
- Primates: BushbabyNorthern Greater GalagoThe northern greater galago , also known as Garnett's greater galago, is a prosimian primate endemic to Africa.A low-coverage genomic sequence of the northern greater galago, was completed in 2006...
, Chimp, Human, MacaqueRhesus MacaqueThe Rhesus macaque , also called the Rhesus monkey, is one of the best-known species of Old World monkeys. It is listed as Least Concern in the IUCN Red List of Threatened Species in view of its wide distribution, presumed large population, and its tolerance of a broad range of habitats...
, Mouse LemurGray Mouse LemurThe gray mouse lemur , or lesser mouse lemur, is a small lemur, a type of strepsirrhine primate, found only on the island of Madagascar. Weighing , it is the largest of the mouse lemurs , a group which include the smallest primates in the world...
, OrangutanSumatran OrangutanThe Sumatran orangutan is one of the two species of orangutans. Found only on the island of Sumatra, in Indonesia, it is rarer and smaller than the Bornean orangutan. The Sumatran orangutan grows to about tall and in males...
, TarsierPhilippine TarsierThe Philippine Tarsier , known locally as the Kupal in Cebuano/Visayan and Mamag in Luzon, is an endangered species of tarsier endemic to the Philippines. It is found in the southeastern part of the archipelago, particularly in the islands of Bohol, Samar, Leyte and Mindanao...
; - Scandentia: Tree shrew ;
- Glires (= Rodents + Lagomorphs): Guineapig, Kangaroo rat, MouseHouse mouseThe house mouse is a small rodent, a mouse, one of the most numerous species of the genus Mus.As a wild animal the house mouse mainly lives associated with humans, causing damage to crops and stored food....
, RatBrown RatThe brown rat, common rat, sewer rat, Hanover rat, Norway rat, Brown Norway rat, Norwegian rat, or wharf rat is one of the best known and most common rats....
, Ground SquirrelThirteen-lined ground squirrelThe thirteen-lined ground squirrel , also known as the striped gopher, leopard ground squirrel, squinney, and as the leopard-spermophile in Audubon’s day, is a ground squirrel....
, PikaAmerican PikaThe American pika , a diurnal species of pika, is found in the mountains of western North America, usually in boulder fields at or above the tree line. They are herbivorous, smaller relatives of rabbits and hares.-Description:...
, RabbitEuropean RabbitThe European Rabbit or Common Rabbit is a species of rabbit native to south west Europe and north west Africa . It has been widely introduced elsewhere often with devastating effects on local biodiversity...
; - Laurasiatheria: Cow, DolphinBottlenose DolphinBottlenose dolphins, the genus Tursiops, are the most common and well-known members of the family Delphinidae, the family of oceanic dolphins. Recent molecular studies show the genus contains two species, the common bottlenose dolphin and the Indo-Pacific bottlenose dolphin , instead of one...
, AlpacaAlpacaAn alpaca is a domesticated species of South American camelid. It resembles a small llama in appearance.Alpacas are kept in herds that graze on the level heights of the Andes of southern Peru, northern Bolivia, Ecuador, and northern Chile at an altitude of to above sea level, throughout the year...
, PigPigA pig is any of the animals in the genus Sus, within the Suidae family of even-toed ungulates. Pigs include the domestic pig, its ancestor the wild boar, and several other wild relatives...
, CatCatThe cat , also known as the domestic cat or housecat to distinguish it from other felids and felines, is a small, usually furry, domesticated, carnivorous mammal that is valued by humans for its companionship and for its ability to hunt vermin and household pests...
, DogDogThe domestic dog is a domesticated form of the gray wolf, a member of the Canidae family of the order Carnivora. The term is used for both feral and pet varieties. The dog may have been the first animal to be domesticated, and has been the most widely kept working, hunting, and companion animal in...
, HorseHorseThe horse is one of two extant subspecies of Equus ferus, or the wild horse. It is a single-hooved mammal belonging to the taxonomic family Equidae. The horse has evolved over the past 45 to 55 million years from a small multi-toed creature into the large, single-toed animal of today...
, Megabat, MicrobatLittle brown batThe little brown bat is a species of the genus Myotis , one of the most common bats of North America...
, Hedgehog, ShrewCommon ShrewThe Common Shrew or Eurasian Shrew, Sorex araneus, is the most common shrew, and one of the most common mammals, throughout Northern Europe, including Great Britain, but excluding Ireland. It is long and weighs , and has velvety dark brown fur with a pale underside. Juvenile shrews have lighter...
; - Afrotheria: ElephantAfrican Bush ElephantThe African Bush Elephant or African Savanna Elephant is the larger of the two species of African elephant. Both it and the African Forest Elephant have usually been classified as a single species, known simply as the African Elephant...
, HyraxCape HyraxThe Rock Hyrax , or Cape Hyrax, is one of the four living species of the order Hyracoidea, and the only living species in the genus Procavia. Like all hyraxes, it is a medium-sized terrestrial mammal, superficially resembling a guinea pig with short ears and tail...
, Tenrec - Xenarthra: ArmadilloNine-banded ArmadilloThe nine-banded armadillo , or the nine-banded, long-nosed armadillo, is a species of armadillo found in North, Central, and South America, making it the most widespread of the armadillos...
, slothSlothSloths are the six species of medium-sized mammals belonging to the families Megalonychidae and Bradypodidae , part of the order Pilosa and therefore related to armadillos and anteaters, which sport a similar set of specialized claws.They are arboreal residents of the jungles of Central and South...
; - Marsupialia: OpossumGray Short-tailed OpossumThe gray short-tailed opossum is a small member of the Didelphidae family of opossums. It was the first marsupial to have its genome sequenced. It is naturally found in arboreal habitats in Bolivia, Brazil and Paraguay. The opossum is used as a research model in science, and is also frequently...
, WallabyTammar WallabyThe Tammar Wallaby , also known as the Dama Wallaby or Darma Wallaby, is a small member of the kangaroo family and is the model species for research on kangaroos and marsupials. It is found on offshore islands on the South Australian and Western Australian coast...
; - Monotremes: PlatypusPlatypusThe platypus is a semi-aquatic mammal endemic to eastern Australia, including Tasmania. Together with the four species of echidna, it is one of the five extant species of monotremes, the only mammals that lay eggs instead of giving birth to live young...
;
- Primates: Bushbaby
- Birds: ChickenChickenThe chicken is a domesticated fowl, a subspecies of the Red Junglefowl. As one of the most common and widespread domestic animals, and with a population of more than 24 billion in 2003, there are more chickens in the world than any other species of bird...
, Zebra FinchZebra FinchThe Zebra Finch, Taeniopygia guttata, is the most common and familiar estrildid finch of Central Australia and ranges over most of the continent, avoiding only the cool moist south and the tropical far north. It also can be found natively in Indonesia and East Timor...
; - LepidosauriaLepidosauriaThe Lepidosauria are reptiles with overlapping scales. This subclass includes Squamata and Sphenodontidae. It is a monophyletic group and therefore contains all descendents of a common ancestor. The squamata includes snakes, lizards, tuataras, and amphisbaenia. Lepidosauria is the sister taxon...
: Anole Lizard (pre) ; - LissamphibiaLissamphibiaThe subclass Lissamphibia includes all recent amphibians and means smooth amphibia.Extant amphibians fall into one of three orders — the Anura , the Caudata or Urodela , and the Gymnophiona or Apoda .Although the ancestry of each group is still unclear, all share certain common characteristics,...
: Xenopus tropicalis ; - Teleost fishes: Takifugu rubripes (FuguFuguis the Japanese word for pufferfish and the dish prepared from it, normally species of genus Takifugu, Lagocephalus, or Sphoeroides, or porcupinefish of the genus Diodon. Fugu can be lethally poisonous due to its tetrodotoxin; therefore, it must be carefully prepared to remove toxic parts and to...
), Tetraodon nigroviridisTetraodon nigroviridisTetraodon nigroviridis is one of the pufferfish known as the green spotted puffer. It is found across South and Southeast Asia in coastal freshwater and brackish water habitats. Tetraodon nigroviridis reaches a maximum length of about 15 cm...
(Green spotted pufferfish), Danio rerio (Zebrafish), Oryzias latipes (Medaka), Gasterosteus aculeatus (SticklebackSticklebackThe Gasterosteidae are a family of fish including the sticklebacks. FishBase currently recognises sixteen species in the family, grouped in five genera. However several of the species have a number of recognised subspecies, and the taxonomy of the family is thought to be in need of revision...
) ; - CyclostomataCyclostomataCyclostomata is a group of chordates that comprises the living jawless fishes: the lampreys and hagfishes. Both groups have round mouths that lack jaws but have retractable horny teeth...
: Petromyzon marinus (Sea lampreySea lampreyThe sea lamprey is a parasitic lamprey found on the Atlantic coasts of Europe and North America, in the western Mediterranean Sea, and in the Great Lakes. It is brown, gray, or black on its back and white or gray on the underside and can grow up to 90 cm long. Sea lampreys prey on a wide...
) (pre) ; - Tunicates: Ciona intestinalis, Ciona savignyi ;
- Mammalia
- Non-vertebrates
- Insects: Drosophila melanogasterDrosophila melanogasterDrosophila melanogaster is a species of Diptera, or the order of flies, in the family Drosophilidae. The species is known generally as the common fruit fly or vinegar fly. Starting from Charles W...
(Fruitfly), Anopheles gambiae (Mosquito), Aedes aegypti (Mosquito) - Worm: Caenorhabditis elegansCaenorhabditis elegansCaenorhabditis elegans is a free-living, transparent nematode , about 1 mm in length, which lives in temperate soil environments. Research into the molecular and developmental biology of C. elegans was begun in 1974 by Sydney Brenner and it has since been used extensively as a model...
- Insects: Drosophila melanogaster
- Yeast: Saccharomyces cerevisiaeSaccharomyces cerevisiaeSaccharomyces cerevisiae is a species of yeast. It is perhaps the most useful yeast, having been instrumental to baking and brewing since ancient times. It is believed that it was originally isolated from the skin of grapes...
(Baker's yeast)
See also
- List of sequenced eukaryotic genomes
- Sequence analysisSequence analysisIn bioinformatics, the term sequence analysis refers to the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological...
- Sequence profiling toolSequence profiling toolA sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to...
- Sequence motifSequence motifIn genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance...
- UCSC Genome BrowserUCSC Genome BrowserThe University of California, Santa Cruz is an up-to-date source for genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations...