Thibaut (Genebuild)

Author: Thibaut (Genebuild)

Flagging the removed evidences of our databases

6th September 2011 by Thibaut (Genebuild)·Comments Off

All gene annotations in Ensembl are supported by biological sequence alignments. These sequences used as supporting evidence are downloaded from public databases, for example UniProt and ENA, at the start of the gene annotation process. Public sequence databases are updated regularly, meaning that sequences are added to and removed from them. We don’t update our gene sets every release; for some species the gene annotations may not be updated for several releases. It follows that gene annotations in the Ensembl gene sets may be supported by biological sequences that have been withdrawn from the public databases.

In order to indicate these changes, we now flag sequences that we have used as supporting evidence but which have been withdrawn from the public databases. These flags are updated every release by checking all protein-coding transcripts and exons for all species against the most current sequence databases. Transcripts based on evidence that has been withdrawn are flagged and coloured grey (instead of yellow) on the Transcript’s Supporting Evidence page. Transcripts supported by only a grey protein sequence should be considered less well supported. Below is the example of the transcript ENSGGOT00000034302 from the gorilla which was built using the human protein A6NKB4.2

In addition to sequences from external public databases, Ensembl translations from well-annotated species may also be used as supporting evidence for annotation in other species, particularly primates and species with fragmented assemblies. Withdrawn Ensembl translations are flagged in the same way as described above.

You can also access these data programmatically using the API by looking for the transcript attribute “NoEvidence”.

Human BodyMap 2.0 data from Illumina

24th May 2011 by Thibaut (Genebuild)·37 Comments

I’d like to introduce you an exciting new data set that we’ve introduced in Ensembl release 62: RNASeq data from Illumina’s Human BodyMap 2.0 project. The data, generated on HiSeq 2000 instruments in 2010, consist of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Raw reads are available for download here. For each tissue, we have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.

You can view these data in the Region in Detail view. Click on ‘Configure this page’ and choose ‘RNA-Seq’ at the left of the main panel. Enable any or all of the 32 tracks and then close the configuration panel. Out of 32 possible tracks you can draw, 16 are tissue ‘gene model’ tracks, and 16 are ‘intron’ tracks.

The ‘gene model’ track shows you a transcript model. The ‘intron’ track shows you how many raw reads aligned across an exon-exon junction. The higher the intron block, the more highly expressed the transcript isoform is.

In this example, the kidney gene model track shows a transcript (dark blue) with an exon structure that matches the gold-coloured Ensembl transcript AQP6-001. The kidney transcript model includes coding and noncoding exons (in the example above, the empty box is UTR, and the filled boxes are exons).
Click on the kidney intron track to see that 192 raw reads were split between the first and second exons.

This example is interesting because it shows a gene with high expression in kidney tissue, and almost no expression in any other tissue.

The high read coverage for kidney means that the transcript’s exon-intron structure produced for the gene track has a good chance of being correct. When read coverage is very low, it is not always possible to build a full-length transcript model: Look at the colon and brain intron tracks to see that two colon reads and three brain reads have aligned across the transcript’s middle exon-exon junction. Although this read coverage is low, our pipeline has generated a transcript model for brain tissue. The pipeline however was not able to predict the two splice on either side because there were no raw reads from brain aligning over the splice junctions.

Below is a nice example of a gene that seems to be expressed in all 16 tissues, spermidine synthase (SRM).

Try dump_transcripts.pl as an example script to access the RNAseq-based transcript models. Have fun with these new data!

Ensembl Blog

News about the Ensembl Project and its genome browser

Author: Thibaut (Genebuild)

Flagging the removed evidences of our databases

Human BodyMap 2.0 data from Illumina