Project ideas

Ensembl is part of the Genome Assembly and Annotation (GAA) section of EMBL-EBI. The GAA section contains other popular services such as MGnify, HGNC/VGNC and Wormbase. Below are project ideas related to the activities of the section. GAA is able to host funded contributors on summer projects or internships at various times throughout the year.

Index of Project Ideas

Genome Quality Assessment Pipeline in NextFlow

Brief Explanation

Genome assemblies and annotations vary in quality, making it essential to have reliable tools for assessing their completeness and consistency. This project aims to develop a NextFlow pipeline that evaluates genome quality using widely accepted tools and comparative analysis techniques. The pipeline will integrate BUSCO for completeness assessment, OMArk for annotation validation, and a multivariate analysis module to compare key genome annotation metrics (e.g., number of protein-coding genes, transcripts, exons per transcript) against reference genomes from the same taxonomic group. The final output will include statistical summaries and visualisations to aid in interpreting genome quality.

Expected results

  • A NextFlow pipeline capable of processing a genome assembly and its annotation to assess quality.
  • Integration of BUSCO and OMArk for standard completeness and annotation checks.
  • A module that extracts key genome annotation metrics and performs multivariate analysis (PCA, clustering) to compare with reference Ensembl annotations.
  • Automated report generation including statistics and visualisations.
  • Well-documented code with clear installation and usage instructions.

Required knowledge

  • Python or R for data analysis and visualisation.
  • NextFlow or familiarity with workflow automation tools.
  • Basic understanding of genome annotations (FASTA, GFF3 formats).

Desirable knowledge

  • Experience with BUSCO and OMArk.
  • Understanding of multivariate statistical analysis (e.g., PCA, clustering).
  • Familiarity with Ensembl databases and programmatic access (e.g., Ensembl REST API).
  • Experience with containerisation (Docker/Singularity) for reproducibility.

Difficulty

Medium

Length

350h

Mentors

Leanne Haggerty, Simarpreet Kaur Bhurji

Transcriptomic Data Assessment Pipeline for Genome Annotation

Brief Explanation

This project aims to develop a NextFlow pipeline for assessing the usability of transcriptomic data for genome annotation. The pipeline will take raw FASTQ files along with associated metadata (e.g., read type, length, tissue source, species, reference genome, and gene set) and perform a series of analyses to evaluate transcript coverage, completeness, and usability for gene annotation.

Key steps in the assessment include:

  1. Alignment & Gene Model Construction
    • Align transcriptomic reads to the provided or selected reference genome.
    • Construct gene models.
  2. Completeness Analyses
    • Evaluate gene set completeness using BUSCO and OMArk.
    • If a reference gene set is available, assess splice junction saturation to determine whether the sequencing depth is sufficient to reconstruct full gene models.
    • Compare predicted exon-intron structures to the reference annotations to detect inconsistencies.
  3. K-mer Analysis for Unannotated Species
    • For species without a reference gene set, perform a K-mer analysis to estimate genome completeness and detect conserved sequences.
  4. Cross-Tissue Overlap Assessment
    • If multiple tissues from the same species are provided, report the proportion of unique transcripts contributed by each tissue, highlighting tissue-specific expression patterns and redundancy in sequencing efforts.

The final pipeline will generate statistical reports, visualisations, and recommendations regarding the adequacy of the transcriptomic data for genome annotation.

Expected results

  • A NextFlow pipeline that can process RNA-Seq (short/long reads) and metadata to assess usability for genome annotation.
  • Gene model construction and completeness analysis with BUSCO, OMArk, and splice junction saturation checks.
  • K-mer analysis for species without a reference gene set.
  • Tissue overlap reports for multi-tissue datasets.
  • Automated report generation summarising findings and providing data quality recommendations.

Required knowledge

  • RNA-Seq analysis and alignment tools (STAR, Minimap2, StringTie, Scallop).
  • Genome annotation concepts (gene models, transcript assemblies, splice junctions).
  • NextFlow for workflow automation.
  • Basic command-line proficiency (Linux, Bash scripting).

Desirable knowledge

  • Familiarity with BUSCO, OMArk, and K-mer analyses.
  • Experience handling long-read (PacBio/Nanopore) vs. short-read (Illumina) data.
  • Knowledge of statistical methods for transcriptome completeness evaluation.
  • Experience with containerisation (Docker/Singularity) for workflow reproducibility.

Difficulty

Medium

Length

350h

Mentors

Leanne Haggerty, Francesca Tricomi

NextFlow Pipeline for Selenocysteine Protein Annotation in Ensembl Gene Sets

Brief Explanation

Selenocysteine-containing proteins (selenoproteins) play crucial biological roles, but their annotation remains challenging due to the unique incorporation of selenocysteine (Sec, U) at UGA codons. Currently, Ensembl uses Exonerate to align known selenoproteins to genomes and manually verifies models based on sequence identity and coverage. However, the existing approach is inefficient and outdated, requiring a more scalable and automated solution.

This project will develop a NextFlow pipeline to efficiently annotate selenoproteins in Ensembl gene sets by:

  1. Optimising the search for selenoprotein homologs
    • Aligning known selenoproteins against the genome using more efficient tools like MMseqs2, DIAMOND, or TBLASTN.
    • Filtering candidate regions based on sequence similarity, focusing on high-identity and high-coverage matches.
  2. Improving Selenocysteine Validation
    • Detecting UGA codons in aligned models and verifying the presence of SECIS elements (selenocysteine insertion sequences) in upstream/downstream regions.
    • Ensuring selenocysteine positions match the reference protein sequences.
  3. Automated Filtering and Quality Control
    • Retaining only models with ≥90% coverage and ≥95% sequence identity to known selenoproteins.
    • Removing false positives by integrating BUSCO-like completeness scoring.
    • Generating quality assessment reports.
  4. Deployability and Scalability
    • Implementing the pipeline in NextFlow to improve reproducibility and scalability across multiple genomes.
    • Providing Docker/Singularity containers for easy deployment in HPC and cloud environments.

The final pipeline will streamline selenoprotein annotation within Ensembl, improving accuracy, efficiency, and automation.

Expected results

  • A NextFlow-based selenoprotein annotation pipeline that aligns known selenoproteins and predicts valid selenocysteine-containing models.
  • Automated verification of UGA codons and SECIS elements.
  • Higher efficiency than the current Exonerate-based pipeline, with improved filtering for false positives.
  • Integration-ready outputs for Ensembl gene sets.
  • Containerised workflow for deployment on multiple computing environments.

Required knowledge

  • NextFlow or similar workflow automation tools.
  • Sequence alignment tools (DIAMOND, MMseqs2, TBLASTN, Exonerate).
  • Genome annotation formats (FASTA, GFF3).
  • Basic knowledge of selenoprotein biology and SECIS elements.

Desirable knowledge

  • Experience with gene annotation pipelines (e.g., Ensembl-anno, AUGUSTUS, BRAKER).
  • RNA structure analysis tools for SECIS detection (e.g., SECISearch, Infernal).
  • BUSCO or other completeness assessment tools.
  • Containerisation technologies (Docker, Singularity).

Difficulty

Medium

Length

350h

Mentors

Jose Perez-Silva

Expanding a Pipeline for Small Non-Coding RNA (sncRNA) Identification in Ensembl Genomes

Brief Explanation

This project aims to enhance an existing pipeline for identifying small non-coding RNAs (sncRNAs) in Ensembl genomes. Building on the current MirMachine modules, the pipeline will be expanded to incorporate additional analyses using RFAM and miRBase databases. Further improvements will include running sequence similarity searches with NCBI-BLAST and generating structural models using the Infernal software suite. The final pipeline will be optimised for flexibility, supporting various input sources, and containerised using Docker/Singularity to ensure reproducibility.

Expected Results

  • Integration of RFAM and miRBase data for improved sncRNA annotation.
  • Incorporation of NCBI-BLAST for sequence similarity searches.
  • Implementation of Infernal for RNA structural model generation.
  • Optimisation of pipeline scalability and flexibility for different input sources.
  • Containerisation of the pipeline using Docker/Singularity for easy deployment.
  • Documentation and testing to ensure usability and reproducibility.

Required Knowledge

  • NextFlow or other workflow management tools.
  • Python and/or Bash for pipeline scripting.
  • Basic RNA bioinformatics (FASTA, GFF3 formats, RNA databases).

Desirable Knowledge

  • Experience with RFAM, miRBase, and NCBI-BLAST.
  • Familiarity with Infernal for RNA secondary structure modeling.
  • Knowledge of Docker/Singularity for workflow containerisation.
  • Experience in workflow optimisation for large-scale genomic data.

Difficulty

Medium

Length

350h

Mentors

Jose Perez-Silva

Automated Pseudogene detection in Ensembl gene sets

Brief Explanation

This project aims to develop a Python module for identifying potential pseudogenes within Ensembl gene sets. The module will be designed as a standalone tool that can process GTF/GFF3 and FASTA files and will be integrated into the Ensembl-anno annotation toolkit. It will identify genes exhibiting pseudogene-like properties based on a combination of sequence similarity, structural abnormalities, and genomic context.

Pseudogenes can arise through gene duplication (non-processed pseudogenes) or retrotransposition (processed pseudogenes), and this tool will systematically detect both. Key features include:

  • Annotation-based filtering: Identifying models with high sequence similarity to known proteins but displaying abnormalities such as:
    • Missing start codon
    • Non-canonical splicing events
    • Unusually small introns (< 75bp)
    • Excessive repeat coverage
  • Processed pseudogene classification: Identifying single-exon models that have a multi-exon homolog elsewhere in the genome, suggesting retrotransposition.
  • Structural and sequence-based heuristics: Additional filters based on open reading frame (ORF) disruptions, premature stop codons, or truncations.

The final module will output a revised gene set with genes flagged or reclassified as pseudogenes based on the above criteria.

Expected results

  • A standalone Python module that can process GTF/GFF3 + FASTA input files to detect pseudogenes.
  • Integration into Ensembl-anno to work as part of the Ensembl annotation toolkit.
  • Detection of both processed (retrotransposed) and non-processed (duplicated) pseudogenes using multiple filtering criteria.
  • Scalable and optimised performance for large genomes.
  • Well-documented code with unit tests and clear integration instructions.

Required knowledge

  • Python (including bioinformatics libraries such as Biopython, Pandas).
  • GTF/GFF3 and FASTA file formats (basic understanding of genome annotations).
  • Basic genomics, particularly gene structures and pseudogenes.

Desirable knowledge

  • Experience with Ensembl gene annotation and tools like Ensembl-anno.
  • Familiarity with sequence alignment tools (BLAST, DIAMOND, HMMER) for homology detection.
  • Understanding of alternative splicing and transposable elements.
  • Knowledge of workflow optimisation for handling large datasets.

Difficulty

Medium

Length

Medium

Mentors

Leanne Haggerty

Expand automatic GFF3 validation for Ensembl genome loading

Brief Explanation

Ensembl Plants and Ensembl Metazoa import publicly available genome assemblies and their annotations from community contributors. Whilst assemblies are submitted to INSDC sequence archives, it is often the case that genome annotations, in the form of GFF3 files, are submitted to generalist repositories such as Zenodo where the files do not undergo further validation to ensure that they conform to community standards. Due to the flexibility of the GFF3 standard and the variety of ways in which gene models can be represented, bringing these annotations into Ensembl often requires manual intervention to ensure the GFF3 file is compatible with our loading processes. The goal of this project is to develop a Python module to integrate as part of our loading system to easily validate (and even curate) these files before any compute is performed, allowing us to get back to our providers and request amendments or validate proposed changes as early as possible in our production system.

Expected results

  • A standalone Python module that can process GFF3 files and validate them
  • The validation should determine if the GFF3 file is ready to be loaded by Ensembl, or if there are issues: the more detailed this report can be the easier it will be for us to return this information to our providers for them to be addressed
  • The code will include documentation as well as type hints and unit testing
  • Desirable: add an option to correct the more straightforward errors reported above

Required knowledge

  • Python, Bash
  • Basic understanding of GFF3 files and gene models

Desirable knowledge

  • Advanced knowledge of GFF3 files and gene models of non-vertebrate/plant genomes
  • Familiarity with tools such as GFF3toolkit, ACAT (Another Gff Analysis Toolkit) or GenomeTools
  • AI tools such as tensorflow as they could be of potential usefulness

Difficulty

Medium

Length

350h

Mentors

Jorge Alvarez, Sarah Dyer

Automatic detection of new genomes of interest

Brief Explanation

Ensembl Metazoa and Plants plan each release by generating a list of available species from INSDC resources and assessing their available information (e.g. taxonomic clade, assembly quality, annotation availability/quality, etc.) to select ~40 species that will be processed and loaded into the next Ensembl release. As an example, taxonomic information is used to highlight species that cover new clades not present in Ensembl, as well as those that bring novel information to existing clades, e.g. new locust genomes in the well-known Neoptera clade. This process has several manual steps.

As we continue to expand our Ensembl Metazoa and Plant resources we would like to automate the process described above to check available new species from INSDC resources, as well as create a system that allows us to rank these assemblies using different criteria. This system should collect the data on a regular basis, e.g. monthly, and provide the required information in a format for easy ingestion into our production loading system, for example listing the GCA identifier, species name, strain, common name, and taxonomy information. Integration with Google sheets would be desirable.

Expected results

  • Automatic system that can run monthly to provide a list of available new genomes from INSDC for ingestion into Ensembl

Required knowledge

  • Python and Bash
  • Slurm and cron jobs
  • Basic understanding of taxonomy information

Desirable knowledge

  • NCBI’s datasets CLI

Difficulty

Medium

Length

350h

Mentors

Jorge Alvarez, Sarah Dyer

Rust Tauri app of Ensembl’s Regulatory Activity Viewer

Brief Explanation

Ensembl’s Regulatory Activity Viewer is a new client-side JavaScript application at the MVP stage. It shows genomic annotation and regulatory activity (open chromatin, ChIP-seq) across cell types for a given region of the genome. It interacts with the backend through HTTP APIs that serve JSON. The backend endpoints are written in Rust, with the data being served from files in an S3-compatible Object Store. 

We would like to explore the possibility of building a Tauri client application, replacing the HTTP endpoints with calls directly to the Rust code. We would then test whether we can integrate local files with the files we provide through our HTTP APIs.

Expected results

  • A Tauri client app that can run the Regulatory Activity Viewer using local files.
  • Ideally, the app could also be used to access our HTTP APIs (either/or is OK in the first instance).
  • Integration of our HTTP API with local files is a stretch goal.
  • This is a proof of concept, so the code need not be production ready. However, learnings and limitations should be documented for subsequent iterations.

Required knowledge

  • Rust
  • JavaScript (familiarity with the language; JavaScript framework experience isn’t required)

Desirable knowledge

  • Understanding of Web Components with opinions on their strengths and weaknesses
  • Experience with front-end frameworks like React

Difficulty

Medium

Length

350h

Mentors

Garth Ilsley, Paulo Lins 

Exploring efficient techniques for fairer access to the MGnify Protein Database

Brief Explanation

The latest release of the MGnify Proteins Database contains over 2.4 billion non-redundant protein records split into ~717 million representative clusters. Along with the large volume of raw biological data on offer, the potential of the database lies in its links to the higher-level microbiome MGnify resource, including records to their metagenomics origin. The scale of this data currently necessitates querying through a high-performance computing cluster (to which the database’s flat files can be downloaded), or through the newly released Google Big Query dataset. With recent developments in data science allowing for the analysis of big data on personal workstations, there is therefore exciting space for more accessible pathways of exploiting the MGnify Proteins Database.

In this project, we would like to explore computational techniques that will help open the MGnify Protein Database to a larger audience. Current avenues for work include the transformation of the database into a “lite” version to be managed using scalable database systems, including high-performance tools like SQLite, DuckDB, and polars. A set of utilities (e.g. CLI tools or Python utilities) that will query this form of the database could then be developed e.g. to enable the retrieval of biomes associated with a particular protein sequence.

Expected results

  • Sketching a schema for a MGnify Proteins Database-lite
  • Ingest subsets of database flat files into efficient relational database or dataframe systems like SQlite, DuckDB, Parquet
  • Test and benchmark common and simple queries to the database
  • Write command-line utilities implementing previously developed queries

Required knowledge

  • Familiar with relational DBMS concepts like ER models
  • Experience with SQL for creating and querying a database
  • Comfortable in using a Unix shell (bash)

Desirable knowledge

  • Python skills for data-manipulation and development of simple command-line utilities
  • Basic git skills for version-control of work

Difficulty

Medium

Length

350h

Mentors

Christian Atallah, Jiawei Wang

Uncharacterized Yet Conserved: Functional Insights into Human Proteins Across Haplotypes

Brief Explanation

This project aims to identify patterns among conserved human proteins which are not functionally characterized so far in the human genome by integrating genomic, structural, and multi-omics data across different haplotypes from the Human Pangenome Project. Many human proteins remain uncharacterized despite being evolutionarily conserved. By analyzing expression data, protein interactions, 3D structure, genetic variation, and taxonomic conservation, this project seeks to predict potential functions for these proteins.


The final output will be a pipeline that systematically analyzes conserved proteins with no known function leading to potential discoveries in human biology, disease relevance, and novel drug targets.

Expected Results

  • A comprehensive list of proteins which do not have known/assigned HGNC symbol or GO annotation across human haplotypes, categorized based on:
    • Expression across tissues (GTEx)
    • Population variation (Human Pangenome Reference Consortium Project, gnomAD)
    • Evolutionary conservation (OrthoDB, Ensembl Compara)
    • Protein structure-based functional prediction (AlphaFold2)
    • Disease and clinical relevance (GWAS, ClinVar)
  • A functional classification of these conserved proteins based on:
    • Predicted enzymatic/metabolic roles [ENZYME Database]
    • Interaction with known proteins (STRING, BioGRID)
    • Protein motifs and domain analysis (Pfam)
  • A bioinformatics pipeline that can be reused for future studies in other species or diseases. The Pipeline should have following components:
    • Data Extraction
    • Multi-Omics Analysis
    • Structural & Functional Prediction
    • Evolutionary & Population-Level Insights
    • Comparative analysis (OrthoDB, OMA)
    • Disease & Clinical Relevance
  • A prioritised list of high-impact conserved proteins with strong functional predictions, plus an automated pipeline for future use.

Required Knowledge

  • Essential Skills & Tools:
    • Basic bioinformatics (FASTA, GFF file handling)
    • Experience with Python/R for data analysis
    • Understanding of genomic databases (Ensembl, GENCODE)
    • Familiarity with multi-omics analysis (GTEx, proteomics tools)
    • Ability to use AlphaFold2 for structural prediction
  • Basic machine learning (for functional clustering)
  • Programming in Python

Desirable Knowledge

  • Knowledge of human genomics
  • Experience in evolutionary biology 
  • Nextflow

Difficulty

Intermediate to Advanced

Length

350 hours

Mentor

Swati Sinha

A new store for variation phenotypes: prototyping a mongo database phenotype store and graphQL access

Brief Explanation

As part of development on the new ensembl beta site, novel data structures, storage methods and methods of accessing them are needed. As part of this effort we envision storing existing variant, transcript and gene phenotype data in a mongo DB(?) which will be accessed via extension of our existing graphQL API (Hypsipyle).

Expected results

A functional database with methods to load in new data.

Associated functions added to Hypsipyle to retrieve phenotype data in line with the existing Variation Data Model (VDM).

Required knowledge

  • Programming on Python
  • Familiarity with NoSQL (preferably MongoDB)
  • Desirable knowledge
  • API development experience
  • Experience with graphQL

Difficulty

Intermediate

Length

350 hours

Mentors

Likhitha Surapaneni, Syed Hossain

Assessment of available Linkage Disequilibrium (LD) tools and how to integrate

Brief Explanation

As part of the new beta Ensembl website we are going to replicate the existing site and allow users to create on the fly LD calculations and generate plots. For this, we need to evaluate existing tools (such as PLINK) and determine how best to interface with them (python wrappers, API etc…).

Expected results

  1. Determination for each tool whether it meets our requirements.
  2. Determination of best/fastest/most reliable means of interfacing with each tool that passes step 1.
  3. Using previous steps, put forward a candidate(s) solution for LD on the beta site.

Required knowledge

  • Programming in Python
  • Some background knowledge of bioinformatics tools
  • Desirable knowledge
  • Some knowledge of statistical genetics / genomics 

Difficulty

Intermediate to advanced

Length

350 hours

Mentors

Jamie Allen

Integrating European Variation Archive (EVA) data in the Ensembl beta website

Brief Explanation

There is variation data available for many species in the EVA which can all be accessed through their API. We would like to integrate this API access into our own graphQL based API that serves variation data to the new Ensembl beta website. This would support the detailed variant views provided by the entity viewer on our new site.

Expected results

  1. Wrapped API call to the EVA added to our graphQL API.
  2. Data reshaped appropriately to fit our Variation Data Model (VDM) and serve data to the web client.
  3. Potential for interaction with the EVA to request new / adjust end points if needed.

Required knowledge

  • Python programming.
  • Experience with APIs
  • Desirable knowledge
  • Experience with graphQL

Difficulty

Intermediate

Length

350 hours

Mentors

Syed Hossain, Likhitha Surapaneni

Slurpy, A TUI-based Python Interface for HPC Job management using Slurm

Brief Explanation

Slurm is a workload manager and job scheduler for High Performance Computing clusters (HPC). At EMBL-EBI, we make use of Slurm to manage jobs on our in-house HPC cluster called the CODON cluster. Slurm provides  REST APIs that can be used by clients to interact with Slurm commands. The Slurm REST APIs are not directly internet facing. i.e no immediately accessible endpoints over the internet per say. Rather they are provided via a daemon that is expected to run within the cluster. Libraries like PySlurm currently provide Python bindings to Slurm’s C library. This however, requires a full Slurm installation, in order to be used. This poses a significant limitation for users who would like to interact with the Slurm controller, without needing a working Slurm installation. 

The aim of this project is to provide a simpler way to interact with Slurm APIs via a Python wrapper. A basic prototype has been created for Slurpy. The main goal of this project would be to take this to a full working implementation. Here is the link to the existing Slurpy prototype

(https://github.com/SandyRogers/slurpy)

Expected results

  • A locally working docker environment that has Slurm running and is operable by the Slurpy library.
  • Using Pydantic schemas to validate requests and responses from the Slurm APIs
  • A TUI application based on Slurpy, that allows users to interact with it via a visually appealing command line interface

Required knowledge

  • Python 
  • Docker

Desirable knowledge

  • Basic knowledge of HPC
  • Good coding practices
  • Experience contributing to open source projects

Difficulty

Medium

Length

350h

Mentors

Mahfouz Shehu