This month we’re meeting Dan Staines, who is part of the Ensembl Production team.
What is your job in Ensembl?
I lead the Production team, who look after all the processes and pipelines needed to transform the data provided by the other teams in Ensembl and Ensembl Genomes into what our users see via the website, FTP site, APIs and servers. We do the twin jobs of writing new automated pipelines and controlling infrastructure to build a release of Ensembl, and making sure those pipelines run as they should during the release process.
What do you enjoy about your job?
We’re in the business of solving problems to get data to our users, and you can’t beat the satisfaction of spotting a problem, figuring out the cause, working out a solution and then seeing things run smoothly. We’re also open to using whatever technology is best fitted to the problem in hand, and it’s always fun to explore new technology and approaches to problems.
What are you currently working on?
I’ve got a variety of projects on the go, ranging from an early proof-of-concept design for an automated platform for importing data during a release, to a new data access framework supporting customisable bulk access to data from across Ensembl and Ensembl Genomes. This last project is particularly exciting, since it has been a huge challenge to find the best approach to scalably search and retrieve data from different parts of Ensembl and combine it with data from our colleagues in different parts of the EBI. However, we’re making great progress and really looking forward to sharing the finished product with our users.
What is your typical day?
I’m lucky enough to live close enough to campus to cycle most days, so that always sets me up for a productive day at work. The day starts with catching up with other people in Ensembl, including a daily standup session with other managers in the group so we know what’s going on in other teams or parts of the institute, and then a quick session with my team to see where we are in our work and to identify any problems that need attention. The rest of the day can vary enormously, ranging from troubleshooting running pipelines and liaising with our systems groups on the resources we need, to mentoring team members on their own development work, and spending quality time working on my development projects. And drinking coffee, obviously.
How did you end up here?
“As far back as I can remember, I always wanted to be a bioinformatician…” Well, not quite, but science was an early fascination, leading me eventually to study Molecular Biology at university. I also had a long-standing interest in computing, owning a variety of early home computers and these two interests collided whilst I was working on my PhD in Biochemistry, and started writing my own tools for sequence manipulation.
At this point it became clear that this was much more fun than lab work (presumably much to the relief of my long-suffering supervisor), leading me to take a job with a startup working on data integration software for biological data. After a few fun-filled years roaming the world writing custom software solutions for pharmaceutical companies and designing data processing frameworks, I ended up at the EBI on a data integration project that eventually turned into Ensembl Genomes, Ensembl’s sister project for non-vertebrate genomes.
Over the years, the Ensembl and Ensembl Genomes projects have slowly become closer in how they operate, and reflecting this, I now head a team that looks after production processes and develops new infrastructure for both projects.
What surprised you most about Ensembl when you started working here?
Ensembl is a very long lived project with a lot of history, so the code base and data schemas are never short of surprises, even for old hands like me.
What is the coolest tool or data type in Ensembl that you think everybody should know about?
One of the main reasons that our processing has managed to keep pace with the explosion of data is in how we’re able to effectively process large amounts of data on our compute cluster. For this, we rely heavily throughout the project on eHive, an extremely capable system developed in the Ensembl team for creating and running pipelines for scheduling and executing millions of parallel compute jobs. Hive means its easy to automate complex and time-consuming processes, leaving us with time to concentrate on the big problems.