The 7th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2016)

4–9 Jul 2016

Europe/Moscow timezone

Optimization for Bioinformatics genome sequencing pipelines by means of HEP computing tools for Grid and Supercomputers

5 Jul 2016, 16:30

15m

310

Sectional reports 2. Operation, monitoring, optimization in distributed computing systems 1. Technologies, architectures, models of distributed computing systems

Mr Alexander Novikov (National Research Centre "Kurchatov Institute")

Modern biology uses complex algorithms and sophisticated software toolkits for genome sequencing studies, computations for which are impossible without access to powerful or significant computing resources. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing volumes of sequencing data that need to be processed, analyzed and made available for bioinformaticians worldwide. Analysis of ancient genomes sequencing data using popular software pipeline PALEOMIX can require resource allocation of powerful standalone computer for a few weeks. PALEOMIX include typical set of software used to process NGS data including adapter trimming, read filtering, sequence alignment, genotyping and phylogenetic or metagenomic analysis. Organization the computation by sophisticated WMS and efficient usage of the supercomputers can greatly enhance this pipeline. Using related storage systems facilitate subsequent analysis. Bioinformatics and other compute intensive sciences draw attention to the success of the projects which use PanDA beyond HEP and Grid. PanDA - Production and Distributed Analysis Workload Management System has been developed to address data processing and analysis challenges of ATLAS experiment at LHC. Recently PanDA has been extended to run HEP and beyond HEP scientific applications on Leadership Class Facilities and supercomputers. In this paper we will describe the adaptation of the PALEOMIX pipeline to a distributed computing environment powered by PanDA for Ancient Mammoths DNA samples. We used PanDA to manage computational tasks on a multi-node parallel supercomputer. That was possible as we split input files into chunks which could be computed in parallel on different nodes as separate inputs for PALEOMIX and finally merge output result. We dramatically decreased the total computation time because of jobs brokering, submission and auto resubmission of failed ones by means of PanDA, which also demonstrated it earlier for the HEP applications in the Grid. Thus using software tools developed initially for HEP and Grid can reduce computation time for bioinformatics tasks such as PALEOMIX pipeline for Ancient Mammoths DNA samples from weeks to days.

Mr Alexander Novikov (National Research Centre "Kurchatov Institute")

Dr Alexei Klimentov (Brookhaven National Lab) Mr Alexey Poyda (NRC KURCHATOV INSTITUTE) Mr Anthony Teslyuk (NRC Kurchatov Institute) Mr Artem Nedoluzhko (NRC Kurchatov Institute) Mr Daniel Drizhuk (NRC Kurchatov Institute) Mr Fedor Sharko (NRC Kurchatov Institute) Mr Ivan Tertychnyi (NRC Kurchatov Institute) Mr Ruslan Mashinistov (NRC Kurchatov Institute) Mr Vasiliy Aulov (NRC Kurchatov Institute)

Slides

Novikov_grid2016_optimization_for_bio_pipelines.pdf

The 7th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2016)

Optimization for Bioinformatics genome sequencing pipelines by means of HEP computing tools for Grid and Supercomputers

310

Speaker

Description

Primary author

Co-authors

Presentation materials

Choose timezone

The 7th International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2016)

Speaker

Description

Primary author

Co-authors

Presentation materials