Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer
DOI:
https://doi.org/10.14529/jsfi200102Abstract
Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.
References
Yoshida, K., Yoshizato, T., Shiraishi, Y., et al.: Integrated molecular analysis of clear-cell renal cell carcinoma. Nature Genetics 45(8), 860–867 (2013), DOI: 10.1038/ng.2699
Yoshida, K., Sanada, M., Shiraishi, Y., et al.: Frequent pathway mutations of splicing machinery in myelodysplasia. Nature 478(7367), 64–69 (2011), DOI: 10.1038/nature10496
Genomon-exome. http://genomon.hgc.jp/exome/en/, accessed: 2019-02-20
Miyazaki, H., Kusano, Y., Shinjou, N., et al.: Overview of the K computer system. Fujitsu Scientific Technical Journal 48(3), 302–309 (2012)
Bamshad, M.J., Ng, S.B., Bigham, A.W., et al.: Exome sequencing as a tool for Mendelian disease gene discovery. Nature Reviews 12(11), 745–755 (2011), DOI: 10.1038/nrg3031
Ajima, Y., Inoue, T., Hiramoto, S., et al.: The Tofu Interconnect. IEEE Micro 32(1), 21–31 (2012), DOI: 10.1109/MM.2011.98
Shimizu T.: Supercomputer “Fugaku”. ISC High Performance 2019, 16-20 June 2019, Frankfurt, Germany. (2019)
Braam, P.J.: The Lustre Storage Architecture. CoRR abs/1903.01955v1 (2019), http://arxiv.org/abs/1903.01955
Sakai, K., Sumimoto, S., Kurokawa, M.: High-performance and highly reliable file system for the K computer. Fujitsu Scientific and Technical Journal 48(3), 302–309 (2012)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009), DOI: 10.1093/bioinformatics/btp324
Sequence Alignment/Map Format Specification - The SAM/BAM Format Specification Working Group. http://samtools.github.io/hts-specs/, accessed: 2019-02-20
Li, H., Handsaker, B., Wysocker, A., et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009), DOI: 10.1093/bioinformatics/btp352
McKenna, A., Hanna M., Banks E., et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20(9), 1297–1303 (2010), DOI: 10.1101/gr.107524.110
Wang, K., Li, M., Hakonarson, H.: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38(16), e164 (2010), DOI: 10.1093/nar/gkq603
Picard. http://broadinstitute.github.io/picard, accessed: 2020-01-20
Marcel, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1), 10–12 (2011), DOI: 10.14806/ej.17.1.200
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18(11), 1851–1858 (2008), DOI: 10.1101/gr.078212.108
Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010), DOI: 10.1093/bioinformatics/btq033
Gentleman, R.C., Carey, V.J., Bates, D.M., et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5(10), R80 (2004), DOI: 10.1186/gb-2004-5-10-r80
Ohue, M., Shimoda, T., Suzuki, S., et al.: MEGADOCK 4.0: an ultra-high-performance protein-protein docking software for heterogeneous supercomputers. Bioinformatics 30(22), 3281–3283 (2014), DOI: 10.1093/bioinformatics/btu532
Matsuda, M., Maruyama, N., Takizawa, S.: K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers. IEEE International Conference on Cluster Computing 2013, CLUSTER, 23-27 Sep. 2013, Douliu, Taiwan. pp. 1–8. IEEE (2013), DOI: 10.1109/CLUSTER.2013.6702663
Seo, J.S., Ju, Y.S., Lee, W.C., et al.: The transcriptional landscape and mutational profile of lung adenocarcinoma. Genome Research 22(11), 2109–2119 (2010), DOI: 10.1101/gr.145144.112
Deutsch, P.: “GZIP file format specification version 4.3”, RFC Editor (1996), DOI: 10.17487/RFC1952
Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: Scientific containers for mobility of compute. PLOS ONE 12(5), 1–20 (2017), DOI: 10.1371/journal.pone.0177459
Tommaso, D.P., Chatzou, M., Floden, E.W., et al.: Nextflow enables reproducible computational workflows. Nature Biotechnology 35(4), 316–319 (2017), DOI: 10.1038/nbt.3820
NGS analyzer. http://www.csrp.riken.jp/application_d_e.html, accessed: 2019-02-20
Downloads
Published
How to Cite
Issue
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.