Araştırma Makalesi
BibTex RIS Kaynak Göster

Performance Analysis of Machine Learning and Bioinformatics Applications on High Performance Computing Systems

Yıl 2020, Cilt: 8 Sayı: 1, 1 - 14, 28.01.2020
https://doi.org/10.21541/apjes.547016

Öz



Nowadays, it is becoming increasingly important to use the most
efficient and most suitable computational resources for algorithmic tools that
extract meaningful information from big data and make smart decisions.
In
this paper, a comparative analysis is provided for performance measurements of
various machine learning and bioinformatics software including scikit-learn, Tensorflow,
WEKA, libSVM, ThunderSVM, GMTK, PSI-BLAST, and HHblits with big data
applications on different high performance computer systems and workstations.
The programs are executed in a wide range of conditions such as single-core central
processing unit (CPU), multi-core CPU, and graphical processing unit (GPU)
depending on the availability of implementation. The optimum number of CPU
cores are obtained for selected software. It is found that the running times
depend on many factors including the CPU/GPU version, available RAM, the number
of CPU cores allocated, and the algorithm used. If parallel implementations are
available for a given software, the best running times are typically obtained by
GPU, followed by multi-core CPU, and single-core CPU. Though there is no best
system that performs better than others in all applications studied, it is
anticipated that the results obtained will help researchers and practitioners
to select the most appropriate computational resources for their machine
learning and bioinformatics projects.

Kaynakça

  • [1]. R. Bekkerman, M. Bilenko, and J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches, Cambridge University Press, 2012.
  • [2]. Supercomputer, https://en.wikipedia.org/wiki/Supercomputer.
  • [3]. Y. Kochura, S. Stirenko, O. Alienin, M. Novotarskiy, and Y. Gordienko, “Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes”, In: Shakhovska N., Stepashko V. (eds) Advances in Intelligent Systems and Computing II. CSIT 2017. Advances in Intelligent Systems and Computing, vol 689. Springer, 243-256, 2018.
  • [4]. V. Kovalev, A. Kalinovsky, and S. Kovalev, “Deep Learning with Theano, Torch, Caffe, TensorFlow, and Deeplearning4J: Which One Is the Best in Speed and Accuracy?”, International Conference on Pattern Recognition and Information Processing, (2016).
  • [5]. A. Shatnawi, G. Al-Bdour, R. Al-Qurran, and M. Al-Ayyoub, “A Comparative Study of Open Source Deep Learning Frameworks”, 9th International Conference on Information and Communication Systems (ICICS), 72-77, (2018).
  • [6]. S. Bahrampur, N. Ramakrishnan, L. Schott, and M. Shah, “Comparative Study of Deep Learning Software Frameworks”, arXiv:1511.06435, 2016.
  • [7]. D.A. Bader, Y. Li, T. Li, and V. Sachdeva, “BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications”, The IEEE International Symposium on Workload Characterization (IISWC 2005), Austin, TX, October 6-8, 2005.
  • [8]. M. Kurtz, F. J. Esteban, P. Hernandez, J. A. Caballero, A. Guevara, G. Dorado, and S. Galvez, “Bioinformatics Performance Comparison of Many-core Tile64 vs. Multi-core Intel Xeon”, Clei Electronic Journal, vol. 17, no. 1, 1-9, 2014.
  • [9]. NVIDIA DGX-1, https://www.nvidia.com/en-us/data-center/dgx-1/.
  • [10]. M. Abadi et al., “Tensorflow: A system for large-scale machine learning”, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)”, USENIX Association, 265-283, (2016). Software available at https://www.tensorflow.org.
  • [11]. F. Pedregosa et al., “Scikit-learn: machine learning in python”, Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. Software available at https://scikit-learn.org/stable/.
  • [12]. E. Frank, M. A. Hall, and I. Witten, “The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016. Software available at https://www.cs.waikato.ac.nz/ml/weka/.
  • [13]. J. Bilmes and G. Zweig, “The graphical models toolkit: An open source software system for speech and time-series processing”, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, IV-3916-IV-3919, (2002). Software available at https://melodi.ee.washington.edu/gmtk/.
  • [14]. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines”, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1--27:27, 2011. Software available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
  • [15]. Z. Wen, J. Shi, Q Li, B. He, and J. Chen, “ThunderSVM: A Fast SVM Library on GPUs and CPUs”, Journal of Machine Learning Research, vol. 19, pp. 1-5, 2018. Software available at https://thundersvm.readthedocs.io/en/latest/.
  • [16]. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25 (17), 3389-3402, (1997). Software available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.
  • [17]. M. Remmert, A. Biegert, A. Hauser, and J. Söding, "HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment", Nat. Methods, 9 (2), 173-175, (2011). Software available at https://github.com/soedinglab/hh-suite.
  • [18]. NCBI, URL: https://www.ncbi.nlm.nih.gov (First published on Nov. 4, 1988).
  • [19]. Protein Data Bank (PDB), https://www.rcsb.org.
  • [20]. D. T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices”, Journal of Molecular Biology, vol 292, no. 2, 195-202, 1999. Software available at http://bioinf.cs.ucl.ac.uk/psipred/.
  • [21]. DSSP, URL: https://swift.cmbi.umcn.nl/gv/dssp/DSSP_1.html, (first published in 1983).
  • [22]. Python, https://www.python.org.
  • [23]. Random forest, https://en.wikipedia.org/wiki/Random_forest.
  • [24]. Artnome, https://www.artnome.com/news/2018/11/8/inventing-the-future-of-art-analytics.
  • [25]. Multi-layer perceptron (MLP), https://en.wikipedia.org/wiki/Multilayer_perceptron.
  • [26]. Protein structure prediction, https://en.wikipedia.org/wiki/Protein_structure_prediction.
  • [27]. Multi-layer perceptron, https://www.oreilly.com/library/view/getting-started-with/9781786468574/ch04s04.html.
  • [28]. S. Fourati et al., “A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection”, Nature Communications, vol. 9, no. 1, pp. 1-11, 2018. Challenge web site: https://www.synapse.org/#!Synapse:syn5647810/wiki/399103.
  • [29]. Google, https://www.google.com.
  • [30]. Convolutional neural network, https://en.wikipedia.org/wiki/Convolutional_neural_network.
  • [31]. Optical character recognition, https://en.wikipedia.org/wiki/Optical_character_recognition.
  • [32]. A comprehensive guide to convolutional neural networks, https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
  • [33]. notMNIST dataset, http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html.
  • [34]. MNIST dataset, https://en.wikipedia.org/wiki/MNIST_database.
  • [35]. Using notMNIST dataset from Tensorflow, http://enakai00.hatenablog.com/entry/2016/08/02/102917.
  • [36]. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256, (2009).
  • [37]. Support vector machine, https://en.wikipedia.org/wiki/Support-vector_machine.
  • [38]. W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes”, BMC Medical Informatics and Decision Making, vol. 10, no. 1, 2010.
  • [39]. Jeffrey A. Bilmes, http://melodi.ee.washington.edu/~bilmes/pgs/index.html.
  • [40]. Dynamic Bayesian network, https://en.wikipedia.org/wiki/Dynamic_Bayesian_network.
  • [41]. Hidden Markov model, https://en.wikipedia.org/wiki/Hidden_Markov_model.
  • [42]. J. A. Cuff and G. J. Barton, “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction”, Proteins, 34(4), 508–519, 1999. Dataset is available at http://www.compbio.dundee.ac.uk/jpred/legacy/data/.
  • [43]. I. Y. Y. Koh, V. A. Eyrich, M. A. Marti-Renom, D. Przybylski, M. S. Madhusudhan, N. Eswar, O. Graña, F. Pazos, A. Valencia, A., and B. Rost, “EVA: Evaluation of protein structure prediction servers”, Nucleic Acids Research, 31(13), 3311–3315, 2003.
  • [44]. Z. Aydin, A. Singh, J. Bilmes and W. S. Noble, “Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure,” BMC Bioinformatics, 12:154, 2011.
  • [45]. Z. Aydin, N. Azgınoglu, H. I. Bilgin, and M. Celik, “Developing Structural Profile Matrices for Protein Secondary Structure and Solvent Accessibility Prediction”, accepted to Bioinformatics, 2019.
  • [46]. TRUBA, https://www.truba.gov.tr/index.php/en/main-page/.
  • [47]. TRUBA wiki page, http://wiki.truba.gov.tr/index.php/Ana_sayfa.
  • [48]. UhEM, http://www.uhem.itu.edu.tr.
  • [49]. İTU UhEM wiki page, http://wiki.uhem.itu.edu.tr/w/index.php/Sarıyer_sistemine_iş_vermek.
  • [50]. CompecTA, https://www.compecta.com.tr.
  • [51]. Abdullah Gul University, http://www.agu.edu.tr.

Performance Analysis of Machine Learning and Bioinformatics Applications on High Performance Computing Systems

Yıl 2020, Cilt: 8 Sayı: 1, 1 - 14, 28.01.2020
https://doi.org/10.21541/apjes.547016

Öz

Günümüzde büyük verilerden
anlamlı bilgiler çıkartan ve akıllı kararlar alabilen algoritmaların en verimli
şekilde ve en uygun hesaplama ortamında çalıştırılması gittikçe artan bir önem
arz etmektedir. Bu makalede scikit-learn, Tensorflow, WEKA, libSVM, ThunderSVM,
GMTK, PSI-BLAST, and HHblits gibi büyük veri analizi uygulamaları bulunan
çeşitli makine öğrenmesi ve biyoenformatik programlarının yüksek başarımlı
hesaplama sistemleri ve iş istasyonlarındaki performansları incelenmiştir.
Programlar tek merkezi işlemci çekirdeğine ek olarak paralel işleme ve grafik
işlemci versiyonlarının mevcut olma durumuna göre, çoklu merkezi işlemci
çekirdeği ve grafik işlemci çekirdeklerinde çalıştırılmıştır. Seçilen programlar
için optimum CPU çekirdek sayısı tespit edilmiştir. Yapılan analizler sonucunda
hız performansının birçok faktöre bağlı olduğu sonucuna varılmıştır. Bunlar
arasında merkezi/grafik işlemci versiyonları, hafıza miktarı, seçilen çekirdek
sayısı ve kullanılan algoritma sayılabilir. Bir programın paralel işlemeye
imkan tanıyan versiyonu mevcutsa en hızlı hesaplama grafik işlemci birimleri
ile, daha sonra paralel merkezi işlemci ve tek merkezi işlemci ile elde
edilmiştir. İncelenen uygulamalar açısından en başarılı sistem farklılık
gösterse de mevcut çalışma makine öğrenmesi ve biyoenformatik alanındaki
araştırma ve geliştirme yapanların projelerinde en uygun kaynakları seçmesine olanak
sağlayacaktır.

Kaynakça

  • [1]. R. Bekkerman, M. Bilenko, and J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches, Cambridge University Press, 2012.
  • [2]. Supercomputer, https://en.wikipedia.org/wiki/Supercomputer.
  • [3]. Y. Kochura, S. Stirenko, O. Alienin, M. Novotarskiy, and Y. Gordienko, “Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes”, In: Shakhovska N., Stepashko V. (eds) Advances in Intelligent Systems and Computing II. CSIT 2017. Advances in Intelligent Systems and Computing, vol 689. Springer, 243-256, 2018.
  • [4]. V. Kovalev, A. Kalinovsky, and S. Kovalev, “Deep Learning with Theano, Torch, Caffe, TensorFlow, and Deeplearning4J: Which One Is the Best in Speed and Accuracy?”, International Conference on Pattern Recognition and Information Processing, (2016).
  • [5]. A. Shatnawi, G. Al-Bdour, R. Al-Qurran, and M. Al-Ayyoub, “A Comparative Study of Open Source Deep Learning Frameworks”, 9th International Conference on Information and Communication Systems (ICICS), 72-77, (2018).
  • [6]. S. Bahrampur, N. Ramakrishnan, L. Schott, and M. Shah, “Comparative Study of Deep Learning Software Frameworks”, arXiv:1511.06435, 2016.
  • [7]. D.A. Bader, Y. Li, T. Li, and V. Sachdeva, “BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications”, The IEEE International Symposium on Workload Characterization (IISWC 2005), Austin, TX, October 6-8, 2005.
  • [8]. M. Kurtz, F. J. Esteban, P. Hernandez, J. A. Caballero, A. Guevara, G. Dorado, and S. Galvez, “Bioinformatics Performance Comparison of Many-core Tile64 vs. Multi-core Intel Xeon”, Clei Electronic Journal, vol. 17, no. 1, 1-9, 2014.
  • [9]. NVIDIA DGX-1, https://www.nvidia.com/en-us/data-center/dgx-1/.
  • [10]. M. Abadi et al., “Tensorflow: A system for large-scale machine learning”, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)”, USENIX Association, 265-283, (2016). Software available at https://www.tensorflow.org.
  • [11]. F. Pedregosa et al., “Scikit-learn: machine learning in python”, Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. Software available at https://scikit-learn.org/stable/.
  • [12]. E. Frank, M. A. Hall, and I. Witten, “The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016. Software available at https://www.cs.waikato.ac.nz/ml/weka/.
  • [13]. J. Bilmes and G. Zweig, “The graphical models toolkit: An open source software system for speech and time-series processing”, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, IV-3916-IV-3919, (2002). Software available at https://melodi.ee.washington.edu/gmtk/.
  • [14]. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines”, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1--27:27, 2011. Software available at https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
  • [15]. Z. Wen, J. Shi, Q Li, B. He, and J. Chen, “ThunderSVM: A Fast SVM Library on GPUs and CPUs”, Journal of Machine Learning Research, vol. 19, pp. 1-5, 2018. Software available at https://thundersvm.readthedocs.io/en/latest/.
  • [16]. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25 (17), 3389-3402, (1997). Software available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.
  • [17]. M. Remmert, A. Biegert, A. Hauser, and J. Söding, "HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment", Nat. Methods, 9 (2), 173-175, (2011). Software available at https://github.com/soedinglab/hh-suite.
  • [18]. NCBI, URL: https://www.ncbi.nlm.nih.gov (First published on Nov. 4, 1988).
  • [19]. Protein Data Bank (PDB), https://www.rcsb.org.
  • [20]. D. T. Jones, “Protein secondary structure prediction based on position-specific scoring matrices”, Journal of Molecular Biology, vol 292, no. 2, 195-202, 1999. Software available at http://bioinf.cs.ucl.ac.uk/psipred/.
  • [21]. DSSP, URL: https://swift.cmbi.umcn.nl/gv/dssp/DSSP_1.html, (first published in 1983).
  • [22]. Python, https://www.python.org.
  • [23]. Random forest, https://en.wikipedia.org/wiki/Random_forest.
  • [24]. Artnome, https://www.artnome.com/news/2018/11/8/inventing-the-future-of-art-analytics.
  • [25]. Multi-layer perceptron (MLP), https://en.wikipedia.org/wiki/Multilayer_perceptron.
  • [26]. Protein structure prediction, https://en.wikipedia.org/wiki/Protein_structure_prediction.
  • [27]. Multi-layer perceptron, https://www.oreilly.com/library/view/getting-started-with/9781786468574/ch04s04.html.
  • [28]. S. Fourati et al., “A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection”, Nature Communications, vol. 9, no. 1, pp. 1-11, 2018. Challenge web site: https://www.synapse.org/#!Synapse:syn5647810/wiki/399103.
  • [29]. Google, https://www.google.com.
  • [30]. Convolutional neural network, https://en.wikipedia.org/wiki/Convolutional_neural_network.
  • [31]. Optical character recognition, https://en.wikipedia.org/wiki/Optical_character_recognition.
  • [32]. A comprehensive guide to convolutional neural networks, https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.
  • [33]. notMNIST dataset, http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html.
  • [34]. MNIST dataset, https://en.wikipedia.org/wiki/MNIST_database.
  • [35]. Using notMNIST dataset from Tensorflow, http://enakai00.hatenablog.com/entry/2016/08/02/102917.
  • [36]. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256, (2009).
  • [37]. Support vector machine, https://en.wikipedia.org/wiki/Support-vector_machine.
  • [38]. W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes”, BMC Medical Informatics and Decision Making, vol. 10, no. 1, 2010.
  • [39]. Jeffrey A. Bilmes, http://melodi.ee.washington.edu/~bilmes/pgs/index.html.
  • [40]. Dynamic Bayesian network, https://en.wikipedia.org/wiki/Dynamic_Bayesian_network.
  • [41]. Hidden Markov model, https://en.wikipedia.org/wiki/Hidden_Markov_model.
  • [42]. J. A. Cuff and G. J. Barton, “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction”, Proteins, 34(4), 508–519, 1999. Dataset is available at http://www.compbio.dundee.ac.uk/jpred/legacy/data/.
  • [43]. I. Y. Y. Koh, V. A. Eyrich, M. A. Marti-Renom, D. Przybylski, M. S. Madhusudhan, N. Eswar, O. Graña, F. Pazos, A. Valencia, A., and B. Rost, “EVA: Evaluation of protein structure prediction servers”, Nucleic Acids Research, 31(13), 3311–3315, 2003.
  • [44]. Z. Aydin, A. Singh, J. Bilmes and W. S. Noble, “Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure,” BMC Bioinformatics, 12:154, 2011.
  • [45]. Z. Aydin, N. Azgınoglu, H. I. Bilgin, and M. Celik, “Developing Structural Profile Matrices for Protein Secondary Structure and Solvent Accessibility Prediction”, accepted to Bioinformatics, 2019.
  • [46]. TRUBA, https://www.truba.gov.tr/index.php/en/main-page/.
  • [47]. TRUBA wiki page, http://wiki.truba.gov.tr/index.php/Ana_sayfa.
  • [48]. UhEM, http://www.uhem.itu.edu.tr.
  • [49]. İTU UhEM wiki page, http://wiki.uhem.itu.edu.tr/w/index.php/Sarıyer_sistemine_iş_vermek.
  • [50]. CompecTA, https://www.compecta.com.tr.
  • [51]. Abdullah Gul University, http://www.agu.edu.tr.
Toplam 51 adet kaynakça vardır.

Ayrıntılar

Birincil Dil İngilizce
Konular Mühendislik
Bölüm Makaleler
Yazarlar

Zafer Aydın 0000-0001-7686-6298

Yayımlanma Tarihi 28 Ocak 2020
Gönderilme Tarihi 30 Mart 2019
Yayımlandığı Sayı Yıl 2020 Cilt: 8 Sayı: 1

Kaynak Göster

IEEE Z. Aydın, “Performance Analysis of Machine Learning and Bioinformatics Applications on High Performance Computing Systems”, APJES, c. 8, sy. 1, ss. 1–14, 2020, doi: 10.21541/apjes.547016.