Research Article
BibTex RIS Cite

Evaluation of Differences of Fast and High Accuracy Base Calling Models of Guppy on Variant Calling Using Low Coverage WGS Data

Year 2023, Volume: 6 Issue: 3, 276 - 287, 20.12.2023
https://doi.org/10.38001/ijlsb.1308355

Abstract

Long-read sequencing technologies such as Oxford Nanopore Technologies (ONT) enabled researchers to sequence long reads fast and cost-effectively. ONT sequencing uses nanopores integrated into semiconductor surfaces and sequences the genomic materials using changes in current across the surface as each nucleotide passes through the nanopore. The default output of ONT sequencers is in FAST5 format. The first and one of the most important steps of ONT data analysis is the conversion of FAST5 files to FASTQ files using “base caller” tools. Generally, base caller tools pre-trained deep learning models to transform electrical signals into reads. Guppy, the most commonly used base caller, uses 2 main model types, fast and high accuracy. Since the computation duration is significantly different between these two models, the effect of models on the variant calling process has not been fully understood. This study aims to evaluate the effect of different models on performance on variant calling.
In this study, 15 low-coverage long-read sequencing results coming from different flow cells of NA12878 (gold standard data) were used to compare the variant calling results of Guppy.
Obtained results indicated that the number of output FASTQ files, read counts and average read lengths between fast and high accuracy models are not statistically significant while pass/fail ratios of the base called datasets are significantly higher in high accuracy models. Results also indicated that the difference in pass/fail ratios arises in a significant difference in the number of called Single Nucleotide Polymorphisms (SNPs), insertions and deletions (InDels). Interestingly the true positive rates of SNPs are not significantly different. These results show that using fast models for SNP calling does not affect the true positive rates statistically. The primary observation in this study, using fast models does not decrease the true positive rate but decreases the called variants that arise due to altered pass/fail ratios. Also, it is not advised to use fast models for InDel calling while both the number of InDels and true positive rates are significantly lower in fast models.
This study, to the best of our knowledge, is the first study that evaluates the effect of different base calling models of Guppy, one of the most common and ONT-supported base callers, on variant calling.

Supporting Institution

TUBITAK

Project Number

20AG005

Thanks

We also thank Dr. Pınar Pir for advises on statistical testing.

References

  • Logsdon, Glennis A., Mitchell R. Vollger, and Evan E. Eichler. Long-Read Human Genome Sequencing and Its Applications. Nature Reviews Genetics 21, no. 10 (June 5, 2020): 597–614. https://doi.org/10.1038/s41576-020-0236-x
  • Wang, Y., et al., Nanopore Sequencing Technology, Bioinformatics and Applications. Nature Biotechnology 39, no. 11 (November 1, 2021): 1348–65. https://doi.org/10.1038/s41587-021-01108-x.
  • Loman, N. J., and R. A. Quinlan. Poretools: A Toolkit for Analyzing Nanopore Sequence Data. Bioinformatics 30, no. 23 (August 20, 2014): 3399–3401. https://doi.org/10.1093/bioinformatics/btu555.
  • Peresini, P., et al., Nanopore Base Calling on the Edge. Bioinformatics 37, no. 24 (July 27, 2021): 4661–67. https://doi.org/10.1093/bioinformatics/btab528.
  • Jain, M, et al. Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads. Nature Biotechnology 36, no. 4 (January 29, 2018): 338–45. https://doi.org/10.1038/nbt.4060
  • aws/aws-cli: Universal Command Line Interface for Amazon Web Services. https://github.com/aws/aws-cli
  • Li, H., Minimap2: Pairwise Alignment for Nucleotide Sequences. Bioinformatics 34, no. 18 (May 10, 2018): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
  • Heng, L., et al., The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, no. 16 (June 8, 2009): 2078–79. https://doi.org/10.1093/bioinformatics/btp352
  • Zheng, Z., et al., Symphonizing Pileup and Full-Alignment for Deep Learning-Based Long-Read Variant Calling. Nature Computational Science 2, no. 12 (December 19, 2022): 797–803. https://doi.org/10.1038/s43588-022-00387-x.
  • Danecek, P., et al., The Variant Call Format and VCFtools. Bioinformatics 27, no. 15 (June 7, 2011): 2156–58. https://doi.org/10.1093/bioinformatics/btr330.
  • Zook, J., et al., Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials. Scientific Data 3, no. 1 (June 7, 2016). https://doi.org/10.1038/sdata.2016.25.
  • Ginestet, C. E., ggplot2: Elegant Graphics for Data Analysis. Journal of the Royal Statistical Society 174, no. 1 (January 1, 2011): 245–46. https://doi.org/10.1111/j.1467-985x.2010.00676_9.x.
  • Nan, X., ggsci: Scientific Journal and Sci-Fi Themed Color Palettes for ‘ggplot2.’ 2023, https://github.com/nanxstats/ggsci.
  • Student. The Probable Error of a Mean. Biometrika 6, no. 1 (March 1, 1908): 1. https://doi.org/10.2307/2331554.
  • Cohen, J., Statistical Power Analysis for the Behavioral Sciences. Routledge EBooks, 2013. https://doi.org/10.4324/9780203771587.
  • Wick, R., et al., Holt. Performance of Neural Network Basecalling Tools for Oxford Nanopore Sequencing. Genome Biology 20, no. 1 (June 24, 2019). https://doi.org/10.1186/s13059-019-1727-y.

Düşük Kapsamlı WGS Verileri Kullanılarak Hızlı ve Yüksek Doğruluklu Guppy Baz Çağırma Modellerinin Varyant Çağırma Üzerine Etkisinin İncelenmesi

Year 2023, Volume: 6 Issue: 3, 276 - 287, 20.12.2023
https://doi.org/10.38001/ijlsb.1308355

Abstract

Oxford Nanopore Technologies (ONT) gibi uzun-okuma dizileme teknolojileri, araştırmacılara uzun genetic materyallerin hızlı ve hesaplı şekilde dizilenmesi imkanı sunmuştur. ONT dizileme teknolojileri; yarı iletken bir yüzeye entegre edilmiş bir nano por yapısı kullanmakta ve portan geçen genetik materyalin geçişinden dolayı oluşan akım değişikliklerini kullanarak dizileme yapar. ONT dizileme platformlarının çıktı dosyası varsayılan olaran FAST5 formatındadır. Dizileme sonuçlarının biyoinformatik analizinde ilk ve en önemli adım bu dosyaların, “baz çağırıcı” adı verilen algoritmalar ile FASTQ formatına dönüştürülmesidir. Genel olarak, baz çağırma algoritmaları ön-eğitimli derin öğrenme modelleri kullanarak elektrik sinyallerini okumalara dönüştürür. Guppy, bu araçlar arasında en bilinenlerinden birisi olup ONT tarafından da geliştirilme süreçlerini desteklenen bir araçtır. Guppy aracı bünyesinde farklı ONT dizileme protokollerine özel modeller bulunmakta ve her protokol için iki ayrı model türü bulunmaktadır. Bu modeller “hızlı” ve “yüksek doğruluklu” olarak bilinmektedir. Bu iki model türünün hesaplama süresi arasında çok büyük farklılıklar bulunmaktadır ancak varyant çağırma aşamsında genetik varyantların doğrulukları üzerine detaylı bir akademik çalışma bulunmamaktadır. Çalışmamız, bu iki model türünün varyant çağırma üzerine etkilerine değerlendirmeyi amaçlamaktadır.
Bu çalışmada, genomik çalışmaları için altın standart olan NA12878’e ait 15 adet düşük kapsamlı Tüm Genom Dizileme (WGS) çalışması kullanılarak varyant çağırma sonuçları karşılaştırılmıştır. Sonuçların gösterdiği üzere iki model türü arasında, çıktı FASTQ dosyası sayıları, okuma sayıları ve ortalama okuma uzunlukları açısından istatistiksel olarak bir fark bulunmamaktadır ancak baz çağırma sonuçlarında başarılı/başarısız oranı yüksek doğruluk modellerde daha fazladır. Sonuçlar bunların yanında oluşan farkın Tek Nokta Polimorfizmi (SNP), delesyon ve insersiyon (InDel) sayılarında anlamlı bir farka sebep olduğunu göstermiştir. İlginç bir şekilde, bu farklara ragmen modeller arasında doğru pozitif SNP sayıları ve iki model arasında ortak olan SNP’lere ait kalite skorları istatistiksel olarak farklı olmamaktadır. Bu sonuçlar, hızlı modellerin doğru pozitif oranını etkilememekte ancak başarılı/başarısız oranından doğan bir varyant sayısı kaybı bulunmaktadır. Ayrıca, InDel perspektifinden bakıldığında hem sayısal hem doğruluk olarak farklı görülmekte bu nedenle InDel çağırma aşamasında hızlı modellerin performansı istatiksel olarak düşürdüğü görülmektedir.
Çalışmamız, bilgimiz dahilinde, en yaygın baz çağırma araçlarından birisi olan Guppy aracının farklı modellerini istatistiksel olarak değerlendiren ilk çalışmadır.

Project Number

20AG005

References

  • Logsdon, Glennis A., Mitchell R. Vollger, and Evan E. Eichler. Long-Read Human Genome Sequencing and Its Applications. Nature Reviews Genetics 21, no. 10 (June 5, 2020): 597–614. https://doi.org/10.1038/s41576-020-0236-x
  • Wang, Y., et al., Nanopore Sequencing Technology, Bioinformatics and Applications. Nature Biotechnology 39, no. 11 (November 1, 2021): 1348–65. https://doi.org/10.1038/s41587-021-01108-x.
  • Loman, N. J., and R. A. Quinlan. Poretools: A Toolkit for Analyzing Nanopore Sequence Data. Bioinformatics 30, no. 23 (August 20, 2014): 3399–3401. https://doi.org/10.1093/bioinformatics/btu555.
  • Peresini, P., et al., Nanopore Base Calling on the Edge. Bioinformatics 37, no. 24 (July 27, 2021): 4661–67. https://doi.org/10.1093/bioinformatics/btab528.
  • Jain, M, et al. Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads. Nature Biotechnology 36, no. 4 (January 29, 2018): 338–45. https://doi.org/10.1038/nbt.4060
  • aws/aws-cli: Universal Command Line Interface for Amazon Web Services. https://github.com/aws/aws-cli
  • Li, H., Minimap2: Pairwise Alignment for Nucleotide Sequences. Bioinformatics 34, no. 18 (May 10, 2018): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
  • Heng, L., et al., The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, no. 16 (June 8, 2009): 2078–79. https://doi.org/10.1093/bioinformatics/btp352
  • Zheng, Z., et al., Symphonizing Pileup and Full-Alignment for Deep Learning-Based Long-Read Variant Calling. Nature Computational Science 2, no. 12 (December 19, 2022): 797–803. https://doi.org/10.1038/s43588-022-00387-x.
  • Danecek, P., et al., The Variant Call Format and VCFtools. Bioinformatics 27, no. 15 (June 7, 2011): 2156–58. https://doi.org/10.1093/bioinformatics/btr330.
  • Zook, J., et al., Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials. Scientific Data 3, no. 1 (June 7, 2016). https://doi.org/10.1038/sdata.2016.25.
  • Ginestet, C. E., ggplot2: Elegant Graphics for Data Analysis. Journal of the Royal Statistical Society 174, no. 1 (January 1, 2011): 245–46. https://doi.org/10.1111/j.1467-985x.2010.00676_9.x.
  • Nan, X., ggsci: Scientific Journal and Sci-Fi Themed Color Palettes for ‘ggplot2.’ 2023, https://github.com/nanxstats/ggsci.
  • Student. The Probable Error of a Mean. Biometrika 6, no. 1 (March 1, 1908): 1. https://doi.org/10.2307/2331554.
  • Cohen, J., Statistical Power Analysis for the Behavioral Sciences. Routledge EBooks, 2013. https://doi.org/10.4324/9780203771587.
  • Wick, R., et al., Holt. Performance of Neural Network Basecalling Tools for Oxford Nanopore Sequencing. Genome Biology 20, no. 1 (June 24, 2019). https://doi.org/10.1186/s13059-019-1727-y.
There are 16 citations in total.

Details

Primary Language English
Subjects Structural Biology, Engineering
Journal Section Research Articles
Authors

Hamza Umut Karakurt 0000-0002-4072-3065

Hasan Ali Pekcan 0000-0002-2095-363X

Ayşe Kahraman 0009-0006-9507-9760

Muntadher Jihad 0000-0001-6697-3401

Bilçağ Akgün 0000-0002-5220-5652

Cuneyt Oksuz 0009-0001-4458-6774

Bahadır Onay 0009-0004-9560-9341

Project Number 20AG005
Early Pub Date December 1, 2023
Publication Date December 20, 2023
Published in Issue Year 2023 Volume: 6 Issue: 3

Cite

EndNote Karakurt HU, Pekcan HA, Kahraman A, Jihad M, Akgün B, Oksuz C, Onay B (December 1, 2023) Evaluation of Differences of Fast and High Accuracy Base Calling Models of Guppy on Variant Calling Using Low Coverage WGS Data. International Journal of Life Sciences and Biotechnology 6 3 276–287.



Follow us on social networks  19277 19276 20153  22366