Research Article
BibTex RIS Cite

Tıbbi görüntüleme sistemlerinde Gemini Advanced, GPT-4, Copilot ve GPT-3.5 modellerinin doğruluk performanslarının karşılaştırılması: Sıfır atışlı yönlendirme analizi

Year 2024, Volume: 13 Issue: 4, 1216 - 1223, 15.10.2024
https://doi.org/10.28948/ngumuh.1492129

Abstract

Büyük dil modelleri (LLM'ler) sağlık hizmetlerinde popülerlik kazanmış ve çeşitli tıbbi uzmanlık alanlarındaki araştırmacıların ilgisini çekmektedir. Doğru sonuçlar için hangi modelin hangi koşullarda iyi performans gösterdiğini belirlemek önemlidir. Bu çalışma, yeni geliştirilen büyük dil modellerinin tıbbi görüntüleme sistemleri için doğruluklarını karşılaştırmayı ve bu modellerin verdikleri doğru yanıtlar açısından birbirleri arasındaki uyumluluklarını değerlendirmeyi amaçlamaktadır. Bu değerlendirme için toplam 400 soru X-ray, ultrason, manyetik rezonans görüntüleme ve nükleer tıp görüntüleme olarak dört kategoriye ayrılmıştır. Büyük dil modellerinin yanıtları, doğru yanıtların yüzdesi ölçülerek sıfır-atışlı yönlendirme yaklaşımıyla değerlendirilmiştir. Modeller arasındaki farkların anlamlılığını değerlendirmek için McNemar testi, modellerin güvenilirliğini belirlemek için ise Cohen kappa istatistiği kullanılmıştır. Gemini Advanced, GPT-4, Copilot ve GPT-3.5 için sırasıyla %86.25, %84.25, %77.5 ve %59.75 doğruluk oranları elde edilmiştir. Diğer modellerle karşılaştırıldığında Gemini Advanced ve GPT-4 arasında güçlü bir korelasyon bulunmuştur, К=0,762. Bu çalışma, yakın zamanda geliştirilen Gemini Advanced, GPT-4, Copilot ve GPT-3.5'in tıbbi görüntüleme sistemleriyle ilgili sorulara verdiği yanıtların doğruluğunu analiz eden ilk çalışmadır. Ayrıca bu çalışma ile tıbbi görüntüleme sistemleri ile ilgili çeşitli kaynaklardan üç soru tipinden oluşan kapsamlı bir veri seti oluşturulmuştur.

References

  • S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
  • ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
  • GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
  • Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
  • Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
  • A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
  • A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
  • A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
  • T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
  • R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
  • S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
  • X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
  • M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
  • D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
  • J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
  • E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
  • K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
  • W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
  • G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
  • A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
  • B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
  • J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
  • C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
  • S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
  • H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
  • W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
  • M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
  • E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
  • K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
  • D. D. Feng, Biomedical information technology. Academic Press, 2011.
  • IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
  • M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.

Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis

Year 2024, Volume: 13 Issue: 4, 1216 - 1223, 15.10.2024
https://doi.org/10.28948/ngumuh.1492129

Abstract

Large Language Models (LLMs) have gained popularity across healthcare and attracted the attention of researchers of various medical specialties. Determining which model performs well in which circumstances is essential for accurate results. This study aims to compare the accuracy of recently developed LLMs for medical imaging systems and to evaluate the reliability of LLMs in terms of correct responses. A total of 400 questions were divided into four categories: X-ray, ultrasound, magnetic resonance imaging, and nuclear medicine. LLMs’ responses were evaluated with a zero-prompting approach by measuring the percentage of correct answers. McNemar tests were used to evaluate the significance of differences between models, and Cohen kappa statistics were used to determine the reliability of the models. Gemini Advanced, GPT-4, Copilot, and GPT-3.5 resulted in accuracy rates of 86.25%, 84.25%, 77.5%, and 59.75%, respectively. There was a strong correlation between Gemini Advanced and the GPT-4 compared with other models, К=0.762. This study is the first that analyzes the accuracy of responses of recently developed LLMs: Gemini Advanced, GPT-4, Copilot, and GPT-3.5 on questions related to medical imaging systems. And a comprehensive dataset with three question types was created within medical imaging systems, which was evenly distributed from various sources.

References

  • S. R. Bowman, Eight things to know about large language models, arXiv preprint arXiv:2304.00612, 2023. https://doi.org/10.48550/arXiv.2304.01964
  • ChatGPT. https://chat.openai.com/ Accessed 27 Feb. 2024.
  • GPT-4. https://openai.com/research/gpt-4, Accessed 27 Feb. 2024.
  • Bing Chat: how to use Microsoft’s own version of ChatGPT Digital Trends. https://www.digitaltrends .com/computing/how-to-use-microsoft-chatgpt-bing-edge/, Accessed 27 Feb. 2024.
  • Gemini - Google DeepMind. https://deepmind.google /technologies/gemini/#gemini-1.0, Accessed 28 Feb. 2024.
  • A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023. https://doi.org/10.1038/s41591-023-02448-8
  • A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, M. D. Succi, Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast ımaging pilot, Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023. https://doi.org/10.1016/j.jacr.2023.05. 003
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, Capabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375, 2023. https://doi.org/10.48550/arXiv.2303.13375
  • A.Gilson, CW. Safranek, T. Huang, V. Socrates, L. Chi, RA. Taylor, D. Chartash, How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment, JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023. doi:10.2196/45312
  • T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, V. Tseng, Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models, PLoS digital health, vol. 2, no. 2, p. e0000198, 2023. https://doi.org/10.1371/journal.pdig.0000198
  • R. K. Sinha, A. D. Roy, N. Kumar, H. Mondal, and R. Sinha, Applicability of ChatGPT in assisting to solve higher order problems in pathology, Cureus, vol. 15, no. 2, 2023. doi: 10.7759/cureus.35237
  • S. Huh, Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study,” Journal of educational evaluation for health professions, vol. 20, 2023. https://doi.org/10 .3352/jeehp.2023.20.1
  • X.Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, X. Li, ChatGPT performs on the Chinese national medical licensing examination, Journal of Medical Systems, vol. 47, no. 1, p. 86, 2023. https://doi.org/10.1007/s10916-023-01961-0
  • M. F. Şahin, H. Ateş, A. Keleş, Ç. Doğan, M. Akgül, C. M. Yazıcı, R. Özcan, Responses of five different artificial ıntelligence chatbots to the top searched queries about erectile dysfunction: A comparative analysis, Journal of Medical Systems, vol. 48, no. 1, p. 38, 2024. https://doi.org/10.1007/s10916-024-02056-0
  • D. Brin, V. Sorin, Y. Barash, E. Konen, B. S. Glicksberg, G. N. Nadkarni, E. Klang, Assessing GPT-4 multimodal performance in radiological ımage analysis, medRxiv, pp. 2023–11, 2023. https://doi.org/ 10.1007/s00330-024-11035-5
  • J. L. Prince and J. M. Links, Medical imaging signals and systems, vol. 37. Pearson Prentice Hall Upper Saddle River, 2006.
  • E. Seeram, Medical Imaging Informatics, Digital Radiography: Review Questions, pp. 85–95, 2021.
  • K. H. Ng, J. H. D. Wong, and G. Clarke, Problems and solutions in medical physics: Diagnostic Imaging Physics. CRC Press, 2018.
  • W. R. Hendee and E. R. Ritenour, Medical imaging physics. John Wiley & Sons, 2003.
  • G. Sawhney, Fundamental of biomedical engineering. New Age International, 2007.
  • A. P. Dhawan, Medical image analysis. John Wiley & Sons, 2011.
  • B. H. Brown, R. H. Smallwood, D. C. Barber, P. Lawford, and D. Hose, Medical physics and biomedical engineering. CRC Press, 2017.
  • J. A. Miller, Review Questions for Ultrasound: A Sonographer’s Exam Guide. Routledge, 2018.
  • C. K. Roth and W. H. Faulkner Jr, Review questions for MRI, 2013.
  • S. C. Bushong and G. Clarke, Magnetic resonance imaging: physical and biological principles. Elsevier Health Sciences, 2003.
  • H. Azhari, J. A. Kennedy, N. Weiss, and L. Volokh, From Signals to Image. Springer, 2020.
  • W. A. Worthoff, H. G. Krojanski, and D. Suter, Medical physics: exercises and examples. Walter de Gruyter, 2013.
  • M. Chappell, Principles of Medical Imaging for Engineers. Springer, 2019.
  • E. Mantel, J. S. Reddin, G. Cheng, and A. Alavi, Nuclear Medicine Technology: Review Questions for the Board Examinations. Cham: Springer International Publishing, 2023. https://link.springer.com/10.1007/9 78-3-031-26720-8, Accessed 20 Mar. 2024.
  • K. H. Ng, C. H. Yeong, and A. C. Perkins, Problems and Solutions in Medical Physics: Nuclear Medicine Physics, 1st ed. CRC Press, 2019. https://www.taylor francis.com/books/9780429629129, Accessed 20 Mar. 2024.
  • D. D. Feng, Biomedical information technology. Academic Press, 2011.
  • IBM SPSS Statistics for Windows. IBM Corp., Armonk, NY, Released 2015.
  • M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012.
There are 33 citations in total.

Details

Primary Language English
Subjects Natural Language Processing, Planning and Decision Making, Biomedical Sciences and Technology
Journal Section Research Articles
Authors

Alpaslan Koç 0000-0002-2000-7379

Ayşe Betül Öztiryaki 0009-0004-9973-3251

Early Pub Date September 11, 2024
Publication Date October 15, 2024
Submission Date May 29, 2024
Acceptance Date July 30, 2024
Published in Issue Year 2024 Volume: 13 Issue: 4

Cite

APA Koç, A., & Öztiryaki, A. B. (2024). Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, 13(4), 1216-1223. https://doi.org/10.28948/ngumuh.1492129
AMA Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NOHU J. Eng. Sci. October 2024;13(4):1216-1223. doi:10.28948/ngumuh.1492129
Chicago Koç, Alpaslan, and Ayşe Betül Öztiryaki. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13, no. 4 (October 2024): 1216-23. https://doi.org/10.28948/ngumuh.1492129.
EndNote Koç A, Öztiryaki AB (October 1, 2024) Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13 4 1216–1223.
IEEE A. Koç and A. B. Öztiryaki, “Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis”, NOHU J. Eng. Sci., vol. 13, no. 4, pp. 1216–1223, 2024, doi: 10.28948/ngumuh.1492129.
ISNAD Koç, Alpaslan - Öztiryaki, Ayşe Betül. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi 13/4 (October 2024), 1216-1223. https://doi.org/10.28948/ngumuh.1492129.
JAMA Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NOHU J. Eng. Sci. 2024;13:1216–1223.
MLA Koç, Alpaslan and Ayşe Betül Öztiryaki. “Comparison of the Accuracy Performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 Models in Medical Imaging Systems: A Zero-Shot Prompting Analysis”. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi, vol. 13, no. 4, 2024, pp. 1216-23, doi:10.28948/ngumuh.1492129.
Vancouver Koç A, Öztiryaki AB. Comparison of the accuracy performances of the Gemini Advanced, the GPT-4, the Copilot, and the GPT-3.5 models in medical imaging systems: A Zero-shot prompting analysis. NOHU J. Eng. Sci. 2024;13(4):1216-23.

download