Comparative Evaluation of Multiplatform AI Performance on Practical Ophthalmology Exam Questions: Insights from the Brazilian Council of Ophthalmology Exam

Déborah Silva Nunes; Joacy Pedro Franco David; José Jesu Sisnando D'Araujo Filho; Kelly Cristina Costa Guedes Nascimento; Igor Jordan Barbosa Coutinho; Rebeca Andrade Ferraz; Maria Isabel Muniz Zemero; Syenne Pimentel Fayal; Ana Caroline Coelho dos Passos; Luis Eduardo de Carvalho Barros; Rodrigo Rodrigues Virgolino; George de Almeida Marques; Vitor Hugo Auzier Lima

doi:10.9734/jammr/2025/v37i85913

Full Article - PDF Review History Discussion

Published: 2025-08-19

DOI: 10.9734/jammr/2025/v37i85913

Page: 159-170

Issue: 2025 - Volume 37 [Issue 8]

Article Metrics

Comparative Evaluation of Multiplatform AI Performance on Practical Ophthalmology Exam Questions: Insights from the Brazilian Council of Ophthalmology Exam

Full Article - PDF Review History Discussion

Published: 2025-08-19

DOI: 10.9734/jammr/2025/v37i85913

Page: 159-170

Issue: 2025 - Volume 37 [Issue 8]

Déborah Silva Nunes

State University of Pará/UEPA, Brazil and Bettina Ferro de Souza University Hospital/UFPA, Brazil.

Joacy Pedro Franco David

Federal University of the State of Pará/UFPA, Brazil and Ophthalmology at the Bettina Ferro de Souza Hospital/UFPA, Brazil.

José Jesu Sisnando D'Araujo Filho

Federal University of Pará (UFPA), Brazil, Federal University of the State of Rio de Janeiro, Brazil and Bettina Ferro de Souza Hospital, UFPA, Brazil.

Kelly Cristina Costa Guedes Nascimento

Federal University of the State of Pará/UFPA, Brazil and Bettina Ferro de Souza Hospital/UFPA, Brazil.

Igor Jordan Barbosa Coutinho

Federal University of the State of Pará/UFPA, Brazil.

Rebeca Andrade Ferraz

State University of Pará/UEPA, Brazil and Bettina Ferro de Souza University Hospital/UFPA, Brazil.

Maria Isabel Muniz Zemero

State University of Pará/UEPA, Brazil and Bettina Ferro de Souza University Hospital/UFPA, Brazil.

Syenne Pimentel Fayal

State University of Pará/UEPA, Brazil and Bettina Ferro de Souza University Hospital/UFPA, Brazil.

Ana Caroline Coelho dos Passos

Pará State University Center, Bettina Ferro de Souza University Hospital / UFPA, Brazil.

Luis Eduardo de Carvalho Barros

Federal University of Pará / UFPA, Brazil and Bettina Ferro de Souza University Hospital / UFPA, Brazil.

Rodrigo Rodrigues Virgolino

Federal University of Pará / UFPA, Brazil and Institute of Biological Sciences of the Federal University of Pará / UFPA, Brazil.

George de Almeida Marques

Estácio de Sá University, Brazil and Medical Skills at the AFYA Faculty of Medical Sciences (AFYA), Brazil.

Vitor Hugo Auzier Lima *

Federal University of Pará / UFPA, Brazil and Faculty of Education and Technology of Pará /FAETE, Brazil.

*Author to whom correspondence should be addressed.

Abstract

In recent years, advances in artificial intelligence (AI), especially with the emergence of natural language models and deep neural networks, have revolutionised medical practice, offering tools with the potential to assist both in diagnosis and specialised medical training. The main objective of this study was to evaluate the accuracy and agreement of different artificial intelligence (AI) models in solving practical questions from the Brazilian Council of Ophthalmology (CBO) Exam. To this end, the performances of 5 AI models (ChatGPT, Gemini, DeepSeek, Google AI Studio, and GROK) were analyzed in a set of 560 questions, distributed in eight thematic blocks of ophthalmology (Cornea, Cataract, Retina, Glaucoma, Neuro-ophthalmology, Optics and Refraction, Strabismus, and Plastic Surgery/Lacrimal Duct/Orbit). The answers were compared to the official answer key by calculating the percentage of correct answers and the Cohen's Kappa and Fleiss's coefficients of agreement. Cohen's Kappa coefficient was used to measure the agreement between the AI responses and the official template, as well as Fleiss's Kappa to measure the overall agreement between the different AIs. The most evident finding was that the Gemini model presented the highest accuracy rate (77.6%) and the highest overall agreement with the official answer key. Significant variation in performance between blocks was also observed, with greater accuracy in the Retina and Glaucoma themes, and lower accuracy in the Strabismus and Plastic Surgery blocks. The thematic analysis allowed us to identify the pattern of correct answers by speciality, revealing weaknesses of the models in areas with greater dependence on visual assessment and clinical subjectivity. In addition to a probable educational applicability of AIs, it proved to be viable as a complementary tool in medical training, especially when used under supervision and with defined pedagogical objectives. Therefore, it was concluded that, despite the limitations, the most up-to-date models trained based on specific clinical data were able to faithfully reproduce diagnostic reasoning in several areas of ophthalmology, evidencing their potential for integration into specialised education, as long as they are used with technical and ethical criteria. These findings suggest AI can serve as a supplementary tool in ophthalmic education, with caution in subjective specialities.

Keywords: Artificial intelligence, ophthalmology, medical education, board exam, computer-assisted diagnosis

How to Cite

Nunes, Déborah Silva, Joacy Pedro Franco David, José Jesu Sisnando D'Araujo Filho, Kelly Cristina Costa Guedes Nascimento, Igor Jordan Barbosa Coutinho, Rebeca Andrade Ferraz, Maria Isabel Muniz Zemero, et al. 2025. “Comparative Evaluation of Multiplatform AI Performance on Practical Ophthalmology Exam Questions: Insights from the Brazilian Council of Ophthalmology Exam”. Journal of Advances in Medicine and Medical Research 37 (8):159-70. https://doi.org/10.9734/jammr/2025/v37i85913.

Downloads

Download data is not yet available.