Research Article Open Access

MedFusion: A Unified Multimodal Framework for Visual Question Answering and Explainable Medical Recommendation

Satyajit Mahapatra1, Jibitesh Mishra1, Kumar Janardan Patra1, Sanjit Kumar Dash1 and Aliazar Deneke Deferisha2
  • 1 Schools of Computer Science, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
  • 2 Faculty of Computing and Software Engineering, Arba Minch University, Arba Minch, Ethiopia

Abstract

In clinical decision-making, the ability to ask visual questions about medical images and receive accurate, personalized, and interpretable recommendations can significantly enhance practitioner support systems. This paper presents MedFusion, a unified multimodal framework that integrates Visual Question Answering (VQA), personalized medical recommendation, and explainability within a single architecture. The proposed model employs co-attention–based visual–textual fusion augmented with retrieval-enhanced reasoning to improve answer grounding, while personalized recommendations are generated using a shared multimodal representation supported by GAN-guided feature augmentation. To enhance transparency, the framework provides attention-based heatmaps and natural-language rationales for both answers and recommendations. Extensive experiments on VQA-RAD, EHRXQA, and Med-RecX demonstrate that MedFusion outperforms state-of-the-art medical VQA and recommendation baselines, achieving a 7.4% improvement in VQA accuracy, reducing RMSE to 0.91, and improving human-rated interpretability to 4.5/5. Ablation studies confirm the effectiveness of retrieval augmentation, GAN-guided enhancement, and joint multi-task learning. These results indicate that MedFusion offers a robust and explainable decision-support solution, advancing the deployment of trustworthy, user-adaptive AI systems in real-world healthcare environments.

Journal of Computer Science
Volume 22 No. 5, 2026, 1539-1551

DOI: https://doi.org/10.3844/jcssp.2026.1539.1551

Submitted On: 2 August 2025 Published On: 20 May 2026

How to Cite: Mahapatra, S., Mishra, J., Patra, K. J., Dash, S. K. & Deferisha, A. D. (2026). MedFusion: A Unified Multimodal Framework for Visual Question Answering and Explainable Medical Recommendation. Journal of Computer Science, 22(5), 1539-1551. https://doi.org/10.3844/jcssp.2026.1539.1551

  • 99 Views
  • 23 Downloads
  • 0 Citations

Download

Keywords

  • Multimodal Learning
  • VQA
  • Medical Recommendation
  • XAI
  • Co-Attention
  • Retrieval-Augmented Reasoning
  • CGAN
  • Healthcare Informatics