Multimodal Generative AI Agents for Biomedical Document Classification: Architecture, Ethical Boundaries, and Human-in-the-Loop Governance

Authors

  • Arun Kumar

Keywords:

Multimodal Generative AI, Biomedical Document Classification, Large Vision-Language Models, AI Agents, Human-in-the-Loop (HITL), Bounded Autonomy, AI Governance, Ethical Artificial Intelligence

Abstract

The rapid increase in biomedical research publications has made it difficult for researchers, clinicians, and policymakers to efficiently review and interpret scientific information. Traditional manual review methods and rule-based automation tools are no longer sufficient to manage the growing volume, complexity, and multimodal nature of modern biomedical literature, where important insights need to be presented through both written text and visual elements such as figures and images. To address this challenge, this study proposes and evaluates a multimodal generative AI agent for biomedical document classification and image captioning, which combines an instruction-tuned language model with a vision encoder to process abstract text and related visual content together. The agent operates within a controlled framework that includes human oversight to ensure responsible and ethical use. To study its effect a mixed-method approach was used, including quantitative performance evaluation and qualitative expert review. The model was tested on open-access biomedical papers from arXiv across four subject areas. Results indicate that the multimodal approach performs better than text-only systems in classification accuracy and contextual understanding. However, the findings also show that human supervision remains important in order to reduce risks related to bias and incorrect outputs. This study therefore offers practical and theoretical guidance for developing ethical and reliable AI systems in biomedical research settings.

Downloads

Published

2026-04-20