Research Article

GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation

by  Ahmad Khalil, Mahmoud Khalil, Alioune Ngom
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 61
Published: December 2025
Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom
10.5120/ijca2025926005
PDF

Ahmad Khalil, Mahmoud Khalil, Alioune Ngom . GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation. International Journal of Computer Applications. 187, 61 (December 2025), 17-24. DOI=10.5120/ijca2025926005

                        @article{ 10.5120/ijca2025926005,
                        author  = { Ahmad Khalil,Mahmoud Khalil,Alioune Ngom },
                        title   = { GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 61 },
                        pages   = { 17-24 },
                        doi     = { 10.5120/ijca2025926005 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Ahmad Khalil
                        %A Mahmoud Khalil
                        %A Alioune Ngom
                        %T GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 61
                        %P 17-24
                        %R 10.5120/ijca2025926005
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Video-language models (VideoLLMs) excel at tasks such as video captioning and question answering but often produce hallucinations—content not grounded in the video or metadata—limiting their reliability. To address this, GRAVITI (Grounded Retrieval GenerAtion framework for VideoLLM hallucInation miTIgation) is proposed; a model-agnostic, training-free and API-free framework that integrates a dynamically constructed ad-hoc knowledge base with a retrieval-guided decoding process. This process is referred to as Grounded Retrieval Generation (GRG), where each generated token is conditioned on evidence retrieved from video features and auxiliary metadata. GRAVITI reduces hallucinations while remaining compatible across diverse VideoLLMs. Evaluated on three benchmarks—VidHalluc, EventHallusion, and VideoHallucer—GRAVITI improves overall accuracy by 6–14% and substantially lowers hallucination rates compared to strong baselines. Ablation studies show the impact of retrieval size, detector thresholds, and grounding mechanisms, highlighting the effectiveness of GRG in producing reliable, multi-modal video descriptions

References
  • Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024.
  • Anirudh Gunjal, Jie Yin, and Erkut Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  • Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, January 2025.
  • Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Wenliang Dai, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, November 2022. Archived from the original on 26 March 2023. Retrieved 15 January 2023.
  • Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding, 2025.
  • Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaVA: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, Miami, Florida, USA, November 2024. Association for Computational Linguistics.
  • Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2024.
  • Fuwen Liu, Kevin Lin, Li Li, Jing Wang, Yusuf Yacoob, and Li Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  • Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024.
  • BingWang, FeiWu, Xiang Han, Jie Peng, Hong Zhong, Peng Zhang, Xiaojie Dong, Wei Li, Wei Li, Jing Wang, et al. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023.
  • Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, ChentingWang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen,WenhaiWang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025.
  • Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024.
  • Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding, 2025.
  • Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, and Siliang Tang. The best of both worlds: Integrating language models and diffusion models for video generation. arXiv preprint arXiv:2503.04606, 2025.
  • Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin, et al. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding. arXiv preprint arXiv:2501.07888, 2025.
  • Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Na Zhao, Zhiyu Tan, Hao Li, and Jingjing Chen. Eventhallusion: Diagnosing event hallucinations in video llms, 2025.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

VideoLLMs Hallucination GRAVITI Grounded Retrieval xGeneration

Powered by PhDFocusTM