|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 61 |
| Published: December 2025 |
| Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom |
10.5120/ijca2025926005
|
Ahmad Khalil, Mahmoud Khalil, Alioune Ngom . GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation. International Journal of Computer Applications. 187, 61 (December 2025), 17-24. DOI=10.5120/ijca2025926005
@article{ 10.5120/ijca2025926005,
author = { Ahmad Khalil,Mahmoud Khalil,Alioune Ngom },
title = { GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation },
journal = { International Journal of Computer Applications },
year = { 2025 },
volume = { 187 },
number = { 61 },
pages = { 17-24 },
doi = { 10.5120/ijca2025926005 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2025
%A Ahmad Khalil
%A Mahmoud Khalil
%A Alioune Ngom
%T GRAVITI: Grounded Retrieval Generation Framework for VideoLLM Hallucination Mitigation%T
%J International Journal of Computer Applications
%V 187
%N 61
%P 17-24
%R 10.5120/ijca2025926005
%I Foundation of Computer Science (FCS), NY, USA
Video-language models (VideoLLMs) excel at tasks such as video captioning and question answering but often produce hallucinations—content not grounded in the video or metadata—limiting their reliability. To address this, GRAVITI (Grounded Retrieval GenerAtion framework for VideoLLM hallucInation miTIgation) is proposed; a model-agnostic, training-free and API-free framework that integrates a dynamically constructed ad-hoc knowledge base with a retrieval-guided decoding process. This process is referred to as Grounded Retrieval Generation (GRG), where each generated token is conditioned on evidence retrieved from video features and auxiliary metadata. GRAVITI reduces hallucinations while remaining compatible across diverse VideoLLMs. Evaluated on three benchmarks—VidHalluc, EventHallusion, and VideoHallucer—GRAVITI improves overall accuracy by 6–14% and substantially lowers hallucination rates compared to strong baselines. Ablation studies show the impact of retrieval size, detector thresholds, and grounding mechanisms, highlighting the effectiveness of GRG in producing reliable, multi-modal video descriptions