Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Abstract

Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in given images. Most existing LVLM hallucination benchmarks are constrained for object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea is to conduct hallucination evaluation on the (object, relation, object) triplets extracted from LVLMs' responses. We introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. We conduct comprehensive evaluations on Tri-HE and observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple yet effective training-free approach to mitigate hallucinations for LVLMs, with which, we exceed all open-sourced counterparts on Tri-HE, achieving comparable performance with the powerful GPT-4V.

🤔 Hallucination or Error?

Tri-HE clearly distinguish object/relational hallucinations with prediction errors of Large Vision-Language Models via a triplet-level framework.

Tri-HE Data Samples

Tri-HE images are all associated with a scene graph and question-answer pairs with the reasoning triplet annotations.

Tri-HE Hallucination Mitigation

Tri-HE proposes a training-free hallucination mitigation method by 1) first prompting LVLMs to extract relational descriptions from given images, 2) based on which LVLMs are required to respond user queries without seeing the images.

BibTeX

@article{wu2024unified,
  title={Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models},
  author={Wu, Junjie and Chung, Tsz Ting and Chen, Kai and Yeung, Dit-Yan},
  journal={arXiv preprint arXiv:2410.23114},
  year={2024}
}