Are We Ready for Multi-Image Reasoning? Launching VHs: The Visual Haystacks Benchmark!

• Humans can sift through thousands of images, spotting subtle patterns-a skill AI still struggles to match. • Traditional VQA systems answer questions about single images, missing cross‑image reasoning needed for real‑world tasks. • Multi‑Image Question Answering (MIQA) tackles reasoning across large image collections, from medical scans to satellite mosaics. • Visual Haystacks introduces the first visual‑centric Needle‑in‑Haystack benchmark for evaluating LMMs on long‑context visual data. • The benchmark embeds a “needle” answer within a vast haystack of images, testing retrieval and inference capabilities. • Applications span healthcare, environmental monitoring, urban planning, art analysis, and retail surveillance-demonstrating MIQA’s broad impact.

Article Summaries:

A new benchmark, Visual Haystacks (VHs), has been released to test large multimodal models (LMMs) on multi‑image reasoning. The VHs dataset contains roughly 1,000 binary question‑answer pairs, each tied to a set of 1-10,000 uncorrelated images drawn from COCO. Unlike traditional visual question answering (VQA), which focuses on single images, VHs requires models to locate a “needle” image within a vast “haystack” and answer questions about its visual content. The benchmark aims to evaluate LMMs’ ability to retrieve and reason across long‑context visual data, addressing gaps in current VQA systems that struggle with large image collections.

Sources:

http://bair.berkeley.edu/blog/2024/07/20/visual-haystacks/