Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3's Top Model

• Nemotron ColEmbed V2 introduces late‑interaction embeddings for unified text‑image retrieval. • Three model sizes-3B, 4B, 8B-deliver state‑of‑the‑art accuracy on ViDoRe V1‑V3. • Late‑interaction allows fine‑grained token representations, boosting semantic matching across modalities. • Nemotron ColEmbed V2 tops ViDoRe V3 leaderboard: 8B ranks 1st, 4B 3rd, 3B 6th. • Models map text, tables, charts, and figures into a shared representation space. • Earlier Llama‑Nemotron‑Embed‑VL‑1B focused on efficiency; ColEmbed V2 prioritizes accuracy for enterprise search.

Article Summaries:

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model Modern search systems are increasingly designed to process heterogeneous document images that may contain text, tables, charts, figures, and other visual components. In this context, accurately retrieving relevant information across these diverse modalities is a central challenge. Multimodal embedding models built on top of foundational vision-language models (VLMs) map diverse content types into a shared representation space, enabling unified retrieval over text, images, and structured visual elements. A

Sources:

https://huggingface.co/blog/nvidia/nemotron-colembed-v2