• Closing the Gap Between Text and Speech Understanding in LLMs Closing the Gap Between Text and Speech Understanding in LLMs AuthorsSantiago Cuervoâ  , Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh View publication Copy Bibtex Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. • However, these speech-adapted LLMs consistently underperform their text-based counterpartsâ and even cascaded pipelinesâ on language understanding tasks. • We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. • Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. • As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. • In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text.

Article Summaries:

  • Closing the Gap Between Text and Speech Understanding in LLMs AuthorsSantiago Cuervoâ , Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh Closing the Gap Between Text and Speech Understanding in LLMs AuthorsSantiago Cuervoâ , Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterpartsâand even ca

Sources: