• BioBridge fuses protein language models with general LLMs to enhance biological reasoning across diverse tasks. • Domain-Incremental Continual Pre‑Training (DICP) injects domain knowledge while preventing catastrophic forgetting. • A PLM‑Projector‑LLM pipeline aligns protein embeddings into the language model’s semantic space. • End‑to‑end optimization supports protein property prediction, knowledge QA, and general understanding tasks. • Performance matches leading PLMs on EC and BindingDB benchmarks, and rivals LLMs on MMLU and RACE. • BioBridge demonstrates a novel synergy of domain‑specific adaptability and general‑purpose language competency.
Article Summaries:
- BioBridge is a new framework that merges protein‑specific knowledge with general‑purpose language models. It uses Domain‑Incremental Continual Pre‑training (DICP) to infuse protein domain data and broad reasoning corpora into a large language model while preventing catastrophic forgetting. A PLM‑Projector‑LLM pipeline aligns protein sequence embeddings with the language model’s semantic space, enabling end‑to‑end training across diverse tasks such as protein property prediction and knowledge‑based question answering. Benchmarks show BioBridge matches leading protein language models on EC and BindingDB datasets and performs comparably to mainstream LLMs on general tasks like MMLU and RACE, demonstrating a balanced blend of domain adaptability and general language competence.
Sources: