BRIDGE (Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text)

HMS MGB Broad YLab

πŸ“œ Background

Recent advances in Large Language Models (LLMs) have demonstrated transformative potential in healthcare, yet concerns remain around their reliability and clinical validity across diverse clinical tasks, specialties, and languages. To support timely and trustworthy evaluation, building upon our systematic review of global clinical text resources, we introduce BRIDGE, a multilingual benchmark that comprises 87 real-world clinical text tasks spanning nine languages and more than one million samples. Furthermore, we construct this leaderboard of LLM in clinical text understanding by systematically evaluating 52 state-of-the-art LLMs (by 2025/04/28).

This project is led and maintained by the team of Prof. Jie Yang and Prof. Kueiyu Joshua Lin at Harvard Medical School and Brigham and Women's Hospital.
dataset

πŸ† BRIDGE Leaderboard

BRIDGE features three leaderboards, each evaluating LLM performance in clinical text tasks under a distinct inference strategy:

  • Zero shot: Only the task instructions and input data are provided. The LLM is prompted to directly produce the target answer without any support.
  • Chain-of-Thought (CoT): Task instructions explicitly guide the LLM to generate a step-by-step explanation of its reasoning process before providing the final answer, enhancing interpretability and reasoning transparency.
  • Few-shot: Five independent samples serve as examples, which leverage the LLM's capability of in-context learning to guide the model to conduct tasks.

In addition, BRIDGE offers multiple model filters and task filters to enable users to explore LLM performance across different clinical contexts, empowering researchers and clinicians to make informed decisions and track model advancements over time.

model

🌍 Key Features

  • Real-world Clinical Text: All tasks are sourced from real-world medical settings, such as electronic health records (EHRs), clinical case reports, or healthcare consultations
  • Multilingual Context: 9 languages: English, Chinese, Spanish, Japanese, German, Russian, French, Norwegian, and Portuguese
  • Diverse Task Types: 8 task types: Text classification, Semantic similarity, Normalization and coding, Named entity recognition, Natural language inference, Event extraction, Question answering, and Text summarization
  • Broad Clinical Applications: 14 Clinical specialties, 7 Clinical document types, 20 Clinical applications covering 6 clinical stages of patient care
  • Advanced LLMs (52 models):
    • Proprietary models: GPT-4o, GPT-3.5, Gemini-2.0-Flash, Gemini-1.5-Pro ...
    • Open-source models: Llama 3/4, QWEN2.5, Mistral, Gemma ...
    • Medical models: Baichuan-M1-14B, meditron, MeLLaMA...
    • Reasoning models: Deepseek-R1(671B), QWQ-32B, Deepseek-R1-Distll-Qwen/Llama ...
More Details can be found in our BRIDGE paper and systematic review.

πŸ› οΈ How to Evaluate Your Model on BRIDGE ?

πŸ“‚ Dataset Access

All fully open-access datasets in BRIDGE are available in BRIDGE-Open. To ensure the fairness of this leaderboard, we publicly release the following data for each task: Five completed samples serve as few-shot examples, and all testing samples with instruction and input information.

Due to privacy and security considerations of clinical data, regulated-access datasets can not be directly published. However, all detailed task descriptions and their corresponding data sources are available in our BRIDGE paper. Importantly, all 87 datasets have been verified to be either fully open-access or publicly accessible via reasonable request in our systematic review.

πŸ”₯ Result Submission and Model Evaluation

If you would like to see how an unevaluated model performs on BRIDGE, please follow these steps:

  • If you want to run inference locally: Download the BRIDGE-Open dataset and perform inference locally. Save the generated output of each sample in its "pred" field for each dataset file. Then send your results to us via the Google Form.
  • If you want us to run inference: Send the link of the model to us via the Google Form.

We will review and evaluate your submission and update the leaderboard accordingly.

Code Reference: About LLM inference, result extraction, and evaluation scheme, please refer to our BRIDGE GitHub repo.

🚨 Important: Due to computational resource constraint, our team may not be able to test all proposed models, and there will be a delay in showing the results after your submission.

πŸ“’ Updates

  • πŸ—“οΈ 2025/04/28: BRIDGE Leaderboard V1.0.0 is now live!
  • πŸ—“οΈ 2025/04/28: Our paper BRIDGE is now available on arXiv!

🀝 Contributing

We welcome and greatly value contributions and collaborations from the community! If you have clinical text datasets that you would like to add to the BRIDEG benchmark, please fill in the Google Form and let us know!

We are committed to expanding BRIDGE while strictly adhering to appropriate data use agreements and ethical guidelines. Let's work together to advance the responsible application of LLMs in medicine!

πŸš€ Donation

BRIDGE is a non-profit, researcher-led benchmark that requires substantial resources (e.g., high-performance GPUs, a dedicated team) to sustain. To support open and impactful academic research that advances clinical care, we welcome your contributions. Please contact Prof. Jie Yang at jyang66@bwh.harvard.edu to discuss donation opportunities.

πŸ“¬ Contact Information

If you have any questions about BRIDGE or the leaderboard, feel free to contact us!

πŸ“š Citation

If you find this leaderboard useful for your research and applications, please cite the following papers:

@article{BRIDGE-benchmark,
    title={BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text},
    author={Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie},
    year={2025},
    journal={arXiv preprint arXiv: 2504.19467},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.19467},
}
@article{clinical-text-review,
    title={Clinical text datasets for medical artificial intelligence and large language modelsβ€”a systematic review},
    author={Wu, Jiageng and Liu, Xiaocong and Li, Minghui and Li, Wanxin and Su, Zichang and Lin, Shixu and Garay, Lucas and Zhang, Zhiyun and Zhang, Yujie and Zeng, Qingcheng and Shen, Jie and Yuan, Changzheng and Yang, Jie},
    journal={NEJM AI},
    volume={1},
    number={6},
    pages={AIra2400012},
    year={2024},
    publisher={Massachusetts Medical Society}
}

If you use the datasets in BRIDGE, please also cite the original paper of datasets, which can be found in our BRIDGE paper.