feat: Add MIMIC-IV-Note-Ext-DI dataset and PatientSummaryGeneration task#975
Closed
dsimanis wants to merge 1 commit intosunlabuiuc:masterfrom
Closed
feat: Add MIMIC-IV-Note-Ext-DI dataset and PatientSummaryGeneration task#975dsimanis wants to merge 1 commit intosunlabuiuc:masterfrom
dsimanis wants to merge 1 commit intosunlabuiuc:masterfrom
Conversation
Add support for clinical text summarization in PyHealth: - MimicIVNoteExtDIDataset: loads BHC-to-DI discharge instruction pairs from the PhysioNet data release (Hegselmann et al., 2024) - PatientSummaryGeneration task: text-to-text schema for summarization - 19 unit tests with synthetic data - Example ablation script comparing Original vs Cleaned training data - Documentation and index updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributor: Deniss Simanis (denisss2@illinois.edu)
Contribution Type: Dataset + Task
Paper: Hegselmann, S., et al. (2024). A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. PMLR 248, 339-379.
Paper Link: https://arxiv.org/abs/2402.15422
Overview
This PR adds the MIMIC-IV-Note-Ext-DI dataset and a PatientSummaryGeneration task to PyHealth, enabling research on faithful clinical text summarization. This is the first clinical NLP text-to-text summarization dataset in PyHealth.
The dataset maps Brief Hospital Course (BHC) clinical text to patient-facing Discharge Instructions (DI) in layperson language, derived from MIMIC-IV-Note. It supports 15 dataset variants including the full 100,175-example corpus, pre-split train/valid/test sets, and the paper's 100-example Original/Cleaned/Cleaned & Improved subsets used for hallucination-reduction experiments.
Reproduction Results
We reproduced the paper's core finding using LED-base (162M params) on a T4 GPU — a much smaller setup than the paper's Llama 2 70B.
Ablation 1 — Original vs. Cleaned (100 training examples):
Ablation 2 — Sample efficiency (novel extension):
Key findings:
What Was Implemented
Dataset
pyhealth/datasets/mimic4_note_ext_di.py—MimicIVNoteExtDIDatasetinheriting fromBaseDatasetpyhealth/datasets/configs/mimic4_note_ext_di.yaml— YAML schema configTask
pyhealth/tasks/patient_summary_generation.py—PatientSummaryGenerationinheriting fromBaseTaskinput_schema={"text": "text"},output_schema={"summary": "text"}Tests
tests/core/test_mimic4_note_ext_di.py— 19 unit tests covering:Documentation
docs/api/datasets/pyhealth.datasets.MimicIVNoteExtDIDataset.rstdocs/api/tasks/pyhealth.tasks.PatientSummaryGeneration.rstdocs/api/datasets.rstanddocs/api/tasks.rstindex filesExample / Ablation Script
examples/mimic4noteextdi_patient_summary_led.py--demomode for synthetic data (no GPU/real data needed)Files to Review
Core implementation
pyhealth/datasets/mimic4_note_ext_di.pypyhealth/datasets/configs/mimic4_note_ext_di.yamlpyhealth/tasks/patient_summary_generation.pyRegistration
pyhealth/datasets/__init__.pypyhealth/tasks/__init__.pyTests
tests/core/test_mimic4_note_ext_di.pytest-resources/core/mimic4_note_ext_di/summaries.csvDocumentation
docs/api/datasets/pyhealth.datasets.MimicIVNoteExtDIDataset.rstdocs/api/tasks/pyhealth.tasks.PatientSummaryGeneration.rstdocs/api/datasets.rstdocs/api/tasks.rstExample
examples/mimic4noteextdi_patient_summary_led.pyNotes
transformers,datasets,evaluate,rouge-score,bert-score