Skip to content

feat: Add MIMIC-IV-Note-Ext-DI dataset and PatientSummaryGeneration task#975

Closed
dsimanis wants to merge 1 commit intosunlabuiuc:masterfrom
dsimanis:feat/mimic4-note-ext-di-patient-summary
Closed

feat: Add MIMIC-IV-Note-Ext-DI dataset and PatientSummaryGeneration task#975
dsimanis wants to merge 1 commit intosunlabuiuc:masterfrom
dsimanis:feat/mimic4-note-ext-di-patient-summary

Conversation

@dsimanis
Copy link
Copy Markdown

Contributor: Deniss Simanis (denisss2@illinois.edu)

Contribution Type: Dataset + Task

Paper: Hegselmann, S., et al. (2024). A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models. PMLR 248, 339-379.
Paper Link: https://arxiv.org/abs/2402.15422


Overview

This PR adds the MIMIC-IV-Note-Ext-DI dataset and a PatientSummaryGeneration task to PyHealth, enabling research on faithful clinical text summarization. This is the first clinical NLP text-to-text summarization dataset in PyHealth.

The dataset maps Brief Hospital Course (BHC) clinical text to patient-facing Discharge Instructions (DI) in layperson language, derived from MIMIC-IV-Note. It supports 15 dataset variants including the full 100,175-example corpus, pre-split train/valid/test sets, and the paper's 100-example Original/Cleaned/Cleaned & Improved subsets used for hallucination-reduction experiments.

Reproduction Results

We reproduced the paper's core finding using LED-base (162M params) on a T4 GPU — a much smaller setup than the paper's Llama 2 70B.

Ablation 1 — Original vs. Cleaned (100 training examples):

Metric Original Cleaned Delta
ROUGE-1 40.35 40.17 -0.18
ROUGE-2 12.25 12.46 +0.21
ROUGE-L 22.84 23.10 +0.26
BERTScore F1 86.27 86.41 +0.14
Mean gen length 128.9 114.9 -14.0 words

Ablation 2 — Sample efficiency (novel extension):

N Orig R1 Clean R1 Delta Orig len Clean len
25 34.88 36.57 +1.69 168.0 152.7
50 36.57 37.73 +1.16 182.5 161.4
100 40.35 40.17 -0.18 128.9 114.9

Key findings:

  • Cleaned-trained models consistently generate shorter summaries (14-21 fewer words), indicating fewer hallucinated details — reproducing the paper's core claim.
  • At N=100, standard metrics converge, confirming the paper's finding that ROUGE/BERTScore do not capture faithfulness.
  • The cleaning benefit is strongest at small sample sizes (N=25: +1.69 ROUGE-1) — a novel finding beyond the original paper.

What Was Implemented

Dataset

  • pyhealth/datasets/mimic4_note_ext_di.pyMimicIVNoteExtDIDataset inheriting from BaseDataset
  • Handles automatic JSONL-to-CSV conversion from the PhysioNet data release
  • Supports 15 variants: BHC splits, full-context splits, and derived hallucination-reduction datasets
  • pyhealth/datasets/configs/mimic4_note_ext_di.yaml — YAML schema config

Task

  • pyhealth/tasks/patient_summary_generation.pyPatientSummaryGeneration inheriting from BaseTask
  • Text-to-text schema: input_schema={"text": "text"}, output_schema={"summary": "text"}

Tests

  • tests/core/test_mimic4_note_ext_di.py — 19 unit tests covering:
    • Dataset loading from CSV and JSONL
    • Patient/event parsing and attribute access
    • Task sample generation and schema validation
    • Data integrity (all patients produce samples, no empty fields)
    • Error handling (invalid variants, missing files)
    • All tests use synthetic data (5 patients), complete in ~10 seconds

Documentation

  • docs/api/datasets/pyhealth.datasets.MimicIVNoteExtDIDataset.rst
  • docs/api/tasks/pyhealth.tasks.PatientSummaryGeneration.rst
  • Updated docs/api/datasets.rst and docs/api/tasks.rst index files

Example / Ablation Script

  • examples/mimic4noteextdi_patient_summary_led.py
  • Demonstrates Original vs. Cleaned training comparison with LED-base
  • Includes --demo mode for synthetic data (no GPU/real data needed)
  • Results documented in module docstring

Files to Review

Core implementation

  • pyhealth/datasets/mimic4_note_ext_di.py
  • pyhealth/datasets/configs/mimic4_note_ext_di.yaml
  • pyhealth/tasks/patient_summary_generation.py

Registration

  • pyhealth/datasets/__init__.py
  • pyhealth/tasks/__init__.py

Tests

  • tests/core/test_mimic4_note_ext_di.py
  • test-resources/core/mimic4_note_ext_di/summaries.csv

Documentation

  • docs/api/datasets/pyhealth.datasets.MimicIVNoteExtDIDataset.rst
  • docs/api/tasks/pyhealth.tasks.PatientSummaryGeneration.rst
  • docs/api/datasets.rst
  • docs/api/tasks.rst

Example

  • examples/mimic4noteextdi_patient_summary_led.py

Notes

  • Data requires credentialed PhysioNet access (https://doi.org/10.13026/m6hf-dq94)
  • Tests use synthetic data only — no real MIMIC data required
  • Example script requires transformers, datasets, evaluate, rouge-score, bert-score
  • Reproduction used Google Colab Pro with T4 GPU; each training run ~35 minutes

Add support for clinical text summarization in PyHealth:
- MimicIVNoteExtDIDataset: loads BHC-to-DI discharge instruction pairs
  from the PhysioNet data release (Hegselmann et al., 2024)
- PatientSummaryGeneration task: text-to-text schema for summarization
- 19 unit tests with synthetic data
- Example ablation script comparing Original vs Cleaned training data
- Documentation and index updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dsimanis dsimanis closed this Apr 13, 2026
@dsimanis dsimanis deleted the feat/mimic4-note-ext-di-patient-summary branch April 13, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant