Local staging area for the inputs to the NER, AC, and ICD coding pipelines in this repository. The directory is default-deny under git: only the sample data and public metadata listed below are tracked. Anything else you place here may contain credentialed clinical text and must not be committed.
The What data do I need? section is the right starting point for first-time setup; it walks through the smallest data set that matches what you want to run. This page is the map that says where each downloaded artefact lands once you have it, and which scripts will pick it up.
Schemas, statistics, label sets, and dataset-level caveats live with the
data they describe in the PhysioNet release. Until those files are unpacked,
the tracked placeholder is mimic-iv-ext-entitycoding/NOTE.md.
| Path | What goes here | How to populate | Consumed by |
|---|---|---|---|
mimic-iv-ext-entitycoding/ |
The MIMIC-IV-Ext-EntityCoding PhysioNet release files (entity_annotations.csv, assertion_annotations.csv, assertion_sentences.csv, mimic-iv_notes_subset.csv, entity_ids.txt, assertion_ids.txt, annotation_guidelines.md). |
Download the release from PhysioNet (see "Source datasets" below) and unpack the files directly into this directory. The tracked placeholder is mimic-iv-ext-entitycoding/NOTE.md; the detailed release README is included with the PhysioNet files. |
ner/prepare_ner_training_data.py, ner/compute_release_stats.py, ac/prepare_our_annotations.py |
mimic-iv-note/ |
Ad hoc decompressed MIMIC-IV / MIMIC-IV-Note files used by top-level helpers (e.g. discharge.csv, services.csv). |
Download from PhysioNet and decompress as needed. See mimic-iv-note/README.md. |
ner/ner_dataset_creation.py (paper-time annotation sampling) |
models/ |
Downloaded NER, AC, and base RoBERTa-PM checkpoints. | python data_download.py (run from the repo root). The --models ner,ac,roberta selectors cover this directory. |
ner/extract_entities.py, ner/ner_model_training.ipynb, ac/train_ac_model.py |
ner/ |
Local-only NER training and entity-extraction artefacts ({train,validation,test}_latest.csv, entity_annotations_conll.txt, ner_dataset_notes_repro.{csv,parquet}). |
Generated by ner/extract_entities.py, ner/prepare_ner_training_data.py, and ner/ner_dataset_creation.py. Not committed. |
ner/ner_model_training.ipynb, ner/create_train_input.py |
code_descriptions/ |
ICD-9 and ICD-10 long-description tables (d_icd_diagnoses.csv, d_icd_procedures.csv), snapshotted at MIMIC-IV-Note v2.2. Tracked. |
Already present in the repo. | code_evidence/visualise_predictions_explanations.py |
sample_data/ |
Synthetic GPT-4o-generated discharge summaries (sample_notes.csv, columns note_id,text) for smoke testing. Tracked. |
Already present in the repo. See sample_data/README.md. |
run_pipeline.py, ner/extract_entities.py, ner/clean_documents.py |
A note on naming: the entity-only training parquets land at
external/plm_ca/data/processed/mimiciv_icd10/entity-only/ (with a hyphen),
while the trained models live at external/plm_ca/models/entityonly/
(without one). This is intentional and the loaders depend on it.
PLM-CA's own downloads (the MIMIC files used by the vendored trainer, plus
the MDACE annotations) live under external/plm_ca/data/ and are
documented separately in docs/inference.md
and external/plm_ca/data/raw/MDace/README.md. Do not stage those files in
data/; the PLM-CA scripts assume their own directory tree.
The IAA reproduction inputs (annotator JSONs and source texts) live under
ner/inter_annotator_agreement/annotations/, not in data/. See
ner/inter_annotator_agreement/README.md.
All of the clinical sources below require PhysioNet credentialed access
(account, CITI human-subjects training, and a signed Data Use Agreement).
The i2b2 / n2c2 challenge corpora used to train the AC model are
registered separately via the DBMI portal; see ac/README.md.
| Dataset | Version | Where to get it | Lands in |
|---|---|---|---|
| MIMIC-IV-Note | 2.2 | https://physionet.org/content/mimic-iv-note/2.2/ | mimic-iv-note/ for top-level helpers; external/plm_ca/data/raw/physionet.org/files/mimic-iv-note/2.2/ for the PLM-CA trainer |
| MIMIC-IV | 2.2 | https://physionet.org/content/mimiciv/2.2/ | mimic-iv-note/ for top-level helpers (e.g. services.csv); external/plm_ca/data/raw/physionet.org/files/mimiciv/2.2/ for the PLM-CA trainer |
| MIMIC-IV-Ext-EntityCoding | 1.0.0 (PhysioNet release in review; URL pending publication) | PhysioNet (link will be added here once the project is published) | mimic-iv-ext-entitycoding/ |
| MIMIC-III | 1.4 | https://physionet.org/content/mimiciii/1.4/ | external/plm_ca/data/raw/physionet.org/files/mimiciii/1.4/ for the PLM-CA trainer; ac/sources/mimic_iii/NOTEEVENTS.csv (decompressed) for AC training |
The project's NER, AC, and ICD coding model weights are not on PhysioNet;
they are fetched from a credentialed Google Drive bundle by
python data_download.py. URLs and per-model selectors live in
config/download_config.yaml; see docs/inference.md
for usage.
mimic-iv-note/README.md: what to drop into the local MIMIC-IV-Note staging directory and what to leave under the PLM-CA tree.sample_data/README.md: provenance and intended use of the synthetic notes.- Top-level
README.md: landing page (quick start, at-a-glance result, repo tour, install, license summary, citation). Detailed reproduction recipes live indocs/.