Name	Name	Last commit message	Last commit date
parent directory ..
code_descriptions	code_descriptions
mimic-iv-ext-entitycoding	mimic-iv-ext-entitycoding
mimic-iv-note	mimic-iv-note
sample_data	sample_data
README.md	README.md

data/

Local staging area for the inputs to the NER, AC, and ICD coding pipelines in this repository. The directory is default-deny under git: only the sample data and public metadata listed below are tracked. Anything else you place here may contain credentialed clinical text and must not be committed.

The What data do I need? section is the right starting point for first-time setup; it walks through the smallest data set that matches what you want to run. This page is the map that says where each downloaded artefact lands once you have it, and which scripts will pick it up.

Schemas, statistics, label sets, and dataset-level caveats live with the data they describe in the PhysioNet release. Until those files are unpacked, the tracked placeholder is mimic-iv-ext-entitycoding/NOTE.md.

Directory map

Path	What goes here	How to populate	Consumed by
`mimic-iv-ext-entitycoding/`	The MIMIC-IV-Ext-EntityCoding PhysioNet release files (`entity_annotations.csv`, `assertion_annotations.csv`, `assertion_sentences.csv`, `mimic-iv_notes_subset.csv`, `entity_ids.txt`, `assertion_ids.txt`, `annotation_guidelines.md`).	Download the release from PhysioNet (see "Source datasets" below) and unpack the files directly into this directory. The tracked placeholder is `mimic-iv-ext-entitycoding/NOTE.md`; the detailed release README is included with the PhysioNet files.	`ner/prepare_ner_training_data.py`, `ner/compute_release_stats.py`, `ac/prepare_our_annotations.py`
`mimic-iv-note/`	Ad hoc decompressed MIMIC-IV / MIMIC-IV-Note files used by top-level helpers (e.g. `discharge.csv`, `services.csv`).	Download from PhysioNet and decompress as needed. See `mimic-iv-note/README.md`.	`ner/ner_dataset_creation.py` (paper-time annotation sampling)
`models/`	Downloaded NER, AC, and base RoBERTa-PM checkpoints.	`python data_download.py` (run from the repo root). The `--models ner,ac,roberta` selectors cover this directory.	`ner/extract_entities.py`, `ner/ner_model_training.ipynb`, `ac/train_ac_model.py`
`ner/`	Local-only NER training and entity-extraction artefacts (`{train,validation,test}_latest.csv`, `entity_annotations_conll.txt`, `ner_dataset_notes_repro.{csv,parquet}`).	Generated by `ner/extract_entities.py`, `ner/prepare_ner_training_data.py`, and `ner/ner_dataset_creation.py`. Not committed.	`ner/ner_model_training.ipynb`, `ner/create_train_input.py`
`code_descriptions/`	ICD-9 and ICD-10 long-description tables (`d_icd_diagnoses.csv`, `d_icd_procedures.csv`), snapshotted at MIMIC-IV-Note v2.2. Tracked.	Already present in the repo.	`code_evidence/visualise_predictions_explanations.py`
`sample_data/`	Synthetic GPT-4o-generated discharge summaries (`sample_notes.csv`, columns `note_id,text`) for smoke testing. Tracked.	Already present in the repo. See `sample_data/README.md`.	`run_pipeline.py`, `ner/extract_entities.py`, `ner/clean_documents.py`

A note on naming: the entity-only training parquets land at external/plm_ca/data/processed/mimiciv_icd10/entity-only/ (with a hyphen), while the trained models live at external/plm_ca/models/entityonly/ (without one). This is intentional and the loaders depend on it.

PLM-CA's own downloads (the MIMIC files used by the vendored trainer, plus the MDACE annotations) live under external/plm_ca/data/ and are documented separately in docs/inference.md and external/plm_ca/data/raw/MDace/README.md. Do not stage those files in data/; the PLM-CA scripts assume their own directory tree.

The IAA reproduction inputs (annotator JSONs and source texts) live under ner/inter_annotator_agreement/annotations/, not in data/. See ner/inter_annotator_agreement/README.md.

Source datasets

All of the clinical sources below require PhysioNet credentialed access (account, CITI human-subjects training, and a signed Data Use Agreement). The i2b2 / n2c2 challenge corpora used to train the AC model are registered separately via the DBMI portal; see ac/README.md.

Dataset	Version	Where to get it	Lands in
MIMIC-IV-Note	2.2	https://physionet.org/content/mimic-iv-note/2.2/	`mimic-iv-note/` for top-level helpers; `external/plm_ca/data/raw/physionet.org/files/mimic-iv-note/2.2/` for the PLM-CA trainer
MIMIC-IV	2.2	https://physionet.org/content/mimiciv/2.2/	`mimic-iv-note/` for top-level helpers (e.g. `services.csv`); `external/plm_ca/data/raw/physionet.org/files/mimiciv/2.2/` for the PLM-CA trainer
MIMIC-IV-Ext-EntityCoding	1.0.0 (PhysioNet release in review; URL pending publication)	PhysioNet (link will be added here once the project is published)	`mimic-iv-ext-entitycoding/`
MIMIC-III	1.4	https://physionet.org/content/mimiciii/1.4/	`external/plm_ca/data/raw/physionet.org/files/mimiciii/1.4/` for the PLM-CA trainer; `ac/sources/mimic_iii/NOTEEVENTS.csv` (decompressed) for AC training

The project's NER, AC, and ICD coding model weights are not on PhysioNet; they are fetched from a credentialed Google Drive bundle by python data_download.py. URLs and per-model selectors live in config/download_config.yaml; see docs/inference.md for usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

data/

Directory map

Source datasets

See also

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

data/

Directory map

Source datasets

See also