FMs for EHRs

This workflow can be used to reproduce the results in the accompanying manuscript.

Requirements & structure

The bash scripts can be run in a slurm environment with the specified resource requirements. (We used compute nodes with 8xA100 GPUs, connected with 2x 16-core 3.0-GHz AMD Milan processors for GPU-based work.) Each bash script calls one or more python scripts that depend on an environment as described in the requirements.txt file:

python3 -m venv venv
source venv/bin/activate
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip3 install -r requirements.txt

The code is structured logically as follows, where the numerical prefixes correspond to the prefixes in the bash (.sh) files:

What the code does

Data wrangling & tokenization

The code operates on MIMIC tabular data converted to the CLIF-2.0.0 format. It gathers data associated to a given hospitalization_id and generates a sequence of integers corresponding to the stay. Each sequence begins with a start token, information about the patient, information about the stay itself, and then encoded category-value pairs corresponding to, inter alia, lab records, vitals, and medication. The sequences end with information on discharge and an end token, like so:

Category-value tokenization iterates over all categories present in a table and learns deciles for the values within each category. For example, the vital corresponding to temperature in Celsius may be assigned the integer label ‘33.’ All measurements of temperature in the training set are used to determine deciles for measurements within this category. For hospitalization 42, the tokens ‘33’ for this category and then ‘0’ for the corresponding deciled measurement would be inserted into the timeline at ‘E1’:

Self-supervised training

Our training process packs sequences together, allowing one sequence to bleed into the next example within a batch. The dark goldenrod boundary outlines tokens corresponding to two individual hospitalization events:

We insert a variable number of padding tokens between sequences to expose the model to padding. For the initial training, the model attempted to predict the next token in a sequence given the previous tokens (‘context’).

Objective-specific finetuning

We perform supervised fine-tuning with left-padded sequences. Each hospitalization event (truncated at 24 hours) occupies a single training instance and is paired with its associated subsequent outcome. In this way, fine-tuning is outcome-specific.

Representation extraction and analysis

Our pipeline extracts model-specific representations for each hospitalization event that our useful for predicting a number of subsequent outcomes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
img		img
.editorconfig		.editorconfig
.gitignore		.gitignore
01_create_train_val_test_split.py		01_create_train_val_test_split.py
01_create_train_val_test_split.sh		01_create_train_val_test_split.sh
02_tokenize_train_val_test_split.py		02_tokenize_train_val_test_split.py
02_tokenize_train_val_test_split.sh		02_tokenize_train_val_test_split.sh
03_tune_model.py		03_tune_model.py
03_tune_model.sh		03_tune_model.sh
04_examine_models.py		04_examine_models.py
04_examine_models.sh		04_examine_models.sh
05_extract_hidden_states.py		05_extract_hidden_states.py
05_extract_hidden_states.sh		05_extract_hidden_states.sh
06_extract_outcomes.py		06_extract_outcomes.py
06_extract_outcomes.sh		06_extract_outcomes.sh
07_find_outliers_oos.py		07_find_outliers_oos.py
07_find_outliers_oos.sh		07_find_outliers_oos.sh
08_transfer_rep_based_preds.py		08_transfer_rep_based_preds.py
08_transfer_rep_based_preds.sh		08_transfer_rep_based_preds.sh
09_fine_tune_classification.py		09_fine_tune_classification.py
09_fine_tune_classification.sh		09_fine_tune_classification.sh
10_fine_tuned_predictions.py		10_fine_tuned_predictions.py
10_fine_tuned_predictions.sh		10_fine_tuned_predictions.sh
11_extract_all_hidden_states.py		11_extract_all_hidden_states.py
11_extract_all_hidden_states.sh		11_extract_all_hidden_states.sh
12_process_representation_trajectories.py		12_process_representation_trajectories.py
12_process_representation_trajectories.sh		12_process_representation_trajectories.sh
13_sft_predictions_over_time.py		13_sft_predictions_over_time.py
13_sft_predictions_over_time.sh		13_sft_predictions_over_time.sh
14_lr_predictions_over_time.py		14_lr_predictions_over_time.py
14_lr_predictions_over_time.sh		14_lr_predictions_over_time.sh
15_visualize_tokenwise_predictions.py		15_visualize_tokenwise_predictions.py
15_visualize_tokenwise_predictions.sh		15_visualize_tokenwise_predictions.sh
16_process_sft_preds.py		16_process_sft_preds.py
16_process_sft_preds.sh		16_process_sft_preds.sh
17_transfer_fine_tuning.sh		17_transfer_fine_tuning.sh
18_transferred_sft_preds.sh		18_transferred_sft_preds.sh
19_process_transferred_preds.sh		19_process_transferred_preds.sh
20_aggregate_summary_stats.py		20_aggregate_summary_stats.py
20_aggregate_summary_stats.sh		20_aggregate_summary_stats.sh
LICENSE.md		LICENSE.md
README.md		README.md
dataset.py		dataset.py
logger.py		logger.py
preamble.sh		preamble.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
util.py		util.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FMs for EHRs

Requirements & structure

What the code does

Data wrangling & tokenization

Self-supervised training

Objective-specific finetuning

Representation extraction and analysis

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FMs for EHRs

Requirements & structure

What the code does

Data wrangling & tokenization

Self-supervised training

Objective-specific finetuning

Representation extraction and analysis

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages