Drug-induced toxicity is one of the leading reasons new drugs fail clinical trials. Machine learning models that predict drug toxicity from molecular structure could help researchers prioritize less toxic drug candidates. However, current toxicity datasets are typically small and limited to a single organ system (e.g., cardio, renal, or liver). Creating these datasets often involved time-intensive expert curation by parsing drug label documents that can exceed 100 pages per drug. Here, we introduce UniTox, a unified dataset of 2,418 FDA-approved drugs with drug-induced toxicity summaries and ratings created by using GPT-4o to process FDA drug labels. UniTox spans eight types of toxicity: cardiotoxicity, liver toxicity, renal toxicity, pulmonary toxicity, hematological toxicity, dermatological toxicity, ototoxicity, and infertility. This is, to the best of our knowledge, the largest such systematic human in vivo database by number of drugs and toxicities, and the first covering nearly all FDA-approved medications for several of these toxicities.
We recruited clinicians to validate a random sample of our GPT-4o annotated toxicities, and UniTox’s toxicity ratings concord with clinician labelers 87-96% of the time. Finally, we benchmark a graph neural network trained on UniTox to demonstrate the utility of this dataset for building molecular toxicity prediction models.
Data is available at Zenodo
Below, please find our datasheet, following the model outlined in Datasheets for Datasets (Gebru et al. 2018)
UniTox was created as a unified toxicity dataset across eight types of drug toxicities (cardiotoxicity, liver toxicity, renal toxicity, pulmonary toxicity, hematological toxicity, dermatological toxicity, ototoxicity, and infertility). We generated information across all toxicities for the same set of 2,418 drugs with the same methodology of applying LLMs. For each drug, for each toxicity, we provide an LLM-generated summary of the relevant portions of the drug label, as well as ternary (No/Less/Most) predictions and binary (No/Yes) predictions for that toxicity.
The dataset was created by Jake Silberg, Kyle Swanson, Elana Simon, Angela Zhang, Zaniar Ghazizadeh, and James Zou at Stanford University, as well as Scott Ogden and Hisham Hamadeh at GenMab.
Chan-Zuckerberg Biohub
Each dataset instance is a single drug. For each drug, we provide information across eight toxicities, as well as a unique identifier in Structured Drug Labeling (SPL) format for the drug label used to create the toxicity information.
There are 2,418 drugs in the dataset and each drug has information on eight toxicities.
The dataset is a subset of all possible NDA, ANDA, and BLA drug labels for FDA approved drugs (50,617 labels in total). We de-duplicated these drugs by unique generic names. Drugs that do not have a current FDA-approved label (e.g., withdrawn or discontinued drugs) are not included.
Each instance is a single drug. For each instance, there are eight toxicities, and for each toxicity, there is an LLM-generated summary of the relevant sections of the drug label, a ternary prediction (No/Less/Most), and a binary prediction (No/Yes). Each instance also provides the unique SPL ID, allowing users to find the exact text used to generate the instance data.
Each instance has LLM-generated toxicity labels, both in ternary (No/Less/Most) and binary (No/Yes) form, for eight toxicity types.
All instances have a generic drug name, SPL ID, and LLM-generated toxicity summaries and labels. However, not all instances have SMILES. Only drugs that are small molecules whose generic name matches an entry in PubChem have a SMILES.
No, there are no relationships between instances (e.g., drug classes or disease treatment classes).
There are not suggested data splits.
There are potential redundancies of the following forms: Because we de-duplicated drugs based on generic name, drugs with the same moiety may appear with different names (e.g., Abacavir and Abacavir Sulfate) Any typos or inconsistencies in a drug name would cause it to appear multiple times in the dataset (e.g., HCL vs. Hydrochloride)
The dataset is self-contained except for the FDA labels, which were used by the LLM to generate the toxicity summaries and labels but are not included directly in the dataset. The FDA labels can be found on the FDALabel website based on the SPL IDs included in the dataset.
No, all data in the dataset, and all data used to generate the dataset are publicly available data published in several forms on FDA websites.
These labels are created by the FDA in discussion and consultation with the drugmaker. Our LLM-generated summaries involved no data collection by anyone other than the authors.
We have not identified the oldest label still in the dataset. Labels are updated regularly as new information about a drug becomes available.
Ethical reviews of the LLM-processing steps were not conducted as the risk was considered minimal
The deduplication process has been described above. The HTML drug label was stripped of tags using the Beautiful Soup python package. Figures in the drug label were not processed or considered.
The exact drug label text used to generate our summaries and predictions can be identified from the SPL ID. Additionally, the raw query results from FDALabel, prior to deduplication are available in our github.
All the code used for deduplication is available from our github.
Yes, in the paper that published this dataset, a subset of the data was used to train a graph neural network to predict small molecule drug toxicities.
No, there is no central repository for all papers using this dataset.
This data could be used for other tasks related to predicting drug toxicity or understanding the relations between different types of toxicity.
It is worth noting that the toxicity summaries and labels are LLM-generated and therefore are not entirely accurate.
The dataset should not be used for patients to decide whether a drug is safe to take. Patients should always consult medical experts about these drugs.
Yes, the dataset is publicly available on the internet.
The dataset is on GitHub and on Zenodo with a DOI.
The dataset was first distributed in June 2024.
The dataset is distributed under a CC BY 4.0 license.
No, since the FDA drug labels from which the dataset was generated are in the public domain.
No.
The Zou lab at Stanford University will be supporting and hosting UniTox.
Jake Silberg can be contacted at jsilberg at stanford.edu
This will be posted on the dataset webpage.
N/A
To avoid providing inconsistent drug information, we do not anticipate hosting previous versions, though any errata will be made available
We suggest contacting the authors.