Clinical language models hold tremendous promise in enhancing the effectiveness of healthcare professionals. Recently, there has been a significant surge of interest in the development of clinical language models. One recent efforts in this direction include the creation of Clinical T5.

However, a notable challenge in developing large-scale clinical language models lies in the use of data. Clinical NLP relies primarily on MIMIC-III, which comprises clinical notes from a private hospital. This data can only be utilized for non-commercial purposes due to its private and sensitive nature.

There have been numerous calls for a collaborative approach to clinical model development through data sharing. Nonetheless, such collaborative efforts are typically long-term endeavours that require considerable time and coordination.

Alternatively, we can leverage the capabilities of large language models, which excel at generating text. One potential strategy is to use these models to generate synthetic clinical notes and then train large language models on this synthetic data. The hope is that these models will perform comparably to those trained on real clinical notes

Generating Synthetic Notes

The authors of this paper have made an astute observation. They've noted that, unlike clinical notes, case reports are not subject to privacy constraints. In the context of inpatient healthcare, many clinical language models are trained on a specific type of document known as discharge summaries. These summaries encompass a patient's entire stay and are structured, typically including details about allergies, conducted tests, post-discharge instructions, and more.

However, when it comes to case reports, they lack this structured information. This is where GPT comes into play. The authors use GPT 3.5 to generate synthetic clinical notes that resemble original notes.

<aside> 🤔 What strategies can be employed to ensure that synthetic notes closely resemble authentic clinical records? One method involves conducting preliminary testing on a subset of these synthetic notes before deploying GPT-3.5 for large-scale generation. Additionally, an alternative approach is to train a language model using actual clinical notes and subsequently evaluate its perplexity on the synthetic notes. This evaluation serves as an indicator of the fidelity of the generated clinical notes.

Also they could have used something like MAUVE (Read here) to compare human generated text with machine generated text

</aside>

Using Clinical Notes to Generate Instruction Based Data

Instruction data comprises both an instruction and its corresponding answer. The training of instruction-based language models has demonstrated recent success. In this paper, clinical notes generated in a prior step are employed to construct an instruction dataset. The study focuses on various tasks deemed relevant in clinical settings, such as Named Entity Recognition, paraphrasing, and coreference resolution, among others.

Early endeavors to assemble instruction-based data involved human intervention. However, recent initiatives like Self-Instruct have shown that Large Language models can autonomously generate instructions and their corresponding answers. In line with these advancements, this paper follows a similar approach by generating instructions for the tasks using Self-Instruct. This process involves the following steps:

  1. Experts provide initial seed instructions for each task type.
  2. These seed instructions are used to prompt ChatGPT, which generates additional instructions.
  3. The generated instruction is then paired with the clinical note and input into ChatGPT once more to obtain the answer

Thats it. Train LLAMA or any other model in this instruction data to get a task-specific instruction finetuned large language model and evaluate it. Performance of model trained on synthetic notes $\sim$ model trained on real notes is the amin highlight of the paper. However, they use GPT-4 for evaluation.

<aside> 🤔 GPT-4 as the only evaluation is problematic This paper appears to exclusively rely on GPT-4 for evaluation, which raises some concerns. For instance, it could benefit from employing additional automated evaluation metrics, such as ROUGE, to assess the quality of summarization. While the use of GPT-4 offers the advantage of consolidating all tasks under a single scoring system, the question of whether this alone suffices in this context remains a point of doubt

</aside>