Written by: Peijin Han

Edited by: Frances Gu and Henry Ertl

Imagine you had the misfortune of waking up one day with a crippling headache. You have been experiencing neck pains and headaches for a couple of years, but doctors always told you this was a neck strain and prescribed a muscle relaxer. However, today is your third day of suffering from this headache, which doesn’t seem to be alleviated by anything you have at home. In the past, you would have navigated the outside world to go to a clinic. But today, instead of dragging yourself out of bed and enduring onerous wait times, you opt to receive medical attention from the comfort of your home.

You open an online medical service app and type, “I have a headache,” and then proceed to answer questions about whether you had nausea or vomiting, saw lights before the headache started, have a history of sinusitis, or took any medications. Based on your answers, the app provides a list of possible diagnoses: migraine (92% probability), neck strains (5% probability), tension headache (2.5% probability), or sinusitis headache (0.5% probability). Based on the highest probability diagnosis, the app recommends that you stay at home, rest in a dark room, stay hydrated, and take the medication it prescribes, all the while sending the recorded conversation and prescription to the pharmacy on file. An on-call physician supervises all of these activities: they will verify the prescription and sign the note while watching for any abnormalities. You take the medication and the headache magically stops. You wish you had consulted the app sooner to find out the true culprit!

Illustration by Zoe Yeoh

This scenario might sound like something out of a science fiction novel, but believe it or not, it is a reality made possible by ChatGPT! ChatGPT, one of the emerging artificial intelligence (AI) technologies built on large language models, was first released by
OpenAI (1) in 2022 and is attracting lots of attention this year. ChatGPT is able to complete many language tasks like question answering, text generation, and summarization using large language models, which belong to the generative pre-trained transformer (GPT) family. Unlike Siri or Alexa, ChatGPT better engages people in conversations because it can analyze and respond to a wide range of topics and subjects. You can ask ChatGPT anything, from general questions like, “How can I grow a plant?” to difficult academic questions like, “Can you help me with this math proof?” Although ChatGPT doesn’t always have a perfect answer, it still responds to the questions with ease and wit. Shockingly, although with the minimum passing scores, it passed the United States Medical Licensing Examination without prior training (2), suggesting that there is untapped potential for the use of ChatGPT in medicine. But before delving into how ChatGPT might revolutionize healthcare, it is first useful to understand the inner workings of ChatGPT.

How does ChatGPT work?

To some extent, the way ChatGPT learns is similar to how humans do. When asked a question, we need to first understand and interpret the voice we hear or the text we read – in other words, “decode” information. Next, we decide on the best response using the knowledge that comes from what we’ve read, heard, or previously experienced. Finally, we organize our thoughts and speak or write our responses for other people to understand.

ChatGPT works similarly by attempting to analyze the input question and then spitting out strings of words that it predicts will best answer the question based on accumulated knowledge or, for ChatGPT, the data it was trained on. First, ChatGPT “decodes” text data by categorizing it into words, numbers, and punctuation. These elements are converted to numbers by a deep learning neural network, which is essentially a series of complex mathematical models inspired by signaling in the human brain. These complicated equations consist of billions of parameters, and they are updated through thousands of training iterations. The resulting neural network generates probability scores to predict what text should come next based on its prior knowledge: the data it was trained on. Like children who read a lot are generally good at writing, the more training data ChatGPT sees, the more patterns it learns from the human language, and the more plausible texts it can generate. In fact, ChatGPT has been training on a massive corpus of data – basically, anything you can find on the open internet. This training corpus is on the scale of billions of language elements, which enables ChatGPT to accumulate prodigious amounts of knowledge. In this way, ChatGPT can acquire information about medicine using its pattern recognition process from open-access medical journals, websites from public health organizations like the Center for Disease Control, and other entities like WebMD and Medscape.

A ChatGPT Case Study

To test the strengths and weaknesses of ChatGPT in a medical setting, I asked ChatGPT a question from a popular question bank for United States Medical Licensing Examination preparation:

A 24-year-old woman is evaluated for a lump in her neck. She discovered the lump a week ago while showering. The patient feels well and has no associated pain, fever, night sweats, or weight changes. Her medical history is unremarkable, although she has had irregular menses for a year since insertion of an intrauterine device. The patient does not use tobacco, alcohol, or illicit drugs. Her mother died during surgery for thyroid cancer. Blood pressure is 133/80 mm Hg and pulse is 78/min. On examination, there is a palpable 2-cm nodule in the right thyroid lobe. The remainder of the examination, including chest, abdomen, extremities, and skin, is normal. Serum thyroid stimulating hormone (TSH) and calcium levels are normal and calcitonin is elevated. Ultrasound-guided aspiration biopsy reveals malignant cells. Which test is the best next step in the evaluation of this patient?

Here is the response from ChatGPT:

Based on the clinical presentation and findings, the patient may have thyroid cancer, and the best next step in evaluation would be to perform imaging studies to determine the extent and stage of the disease. Among the imaging modalities, the most appropriate test for initial evaluation of thyroid cancer is a neck ultrasound, which can provide information on the size and location of the thyroid nodule, as well as the presence of any suspicious cervical lymph nodes.

Other imaging tests that may be useful in the evaluation of thyroid cancer include computer tomography (CT) scan of the neck and chest, magnetic resonance imaging (MRI) of the neck, and radioactive iodine (RAI) scan. However, the initial imaging study of choice is a neck ultrasound, which is non-invasive, readily available, and has high sensitivity and specificity for detecting thyroid nodules and cervical lymph nodes.

In summary, the best next step in the evaluation of a 24-year-old woman with a palpable thyroid nodule and elevated calcitonin levels is to perform a neck ultrasound to determine the extent and stage of thyroid cancer.

In the scenario posed in the test bank question, the patient has all of the symptoms that would lead an experienced doctor to suspect the patient likely has an inherited form of medullary thyroid cancer and may have a form of multiple endocrine neoplasia type 2 if other certain types of tumors are present (3). The standard next decision would be to screen for one of those other types of tumors, specifically pheochromocytoma, a non-cancerous tumor on the adrenal glands which makes and secretes hormones like norepinephrine and epinephrine. This condition can lead to life-threatening high blood pressure during thyroid tumor removal surgery since the norepinephrine and epinephrine typically made by the adrenal glands will be dysregulated because of the adrenal tumor.

However, ChatGPT answers something regarding how to diagnose a thyroid nodule with a “neck ultrasound” in the first paragraph and explains why it is preferred in the second paragraph. We can see that ChatGPT makes three mistakes here. First, it doesn’t know the diagnosis of thyroid cancer is confirmed for the patient. The answer it gave is essentially “how to determine whether this thyroid nodule is cancer.” However, the question stated that malignant cells have already been identified. Second, it doesn’t link the patient’s symptom to the underlying cause — it fails to recognize this is an inherited form of medullary thyroid cancer, maybe because it ignores the supporting evidence, specifically increased calcitonin and family history of thyroid malignancy. Third, it misses other features that are associated with this underlying cause – it doesn’t link medullary thyroid cancer to multiple endocrine neoplasia type 2 and ignores the fact that pheochromocytoma is associated with this condition. Instead, it regurgitates sentences that will likely come next for “something related to thyroid nodules,” which is more common based on the billions of data points it sees. These three errors all point to a fundamental difficulty when applying AI to complex problems: it is challenging to teach ChatGPT the correct answers because it relies on guessing instead of understanding.

ChatGPT: the secretary instead of the boss

Although there is still much room for improvement in making diagnoses and treatment plans, ChatGPT can be a great “secretary” for physicians. For one, ChatGPT can help with administration and paperwork, such as generating notes and sending messages, on which physicians spend an average of 15.5 hours per week (4). In fact, many high-tech companies have been working on AI documentation tools (5), which generate clinical notes within seconds from conversations with patients conducted in person or via telehealth. ChatGPT can also help with patient-provider communication, like explaining complicated diseases or procedures, sending orders and prescriptions, and generating patient handouts and instructions.

In addition to reducing the administrative burden of practicing medicine, ChatGPT’s impact on the clinical workflow could be more complex and far-reaching. Healthcare providers are almost always working in a high-stress and fast-paced environment, where they are prone to mistakes because of burnout, not being up-to-date with the current medical guidelines, and missing information due to a lack of time to examine patients thoroughly. ChatGPT may be able to alleviate these issues by taking patient histories and exploring their symptoms comprehensively, supporting physicians’ education and continuous learning, and helping with preventive medicine, such as recognizing disease outbreaks by analyzing global health data. ChatGPT could also help physicians with the emotional side of the job: one study (6) showed that on average, ChatGPT’s responses were 9.8 times more likely to be empathetic than physicians’ when interacting with patients, which could help physicians have difficult conversations and deliver bad news to patients.

Illustration by Zoe Yeoh

Right now, ChatGPT can’t be the “boss” because it cannot fully automate the clinical workflow. As we saw in the case study, the most challenging part of automating clinical workflow is generating correct medical diagnoses and treatments, a complicated task that includes interpreting imaging and lab results, determining differential diagnoses with the likelihoods of each possibility, and deciding on the best treatments based on individual patient characteristics. This is a high-risk, high-reward problem: if ChatGPT can accurately diagnose and prescribe appropriate treatment, it would significantly change the landscape of medicine.

There is still much progress needed before it can reliably rival an experienced human physician in the clinical workflow. Instead of understanding mechanisms
of disease, ChatGPT just infers the most plausible responses based on probability scores produced by its neural network (7). In other words, ChatGPT will sometimes make up information like an eager child who doesn’t seek to learn and understand, but rather just wants to finish their homework so they can go out and play. As a result, ChatGPT is severely limited when it comes to rare and complex diseases; because it does not actually comprehend information, it cannot make connections between presented symptoms and underlying causes when the literature on certain diseases is much less comprehensive. Instead, it will likely default to the most probable diagnosis or management strategies but miss the correct ones because ChatGPT infers information based on the probabilistic models.

The future of AI in medicine

Although scientists in OpenAI are working hard to increase the accuracy of the responses, right now ChatGPT is better suited to be a physician’s secretary instead of replacing them as the boss in the clinical workflow. The bottleneck that AI research faces is how to design large language models that fundamentally comprehend knowledge instead of making educated guesses based on probability. Even if ChatGPT can be safely used in low-risk settings such as charting and patient communication, physicians will still need to act like teachers to check ChatGPT’s “homework.” In addition, ChatGPT can’t perform physical exams, meaning ChatGPT will need to notify doctors so they can provide careful examinations when patients have symptoms like shortness of breath, cough, pain, or weakness. Lastly, ChatGPT may not be able to provide sufficient empathy – when one is sick and vulnerable, meeting the doctors in person, hearing their encouragement, and holding their hands can make a big difference that AI simply cannot replace.

Despite these limitations, ChatGPT can still have a major impact on the medical world. It is likely that AI “doctors,” such as an upgraded ChatGPT, will have human-comparable understanding, a large knowledge base, and powerful computing resources, all of which could revolutionize healthcare. If AI can make optimal medical decisions in various complex situations, human doctors will be freed from tedious administrative responsibilities and can spend their time on more meaningful tasks such as prioritizing their mental health and work-life balance or expanding their knowledge of new drugs, rare diseases, and improved surgical procedures. Until then, we must continue to develop ChatGPT so that it can realize its potential in the medical field and, perhaps in the future, be promoted from secretary to boss.


Peijin is a grad student in computational medicine and bioinformatics, where she researches how to predict disease development using patient health data. She was previously trained as a medical doctor in China and hopes to see how AI benefits healthcare providers, patients, and researchers in the near future.

Leave a comment