Generative AI Cheat Bot
Almost daily, we learn about the application of artificial intelligence (AI) and large language models (LLMs) in research and healthcare. To assess the performance of a popular generative AI chatbot that uses LLM, I conducted a series of experiments starting with a simple question. The response time and quality were impressive. However, by increasing the questions’ complexity and specificity, the "thinking" period was stretched and the answers were more general. To examine the depth of chatbot capabilities, I used research questions from my recently published article, using prompts and requested academic writing format. The chatbot generated a paragraph with specific answers supported by citations. Indeed, I was disappointed in myself for not finding those references in the literature search for my manuscript.
Old Friend Google
When I googled the chatbot's generated paragraph, I found a newswire report comprising 90% of the AI-generated paragraph. The report summarized a series of interviews with experts on the subject. I was astounded when I examined the AI-generated citations. The chatbot had fabricated all the references using research question keywords, selected respected authors and peer-reviewed journals, and provided a random digital object identifier (doi).
Dissection of Reference Fabrication
I systematically searched the references. First, I googled the whole reference, followed by the main parts separately in the following order: title, author, journal and doi. Here are the results of the analysis:
Citation #1:Tomblin Murphy G. et al. 2013. Nurse Practitioners, Physician Assistants and Certified Nurse Midwives in Canadian Neonatal Intensive Care Units: A Mixed Methods Study. Human Resources for Health 11(30). doi: 10.1186/1478-4491-11-30.
- The author is a renowned Canadian researcher and scholar with several papers in peer-reviewed journals.
- The title is a collage of the following article based on the research question keywords.
- Carzoli R.P., M. Martinez-Cruz S. Murphy and T. Chiu. 1994 Comparison of Neonatal Nurse Practitioners, Physician Assistants, and Residents in the Neonatal Intensive Care Unit. Archives of Pediatrics and Adolescent Medicine 148(12): 1271-76. doi: 10.1001/archpedi.1994.02170120033005.
- The journal is a peer-reviewed journal from a known publisher with a two-year impact factor of 4.83. Volume 11 was published in 2013. However, the cited article does not exist.
- The doi belongs to the following article:
- Faye A., P. Fournier, I. Diop, A. Philibert, F. Morestin and A. Dumont. 2013. Developing a Tool to Measure Satisfaction among Health Professionals in Sub-Saharan Africa. Human Resources for Health11: 30. doi.10.1186/1478-4491-11-30
Citation #2: Dhawan S. et al. 2021. Impact of COVID-19 on Health Workforce and Health Systems: A Scoping Review. Journal of Primary Care & Community Health12. doi: 10.1177/21501327211016941.
- The author: Google Scholar identified multiple researchers from India with the same surname and initials. However, none had publications on the health human resources subject.
- The title was not found on PubMed or Google Scholar. The closest published literature is:
- Khalil-Khan A. and M. Ab Khan. 2023. The Impact of COVID-19 on Primary Care: A Scoping Review. Cureus 15(1): e33241. doi: 10.7759/cureus.33241.
- The journal is a peer-reviewed journal from a known publisher with a two-year impact factor of 3.6. Volume 12 was published in 2021. However, the cited article does not exist.
- The doi could not be identified on PubMed, Google Scholar or through general Google search.
- The author is an emergency medicine clinician scientist with multiple COVID-19 and health human resources publications.
- The title is borrowed from the following article mixed with keywords from the research question:
- Kasthuri A. 2018. Challenges to Healthcare in India – The Five A's. Indian Journal of Community Medicine 43(3): 141–43. doi: 10.4103/ijcm.IJCM_194_18.
- The journal is a peer-reviewed journal from a known publisher with a two-year impact factor of 0.3. However, the article does not exist in Vol. 10.
- The doi belongs to the following article:
- Dalal A., A. Kumar, K. Arivarasan, A. Dahale, S. Sachdeva, U. Sonika et al. 2021. Colonic Stenting Using Side-Viewing Endoscope: A Case Report. Journal of Digestive Endoscopy 12(04): 247–48. doi: 10.1055/s-0040-1713833.
What is an LLM?
LLMs are next-word prediction statistical models using a machine-learning (ML) neural network. They are frequently trained through unlabelled data sets using self- or semi-supervised learning. After training, the algorithm will look at the inquiry and choose the words with the highest probability from its past learning to fill the gaps. However, LLMs are prone to bias. If the LLMs’ training data is biased or incomplete, then its response could be equally unreliable – a phenomenon known as "hallucinations" as the LLM is astray. Many believe that at preliminary stages (e.g., today), a LLM does not have a concept of fact. They predict the next word based on what they have seen so far purely as a statistical probability. The LLM is a system that spouts without any context. Prompt engineering shapes and optimizes text prompts for an LLM to produce better results. Despite their capabilities, generative AIs are just statistical models that can analyze vast amounts of information and attempt to mimic human intelligence, but they do not understand what they generate. In this case, the AI picked up keywords from the research question and, based on its training (e.g., publications on the same subject), stitched together a journal citation using the standard format citation structure: Author, title, journal and doi. From a statistical model, the probability that one of the selected authors published a research article in the chosen journal is relatively high. Therefore, the model believed the citation was appropriate, even though it was false.
The ramifications of this random stitching and data fabrication can be severe in the real world. In this controlled exercise, the AI generated a paragraph in response to a scientific research question that has already been published. The fun factor trumped the risk. However, the rush to use AI can be catastrophic in real world scenarios.
Recently, the court fined a lawyer who searched the legal case using a generative AI chatbot because the generative AI made up the legal cases he referred to. Several organizations are using AI engines for trend identification and predicting the future (e.g., testing biomarkers and identifying the risk of a disease). We still remember Holmes's story. Oth ers are creating AI to design clinical trials, identify cohorts and analyze the trial data. However, what if AI gives the best possible answer by generating its own patient data? Sadly, the rapid progress of AI engines has caught the regulators and industry off guard, and they have yet been unable to adopt the relevant strategies and laws enforcing the appropriate use of AI.
Recent attention to AI is based on public access to generative AI chatbots, but the principles behind predictive statistical models are not new. Different industries have used mathematical models (algorithms) to predict events and deliver action for years. For example, advanced medical devices have used predictive algorithms to monitor and control cardiac events and provide lifesaving therapy (e.g., implantable cardiac defibrillators and resynchronization therapy). The difference is that medical technology companies constantly monitor, evaluate and enhance device algorithms, assuring their safety and accuracy.
However, the proliferation of complex statistical models over the Internet and in the hands of every “Pythonist” creates a prodigious dilemma. At this stage of AI evolution and as long as AI is prone to bias, hallucination and people's greed, generative LLMs may only be used to do grade school homework or plan a weekend trip. It will take many more years until an LLM understands what it produces and learns the concept of fact versus fiction. This means that the machines could actually start thinking like humans and understand reality. Alas, this reminds me of George Orwell and Arnold Schwarzenegger.
About the Author(s)
Hamid Sadri is a med-tech executive specialized in health economics and outcomes research and health technology assessment. Hamid has advanced degrees in health economics and healthcare administration from the University of Toronto and has authored 21 research articles in peer-reviewed journals. He is currently the director of Market Access at Medtronic ULC.
Be the first to comment on this!
Personal Subscriber? Sign In
Note: Please enter a display name. Your email address will not be publically displayed