Evermore | InnovationS

• •

A Primer on Knowledge Acquisition in Unstructured Medical Free Text

We will define artificial intelligence enabled knowledge acquisition (AIKA) as AI systems gathering/ extracting/ discovering, indexing and contextualizing data so it is available to support human processes.  Hence, AIKA is very broad and it can be implemented in almost every knowledge management system (KMS).  In order to illustrate requirements, techniques and limitations of AIKA, our discussion will center on natural language processing (NLP) information extraction (IE). 

Ubiquitous in modern healthcare, the electronic health record (EHR) contains as much as 80% unstructured data in free text representing a vast amount of patient data.  Although  technically recorded as explicit knowledge, free text clinical narratives are functionally tacit as it remains within the confines of the provider-patient conversation.  NLP IE allows computational analysis of these conversations potentiating knowledge capture and discovery.   Within the science of AI, NLP employs methods facilitating the analysis, manipulation, understanding, inference and modeling of characteristically human manners of verbal or textual communication. Large language models (LLMs) are a specific use case of NLP adapting from inputs and external data via machine learning (ML).  (Yang et al., 2022) Large language models are capable of extracting information from EHR free text narratives composed by clinicians in the gamut of documents appearing in the patient health record. (Meystre et al., 2008)  

Natural language processing information extraction requires 1.) unstructured text sources 2.) training text 3.) human annotation and supervision 4.) an LLM.  (Sharda et al., 2020)  Ideally, IE should follow the Cross Industry Standards Process for Datamining (CRISP-DM) as a quality/process framework and output interoperable data classes, elements and structures.  (Henry et al., 1998; Sharda et al., 2020)

Starting with a rules-based extraction model, a real world unstructured data extraction study utilized manual human extraction to train ML models to extract and curate pertinent information in free text medical records .  This hybrid rules-based model with human fine-tuning creates a recursive system incrementally improving accuracy leading to the best performance when compared to exclusively rules-based or purely supervised systems.   (Adamson et al., 2023)  Several approaches in NLP IE mirror this method yielding bespoke LLM’s trained on hundreds of millions of medical domain texts and billions of words.  (Yang et al., 2022) Slowing progress, real world texts with the potential to be used as training data are often inaccessible due to journal paywalls and healthcare systems concerned with protected health information (PHI).  (Meystre et al., 2008)  However, Gu et al demonstrated LLM’s trained on just bio-medical domain data outperformed mixed domain trained models disproving the notion that more data always yields better performance (2022) Today, publicly available LLMs trained on significantly fewer (but more specific) examples are capable of accurate medical domain IE.  Chen demonstrated that an open sourced model trained with just 1000 documents outperformed an industrial pre-trained extraction solution. (2021) Moreover, Consoli et al. demonstrated human comparable performance with OpenAI’s GPT 3.5 utilizing 2048 sample notes.  The GPT model also demonstrated tenfold reduction in time and twentyfold reduction in cost as compared to LLM manual fine-tuning. (2025)

CRISP-DM requires data preparation, entailing significant effort to validate, clean and organize the raw data from our extraction model.  This is important for computational analysis and data modeling required in the framework’s later stages.  (Sharda et al., 2020)  Central to this endeavor are defined data classes and fields.  The advent of Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR), the widespread use of International Classification of Diseases Revision 10 (ICD-10) and the granularity of Systemized Nomenclature of Medical Clinical Terms (SNOMED-CT) provides structure for extracted data to be organized and retrieved. (Han et al., 2022) Additionally, the advent of data lakes and warehouses allows for scalable unstructured data ingestion and structured data output respectively.  (Sharda et al., 2020)  A diagram of NLP IE’s general processes and elements are outlined in figure 1.  

We identified key barriers to effective NLP IE:  1.) irregularities in documentation style between providers and institutions  2.) poor representation of some concepts within the training data 3.) no medical domain consensus on bias evaluation/ prevention  4.) no medical domain consensus on model misalignment evaluation/ prevention.

Despite advances in NLP, it is important to acknowledge the inherent shortcomings of the source data.  Clinical free texts exhibit far more language irregularities than other sources, thus posing inherent challenges to NLP.  The variations in abbreviations, short-hand, manners of speaking and other irregularities challenge NLP models attempting to contextualize, infer, and define meaning.  (Meystre et al., 2008)  In that same vein, not all concepts are represented in the training data equally in number, quality or completeness.  For example, social determinants of health like housing, financial or transportation insecurity are, in general, not well documented in the medical record. (Cook et al., 2021)

Similar to the medical domain, frameworks necessary for general domain AI concerns, like bias and misalignment, are in their infancy and remain controversial.  For example, the AI Bill of Rights, drafted in 2022 by the previous United States Presidential administration, created a high level federal framework directly addressing the necessity of human fallback and algorithmic discrimination protections.  (Blueprint for an AI Bill of Rights, memorandum, October 2022)  The current administration rolled back much of the language, citing the preservation of American innovation and competitiveness in the AI domain.  (R. T. Vought, memorandum, April 2025.-a, memorandum, April 2025.-b).  Policy regarding these systems currently lies with developers, vendors and the institutions they serve.  At scale, if not vigilantly guarded against, bias can proliferate within the health system augmented by artificial intelligence.  (Ong et al., 2024)  Moreover, some research acknowledges the difficulty to detect, predict, remedy and prevent AI systems to behave contrary to human beneficial values, goals, intentions, and social norms.  Generative models, the basis of some IE methods, tend towards reward hacking and sycophancy where the system finds a loophole to make it appear the goals have been met.  (Dung, 2023)

The requirements and limitations outlined in this work regarding AIKA resonate with general domain AI and are especially significant when employed in a KMS solution aiding healthcare decision making and delivery.  It is beyond the scope of this work to resolve these.  However, it is important to understand these topics in order to inform our opinions on AIKA techniques and help shape policy (internal and external) regarding critical patient facing implementations.  

References 

Adamson, B., Waskom, M., Blarre, A., Kelly, J., Krismer, K., Nemeth, S., Gippetti, J., Ritten, J., Harrison, K., Ho, G., Linzmayer, R., Bansal, T., Wilkinson, S., Amster, G., Estola, E., Benedum, C. M., Fidyk, E., Estévez, M., Shapiro, W., & Cohen, A. B. (2023). Approach to machine learning for extraction of real-world data variables from electronic health records. Frontiers in Pharmacology, 14, 1180962.

Becerra-Fernandez, I., Sabherwal, R., & Kumi, R. Chapter 14. (2024). Knowledge Management: Systems and Processes in the AI Era (Third). Abingdon, Oxon.

Blueprint for an AI Bill of Rights. (2022, October). [Memorandum]

Chen, Z. (2021). Extraction of Social Determinants of Health from Electronic Health Records using Natural Language Processing

Consoli, B., Wu, X., Wang, S., Zhao, X., Wang, Y., Rousseau, J., Shen, L., Xu, H., Peng, Y., Long, Q., Chen, T., & Ding, Y. (n.d.). SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

Cook, L. A., Sachs, J., & Weiskopf, N. G. (2021). The quality of social determinants data in the electronic health record: A systematic review. Journal of the American Medical Informatics Association, 29(1), 187–196.

Dung, L. (2023). Current cases of AI misalignment and their implications for future risks. Synthese, 202(5), 138.

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., & Poon, H. (2022). Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare, 3(1), 1–23

Han, S., Zhang, R. F., Shi, L., Richie, R., Liu, H., Tseng, A., Quan, W., Ryan, N., Brent, D., & Tsui, F. R. (2022). Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. Journal of Biomedical Informatics, 127, 103984

[Henry, S. B., Warren, J. J., Lange, L., & Button, P. (1998). A Review of Major Nursing Vocabularies and the Extent to Which They Have the Characteristics Required for Implementation in Computer-based Systems. Journal of the American Medical Informatics Association, 5(4), 321–328

Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C., & Hurdle, J. F. (2008). Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. Yearbook of Medical Informatics, 17(01), 128–144

Ong, J. C. L., Seng, B. J. J., Law, J. Z. F., Low, L. L., Kwa, A. L. H., Giacomini, K. M., & Ting, D. S. W. (2024). Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Reports Medicine, 5(1), 101356

Sharda, R., Delen, D., & Turban, E. (2020). Chapter 7. Analytics, Data Science, and Artificial Intelligence: Systems for Decision Support (11th ed.). Pearson Education Limited

Sheikh, H., Prins, C., & Schrijvers, E. (2023). Mission AI: The New System Technology. Springer International Publishing.

Vought, R. T. (n.d.-a). MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES [Personal communication]

Vought, R. T. (n.d.-b). MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES [Personal communication]

Yang, X., Chen, A., PourNejatian, N., Shin, H. C., Smith, K. E., Parisien, C., Compas, C., Martin, C., Costa, A. B., Flores, M. G., Zhang, Y., Magoc, T., Harle, C. A., Lipori, G., Mitchell, D. A., Hogan, W. R., Shenkman, E. A., Bian, J., & Wu, Y. (2022). A large language model for electronic health records. Npj Digital Medicine, 5(1), 194.

Figure 1.