Knowledge Net: An Automated Clinical Knowledge Graph Generation Framework for Evidence Based Medicine
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
To practice the evidence-based medicine, clinicians are interested to find the most suitable research for the clinical decision making. The use of knowledge graphs (KGs) and Neuro-Symbolic methods to integrate and analyze complex and heterogeneous healthcare data is critical to enable evidence-based treatment in clinical decision support systems (CDSS). Healthcare generates a vast amount of data, including electronic health records (EHRs), medical images, genetic information, research papers, and clinical guidelines. Neuro-symbolic AI can leverage its neural network component to process unstructured data, while using symbolic reasoning to interpret the data and make logical inferences. It also enables a deeper understanding of patient data, leading to more accurate diagnoses, personalized treatment plans, and improved patient outcomes. By incorporating symbolic reasoning, Neuro-Symbolic AI systems can provide explanations for their outputs, making them more transparent and interpretable. To enable Neuro-Symbolic AI in healthcare, large-scale KGs play a pivotal role as it can integrate heterogeneous and big healthcare data including medical ontologies, clinical guidelines, drug databases, patient records, and research literature.The existing KG construction frameworks are not fully automated and predominantly carried out using manual or semi-automated approach, requiring substantial effort and expertise. The challenges encompass identifying knowledge sources, disambiguating concepts in context, enriching semantics, determining relationships, and conducting inferential reasoning. Automating the extraction of coherent knowledge and constructing KGs from diverse data forms remains a longstanding goal in AI research. Also, the current frameworks for constructing KGs fail to generate KGs that provide relevant information for evidence-based practitioners. This is because the organization of constructed subgraphs is neither topic-specific nor evidence-based PICO (Participants/Problem P, Intervention-I, Comparison C, Outcome O) query-friendly. These KGs, built through manual or semi-automated processes, are incapable of adapting to new domains and incorporating the constantly changing information into their knowledge base. Consequently, they gradually lose relevance over time and miss out on important evidence. Thus, ignoring temporal information and failing to incorporate dynamic nature of entities and relations can lead to erroneous information extraction and suboptimal decision-making. This dissertation proposes fully automated knowledge graph curation framework to curate information and create KG of different clinical domains by employing concept extraction, semantic enrichment, optimized clustering using Neuro-Symbolic approach, and state of art Recurrent Neural Networks (RNNs) with BioBERT based encoded representation to categorize PICO elements and predict relationships between concepts using huge corpus of publicly available literature on COVID-19 and cerebral aneurysms. The evaluation shows that the proposed framework achieves significant improvement over baseline models and has 93 , and 82 accuracy on aneurysm and COVID data set respectively for PICO classification. The Neuro-Symbolic clustering approach outperforms traditional baseline models by 43 and achieves average precision of 88 across all identified clusters. Also, the relationship extraction module has an accuracy of 96 with precision and recall being 92 , and 90 respectively. The incorporation of domain-specific and language models has proven to enhance the performance of machine learning models, particularly in the context of Neuro-Symbolic clustering, PICO classification, and relation extraction. The integration of deep learning and symbolic reasoning techniques has demonstrated significant improvements in clustering performance, especially in biomedical research domains. The utilization of the BioBERT embedded layer and LSTM model has notably boosted the accuracy of PICO classification tasks by 11 for both the COVID-19 dataset and cerebral aneurysm dataset. Furthermore, when BioBERT is combined with Bi-LSTM and CNN, the performance of the RE model also experiences substantial enhancements. Future work will focus on parallelizing the data processing pipeline to enhance the efficiency and scalability of the knowledge graph framework, while also developing an interactive user interface for visualization. Additionally, efforts will be dedicated to extending the frameworks application across diverse domains such as the food supply chain, dietary recommendations, agriculture, and fisheries, addressing unique challenges and expanding its impact. This expansion aims to advance multiple industries and leverage the potential benefits of the approach in various domains.