Patient Identity Resolution in Healthcare Master Data Management Using Ensemble Machine Learning
Keywords:
Master Data Management, MDM, Healthcare, Patient Identity Resolution, Machine Learning, Ensemble Classification, Siamese Network, Graph Neural Network, Anomaly DetectionAbstract
Patient identity resolution is a cornerstone of healthcare Master Data Management (MDM), ensuring accurate linkage of records to the correct individual. Despite its importance for patient safety and care continuity, many organizations struggle with fragmented identities due to inconsistent data entry, the absence of a universal identifier, and increasing data heterogeneity across electronic health records (EHRs). Traditional deterministic and probabilistic matching methods, widely embedded in commercial Master Data Management tools, exhibit notable shortcomings such as high false positive and false negative rates, heavy reliance on manual stewardship, black-box implementations, and significant licensing costs. This paper examines machine learning techniques that address these challenges, including logistic regression, support vector machines, gradient boosting, bidirectional long short-term memory networks, and Siamese networks. Each model’s strengths and limitations are compared with respect to matching accuracy, interpretability, and training data requirements. Comparative analyses suggest that while deep learning models, particularly Siamese networks, excel in text-rich identity resolution tasks, methods like gradient boosting strike a balance between accuracy and operational efficiency. To address the complexity of healthcare data, the paper proposes a multi-model patient matching solution. It incorporates data preprocessing techniques, including anomaly detection via isolation forests and autoencoders, and transformer-based natural language processing for feature extraction, to improve pre-match data quality. An ensemble learning architecture then integrates the complementary strengths of multiple machine learning models to achieve robust, scalable, and explainable patient identity resolution. The findings underscore machine learning’s potential to reshape and modernize patient identity resolution across enterprise healthcare systems.
Downloads
References
O. Bess, “Why Duplicate and Mismatched Patient Records Are a Bigger Problem Than You Think,” Medical Economics, vol. 100, no. 11, pp. 12–15, Oct. 2023. Accessed on: Jan. 14, 2026. [Online]. Available: https://www.medicaleconomics.com/view/why-duplicate-and-mismatched-patient-records-are-a-bigger-problem-than-you-think
C. Reifsnyder and A. Weinberg, “How Duplicate Patient Records Can Harm Your Practice—and How to Prevent Them”, Veradigm Blog, Aug 2024. Accessed on: Jan. 14, 2026. [Online]. Available: https://veradigm.com/veradigm-news/prevent-duplicate-patient-records/
J. Sultan, “Patient Matching: Obstacles And Solutions For Critical Patient Data Requirements,” Healthcare Business Today, Aug. 2021. Accessed on: Jan. 15, 2026. [Online]. Available: https://www.healthcarebusinesstoday.com/patient-matching-obstacles-and-solutions-for-critical-patient-data-requirements/
Verato Blog, “Achieving the ONC’s mandated 0.5% duplicate rate with Referential Matching” Verato Blog, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://verato.com/blog/achieving-onc-mandated-duplicate-rate-referential-matching/
G. Church, “The deadly cost of duplicate patient records | Viewpoint” Chief Healthcare Executive, Nov. 2023. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.chiefhealthcareexecutive.com/view/the-deadly-cost-of-duplicate-patient-records-viewpoint
AHIMA White Paper, “A Realistic Approach to Achieving a 1% Duplicate Record Error Rate”, AHIMA Report, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://ahima.org/media/m1pldevh/ahima-pim-whitepaper.pdf
Hospital Access Management, “Training and Tools Can Stop Duplicate Medical Records,” Clinician.com, Sep. 2017. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.clinician.com/articles/141259-training-and-tools-can-stop-duplicate-medical-records
IBM Knowledge Center, “Probabilistic Matching in IBM InfoSphere Master Data Management” IBM White Paper, Apr. 2022. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.ibm.com/support/pages/probabilistic-matching-ibm-infosphere-master-data-management
Verato Blog, “The impact of duplicate medical records: How to prevent overlaps and ensure patient safety” Verato Blog, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://verato.com/blog/duplicate-medical-records/
C. Leahy et al., “Matching Patient Data with Machine Learning (Part 1: The Problem with Rules),” Included Health Tech Blog, Aug. 2022. Accessed on: Jan. 16, 2026. [Online]. Available: https://includedhealth.com/blog/tech/matching-patient-data-with-machine-learning-part-1/
W. Nelson et al., “Optimizing Patient Record Linkage in a Master Patient Index Using Machine Learning: Algorithm Development and Validation,” JMIR Formative Research, vol. 7, e44331, Jun. 2023. DOI: https://doi.org/10.2196/44331.
F. Alafari et al., “Advances in natural language processing for healthcare: A comprehensive review of techniques, applications, and future directions” Computer Science Review, vol. 56, 100725, May 2025. DOI: https://doi.org/10.1016/j.cosrev.2025.100725
T.S. Brisimi et al., “Federated learning of predictive models from federated Electronic Health Records” Int J Med Inform., 112:59-67. Apr. 2018. DOI: 10.1016/j.ijmedinf.2018.01.007
GeeksforGeeks, “Bidirectional LSTM in NLP,” GeeksforGeeks Tutorials. May 2025. Accessed on: Jan. 17, 2026. [Online]. Available: https://www.geeksforgeeks.org/nlp/bidirectional-lstm-in-nlp/
A. Jurek-Loughrey, “Deep learning based approach to unstructured record linkage”, International Journal of Web Information Systems, vol. 17, no. 2, pp. 607-621, 2021. Accessed on: Jan. 17, 2026. [Online]. Available: https://pureadmin.qub.ac.uk/ws/files/273397384/IJWIS_manuscript.pdf
M. Loster et al., “Knowledge Transfer for Entity Resolution with Siamese Neural Networks”, Journal of Data and Information Quality, vol. 13, no. 1, pp. 1-25. Jan. 2021. DOI: https://doi.org/10.1145/3410157
D. Fernández-Llaneza et al., “Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction”, ACS Omega, 6(16):11086-11094. Apr. 2021.doi: 10.1021/acsomega.1c01266
Y. H. Park et al., “Key Intrinsic Connectivity Networks for Individual Identification With Siamese Long Short-Term Memory,” Front. Neurosci., vol. 15, 2021. DOI: https://doi.org/10.3389/fnins.2021.660187
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD. Aug. 2016, pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785
G. Ke, Q. Meng et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017. Accessed on: Feb. 12, 2026. [Online]. Available: https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
C. Leahy et al., “Matching Patient Data with Machine Learning (Part 2: Leaving Rules Behind),” Included Health Tech Blog, Aug. 2022. Accessed on: Feb. 16, 2026. [Online]. Available: https://includedhealth.com/blog/tech/matching-patient-data-with-machine-learning-part-2/
P. Röchner and F. Rothlauf, “Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort”, International Journal of Medical Informatics, vol. 185, May 2024. DOI: https://doi.org/10.1016/j.ijmedinf.2024.105387.
I. Fellegi and A. Sunter, “A Theory for Record Linkage,” J. Amer. Stat. Assoc., vol. 64, pp. 1183–1210, 1969. Accessed on: Feb. 11, 2026. [Online]. Available: https://www.cs.cornell.edu/~shmat/courses/cs6434/fellegi-sunter.pdf
W. Nelson, N. Khanna et al., “Optimizing Patient Record Linkage in a Master Patient Index Using Machine Learning: Algorithm Development and Validation”, JMIR Form Res. Jun. 2023; 7:e44331. DOI: 10.2196/44331
S.T. Chen, Y.H. Hsiao YH et al., “Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power Doppler imaging”, Korean J Radiol. Sep. 2009 Sep-Oct. DOI: 10.3348/kjr.2009.10.5.464
P. Christen, “Data Pre-Processing”, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Heidelberg, Germany: Springer, 2012, pp. 39-66
B. Bentson, “Using Anomaly Detection to Validate Data Quality,” Medium.com, Jul. 2021. Accessed on: Feb. 11, 2026. [Online]. Available: https://bentson-brian.medium.com/using-anomaly-detection-to-validate-data-quality-856944c5e6cd
D. Ribeiro et al., “Isolation Forests and Deep Autoencoders for Industrial Screw Tightening Anomaly Detection”, Computers 11, no. 4: 54. DOI: https://doi.org/10.3390/computers11040054
V.Churová V, R. Vyškovský et al., “Anomaly Detection Algorithm for Real-World Data and Evidence in Clinical Research: Implementation, Evaluation, and Validation Study”, JMIR Med Inform. May. 2021; 9(5):e27172. doi: 10.2196/27172
K. Yuan et al., “Transformers and large language models are efficient feature extractors for electronic health record studies,” Communications Medicine, Mar. 2025. DOI: https://doi.org/10.1038/s43856-025-00790-1
J. G. D. Ochoa and F. E. Mustafa, “Graph neural network modelling as a potentially effective method for predicting and analyzing procedures based on patients' diagnoses”, Artif. Intell. Med., vol. 131, Sep. 2022. doi: 10.1016/j.artmed.2022.102359, Sep. 2011.
R. A. Barton, T. Neiman, C. Yuan, “Graph neural networks for inconsistent cluster detection in incremental entity resolution”, Amazon Science Publication, 2021. Accessed on: Feb. 11, 2026. [Online]. Available: https://www.amazon.science/publications/graph-neural-networks-for-inconsistent-cluster-detection-in-incremental-entity-resolution
A. T. McNutt et al., “Comparison of Supervised Machine Learning and Probabilistic Approaches for Record Linkage”, AMIA Informatics Summit Proc., 2020. Accessed on: Feb. 11, 2026. [Online]. Available: https://scholarworks.indianapolis.iu.edu/bitstreams/decaeb02-42a9-4539-9d82-be4663e479de/download
E.D. Omar, H. Mat et al., “Comparative Analysis of Logistic Regression, Gradient Boosted Trees, SVM, and Random Forest Algorithms for Prediction of Acute Kidney Injury Requiring Dialysis After Cardiac Surgery”, Int J Nephrol Renovasc Dis. Jul. 2024; 17:197-204. doi: 10.2147/IJNRD.S461028
N. Guttenberg and R. Kanai, “Learning to generate classifiers”, arxiv.org. Accessed on: Feb. 11, 2026. [Online]. Available: https://arxiv.org/pdf/1803.11373
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


