Patient Identity Resolution in Healthcare Master Data Management Using Ensemble Machine Learning

Authors

  • Somnath Banerjee

Keywords:

Master Data Management, MDM, Healthcare, Patient Identity Resolution, Machine Learning, Ensemble Classification, Siamese Network, Graph Neural Network, Anomaly Detection

Abstract

Patient identity resolution is a cornerstone of healthcare Master Data Management (MDM), ensuring accurate linkage of records to the correct individual. Despite its importance for patient safety and care continuity, many organizations struggle with fragmented identities due to inconsistent data entry, the absence of a universal identifier, and increasing data heterogeneity across electronic health records (EHRs). Traditional deterministic and probabilistic matching methods, widely embedded in commercial Master Data Management tools, exhibit notable shortcomings such as high false positive and false negative rates, heavy reliance on manual stewardship, black-box implementations, and significant licensing costs. This paper examines machine learning techniques that address these challenges, including logistic regression, support vector machines, gradient boosting, bidirectional long short-term memory networks, and Siamese networks. Each model’s strengths and limitations are compared with respect to matching accuracy, interpretability, and training data requirements. Comparative analyses suggest that while deep learning models, particularly Siamese networks, excel in text-rich identity resolution tasks, methods like gradient boosting strike a balance between accuracy and operational efficiency. To address the complexity of healthcare data, the paper proposes a multi-model patient matching solution. It incorporates data preprocessing techniques, including anomaly detection via isolation forests and autoencoders, and transformer-based natural language processing for feature extraction, to improve pre-match data quality. An ensemble learning architecture then integrates the complementary strengths of multiple machine learning models to achieve robust, scalable, and explainable patient identity resolution. The findings underscore machine learning’s potential to reshape and modernize patient identity resolution across enterprise healthcare systems.

Downloads

Download data is not yet available.

References

O. Bess, “Why Duplicate and Mismatched Patient Records Are a Bigger Problem Than You Think,” Medical Economics, vol. 100, no. 11, pp. 12–15, Oct. 2023. Accessed on: Jan. 14, 2026. [Online]. Available: https://www.medicaleconomics.com/view/why-duplicate-and-mismatched-patient-records-are-a-bigger-problem-than-you-think

C. Reifsnyder and A. Weinberg, “How Duplicate Patient Records Can Harm Your Practice—and How to Prevent Them”, Veradigm Blog, Aug 2024. Accessed on: Jan. 14, 2026. [Online]. Available: https://veradigm.com/veradigm-news/prevent-duplicate-patient-records/

J. Sultan, “Patient Matching: Obstacles And Solutions For Critical Patient Data Requirements,” Healthcare Business Today, Aug. 2021. Accessed on: Jan. 15, 2026. [Online]. Available: https://www.healthcarebusinesstoday.com/patient-matching-obstacles-and-solutions-for-critical-patient-data-requirements/

Verato Blog, “Achieving the ONC’s mandated 0.5% duplicate rate with Referential Matching” Verato Blog, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://verato.com/blog/achieving-onc-mandated-duplicate-rate-referential-matching/

G. Church, “The deadly cost of duplicate patient records | Viewpoint” Chief Healthcare Executive, Nov. 2023. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.chiefhealthcareexecutive.com/view/the-deadly-cost-of-duplicate-patient-records-viewpoint

AHIMA White Paper, “A Realistic Approach to Achieving a 1% Duplicate Record Error Rate”, AHIMA Report, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://ahima.org/media/m1pldevh/ahima-pim-whitepaper.pdf

Hospital Access Management, “Training and Tools Can Stop Duplicate Medical Records,” Clinician.com, Sep. 2017. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.clinician.com/articles/141259-training-and-tools-can-stop-duplicate-medical-records

IBM Knowledge Center, “Probabilistic Matching in IBM InfoSphere Master Data Management” IBM White Paper, Apr. 2022. Accessed on: Jan. 16, 2026. [Online]. Available: https://www.ibm.com/support/pages/probabilistic-matching-ibm-infosphere-master-data-management

Verato Blog, “The impact of duplicate medical records: How to prevent overlaps and ensure patient safety” Verato Blog, n.d. Accessed on: Jan. 16, 2026. [Online]. Available: https://verato.com/blog/duplicate-medical-records/

C. Leahy et al., “Matching Patient Data with Machine Learning (Part 1: The Problem with Rules),” Included Health Tech Blog, Aug. 2022. Accessed on: Jan. 16, 2026. [Online]. Available: https://includedhealth.com/blog/tech/matching-patient-data-with-machine-learning-part-1/

W. Nelson et al., “Optimizing Patient Record Linkage in a Master Patient Index Using Machine Learning: Algorithm Development and Validation,” JMIR Formative Research, vol. 7, e44331, Jun. 2023. DOI: https://doi.org/10.2196/44331.

F. Alafari et al., “Advances in natural language processing for healthcare: A comprehensive review of techniques, applications, and future directions” Computer Science Review, vol. 56, 100725, May 2025. DOI: https://doi.org/10.1016/j.cosrev.2025.100725

T.S. Brisimi et al., “Federated learning of predictive models from federated Electronic Health Records” Int J Med Inform., 112:59-67. Apr. 2018. DOI: 10.1016/j.ijmedinf.2018.01.007

GeeksforGeeks, “Bidirectional LSTM in NLP,” GeeksforGeeks Tutorials. May 2025. Accessed on: Jan. 17, 2026. [Online]. Available: https://www.geeksforgeeks.org/nlp/bidirectional-lstm-in-nlp/

A. Jurek-Loughrey, “Deep learning based approach to unstructured record linkage”, International Journal of Web Information Systems, vol. 17, no. 2, pp. 607-621, 2021. Accessed on: Jan. 17, 2026. [Online]. Available: https://pureadmin.qub.ac.uk/ws/files/273397384/IJWIS_manuscript.pdf

M. Loster et al., “Knowledge Transfer for Entity Resolution with Siamese Neural Networks”, Journal of Data and Information Quality, vol. 13, no. 1, pp. 1-25. Jan. 2021. DOI: https://doi.org/10.1145/3410157

D. Fernández-Llaneza et al., “Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction”, ACS Omega, 6(16):11086-11094. Apr. 2021.doi: 10.1021/acsomega.1c01266

Y. H. Park et al., “Key Intrinsic Connectivity Networks for Individual Identification With Siamese Long Short-Term Memory,” Front. Neurosci., vol. 15, 2021. DOI: https://doi.org/10.3389/fnins.2021.660187

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD. Aug. 2016, pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785

G. Ke, Q. Meng et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”, Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017. Accessed on: Feb. 12, 2026. [Online]. Available: https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

C. Leahy et al., “Matching Patient Data with Machine Learning (Part 2: Leaving Rules Behind),” Included Health Tech Blog, Aug. 2022. Accessed on: Feb. 16, 2026. [Online]. Available: https://includedhealth.com/blog/tech/matching-patient-data-with-machine-learning-part-2/

P. Röchner and F. Rothlauf, “Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort”, International Journal of Medical Informatics, vol. 185, May 2024. DOI: https://doi.org/10.1016/j.ijmedinf.2024.105387.

I. Fellegi and A. Sunter, “A Theory for Record Linkage,” J. Amer. Stat. Assoc., vol. 64, pp. 1183–1210, 1969. Accessed on: Feb. 11, 2026. [Online]. Available: https://www.cs.cornell.edu/~shmat/courses/cs6434/fellegi-sunter.pdf

W. Nelson, N. Khanna et al., “Optimizing Patient Record Linkage in a Master Patient Index Using Machine Learning: Algorithm Development and Validation”, JMIR Form Res. Jun. 2023; 7:e44331. DOI: 10.2196/44331

S.T. Chen, Y.H. Hsiao YH et al., “Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power Doppler imaging”, Korean J Radiol. Sep. 2009 Sep-Oct. DOI: 10.3348/kjr.2009.10.5.464

P. Christen, “Data Pre-Processing”, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Heidelberg, Germany: Springer, 2012, pp. 39-66

B. Bentson, “Using Anomaly Detection to Validate Data Quality,” Medium.com, Jul. 2021. Accessed on: Feb. 11, 2026. [Online]. Available: https://bentson-brian.medium.com/using-anomaly-detection-to-validate-data-quality-856944c5e6cd

D. Ribeiro et al., “Isolation Forests and Deep Autoencoders for Industrial Screw Tightening Anomaly Detection”, Computers 11, no. 4: 54. DOI: https://doi.org/10.3390/computers11040054

V.Churová V, R. Vyškovský et al., “Anomaly Detection Algorithm for Real-World Data and Evidence in Clinical Research: Implementation, Evaluation, and Validation Study”, JMIR Med Inform. May. 2021; 9(5):e27172. doi: 10.2196/27172

K. Yuan et al., “Transformers and large language models are efficient feature extractors for electronic health record studies,” Communications Medicine, Mar. 2025. DOI: https://doi.org/10.1038/s43856-025-00790-1

J. G. D. Ochoa and F. E. Mustafa, “Graph neural network modelling as a potentially effective method for predicting and analyzing procedures based on patients' diagnoses”, Artif. Intell. Med., vol. 131, Sep. 2022. doi: 10.1016/j.artmed.2022.102359, Sep. 2011.

R. A. Barton, T. Neiman, C. Yuan, “Graph neural networks for inconsistent cluster detection in incremental entity resolution”, Amazon Science Publication, 2021. Accessed on: Feb. 11, 2026. [Online]. Available: https://www.amazon.science/publications/graph-neural-networks-for-inconsistent-cluster-detection-in-incremental-entity-resolution

A. T. McNutt et al., “Comparison of Supervised Machine Learning and Probabilistic Approaches for Record Linkage”, AMIA Informatics Summit Proc., 2020. Accessed on: Feb. 11, 2026. [Online]. Available: https://scholarworks.indianapolis.iu.edu/bitstreams/decaeb02-42a9-4539-9d82-be4663e479de/download

E.D. Omar, H. Mat et al., “Comparative Analysis of Logistic Regression, Gradient Boosted Trees, SVM, and Random Forest Algorithms for Prediction of Acute Kidney Injury Requiring Dialysis After Cardiac Surgery”, Int J Nephrol Renovasc Dis. Jul. 2024; 17:197-204. doi: 10.2147/IJNRD.S461028

N. Guttenberg and R. Kanai, “Learning to generate classifiers”, arxiv.org. Accessed on: Feb. 11, 2026. [Online]. Available: https://arxiv.org/pdf/1803.11373

Downloads

Published

20.01.2026

How to Cite

Somnath Banerjee. (2026). Patient Identity Resolution in Healthcare Master Data Management Using Ensemble Machine Learning. International Journal of Intelligent Systems and Applications in Engineering, 14(1), 37–45. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/8191

Issue

Section

Research Article