ML-Based Deduplication and Semantic Collapsing in High-Volume Communication Streams
Keywords:
Deduplication, Semantic Similarity, Notification Overload, Event Streams, Sentence Embeddings, Approximate Nearest Neighbors, Online Clustering, Personalization, Stream Processing, Knowledge Distillation.Abstract
High-volume communication ecosystems generate large quantities of overlapping and redundant notifications that dilute user attention and inflate infrastructure costs. This article presents a machine learning-driven architecture for identifying semantic similarity across event streams and collapsing redundant notifications into concise, actionable updates. The proposed system combines deterministic deduplication, dense embedding inference, approximate nearest neighbor retrieval, and online clustering into a single streaming pipeline that is operating within strict latency budgets and personalization requirements. Evaluation criteria encompass offline clustering accuracy, online user engagement results, and cost savings in infrastructure. By transforming raw event floods into coherent, user-meaningful communication, the approach reduces notification overload, preserves semantic fidelity, and improves platform operational efficiency.
Downloads
References
Donghwa Chung et al., "Perceived Information Overload and Intention to Discontinue Use of Short-Form Video: The Mediating Roles of Cognitive and Psychological Factors," Behavioral Sciences, 2023. Available: https://www.mdpi.com/2076-328X/13/1/50
Hippolyte Fournier et al., "Attention hijacked: How social media notifications disrupt cognitive processing," Computers in Human Behavior, 2026. Available: https://www.sciencedirect.com/science/article/pii/S0747563226000233
Qingyand Yu et al., "A Survey on Intelligent Management of Alerts and Incidents in IT Services," Journal of Network and Computer Applications, 2024. Available: https://netman.aiops.org/wp-content/uploads/2024/08/A-survey-on-intelligent-management-of-alerts-and-incidents-in-IT-services.pdf
Nicholas Fitz et al., "Batching Smartphone Notifications Can Improve Well-Being," Computers in Human Behavior, 2019. Available: https://static1.squarespace.com/static/57a40c19414fb54f51f8095f/t/614a55faa7b89e25f4e48ad1/1632261627146/2019%2BFitz%2BBatching.pdf
Yong-Han Lin et al., "Pinning, Sorting, and Categorizing Notifications: A Mixed-Methods Usage and Experience Study of Mobile Notification-Management Features," Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2024. Available: https://people.cs.nycu.edu.tw/~armuro/pubs/lin-et-al-2024-imwut.pdf
Uei-Dar Chen et al., "From Overwhelmed to Overview: Understanding Smartphone Users' Preferences and Expectations in Relieving Notification Overload via Text Summarization," Proceedings of the ACM on Human-Computer Interaction (MobileHCI), 2025. Available: https://people.cs.nycu.edu.tw/~armuro/pubs/chen-et-al-2025-mobilehci.pdf
A. Broder, "On the Resemblance and Containment of Documents," Proceedings of the Compression and Complexity of Sequences, 1997. Available: https://www.computer.org/csdl/proceedings-article/sequences/1997/81320021/12OmNwDACjh
Piotr Indyk and Rajeev Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," Proceedings of the 30th Annual ACM Symposium on Theory of Computing, 1998. Available: https://dl.acm.org/doi/epdf/10.1145/276698.276876
Moses S. Charikar et al., "Similarity Estimation Techniques from Rounding Algorithms," Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002. Available: https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
Jacob Devlin et al., "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1423.pdf
Nils Reimers and Iryna Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. Available: https://aclanthology.org/D19-1410.pdf
Tianyu Gao et al., "SimCSE: Simple Contrastive Learning of Sentence Embeddings," Proceedings of Empirical Methods in Natural Language Processing, 2021. Available: https://aclanthology.org/2021.emnlp-main.552/
Liang Wang et al., "Text Embeddings by Weakly-Supervised Contrastive Pre-Training," arXiv, 2024. Available: https://arxiv.org/pdf/2212.03533
Daniel Cer et al., "SemEval-2017 Task 1: Semantic Textual Similarity—Multilingual and Cross-lingual Focused Evaluation," Proceedings of International Workshop on Semantic Evaluations, 2017. Available: https://aclanthology.org/S17-2001.pdf
Niklas Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," Proceedings of the European Chapter of the Association for Computational Linguistics, 2023. Available: https://aclanthology.org/2023.eacl-main.148/
Victor Sanh et al., "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," NeurIPS Workshop on Energy Efficient Deep Learning, 2019. Available: https://arxiv.org/abs/1910.01108
Wenhui Wang et al., "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers," Advances in Neural Information Processing Systems, 2020. Available: https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Y. A. Malkov et al., "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. Available: https://arxiv.org/abs/1603.09320
Jeff Johnson et al., "Billion-Scale Similarity Search with GPUs," IEEE Transactions on Big Data, 2021. Available: https://www.computer.org/csdl/journal/bd/2021/03/08733051/1aFvgKKpjoc
Charu C. Aggarwal et al., "A Framework for Clustering Evolving Data Streams," Proceedings of the 29th International Conference on Very Large Data Bases, 2003. Available: https://www.vldb.org/conf/2003/papers/S04P02.pdf
Feng Cao et al., "Density-Based Clustering over an Evolving Data Stream with Noise," Proceedings of the 2006 SIAM International Conference on Data Mining, 2006. Available: https://epubs.siam.org/doi/10.1137/1.9781611972764.29
Kailash Karthik Saravanakumar et al., "Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings," Proceedings of the European Chapter of the Association for Computational Linguistics, 2021. Available: https://aclanthology.org/2021.eacl-main.198.pdf
Tyler Akidau et al., "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing," Proceedings of the VLDB Endowment, 2015. Available: https://research.google.com/pubs/archive/43864.pdf
Paris Carbone et al., "Apache Flink: Stream and Batch Processing in a Single Engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
Amit Bagga et al., "Entity-Based Cross-Document Coreferencing Using the Vector Space Model," Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, 1998. Available: https://dl.acm.org/doi/10.3115/980845.980859
Lawrence Hubert & Phipps Arabie, "Comparing Partitions," Journal of Classification, 1985. Available: https://link.springer.com/article/10.1007/BF01908075
Nguyen Xuan Vinh et al., "Information Theoretic Measures for Clustering Comparison: Variants, Properties, Normalization and Correction for Chance," Journal of Machine Learning Research, 2010. Available: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


