ML-Based Deduplication and Semantic Collapsing in High-Volume Communication Streams

Authors

  • Ankita Kamat

Keywords:

Deduplication, Semantic Similarity, Notification Overload, Event Streams, Sentence Embeddings, Approximate Nearest Neighbors, Online Clustering, Personalization, Stream Processing, Knowledge Distillation.

Abstract

High-volume communication ecosystems generate large quantities of overlapping and redundant notifications that dilute user attention and inflate infrastructure costs. This article presents a machine learning-driven architecture for identifying semantic similarity across event streams and collapsing redundant notifications into concise, actionable updates. The proposed system combines deterministic deduplication, dense embedding inference, approximate nearest neighbor retrieval, and online clustering into a single streaming pipeline that is operating within strict latency budgets and personalization requirements. Evaluation criteria encompass offline clustering accuracy, online user engagement results, and cost savings in infrastructure. By transforming raw event floods into coherent, user-meaningful communication, the approach reduces notification overload, preserves semantic fidelity, and improves platform operational efficiency.

Downloads

Download data is not yet available.

References

Donghwa Chung et al., "Perceived Information Overload and Intention to Discontinue Use of Short-Form Video: The Mediating Roles of Cognitive and Psychological Factors," Behavioral Sciences, 2023. Available: https://www.mdpi.com/2076-328X/13/1/50

Hippolyte Fournier et al., "Attention hijacked: How social media notifications disrupt cognitive processing," Computers in Human Behavior, 2026. Available: https://www.sciencedirect.com/science/article/pii/S0747563226000233

Qingyand Yu et al., "A Survey on Intelligent Management of Alerts and Incidents in IT Services," Journal of Network and Computer Applications, 2024. Available: https://netman.aiops.org/wp-content/uploads/2024/08/A-survey-on-intelligent-management-of-alerts-and-incidents-in-IT-services.pdf

Nicholas Fitz et al., "Batching Smartphone Notifications Can Improve Well-Being," Computers in Human Behavior, 2019. Available: https://static1.squarespace.com/static/57a40c19414fb54f51f8095f/t/614a55faa7b89e25f4e48ad1/1632261627146/2019%2BFitz%2BBatching.pdf

Yong-Han Lin et al., "Pinning, Sorting, and Categorizing Notifications: A Mixed-Methods Usage and Experience Study of Mobile Notification-Management Features," Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2024. Available: https://people.cs.nycu.edu.tw/~armuro/pubs/lin-et-al-2024-imwut.pdf

Uei-Dar Chen et al., "From Overwhelmed to Overview: Understanding Smartphone Users' Preferences and Expectations in Relieving Notification Overload via Text Summarization," Proceedings of the ACM on Human-Computer Interaction (MobileHCI), 2025. Available: https://people.cs.nycu.edu.tw/~armuro/pubs/chen-et-al-2025-mobilehci.pdf

A. Broder, "On the Resemblance and Containment of Documents," Proceedings of the Compression and Complexity of Sequences, 1997. Available: https://www.computer.org/csdl/proceedings-article/sequences/1997/81320021/12OmNwDACjh

Piotr Indyk and Rajeev Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," Proceedings of the 30th Annual ACM Symposium on Theory of Computing, 1998. Available: https://dl.acm.org/doi/epdf/10.1145/276698.276876

Moses S. Charikar et al., "Similarity Estimation Techniques from Rounding Algorithms," Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002. Available: https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

Jacob Devlin et al., "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1423.pdf

Nils Reimers and Iryna Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. Available: https://aclanthology.org/D19-1410.pdf

Tianyu Gao et al., "SimCSE: Simple Contrastive Learning of Sentence Embeddings," Proceedings of Empirical Methods in Natural Language Processing, 2021. Available: https://aclanthology.org/2021.emnlp-main.552/

Liang Wang et al., "Text Embeddings by Weakly-Supervised Contrastive Pre-Training," arXiv, 2024. Available: https://arxiv.org/pdf/2212.03533

Daniel Cer et al., "SemEval-2017 Task 1: Semantic Textual Similarity—Multilingual and Cross-lingual Focused Evaluation," Proceedings of International Workshop on Semantic Evaluations, 2017. Available: https://aclanthology.org/S17-2001.pdf

Niklas Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," Proceedings of the European Chapter of the Association for Computational Linguistics, 2023. Available: https://aclanthology.org/2023.eacl-main.148/

Victor Sanh et al., "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," NeurIPS Workshop on Energy Efficient Deep Learning, 2019. Available: https://arxiv.org/abs/1910.01108

Wenhui Wang et al., "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers," Advances in Neural Information Processing Systems, 2020. Available: https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Y. A. Malkov et al., "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. Available: https://arxiv.org/abs/1603.09320

Jeff Johnson et al., "Billion-Scale Similarity Search with GPUs," IEEE Transactions on Big Data, 2021. Available: https://www.computer.org/csdl/journal/bd/2021/03/08733051/1aFvgKKpjoc

Charu C. Aggarwal et al., "A Framework for Clustering Evolving Data Streams," Proceedings of the 29th International Conference on Very Large Data Bases, 2003. Available: https://www.vldb.org/conf/2003/papers/S04P02.pdf

Feng Cao et al., "Density-Based Clustering over an Evolving Data Stream with Noise," Proceedings of the 2006 SIAM International Conference on Data Mining, 2006. Available: https://epubs.siam.org/doi/10.1137/1.9781611972764.29

Kailash Karthik Saravanakumar et al., "Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings," Proceedings of the European Chapter of the Association for Computational Linguistics, 2021. Available: https://aclanthology.org/2021.eacl-main.198.pdf

Tyler Akidau et al., "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing," Proceedings of the VLDB Endowment, 2015. Available: https://research.google.com/pubs/archive/43864.pdf

Paris Carbone et al., "Apache Flink: Stream and Batch Processing in a Single Engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf

Amit Bagga et al., "Entity-Based Cross-Document Coreferencing Using the Vector Space Model," Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, 1998. Available: https://dl.acm.org/doi/10.3115/980845.980859

Lawrence Hubert & Phipps Arabie, "Comparing Partitions," Journal of Classification, 1985. Available: https://link.springer.com/article/10.1007/BF01908075

Nguyen Xuan Vinh et al., "Information Theoretic Measures for Clustering Comparison: Variants, Properties, Normalization and Correction for Chance," Journal of Machine Learning Research, 2010. Available: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

Downloads

Published

20.05.2026

How to Cite

Ankita Kamat. (2026). ML-Based Deduplication and Semantic Collapsing in High-Volume Communication Streams. International Journal of Intelligent Systems and Applications in Engineering, 14(1s), 978 –. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/8293

Issue

Section

Research Article