We Handled a 200GB/day Log Volume Without Breaking the Bank: A Practical Framework for Cost-Effective Observability at Scale

Authors

  • Rahul Bhatia

Keywords:

Log Management, Observability, Telemetry Pipeline, Data Tiering, Log Filtering, Cost Optimization, Distributed Systems, Indexing, Retention Policy, Platform Engineering

Abstract

Scaling observability in a cost-effective way is one of the most pressing challenges facing modern engineering teams. As distributed systems grow in complexity and traffic, daily log volumes can reach hundreds of gigabytes, creating a compounding burden on storage infrastructure, indexing engines, and operational budgets. This article presents a practitioner-driven case study documenting the architectural decisions, tooling evaluations, and pipeline optimizations used to manage a sustained high-volume logging workload without exhausting financial resources or degrading system visibility. The strategies explored include log filtering and structured enrichment at the collection edge, dynamic data tiering, field-selective indexing, retention policy governance, and license-aware tooling decisions. In this article, they are contextualized against established literature in distributed systems observability, telemetry pipeline design, and cloud cost optimization. The leads are in a repeatable, layered framework applicable to platform engineers, site reliability engineers, and infrastructure architects responsible for managing large-scale telemetry in production environments.

DOI: https://doi.org/10.17762/ijisae.v14i1s.8234

Downloads

Download data is not yet available.

References

B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade," ACM Queue, vol. 14, no. 1, pp. 70–93, Jan.–Feb. 2016. Available: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44843.pdf

Shekhar Jha, “Foundations of Observability Engineering,” in International Journal of Multidisciplinary on Science and Management, 2024. Available: https://www.ijmsm.org/volume1-issue3/IJMSM-V1I3P104.pdf

Neal Leavitt, "Complex-event processing poised for growth," IEEE Computer, vol. 42, no. 4, pp. 17–20, Apr. 2009. Available: https://www.leavcom.com/pdf/CEP.pdf

Mark D. Syer, et al., "Continuous validation of performance test suites," in Proc. Int. Conf. Performance Engineering (ICPE), Prague, Czech Republic, 2014, pp. 197–208. Available: http://www.cse.yorku.ca/~zmjiang/publications/asej2016_syer.pdf

Adrian Jackson, et al., "Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory," arXiv:1805.10041v1 [cs.DC] 25 May 2018. Available: https://arxiv.org/pdf/1805.10041

Hyeontaek Lim, et al., "SILT: A memory-efficient, high-performance key-value store," in Proc. 23rd ACM Symp. Operating Systems Principles (SOSP), Cascais, Portugal, 2011, pp. 1–13. Available: https://www.pdl.cmu.edu/PDL-FTP/Storage/sosp11_silt.pdf

Benjamin H. Sigelman et al., "Dapper, a large-scale distributed systems tracing infrastructure," Google, Mountain View, CA, Tech. Rep. Google-TR-2010-003, Apr. 2010. Available: https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf

Pinjia He, et al., "An evaluation study on log parsing and its use in log mining," in Proc. 46th IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN), Toulouse, France, 2016, pp. 654–661. Available: https://pinjiahe.github.io/files/pdf/research/DSN16.pdf

Valerio Persico, et al., "Measuring network throughput in the cloud: The case of Amazon EC2," Computer Networks, vol. 93, pp. 408–422, Dec. 2015. Available: http://wpage.unina.it/valerio.persico/pubs/tput_cloud_AWS_comnet.pdf

Seyed Ali Mirheidari, et al., "Alert correlation algorithms: A survey and taxonomy," in Proc. Int. Conf. Cyberspace Safety and Security (CSS), Zhangjiajie, China, 2013, pp. 183–197. Available: https://arxiv.org/pdf/1811.00921

Justin Zobel and Alistair Moffat, "Inverted files for text search engines," ACM Computing Surveys, vol. 38, no. 2, pp. 6–es, Jul. 2006. Available: https://dmice.ohsu.edu/bedricks/courses/cs506-problem-solving-with-large-clusters/articles/week1/zobel_invertedindex.pdf

Matin Kleppmann, “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems,” Sebastopol, CA: O'Reilly Media, 2017. Available: https://unidel.edu.ng/focelibrary/books/Designing%20Data-Intensive%20Applications%20The%20Big%20Ideas%20Behind%20Reliable,%20Scalable,%20and%20Maintainable%20Systems%20by%20Martin%20Kleppmann%20(z-lib.org).pdf

Royal Borough of Kingston upon Thames, "Information Security and Governance Policy and Framework," Information Systems Frontiers, vol. 21, no. 4, pp. 935–949, Aug. 2019. Available: https://www.kingston.gov.uk/sites/default/files/2025-05/Information_Security_and_Governance_Policy_and_Framework___RBK__Approved_.pdf

Wei Xu, et al., "Detecting large-scale system problems by mining console logs," in Proc. 22nd ACM Symp. Operating Systems Principles (SOSP), Big Sky, MT, 2009, pp. 117–132. Available: https://www.sigops.org/s/conferences/sosp/2009/papers/xu-sosp09.pdf

Min Du, et al., "DeepLog: Anomaly detection and diagnosis from system logs through deep learning," in Proc. 2017 ACM SIGSAC Conf. Computer and Communications Security (CCS), Dallas, TX, 2017, pp. 1285–1298. Available: https://users.cs.utah.edu/~lifeifei/papers/deeplog.pdf

Downloads

Published

14.02.2026

How to Cite

Rahul Bhatia. (2026). We Handled a 200GB/day Log Volume Without Breaking the Bank: A Practical Framework for Cost-Effective Observability at Scale. International Journal of Intelligent Systems and Applications in Engineering, 14(1s), 705–717. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/8234

Issue

Section

Research Article