From Signals to Root Cause: A Systems Architecture for Agentic AI in Observability

Akila Balasubramanian

Authors

Akila Balasubramanian

Keywords:

Agentic artificial intelligence; Root cause analysis; Cloud observability; Large language models; Hypothesis refinement; Distributed systems; Iterative reasoning

Abstract

Modern distributed systems generate high-cardinality telemetry across metrics, logs, and traces, creating a combinatorial search space that renders manual root cause analysis (RCA) increasingly impractical at cloud scale. Existing approaches—including rule-based automation and prompt-driven large language model (LLM) systems—fail to support reliable RCA due to the absence of structured multi-step reasoning, persistent state management, and deterministic execution. This paper presents an agentic systems framework that models RCA as a closed-loop, sequential decision-making process over observability telemetry. A layered architecture is introduced comprising a control layer for state-machine-based orchestration, a memory layer for token-aware context management, a tooling layer for deterministic interaction with heterogeneous observability backends, and a governance layer for enforcing correctness, security, and auditability. RCA is executed through iterative hypothesis refinement, supported by algorithms for action selection, evidence aggregation, conflict resolution, and failure recovery. Empirical evaluation across 1,200 production-style troubleshooting tasks demonstrates that the proposed system improves task success rates from 61.8% to 86.7%, reduces user intervention by 3.5×, decreases effective time-to-resolution by approximately 42%, and reduces token consumption by up to 4.8× through adaptive memory strategies. Robustness experiments show nearly 2× improvement in failure recovery and significant gains in handling ambiguous inputs compared to prompt-only and static pipeline baselines. These results establish that agentic architectures can transform observability from passive telemetry monitoring into active, evidence-driven, automated reasoning.

Downloads

Download data is not yet available.

References

S. Dhar et al., "Observability in microservices: An in-depth exploration of frameworks, challenges, and deployment paradigms," IEEE Access, vol. 12, pp. 1–25, 2024. https://ieeexplore.ieee.org/document/10967524

X. Li, S. Wang, S. Zeng, Y. Wu, and Y. Yang, "A survey on LLM-based multi-agent systems: Workflow, infrastructure, and challenges," Vicinagearth, vol. 1, no. 1, art. 9, 2024. https://doi.org/10.1007/s44336-024-00009-2

H. Wang et al., "Holistic root cause analysis for failures in cloud-native systems through observability data," IEEE Trans. Serv. Comput., vol. 17, no. 6, 2024. https://ieeexplore.ieee.org/document/10713920

R. Kumar et al., "AIOps: Analysing cloud failure detection approaches for enhanced operational efficiency," in Proc. IEEE Int. Conf. Artif. Intell. Appl., 2023. https://ieeexplore.ieee.org/document/10199929

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing reasoning and acting in language models," arXiv preprint arXiv:2210.03629, 2023. https://arxiv.org/abs/2210.03629

Y. Chen et al., "Exploring LLM-based agents for root cause analysis," in Proc. 32nd ACM Int. Conf. Found. Softw. Eng. (FSE 2024), 2024. https://doi.org/10.1145/3663529.3663841

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, "Reflexion: Language agents with verbal reinforcement learning," in Adv. Neural Inf. Process. Syst. 36 (NeurIPS 2023), 2023. https://arxiv.org/abs/2303.11366

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, "Toolformer: Language models can teach themselves to use tools," in Adv. Neural Inf. Process. Syst. 36 (NeurIPS 2023), 2023. https://arxiv.org/abs/2302.04761

E. Karpas, O. Abend et al., "MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning," arXiv preprint arXiv:2205.00445, 2022. https://arxiv.org/abs/2205.00445

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, "Chain-of-thought prompting elicits reasoning in large language models," in Adv. Neural Inf. Process. Syst. 35 (NeurIPS 2022), 2022. https://arxiv.org/abs/2201.11903

L. Zhang et al., "TAMO: Fine-grained root cause analysis via tool-assisted LLM agent with multi-modality observation data in cloud-native systems," IEEE Trans. Netw. Serv. Manag., 2024. https://ieeexplore.ieee.org/document/11229957

M. Rezaei et al., "Anomaly detection and root cause analysis in cloud-native environments using large language models and Bayesian networks," IEEE Access, vol. 12, 2024. https://ieeexplore.ieee.org/document/10979844

M. Santos, A. Villas, and C. dos Reis, "Memory approaches for LLM-based agents: A comparative study of in-context and episodic architectures," in Lect. Notes Comput. Sci., Springer, 2026. https://doi.org/10.1007/978-3-032-15632-7_19

B. Chen et al., "Advancing root cause analysis in cloud-native systems with knowledge graph path embedding translation," in Proc. IEEE Int. Conf. Softw. Eng., 2024. https://ieeexplore.ieee.org/document/10580547

J. Ma et al., "Leveraging multi-agent framework for root cause analysis," Complex Intell. Syst., vol. 11, 2025. https://doi.org/10.1007/s40747-025-02096-0

P. Liu et al., "Augmenting automatic root-cause identification with incident alerts using LLM," in Proc. IEEE Int. Conf. Softw. Maint. Evol., 2024. https://ieeexplore.ieee.org/document/10838171

Y. Zhang et al., "Graph-based anomaly detection and root cause analysis for microservices in cloud-native platform," in Proc. IEEE Int. Conf. Cloud Comput., 2024. https://ieeexplore.ieee.org/document/10917910

X. Zhao et al., "Grace: Interpretable root cause analysis by graph convolutional network for microservices," in Proc. IEEE Int. Conf. Web Serv., 2023. https://ieeexplore.ieee.org/document/10188728

H. Li et al., "Automated traces-based anomaly detection and root cause analysis in cloud platforms," in Proc. IEEE Int. Conf. Cloud Comput., 2022. https://ieeexplore.ieee.org/document/9946356

K. Wang et al., "AI for information technology operation (AIOps): A review of IT incident risk prediction," in Proc. IEEE Int. Conf. Softw. Eng., 2023. https://ieeexplore.ieee.org/document/10068482

L. Chen et al., "MRCA: Metric-level root cause analysis for microservices via multi-modal data," in Proc. 39th IEEE/ACM Int. Conf. Autom. Softw. Eng., 2024. https://ieeexplore.ieee.org/document/10764888

D. Wang et al., "LLM and AI agents for autonomous systems: A survey of applications, datasets, and security challenges," IEEE Access, vol. 13, 2025. https://ieeexplore.ieee.org/document/11397656

From Signals to Root Cause: A Systems Architecture for Agentic AI in Observability

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

From Signals to Root Cause: A Systems Architecture for Agentic AI in Observability

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By