From Signals to Root Cause: A Systems Architecture for Agentic AI in Observability
Keywords:
Agentic artificial intelligence; Root cause analysis; Cloud observability; Large language models; Hypothesis refinement; Distributed systems; Iterative reasoningAbstract
Modern distributed systems generate high-cardinality telemetry across metrics, logs, and traces, creating a combinatorial search space that renders manual root cause analysis (RCA) increasingly impractical at cloud scale. Existing approaches—including rule-based automation and prompt-driven large language model (LLM) systems—fail to support reliable RCA due to the absence of structured multi-step reasoning, persistent state management, and deterministic execution. This paper presents an agentic systems framework that models RCA as a closed-loop, sequential decision-making process over observability telemetry. A layered architecture is introduced comprising a control layer for state-machine-based orchestration, a memory layer for token-aware context management, a tooling layer for deterministic interaction with heterogeneous observability backends, and a governance layer for enforcing correctness, security, and auditability. RCA is executed through iterative hypothesis refinement, supported by algorithms for action selection, evidence aggregation, conflict resolution, and failure recovery. Empirical evaluation across 1,200 production-style troubleshooting tasks demonstrates that the proposed system improves task success rates from 61.8% to 86.7%, reduces user intervention by 3.5×, decreases effective time-to-resolution by approximately 42%, and reduces token consumption by up to 4.8× through adaptive memory strategies. Robustness experiments show nearly 2× improvement in failure recovery and significant gains in handling ambiguous inputs compared to prompt-only and static pipeline baselines. These results establish that agentic architectures can transform observability from passive telemetry monitoring into active, evidence-driven, automated reasoning.
Downloads
References
S. Dhar et al., "Observability in microservices: An in-depth exploration of frameworks, challenges, and deployment paradigms," IEEE Access, vol. 12, pp. 1–25, 2024. https://ieeexplore.ieee.org/document/10967524
X. Li, S. Wang, S. Zeng, Y. Wu, and Y. Yang, "A survey on LLM-based multi-agent systems: Workflow, infrastructure, and challenges," Vicinagearth, vol. 1, no. 1, art. 9, 2024. https://doi.org/10.1007/s44336-024-00009-2
H. Wang et al., "Holistic root cause analysis for failures in cloud-native systems through observability data," IEEE Trans. Serv. Comput., vol. 17, no. 6, 2024. https://ieeexplore.ieee.org/document/10713920
R. Kumar et al., "AIOps: Analysing cloud failure detection approaches for enhanced operational efficiency," in Proc. IEEE Int. Conf. Artif. Intell. Appl., 2023. https://ieeexplore.ieee.org/document/10199929
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing reasoning and acting in language models," arXiv preprint arXiv:2210.03629, 2023. https://arxiv.org/abs/2210.03629
Y. Chen et al., "Exploring LLM-based agents for root cause analysis," in Proc. 32nd ACM Int. Conf. Found. Softw. Eng. (FSE 2024), 2024. https://doi.org/10.1145/3663529.3663841
N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, "Reflexion: Language agents with verbal reinforcement learning," in Adv. Neural Inf. Process. Syst. 36 (NeurIPS 2023), 2023. https://arxiv.org/abs/2303.11366
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, "Toolformer: Language models can teach themselves to use tools," in Adv. Neural Inf. Process. Syst. 36 (NeurIPS 2023), 2023. https://arxiv.org/abs/2302.04761
E. Karpas, O. Abend et al., "MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning," arXiv preprint arXiv:2205.00445, 2022. https://arxiv.org/abs/2205.00445
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, "Chain-of-thought prompting elicits reasoning in large language models," in Adv. Neural Inf. Process. Syst. 35 (NeurIPS 2022), 2022. https://arxiv.org/abs/2201.11903
L. Zhang et al., "TAMO: Fine-grained root cause analysis via tool-assisted LLM agent with multi-modality observation data in cloud-native systems," IEEE Trans. Netw. Serv. Manag., 2024. https://ieeexplore.ieee.org/document/11229957
M. Rezaei et al., "Anomaly detection and root cause analysis in cloud-native environments using large language models and Bayesian networks," IEEE Access, vol. 12, 2024. https://ieeexplore.ieee.org/document/10979844
M. Santos, A. Villas, and C. dos Reis, "Memory approaches for LLM-based agents: A comparative study of in-context and episodic architectures," in Lect. Notes Comput. Sci., Springer, 2026. https://doi.org/10.1007/978-3-032-15632-7_19
B. Chen et al., "Advancing root cause analysis in cloud-native systems with knowledge graph path embedding translation," in Proc. IEEE Int. Conf. Softw. Eng., 2024. https://ieeexplore.ieee.org/document/10580547
J. Ma et al., "Leveraging multi-agent framework for root cause analysis," Complex Intell. Syst., vol. 11, 2025. https://doi.org/10.1007/s40747-025-02096-0
P. Liu et al., "Augmenting automatic root-cause identification with incident alerts using LLM," in Proc. IEEE Int. Conf. Softw. Maint. Evol., 2024. https://ieeexplore.ieee.org/document/10838171
Y. Zhang et al., "Graph-based anomaly detection and root cause analysis for microservices in cloud-native platform," in Proc. IEEE Int. Conf. Cloud Comput., 2024. https://ieeexplore.ieee.org/document/10917910
X. Zhao et al., "Grace: Interpretable root cause analysis by graph convolutional network for microservices," in Proc. IEEE Int. Conf. Web Serv., 2023. https://ieeexplore.ieee.org/document/10188728
H. Li et al., "Automated traces-based anomaly detection and root cause analysis in cloud platforms," in Proc. IEEE Int. Conf. Cloud Comput., 2022. https://ieeexplore.ieee.org/document/9946356
K. Wang et al., "AI for information technology operation (AIOps): A review of IT incident risk prediction," in Proc. IEEE Int. Conf. Softw. Eng., 2023. https://ieeexplore.ieee.org/document/10068482
L. Chen et al., "MRCA: Metric-level root cause analysis for microservices via multi-modal data," in Proc. 39th IEEE/ACM Int. Conf. Autom. Softw. Eng., 2024. https://ieeexplore.ieee.org/document/10764888
D. Wang et al., "LLM and AI agents for autonomous systems: A survey of applications, datasets, and security challenges," IEEE Access, vol. 13, 2025. https://ieeexplore.ieee.org/document/11397656
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


