Coordination Without Contracts: Toward Formally Grounded Agentic AI Systems
Keywords:
Multi-Agent Systems, Compositional Safety, Memory Architecture, Trust Propagation, Long-Horizon EvaluationAbstract
Autonomous AI agents systems that plan, invoke external tools, spawn sub-agents, and iterate toward long-horizon goals are rapidly moving from research prototypes to production deployments. Yet the theoretical scaffolding needed to reason about agent behavior remains conspicuously thin. Unlike single-turn language models, which inherit decades of statistical learning theory and empirical benchmarking infrastructure, multi-agent LLM pipelines operate without formal contracts between participants, without verified memory semantics, and without evaluation protocols that reflect the temporal depth of real tasks. This paper argues that the central bottleneck in agentic AI research is not the capability of current frontier models, which are already impressive planners in isolation, but rather the absence of compositional safety guarantees that survive agent-to-agent delegation.This work diagnoses four structural limits of the dominant paradigm: (1) context-window memory creates ephemeral, unverifiable state; (2) informal tool-calling interfaces lack precondition/postcondition semantics; (3) inter-agent trust is implicitly inherited rather than explicitly negotiated; and (4) existing benchmarks measure shallow reactive competence rather than long-horizon coherence under adversarial perturbation. Against this diagnosis, four technically-detailed research directions are proposed: typed agent communication protocols with verifiable postconditions; hierarchical memory architectures grounded in external write-ahead logs; a trust-propagation algebra for multi-agent delegation chains; and a new benchmark family, Long Horizon Agent Bench (LHAB) designed to stress-test agents over multi-day, multi-session task horizons. Proof-of-concept experiments feasible in 2026–2027 are outlined, closing with a 36-month research agenda for the community.
Downloads
References
WILLIAM TORGBI AGBEMABIESE, Toward Constitutional Autonomy in AI Systems: A Theoretical Framework for Aligned Agentic Intelligence. IEEE Xplore, 2025. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11354471 .
Anthropic, Claude's Character, and Agentic Capabilities: Technical Report on Claude 3.7 Sonnet. Technical Report, Anthropic, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
Google DeepMind, Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities. Google DeepMind, 2026. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
Nathan Schlaffer, Cobus Greyling. Parallel Agent Processing, 2025. https://www.kore.ai/ai-insights/parallel-agent-processing
Joon Sung Park et al., Social Simulacra: Creating Populated Prototypes for Social Computing Systems. ACM Digital Library, 2025. https://dl.acm.org/doi/10.1145/3526113.3545616
Gheorghe Comanici et al. “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.” Technical Report, Google DeepMind, 2025. https://arxiv.org/abs/2507.06261
Anthropic, Scaling Long-Context Reasoning in Claude 4. Technical Report, Anthropic, 2025. https://www.anthropic.com/news/claude-sonnet-4-6
Fábio Perez, Ian Ribeiro, Ignore Previous Prompt: Attack Techniques for Language Models. In Proc. NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527
Kai Greshake et al., "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv:2302.12173 [cs.CR] 2023. https://arxiv.org/abs/2302.12173
Yangjun Ruan et al., Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv:2309.15817 [cs.AI], 2024. https://arxiv.org/abs/2309.15817
Xiao Liu et al., AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI], 2025. https://arxiv.org/abs/2308.03688
Shuyan Zhou et al., WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI], 2024. https://arxiv.org/abs/2307.13854
Tianbao Xie et al., OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972 [cs.AI], 2024. https://arxiv.org/abs/2404.07972
Shunyu Yao et al., τ-Bench: A Benchmark for Tool-Augmented Language Agent Evaluation in Real-World Domains. arXiv:2406.12045 [cs.AI], 2025. https://arxiv.org/abs/2406.12045
Freda Shi et al., Large Language Models Can Be Easily Distracted by Irrelevant Context. ACM Digital Library, 2023. https://arxiv.org/abs/2302.00093
Cheng-Ping Hsieh et al., RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654 [cs.CL], 2024.https://arxiv.org/abs/2404.06654
Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts Transactions of the ACL, arXiv:2307.03172 [cs.CL], 2023. https://arxiv.org/abs/2307.03172
Yejin Bang et al., A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. AACL 2023.https://arxiv.org/abs/2302.04023
Lei Huang et al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Digital Library, 2025. https://dl.acm.org/doi/10.1145/3703155
John Yang et al., InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. arXiv:2306.14898 [cs.CL], 2024. https://arxiv.org/abs/2306.14898
Qiusi Zhan, Zhixiang Liang, InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. arXiv:2403.02691 [cs.CL], 2024. https://arxiv.org/abs/2403.02691
Mark S. Miller et al., Capability Myths Demolished. Technical Report, Johns Hopkins University Systems Research Laboratory, 2003. https://classpages.cselabs.umn.edu/Fall-2021/csci5271/papers/SRL2003-02.pdf
A. Sabelfeld and A.C. Myers, Language-Based Information-Flow Security. IEEE Xplore, 2003. https://ieeexplore.ieee.org/document/1159651
Leo Gao et al., “Scaling Laws for Reward Model Overoptimization.” In Proc. ICML 2023. https://proceedings.mlr.press/v202/gao23h/gao23h.pdf
Yann Dubois et al., “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.” In Proc. ACL 2024. https://arxiv.org/abs/2404.04475
Austin, JL, How to Do Things with Words. Oxford University Press. 1962. https://silverbronzo.wordpress.com/wp-content/uploads/2017/10/austin-how-to-do-things-with-words-1962.pdf
Searle, John R, Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. https://archive.org/details/speechactsessayi0000sear
Lianmin Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL], 2023. https://arxiv.org/abs/2306.05685
Qingyun Wu et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155 [cs.AI], 2023. https://arxiv.org/abs/2308.08155
Tianbao Xie et al., OpenAgents: An Open Platform for Language Agents in the Wild. arXiv:2310.10634 [cs.CL], 2023. https://arxiv.org/abs/2310.10634
Sirui Hong et al., MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv preprint arXiv:2308.00352. 2023. https://arxiv.org/abs/2308.00352
Charles Packer et al., MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI], 2024. https://arxiv.org/abs/2310.08560
Wanjun Zhong et al., MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 2024. https://ojs.aaai.org/index.php/AAAI/article/view/29946
Fouad Bousetouane, “AI Agents Need Memory Control Over More Context.” arXiv:2601.11653 [q-bio.NC], 2026. https://arxiv.org/abs/2601.11653
Yuntao Bai et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback.” arXiv preprint arXiv:2204.05862, 2022. https://arxiv.org/abs/2204.05862
Long Ouyang et al., Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL], 2022. https://arxiv.org/abs/2203.02155
Anthropic, “Responsible Scaling Policy: Frontier AI Safety Commitments.” Technical Report, Anthropic, 2024. https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf
Ivan Nardini. “Introducing Code Execution: The code sandbox for your agents on Vertex AI Agent Engine.” In Proc. NDSS 2025. https://discuss.google.dev/t/introducing-code-execution-the-code-sandbox-for-your-agents-on-vertex-ai-agent-engine/264336
Difei Gao et al., AssistGUI: Task-Oriented Desktop Graphical User Interface Automation. In Proc. CVPR 2024. https://arxiv.org/html/2312.13108v2
Grégoire Mialon et al., GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983, 2023. https://arxiv.org/abs/2311.12983
Suchin Gururangan et al., Annotation Artifacts in Natural Language Inference Data. In Proc. NAACL 2018. https://aclanthology.org/N18-2017/
R. Thomas McCoy et al., Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proc. ACL 2019. https://aclanthology.org/P19-1334/
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


