Enterprise Distributed Infrastructure for Scalable Generative AI Workloads
Keywords:
Generative AI, Distributed Infrastructure, GPU Orchestration, Retrieval-Augmented Generation, Vector Databases, Latency Optimization, Kubernetes, Multi-Region DeploymentAbstract
Enterprise deployment of generative artificial intelligence (GenAI) at production scale exposes a category of infrastructure problems that classical data-center engineering does not anticipate. Unified graphics processing unit (GPU) pools suffer from chronic underutilization and unpredictable tail latency, while naive multi-tier partitioning sacrifices elasticity to obtain predictability. This article develops a dynamic multi-tier compute architecture that reallocates GPU capacity across inference, training, and operational tiers on a two-to-four-hour prediction horizon. The architecture is evaluated against unified and statically partitioned baselines using an eighteen-month observational deployment profile and published industry benchmarks. Dynamic allocation reduces infrastructure cost by 40–55 percent relative to unified pools while improving p95 latency consistency by 25–35 percent; GPU utilization rises from a 62–68 percent unified baseline to 78–81 percent under dynamic control, exceeding the 68–78 percent utilization band reported in vendor benchmarks. A complementary multi-vector-store knowledge fabric with intelligent query routing reduces retrieval latency by 40–55 percent and increases semantic recall to 99.2 percent, while a twelve-region active-active deployment with content-aware routing reduces p95 latency by 72 percent and compresses tail variance sixfold. The article formalizes a latency-cost Pareto frontier that lets enterprise operators reason explicitly about where to sit on the trade-off curve rather than pursuing unbounded cost reduction. Results hold across transformer families spanning 7 billion to 405 billion parameters and across mixed inference-training-operational workload profiles.
Downloads
References
Vijay Janapa Reddi, et al., "MLPerf Inference Benchmark," arXiv, 2020. [Online]. Available: https://arxiv.org/pdf/1911.02549
Zhisheng Ye, et al., "Deep learning workload scheduling in GPU datacenters: a survey," ACM Computing Surveys, 2024. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3638757
Yunfan Gao, et al., "Retrieval-augmented generation for large language models: a survey," arXiv, 2024. [Online]. Available: https://arxiv.org/pdf/2312.10997
W. Zhao et al., "A survey of large language models," arXiv, 2026. [Online]. Available: https://arxiv.org/pdf/2303.18223
Kubernetes, "Dynamic resource allocation (KEP-4381)." [Online]. Available: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
Z. Li et al., "AlpaServe: statistical multiplexing with model parallelism for deep learning serving," arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2302.11665
G. Yu et al., "Orca: a distributed serving system for transformer-based generative models," Open access to the Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. [Online]. Available: https://www.usenix.org/system/files/osdi22-yu.pdf
W. Kwon et al., "Efficient memory management for large language model serving with PagedAttention," arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2309.06180
Arney Agrawal, et al., "Efficient LLM inference via chunked prefills," ACM SIGOPS Operating Systems Review, 2025. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3759441.3759444
Wenqi Fan, et al., "A survey on RAG meeting LLMs: towards retrieval-augmented large language models," ACM Digital Library, 2024. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3637528.3671470
Jeff Johnson, et al., "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/8733051
Le Ma, et al., "A comprehensive survey on vector databases: storage and retrieval techniques and challenges," arXiv, 2026. [Online]. Available: https://arxiv.org/pdf/2310.11703
En Li, et al., "Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing," IEEE Transactions on Wireless Communications, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/8876870
Kiran Bhat, et al., "Edge computing media cache infrastructure for accelerated AI/ML inference," 2025 12th International Conference on Future Internet of Things and Cloud (FiCloud), 2025. [Online]. Available: https://ieeexplore.ieee.org/document/11205197
Minrui Xu, et al., "Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services," IEEE Communications Surveys & Tutorials, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10398474
Haitao Yuan, et al., "An improved LSTM-based prediction approach for resources and workload in large-scale data centers," IEEE Internet of Things Journal, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10486896
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


