Enterprise Distributed Infrastructure for Scalable Generative AI Workloads

Satish Chandra Guruvelli

Authors

Satish Chandra Guruvelli

Keywords:

Generative AI, Distributed Infrastructure, GPU Orchestration, Retrieval-Augmented Generation, Vector Databases, Latency Optimization, Kubernetes, Multi-Region Deployment

Abstract

Enterprise deployment of generative artificial intelligence (GenAI) at production scale exposes a category of infrastructure problems that classical data-center engineering does not anticipate. Unified graphics processing unit (GPU) pools suffer from chronic underutilization and unpredictable tail latency, while naive multi-tier partitioning sacrifices elasticity to obtain predictability. This article develops a dynamic multi-tier compute architecture that reallocates GPU capacity across inference, training, and operational tiers on a two-to-four-hour prediction horizon. The architecture is evaluated against unified and statically partitioned baselines using an eighteen-month observational deployment profile and published industry benchmarks. Dynamic allocation reduces infrastructure cost by 40–55 percent relative to unified pools while improving p95 latency consistency by 25–35 percent; GPU utilization rises from a 62–68 percent unified baseline to 78–81 percent under dynamic control, exceeding the 68–78 percent utilization band reported in vendor benchmarks. A complementary multi-vector-store knowledge fabric with intelligent query routing reduces retrieval latency by 40–55 percent and increases semantic recall to 99.2 percent, while a twelve-region active-active deployment with content-aware routing reduces p95 latency by 72 percent and compresses tail variance sixfold. The article formalizes a latency-cost Pareto frontier that lets enterprise operators reason explicitly about where to sit on the trade-off curve rather than pursuing unbounded cost reduction. Results hold across transformer families spanning 7 billion to 405 billion parameters and across mixed inference-training-operational workload profiles.

Downloads

Download data is not yet available.

References

Vijay Janapa Reddi, et al., "MLPerf Inference Benchmark," arXiv, 2020. [Online]. Available: https://arxiv.org/pdf/1911.02549

Zhisheng Ye, et al., "Deep learning workload scheduling in GPU datacenters: a survey," ACM Computing Surveys, 2024. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3638757

Yunfan Gao, et al., "Retrieval-augmented generation for large language models: a survey," arXiv, 2024. [Online]. Available: https://arxiv.org/pdf/2312.10997

W. Zhao et al., "A survey of large language models," arXiv, 2026. [Online]. Available: https://arxiv.org/pdf/2303.18223

Kubernetes, "Dynamic resource allocation (KEP-4381)." [Online]. Available: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

Z. Li et al., "AlpaServe: statistical multiplexing with model parallelism for deep learning serving," arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2302.11665

G. Yu et al., "Orca: a distributed serving system for transformer-based generative models," Open access to the Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. [Online]. Available: https://www.usenix.org/system/files/osdi22-yu.pdf

W. Kwon et al., "Efficient memory management for large language model serving with PagedAttention," arXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2309.06180

Arney Agrawal, et al., "Efficient LLM inference via chunked prefills," ACM SIGOPS Operating Systems Review, 2025. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3759441.3759444

Wenqi Fan, et al., "A survey on RAG meeting LLMs: towards retrieval-augmented large language models," ACM Digital Library, 2024. [Online]. Available: https://dl.acm.org/doi/epdf/10.1145/3637528.3671470

Jeff Johnson, et al., "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/8733051

Le Ma, et al., "A comprehensive survey on vector databases: storage and retrieval techniques and challenges," arXiv, 2026. [Online]. Available: https://arxiv.org/pdf/2310.11703

En Li, et al., "Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing," IEEE Transactions on Wireless Communications, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/8876870

Kiran Bhat, et al., "Edge computing media cache infrastructure for accelerated AI/ML inference," 2025 12th International Conference on Future Internet of Things and Cloud (FiCloud), 2025. [Online]. Available: https://ieeexplore.ieee.org/document/11205197

Minrui Xu, et al., "Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services," IEEE Communications Surveys & Tutorials, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10398474

Haitao Yuan, et al., "An improved LSTM-based prediction approach for resources and workload in large-scale data centers," IEEE Internet of Things Journal, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10486896

Enterprise Distributed Infrastructure for Scalable Generative AI Workloads

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Enterprise Distributed Infrastructure for Scalable Generative AI Workloads

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By