Cloud-Native AI Platforms for Scalable Enterprise Machine Learning: Architecture, Challenges, and Best Practices
Keywords:
cloud-native, machine learning platforms, MLOps, scalability, latency, cost-efficiencyAbstract
Cloud-native AI platforms are changing the way enterprises globally formulate, deploy, and maintain machine learning (ML) at scale. In this paper, the architectures of contemporary platforms (aws age on sagemaker and kubeflow) will be reviewed, and their scalability, latency, and cost-efficiency with enterprise ML workloads will be considered. We survey the literature about ML operations (MLOps) and cloud-based ML services to understand prevalent challenges (distributed training, low-latency inference and resource optimization) and best practices to these challenges. Then we describe a research methodology to compare these platforms based on some key parameters and then results are discussed in the context of real world implementations. Findings indicate that cloud native design (with the help of containers, micro services and orchestration) allows highly scalable and portable ML workflows but with low latency due to optimized serving infrastructures. Efficiency is gained through pay-as-you-go management of resources and automation, but the careful selection of architecture is required to prevent technical debt. Conclusions point to the fact that managed services with open-source tools are sufficient to address the needs of the enterprise in relation to scalable ML, but strong MLOps practices are needed to guarantee reliability, security, and governance. Best practices should involve the use of auto-scaling, pipeline automation and continuous monitoring to balance between cost and performance.
Downloads
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of OSDI 2016 (pp. 265–283). USENIX Association.
Amershi, S., Begel, A., Bird, C., Gall, H., Kamar, E., Nagappan, N., … & Zimmermann, T. (2019). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 291–300). IEEE.
Baylor, D., Breck, E., Cheng, H. T., Eng, M., Wilkiewicz, J., Haykal, S., … & O’Neill, B. (2017). TFX: A TensorFlow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17) (pp. 1387–1395). ACM.
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. In Proceedings of 2017 IEEE Big Data (pp. 1123–1132). IEEE.
Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., & Stoica, I. (2017). Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’17) (pp. 613–627). USENIX Association.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., … & Ng, A. (2012). Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25 (NeurIPS 2012) (pp. 1223–1231). Curran Associates, Inc.
Hazelwood, K., Bird, S., Brooks, D., Chinthamani, S., Diril, U., Hampton, C., … & Stanek, J. (2018). Applied machine learning at Facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 620–629). IEEE.
Huang, S., & Li, J. (2019). Scalable machine learning as a service for big data analytics. In 2019 IEEE International Conference on Big Data (Big Data) (pp. 120–129). IEEE.
Jayaram, K. R., Muthusamy, V., Dube, P., Ishakian, V., Wang, C., Herta, B., … & Khalaf, R. (2019). FfDL: A flexible multi-tenant deep learning platform. In Proceedings of Middleware 2019 (Article 27, 13 pages). ACM/IFIP.
Kang, H., & Kim, J. (2018). Cost-efficient resource management for machine learning as a service. ACM Transactions on Internet Technology, 18(2), 22:1–22:20.
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … & Su, B. Y. (2014). Scaling distributed machine learning with the parameter server. In Proceedings of OSDI 2014 (pp. 583–598). USENIX Association.
Lwakatare, L. E., Crnkovic, I., & Bosch, J. (2020). MLOps: Practices and requirements for adopting machine learning in continuous software development pipelines. IEEE Software, 37(5), 45–51.
Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Riddle, S., … & Tong, S. (2017). TensorFlow-Serving: Flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28 (NeurIPS 2015) (pp. 2503–2511).
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799.
Tang, J., & Chen, L. (2020). Machine learning as a service: A data-driven approach. IEEE Access, 8, 151487–151499.
Xiong, J., & Chen, H. (2020). Challenges for building a cloud-native scalable and trustable multi-tenant AIoT platform. In Proceedings of ICCAD ’20: IEEE/ACM International Conference on Computer-Aided Design (pp. 1–6). IEEE.
Zhang, X., & Zheng, J. (2020). Machine learning as a service: A survey. Journal of Software Engineering and Applications, 13(4), 141–152.
Zhao, J., & Li, P. (2019). Secure machine learning as a service. IEEE Transactions on Cloud Computing, 7(3), 778–790.
Zhou, J., Velichkevich, A., Prosvirov, K., Garg, A., Oshima, Y., & Dutta, D. (2019). Katib: A distributed general AutoML platform on Kubernetes. In Proc. of USENIX OpML 2019 (pp. 55–57). USENIX Association.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


