Cloud-Native AI Platforms for Scalable Enterprise Machine Learning: Architecture, Challenges, and Best Practices

Ravi Kiran Gadiraju

Authors

Ravi Kiran Gadiraju

Keywords:

cloud-native, machine learning platforms, MLOps, scalability, latency, cost-efficiency

Abstract

Cloud-native AI platforms are changing the way enterprises globally formulate, deploy, and maintain machine learning (ML) at scale. In this paper, the architectures of contemporary platforms (aws age on sagemaker and kubeflow) will be reviewed, and their scalability, latency, and cost-efficiency with enterprise ML workloads will be considered. We survey the literature about ML operations (MLOps) and cloud-based ML services to understand prevalent challenges (distributed training, low-latency inference and resource optimization) and best practices to these challenges. Then we describe a research methodology to compare these platforms based on some key parameters and then results are discussed in the context of real world implementations. Findings indicate that cloud native design (with the help of containers, micro services and orchestration) allows highly scalable and portable ML workflows but with low latency due to optimized serving infrastructures. Efficiency is gained through pay-as-you-go management of resources and automation, but the careful selection of architecture is required to prevent technical debt. Conclusions point to the fact that managed services with open-source tools are sufficient to address the needs of the enterprise in relation to scalable ML, but strong MLOps practices are needed to guarantee reliability, security, and governance. Best practices should involve the use of auto-scaling, pipeline automation and continuous monitoring to balance between cost and performance.

DOI: https://doi.org/10.17762/ijisae.v9i4.8119

Downloads

Download data is not yet available.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of OSDI 2016 (pp. 265–283). USENIX Association.

Amershi, S., Begel, A., Bird, C., Gall, H., Kamar, E., Nagappan, N., … & Zimmermann, T. (2019). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 291–300). IEEE.

Baylor, D., Breck, E., Cheng, H. T., Eng, M., Wilkiewicz, J., Haykal, S., … & O’Neill, B. (2017). TFX: A TensorFlow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17) (pp. 1387–1395). ACM.

Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. In Proceedings of 2017 IEEE Big Data (pp. 1123–1132). IEEE.

Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., & Stoica, I. (2017). Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’17) (pp. 613–627). USENIX Association.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., … & Ng, A. (2012). Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25 (NeurIPS 2012) (pp. 1223–1231). Curran Associates, Inc.

Hazelwood, K., Bird, S., Brooks, D., Chinthamani, S., Diril, U., Hampton, C., … & Stanek, J. (2018). Applied machine learning at Facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 620–629). IEEE.

Huang, S., & Li, J. (2019). Scalable machine learning as a service for big data analytics. In 2019 IEEE International Conference on Big Data (Big Data) (pp. 120–129). IEEE.

Jayaram, K. R., Muthusamy, V., Dube, P., Ishakian, V., Wang, C., Herta, B., … & Khalaf, R. (2019). FfDL: A flexible multi-tenant deep learning platform. In Proceedings of Middleware 2019 (Article 27, 13 pages). ACM/IFIP.

Kang, H., & Kim, J. (2018). Cost-efficient resource management for machine learning as a service. ACM Transactions on Internet Technology, 18(2), 22:1–22:20.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … & Su, B. Y. (2014). Scaling distributed machine learning with the parameter server. In Proceedings of OSDI 2014 (pp. 583–598). USENIX Association.

Lwakatare, L. E., Crnkovic, I., & Bosch, J. (2020). MLOps: Practices and requirements for adopting machine learning in continuous software development pipelines. IEEE Software, 37(5), 45–51.

Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Riddle, S., … & Tong, S. (2017). TensorFlow-Serving: Flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28 (NeurIPS 2015) (pp. 2503–2511).

Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799.

Tang, J., & Chen, L. (2020). Machine learning as a service: A data-driven approach. IEEE Access, 8, 151487–151499.

Xiong, J., & Chen, H. (2020). Challenges for building a cloud-native scalable and trustable multi-tenant AIoT platform. In Proceedings of ICCAD ’20: IEEE/ACM International Conference on Computer-Aided Design (pp. 1–6). IEEE.

Zhang, X., & Zheng, J. (2020). Machine learning as a service: A survey. Journal of Software Engineering and Applications, 13(4), 141–152.

Zhao, J., & Li, P. (2019). Secure machine learning as a service. IEEE Transactions on Cloud Computing, 7(3), 778–790.

Zhou, J., Velichkevich, A., Prosvirov, K., Garg, A., Oshima, Y., & Dutta, D. (2019). Katib: A distributed general AutoML platform on Kubernetes. In Proc. of USENIX OpML 2019 (pp. 55–57). USENIX Association.

Cloud-Native AI Platforms for Scalable Enterprise Machine Learning: Architecture, Challenges, and Best Practices

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Announcements

Information for Authors

ijisae

Information

Indexed By

Cloud-Native AI Platforms for Scalable Enterprise Machine Learning: Architecture, Challenges, and Best Practices

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By