Optimizing Big Data Processing Workflows using PySpark and Google Cloud Platform: A Performance Evaluation of Data Locality and Caching Strategies

Thulasiram Yachamaneni

Authors

Thulasiram Yachamaneni, Amandeep Singh Arora, Uttam Kotadiya

Keywords:

Big Data, PySpark, data locality, caching strategies, Google Cloud Platform

Abstract

The increasing volume and complexity of Big Data have led to the development of distributed processing frameworks such as Apache Spark, particularly its Python interface, PySpark, which allows for large-scale data processing in cloud environments. This paper investigates the optimization of data locality and caching strategies to improve the performance and scalability of Big Data workflows running on Google Cloud Platform (GCP). Through a series of experiments, various configurations of data placement, replication, and caching techniques—both in-memory and disk-based—were evaluated for their impact on key performance metrics, including execution time, latency, and throughput. The study also assesses the scalability of workflows as data sizes increase, identifying the configurations that allow PySpark workflows to handle growing datasets efficiently. The results reveal that optimized data locality, combined with well-tuned caching strategies, can significantly improve performance and scalability, offering a pathway for businesses to enhance their cloud-based Big Data systems. Furthermore, the findings provide valuable insights for organizations seeking to reduce costs, accelerate decision-making, and improve the efficiency of their data processing workflows. This paper contributes to the ongoing efforts to optimize distributed Big Data processing frameworks in cloud environments and offers practical guidelines for configuring PySpark workflows for maximum performance.

Downloads

Download data is not yet available.

References

Avery, L., & Roberts, M. (2019). Cloud computing and big data: A comprehensive study. Cloud Computing Advances, 3(2), 45-59.

Bhardwaj, A., & Patel, K. (2020). Optimizing caching mechanisms in distributed big data systems. Journal of Parallel and Distributed Computing, 135, 26-42.

Chen, T., & Liu, Y. (2020). Machine learning with PySpark: An overview and case studies. Artificial Intelligence Review, 53(4), 2673-2700.

Goyal, Mahesh Kumar, and Rahul Chaturvedi. "The Role of NoSQL in Microservices Architecture: Enabling Scalability and Data Independence." European Journal of Advances in Engineering and Technology 9.6 (2022): 87-95.

Google Cloud. (n.d.). Cloud Storage. Retrieved from https://cloud.google.com/storage

He, Q., et al. (2020). Efficient caching strategies for big data processing in cloud environments. IEEE Transactions on Cloud Computing, 8(4), 1071-1084.

Kambatla, K., & Sahu, S. (2014). Big data: A survey. Journal of Computing and Information Technology, 22(3), 143-157.

Li, B., & Zhang, X. (2020). Optimization techniques for data replication in distributed systems. Journal of Cloud Computing, 8(2), 180-190.

Li, Y., & Zhang, X. (2019). Data locality optimization in cloud environments for big data analytics. Future Generation Computer Systems, 97, 253-265.

Li, Y., & Zhang, X. (2019). Optimizing Big Data processing with PySpark in cloud environments. Journal of Cloud Computing, 8(3), 45-59.

Loshin, D. (2013). The data governance imperative: A business strategy for managing data. Elsevier.

Marz, N., & Warren, J. (2015). Big data: Principles and paradigms. Springer.

Mehta, A., & Pande, N. (2019). Scalability in cloud-based big data systems: Performance benchmarks and metrics. Future Generation Computer Systems, 92, 105-120.

Tang, L., & Xu, J. (2018). Memory-aware caching strategies for distributed big data systems. International Journal of Cloud Computing and Services Science, 7(4), 45-56.

Wang, H., & Wu, X. (2020). Performance trade-offs in big data processing frameworks. International Journal of Cloud Computing and Services Science, 8(2), 76-88.

White, T. (2012). Hadoop: The definitive guide. O'Reilly Media.

Zaharia, M., Chowdhury, M., Franklin, M. J., & Ghodsi, A. (2016). Spark: The definitive guide: Big data processing made simple. O'Reilly Media.

Zhang, Z., Wang, S., & Zhao, W. (2020). Automating ETL processes using Apache Spark: A comparative study. Journal of Big Data, 7(1), 24-37.

Goyal, Mahesh Kumar. "Synthetic Data Revolutionizes Rare Disease Research: How Large Language Models and Generative AI are Overcoming Data Scarcity and Privacy Challenges

Optimizing Big Data Processing Workflows using PySpark and Google Cloud Platform: A Performance Evaluation of Data Locality and Caching Strategies

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By

Optimizing Big Data Processing Workflows using PySpark and Google Cloud Platform: A Performance Evaluation of Data Locality and Caching Strategies

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By