Optimizing Big Data Processing Workflows using PySpark and Google Cloud Platform: A Performance Evaluation of Data Locality and Caching Strategies
Keywords:
Big Data, PySpark, data locality, caching strategies, Google Cloud PlatformAbstract
The increasing volume and complexity of Big Data have led to the development of distributed processing frameworks such as Apache Spark, particularly its Python interface, PySpark, which allows for large-scale data processing in cloud environments. This paper investigates the optimization of data locality and caching strategies to improve the performance and scalability of Big Data workflows running on Google Cloud Platform (GCP). Through a series of experiments, various configurations of data placement, replication, and caching techniques—both in-memory and disk-based—were evaluated for their impact on key performance metrics, including execution time, latency, and throughput. The study also assesses the scalability of workflows as data sizes increase, identifying the configurations that allow PySpark workflows to handle growing datasets efficiently. The results reveal that optimized data locality, combined with well-tuned caching strategies, can significantly improve performance and scalability, offering a pathway for businesses to enhance their cloud-based Big Data systems. Furthermore, the findings provide valuable insights for organizations seeking to reduce costs, accelerate decision-making, and improve the efficiency of their data processing workflows. This paper contributes to the ongoing efforts to optimize distributed Big Data processing frameworks in cloud environments and offers practical guidelines for configuring PySpark workflows for maximum performance.
Downloads
References
Avery, L., & Roberts, M. (2019). Cloud computing and big data: A comprehensive study. Cloud Computing Advances, 3(2), 45-59.
Bhardwaj, A., & Patel, K. (2020). Optimizing caching mechanisms in distributed big data systems. Journal of Parallel and Distributed Computing, 135, 26-42.
Chen, T., & Liu, Y. (2020). Machine learning with PySpark: An overview and case studies. Artificial Intelligence Review, 53(4), 2673-2700.
Goyal, Mahesh Kumar, and Rahul Chaturvedi. "The Role of NoSQL in Microservices Architecture: Enabling Scalability and Data Independence." European Journal of Advances in Engineering and Technology 9.6 (2022): 87-95.
Google Cloud. (n.d.). Cloud Storage. Retrieved from https://cloud.google.com/storage
He, Q., et al. (2020). Efficient caching strategies for big data processing in cloud environments. IEEE Transactions on Cloud Computing, 8(4), 1071-1084.
Kambatla, K., & Sahu, S. (2014). Big data: A survey. Journal of Computing and Information Technology, 22(3), 143-157.
Li, B., & Zhang, X. (2020). Optimization techniques for data replication in distributed systems. Journal of Cloud Computing, 8(2), 180-190.
Li, Y., & Zhang, X. (2019). Data locality optimization in cloud environments for big data analytics. Future Generation Computer Systems, 97, 253-265.
Li, Y., & Zhang, X. (2019). Optimizing Big Data processing with PySpark in cloud environments. Journal of Cloud Computing, 8(3), 45-59.
Loshin, D. (2013). The data governance imperative: A business strategy for managing data. Elsevier.
Marz, N., & Warren, J. (2015). Big data: Principles and paradigms. Springer.
Mehta, A., & Pande, N. (2019). Scalability in cloud-based big data systems: Performance benchmarks and metrics. Future Generation Computer Systems, 92, 105-120.
Tang, L., & Xu, J. (2018). Memory-aware caching strategies for distributed big data systems. International Journal of Cloud Computing and Services Science, 7(4), 45-56.
Wang, H., & Wu, X. (2020). Performance trade-offs in big data processing frameworks. International Journal of Cloud Computing and Services Science, 8(2), 76-88.
White, T. (2012). Hadoop: The definitive guide. O'Reilly Media.
Zaharia, M., Chowdhury, M., Franklin, M. J., & Ghodsi, A. (2016). Spark: The definitive guide: Big data processing made simple. O'Reilly Media.
Zhang, Z., Wang, S., & Zhao, W. (2020). Automating ETL processes using Apache Spark: A comparative study. Journal of Big Data, 7(1), 24-37.
Goyal, Mahesh Kumar. "Synthetic Data Revolutionizes Rare Disease Research: How Large Language Models and Generative AI are Overcoming Data Scarcity and Privacy Challenges
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.