Efficient Incremental Data Modeling in Apache Iceberg-Based Analytical Pipelines: Partitioning and Snapshot Optimization Strategies
Keywords:
Apache Iceberg, Incremental Data Modeling, Lakehouse Architecture, Partition Evolution, Metadata Optimization.Abstract
Lakehouse relies on Apache Iceberg to efficiently handle big data analytics in a reliable and scala-able way. But inefficient incremental modeling has the capacity of decreasing the speed of queries and hiking the cost of storage in the long run. This paper gives a quantitative assessment of the partitioning and snapshot retention and compaction policies in terms of Monte Carlo simulations. Findings indicate that scans shrink percentage was increased day to day using partitioning (0.61 to 0.82) and reaction savings were decreased (18.4 seconds to 13.4 seconds). Snapshot expiration policies decreased metadata to data ratio (0.18 to 0.07) and reduced the overall query response (19.3 seconds to 15.8 seconds). Threshold based and daily compaction ensured that average file sizes were above 240 MB and overall efficiency score increased by 0.032 as compared to 0.051. Connected optimization minimized the overall latency by 34 per cent and storage fragmentation by 41 per cent. The results offer viable suggestions in the development of robust and viable Iceberg analytical pipelines.
Downloads
References
A. Okolnychyi, C. Sun, K. Tanimura, R. Spitzer, R. Blue, S. Ho, Y. Gu, V. Lakkundi, and D. B. Tsai, "Petabyte-Scale Row-Level Operations in Data Lakehouses," Proceedings of the VLDB Endowment, vol. 17, no. 11, 2024. DOI: 10.14778/3685800.3685834
S. Meneghin, "IcedHops: reducing read and write latency in an Iceberg-backed offline feature store," Politecnico di Milano, 2024. Available: https://www.politesi.polimi.it/handle/10589/234607
Y. Zhang, B. Peng, Y. Du, and J. Su, "GeoLake: Bringing Geospatial Support to Lakehouses," IEEE Access, vol. 12, pp. 3343953, 2023. DOI: 10.1109/access.2023.3343953
P. Hansert and S. Michel, "Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines," Proceedings of the VLDB Endowment, vol. 17, no. 9, 2024. DOI: 10.14778/3681954.3682013
D. Ritter, M. Andrei, S. Cho, M. Görgens, T. Lee, N. May, A. Pathak, and P. Willems, "The HANA Native Query Engine for Lakehouse Systems," Proceedings of the VLDB Endowment, vol. 18, no. 1, 2024. DOI: 10.14778/3750601.3750608
M. Merli, S. Guo, P. Li, H. Chen, and N. Lu, "Ursa: A Lakehouse-Native Data Streaming Engine for Kafka," Proceedings of the VLDB Endowment, vol. 18, no. 1, 2024. DOI: 10.14778/3750601.3750636
G. Huang, A. Lall, C.-N. Chuah, and J. Xu, "Uncovering Global Icebergs in Distributed Streams: Results and Implications," Journal of Network and Systems Management, vol. 19, no. 3, 2011. DOI: 10.1007/S10922-010-9186-5
D. Eswararaj, A. B. Nellipudi, and V. Kollati, "A comparative study of delta parquet, iceberg, and hudi for automotive data engineering use cases," SSRG International Journal of Computer Science and Engineering, vol. 12, no. 17, 2025. DOI: 10.14445/23488387/ijcse-v12i17p104
D. Saha, "Disruption in Data Engineering–Lakehouse Revolution with Iceberg," in Advances in Data Science and Artificial Intelligence, Springer, 2022, ch. 23.
D. Saha, "Disruptor in Data Engineering-Comprehensive Review of Apache Iceberg," Technical Report, 2023.
D. Eswararaj, A. B. Nellipudi, and V. Kollati, "A Comparative Study of Delta Parquet, Iceberg, and Hudi for Automotive Data Engineering Use Cases," arXiv preprint arXiv:2508.13396, 2025. DOI: 10.14445/23488387/IJCSE-V12I17P104
S. Parimi, "A Comparative Performance & Metadata Study of Open Table Formats: Iceberg vs Delta vs Hudi at Scale," Journal of Computer Science and Technology Studies, 2024.
R. Punugoti, "Next-Generation Data Lakehouse Using Open-Source Solutions," IntechOpen, 2024.
P. Bhosale, "Scalable Metadata Management in Data Lakes: The Role of Apache Iceberg," International Journal of Scientific Advancements, vol. 5, no. 9, 2024.
D. B. G. S. Narayanan, "AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems," AI Journal of Computer Science and Technology, vol. 1, no. 1, 2024.
P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, and M. Zaharia, "Analyzing and Comparing Lakehouse Storage Systems," Technical Report, UC Berkeley, 2023.
U. Kothari, "How Apache Iceberg Outperforms Traditional and Hybrid Table Formats for Large-Scale Data Engineering," International Journal for Multidisciplinary Research, vol. 6, no. 2, 2024. DOI: 10.36948/ijfmr.2024.v06i02.50256
A. M. W. Chaudhari and P. A. Charate, "Optimizing Data Lakehouse Architectures for Scalable Real-Time Analytics," International Journal of Scientific Research in Science, Engineering and Technology, vol. 12, no. 2, 2025. DOI: 10.32628/ijsrset25122198
S. S. Kona, "Leveraging Spark and PySpark for Data-Driven Success: Insights and Best Practices Including Parallel Processing, Data Partitioning, and Fault Tolerance Mechanisms," Journal of Management and Corporate Affairs, vol. 2, no. 2, 2023. DOI: 10.47363/jmca/2023(2)160
D. R. Krishnan, D. Le Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues, "IncApprox: A Data Analytics System for Incremental Approximate Computing," in Proc. 25th International Conference on World Wide Web (WWW), 2016. DOI: 10.1145/2872427.2883026
I. Elghandour, A. Kara, D. Olteanu, and S. Vansummeren, "Incremental Techniques for Large-Scale Dynamic Query Processing," arXiv preprint arXiv:Databases, 2018.
M. Sethi, N. Sachindran, and S. Raghavan, "SASH: Enabling continuous incremental analytic workflows on Hadoop," in Proc. IEEE 29th International Conference on Data Engineering (ICDE), 2013. DOI: 10.1109/ICDE.2013.6544911
I. Elghandour, A. Kara, D. Olteanu, and S. Vansummeren, "Incremental Techniques for Large-Scale Dynamic Query Processing," in Proc. ACM SIGMOD International Conference on Management of Data, 2018. DOI: 10.1145/3269206.3274271
Z. Wang, K. Zeng, B. Huang, W. Chen, X. Cui, B. Wang, J. Liu, L. Fan, D. Qu, Z. Hou, T. Guan, C. Li, and J. Zhou, "Tempura: A General Cost Based Optimizer Framework for Incremental Data Processing (Extended Version)," arXiv preprint arXiv:Databases, 2020.
M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki, "Slalom: coasting through raw data via adaptive partitioning and indexing," Proceedings of the VLDB Endowment, vol. 10, no. 10, 2017. DOI: 10.14778/3115404.3115415
A. E. Khalifa, I. Elghandour, and N. M. El-Makky, "IncReStore: Incremental computation of mapreduce workflows," in Proc. IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016. DOI: 10.1109/ICDEW.2016.7495613
E. Viel and U. Haruyasu, "Data stream partitioning re-optimization based on runtime dependency mining," in Proc. IEEE 30th International Conference on Data Engineering Workshops (ICDEW), 2014. DOI: 10.1109/ICDEW.2014.6818327
P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues, "Incremental Sliding Window Analytics," in Encyclopedia of Big Data Technologies, Springer, 2019. DOI: 10.1007/978-3-319-63962-8_156-1
M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki, "Adaptive partitioning and indexing for in situ query processing," The VLDB Journal, vol. 29, no. 1, 2020. DOI: 10.1007/S00778-019-00580-X
Z. Wang, K. Zeng, B. Huang, W. Chen, X. Cui, B. Wang, J. Liu, L. Fan, D. Qu, Z. Hou, T. Guan, C. Li, and J. Zhou, "Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing," in Proc. ACM SIGMOD International Conference on Management of Data, 2020. DOI: 10.1145/3318464.3384708
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


