Efficient Incremental Data Modeling in Apache Iceberg-Based Analytical Pipelines: Partitioning and Snapshot Optimization Strategies

Authors

  • Guruprasad Raghothama Rao

Keywords:

Apache Iceberg, Incremental Data Modeling, Lakehouse Architecture, Partition Evolution, Metadata Optimization.

Abstract

Lakehouse relies on Apache Iceberg to efficiently handle big data analytics in a reliable and scala-able way. But inefficient incremental modeling has the capacity of decreasing the speed of queries and hiking the cost of storage in the long run. This paper gives a quantitative assessment of the partitioning and snapshot retention and compaction policies in terms of Monte Carlo simulations. Findings indicate that scans shrink percentage was increased day to day using partitioning (0.61 to 0.82) and reaction savings were decreased (18.4 seconds to 13.4 seconds). Snapshot expiration policies decreased metadata to data ratio (0.18 to 0.07) and reduced the overall query response (19.3 seconds to 15.8 seconds). Threshold based and daily compaction ensured that average file sizes were above 240 MB and overall efficiency score increased by 0.032 as compared to 0.051. Connected optimization minimized the overall latency by 34 per cent and storage fragmentation by 41 per cent. The results offer viable suggestions in the development of robust and viable Iceberg analytical pipelines.

Downloads

Download data is not yet available.

References

A. Okolnychyi, C. Sun, K. Tanimura, R. Spitzer, R. Blue, S. Ho, Y. Gu, V. Lakkundi, and D. B. Tsai, "Petabyte-Scale Row-Level Operations in Data Lakehouses," Proceedings of the VLDB Endowment, vol. 17, no. 11, 2024. DOI: 10.14778/3685800.3685834

S. Meneghin, "IcedHops: reducing read and write latency in an Iceberg-backed offline feature store," Politecnico di Milano, 2024. Available: https://www.politesi.polimi.it/handle/10589/234607

Y. Zhang, B. Peng, Y. Du, and J. Su, "GeoLake: Bringing Geospatial Support to Lakehouses," IEEE Access, vol. 12, pp. 3343953, 2023. DOI: 10.1109/access.2023.3343953

P. Hansert and S. Michel, "Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines," Proceedings of the VLDB Endowment, vol. 17, no. 9, 2024. DOI: 10.14778/3681954.3682013

D. Ritter, M. Andrei, S. Cho, M. Görgens, T. Lee, N. May, A. Pathak, and P. Willems, "The HANA Native Query Engine for Lakehouse Systems," Proceedings of the VLDB Endowment, vol. 18, no. 1, 2024. DOI: 10.14778/3750601.3750608

M. Merli, S. Guo, P. Li, H. Chen, and N. Lu, "Ursa: A Lakehouse-Native Data Streaming Engine for Kafka," Proceedings of the VLDB Endowment, vol. 18, no. 1, 2024. DOI: 10.14778/3750601.3750636

G. Huang, A. Lall, C.-N. Chuah, and J. Xu, "Uncovering Global Icebergs in Distributed Streams: Results and Implications," Journal of Network and Systems Management, vol. 19, no. 3, 2011. DOI: 10.1007/S10922-010-9186-5

D. Eswararaj, A. B. Nellipudi, and V. Kollati, "A comparative study of delta parquet, iceberg, and hudi for automotive data engineering use cases," SSRG International Journal of Computer Science and Engineering, vol. 12, no. 17, 2025. DOI: 10.14445/23488387/ijcse-v12i17p104

D. Saha, "Disruption in Data Engineering–Lakehouse Revolution with Iceberg," in Advances in Data Science and Artificial Intelligence, Springer, 2022, ch. 23.

D. Saha, "Disruptor in Data Engineering-Comprehensive Review of Apache Iceberg," Technical Report, 2023.

D. Eswararaj, A. B. Nellipudi, and V. Kollati, "A Comparative Study of Delta Parquet, Iceberg, and Hudi for Automotive Data Engineering Use Cases," arXiv preprint arXiv:2508.13396, 2025. DOI: 10.14445/23488387/IJCSE-V12I17P104

S. Parimi, "A Comparative Performance & Metadata Study of Open Table Formats: Iceberg vs Delta vs Hudi at Scale," Journal of Computer Science and Technology Studies, 2024.

R. Punugoti, "Next-Generation Data Lakehouse Using Open-Source Solutions," IntechOpen, 2024.

P. Bhosale, "Scalable Metadata Management in Data Lakes: The Role of Apache Iceberg," International Journal of Scientific Advancements, vol. 5, no. 9, 2024.

D. B. G. S. Narayanan, "AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems," AI Journal of Computer Science and Technology, vol. 1, no. 1, 2024.

P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, and M. Zaharia, "Analyzing and Comparing Lakehouse Storage Systems," Technical Report, UC Berkeley, 2023.

U. Kothari, "How Apache Iceberg Outperforms Traditional and Hybrid Table Formats for Large-Scale Data Engineering," International Journal for Multidisciplinary Research, vol. 6, no. 2, 2024. DOI: 10.36948/ijfmr.2024.v06i02.50256

A. M. W. Chaudhari and P. A. Charate, "Optimizing Data Lakehouse Architectures for Scalable Real-Time Analytics," International Journal of Scientific Research in Science, Engineering and Technology, vol. 12, no. 2, 2025. DOI: 10.32628/ijsrset25122198

S. S. Kona, "Leveraging Spark and PySpark for Data-Driven Success: Insights and Best Practices Including Parallel Processing, Data Partitioning, and Fault Tolerance Mechanisms," Journal of Management and Corporate Affairs, vol. 2, no. 2, 2023. DOI: 10.47363/jmca/2023(2)160

D. R. Krishnan, D. Le Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues, "IncApprox: A Data Analytics System for Incremental Approximate Computing," in Proc. 25th International Conference on World Wide Web (WWW), 2016. DOI: 10.1145/2872427.2883026

I. Elghandour, A. Kara, D. Olteanu, and S. Vansummeren, "Incremental Techniques for Large-Scale Dynamic Query Processing," arXiv preprint arXiv:Databases, 2018.

M. Sethi, N. Sachindran, and S. Raghavan, "SASH: Enabling continuous incremental analytic workflows on Hadoop," in Proc. IEEE 29th International Conference on Data Engineering (ICDE), 2013. DOI: 10.1109/ICDE.2013.6544911

I. Elghandour, A. Kara, D. Olteanu, and S. Vansummeren, "Incremental Techniques for Large-Scale Dynamic Query Processing," in Proc. ACM SIGMOD International Conference on Management of Data, 2018. DOI: 10.1145/3269206.3274271

Z. Wang, K. Zeng, B. Huang, W. Chen, X. Cui, B. Wang, J. Liu, L. Fan, D. Qu, Z. Hou, T. Guan, C. Li, and J. Zhou, "Tempura: A General Cost Based Optimizer Framework for Incremental Data Processing (Extended Version)," arXiv preprint arXiv:Databases, 2020.

M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki, "Slalom: coasting through raw data via adaptive partitioning and indexing," Proceedings of the VLDB Endowment, vol. 10, no. 10, 2017. DOI: 10.14778/3115404.3115415

A. E. Khalifa, I. Elghandour, and N. M. El-Makky, "IncReStore: Incremental computation of mapreduce workflows," in Proc. IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), 2016. DOI: 10.1109/ICDEW.2016.7495613

E. Viel and U. Haruyasu, "Data stream partitioning re-optimization based on runtime dependency mining," in Proc. IEEE 30th International Conference on Data Engineering Workshops (ICDEW), 2014. DOI: 10.1109/ICDEW.2014.6818327

P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues, "Incremental Sliding Window Analytics," in Encyclopedia of Big Data Technologies, Springer, 2019. DOI: 10.1007/978-3-319-63962-8_156-1

M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki, "Adaptive partitioning and indexing for in situ query processing," The VLDB Journal, vol. 29, no. 1, 2020. DOI: 10.1007/S00778-019-00580-X

Z. Wang, K. Zeng, B. Huang, W. Chen, X. Cui, B. Wang, J. Liu, L. Fan, D. Qu, Z. Hou, T. Guan, C. Li, and J. Zhou, "Grosbeak: A Data Warehouse Supporting Resource-Aware Incremental Computing," in Proc. ACM SIGMOD International Conference on Management of Data, 2020. DOI: 10.1145/3318464.3384708

Downloads

Published

25.03.2026

How to Cite

Guruprasad Raghothama Rao. (2026). Efficient Incremental Data Modeling in Apache Iceberg-Based Analytical Pipelines: Partitioning and Snapshot Optimization Strategies. International Journal of Intelligent Systems and Applications in Engineering, 14(1s), 270–277. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/8171

Issue

Section

Research Article