NIOJS: A Novel Intelligent Model Based on Optimal Jumps for Creating Data Sampling from Big Dataset
Keywords:
Big data sampling, Cluster sampling, DBSCAN, NIOJS , Samples, Optimal JumpAbstract
The pervasiveness of big data has revolutionized the landscape of information technology (IT), offering a wealth of insights and opportunities for various sectors, including healthcare, education, and the Internet of Things (IoT). However, the sheer volume and complexity of big data pose challenges in extracting meaningful knowledge. To address this, we propose a novel model for optimal sample selection, enabling efficient extraction of representative subsets from big data. The proposed model, based on optimal jumps, dynamically adapts the clustering process to enhance the efficiency of data sampling. We employ the Adjusted Rand Index (ARI) to evaluate the similarity between clusters and guide the selection of new data in each iteration This model holds the potential to significantly enhance the utilization of big data while reducing computational demands. The proposed could run on big datasets and the samples taken represents the dataset.
Downloads
References
Deng, Dingsheng. "DBSCAN clustering algorithm based on density." 2020
7th international forum on electrical engineering and automation (IFEEA). IEEE, 2020.
Warrens, Matthijs J., and Hanneke van der Hoef. "Understanding the adjusted rand index and other partition comparison indices based on counting object pairs." Journal of Classification 39.3 (2022): 487-509.
Chacón, José E., and Ana I. Rastrojo. "Minimum adjusted Rand index for two clusterings of a given size." Advances in Data Analysis and Classification 17.1 (2023): 125-133.
de Moura Ventorim, Igor, et al. "BIRCHSCAN: A sampling method for applying DBSCAN to large datasets." Expert Systems with Applications 184 (2021): 115518.
Ros, Frédéric, and Serge Guillaume. "DENDIS: A new density-based sampling for clustering algorithm." Expert Systems with Applications 56 (2016): 349-359.
Ros, Frédéric, and Serge Guillaume. "DIDES: a fast and effective sampling for clustering algorithm." Knowledge and information systems 50 (2017): 543-568.
Zhu, Lu, et al. "Improvement of DBSCAN algorithm based on adaptive Eps parameter estimation." Proceedings of the 2018 international conference on algorithms, computing and artificial intelligence. 2018.
Xianting, Qi, and Wang Pan. "A density-based clustering algorithm for high-dimensional data with feature selection." 2016 International Conference on Industrial Informatics-Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII). IEEE, 2016.
Alwosheel, Ahmad, Sander van Cranenburgh, and Caspar G. Chorus. "Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis." Journal of choice modelling 28 (2018): 167-182.
Silva, José, Bernardete Ribeiro, and Andrew H. Sung. "Finding the critical sampling of big datasets." Proceedings of the Computing Frontiers Conference. 2017.
Luchi, Diego, Alexandre Loureiros Rodrigues, and Flávio Miguel Varejão. "Sampling approaches for applying DBSCAN to large datasets." Pattern Recognition Letters 117 (2019): 90-96.
Berndt, Andrea E. "Sampling methods." Journal of Human Lactation 36.2 (2020): 224-226.
Li, Mingyang, et al. "A method of two-stage clustering learning based on improved DBSCAN and density peak algorithm." Computer Communications 167 (2021): 75-84.
Iliyasu, R., & Etikan, I. (2021). Comparison of quota sampling and stratified random sampling. Biom. Biostat. Int. J. Rev, 10(1), 24-27.
Sharma, Gaganpreet. "Pros and cons of different sampling techniques." International journal of applied research 3, no. 7 (2017): 749-752.
Stratton, Samuel J. "Population research: convenience sampling strategies." Prehospital and disaster Medicine 36, no. 4 (2021): 373-374.
Berndt, Andrea E. "Sampling methods." Journal of Human Lactation 36, no. 2 (2020): 224-226.
Mahmud, Mohammad Sultan, Joshua Zhexue Huang, Salman Salloum, Tamer Z. Emara, and Kuanishbay Sadatdiynov. "A survey of data partitioning and sampling methods to support big data analysis." Big Data Mining and Analytics 3, no. 2 (2020): 85-101.
Pandey, Kamlesh Kumar, and Diwakar Shukla. "Stratified sampling-based data reduction and categorization model for big data mining." In Communication and Intelligent Systems: Proceedings of ICCIS 2019, pp. 107-122. Springer Singapore, 2020.
Djouzi, Kheyreddine, Kadda Beghdad-Bey, and Abdenour Amamra. "A new adaptive sampling algorithm for big data classification." Journal of Computational Science 61 (2022): 101653.
Hasanin, Tawfiq, Taghi M. Khoshgoftaar, Joffrey L. Leevy, and Richard A. Bauder. "Severely imbalanced big data challenges: investigating data sampling approaches." Journal of Big Data 6, no. 1 (2019): 1-25.
Pandey, Kamlesh Kumar, and Diwakar Shukla. "Euclidean distance stratified random sampling based clustering model for big data mining." Computational and Mathematical Methods 3, no. 6 (2021): e1206
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.