Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost

Authors

  • Vibhu Verma

Keywords:

Machine Learning, Data Subsampling, Training Efficiency, Rich Representations, Computational Optimization.

Abstract

The increasing complexity of Machine Learning models raises the demanding needs for effective strategies that reduce training time without losing performance. The paper compares a few different approaches for generating subsamples of the data serving as a rich representation of the full dataset, enabling faster training while maintaining model accuracy. By leveraging ML techniques, this approach identifies and extracts representative subsets that preserve the most salient features of the original data. These are then used for training various models, reducing computational costs and time requirements. Experimental results show that across different ML tasks, the proposed approach yields significant regular reductions in training time while retaining comparable predictive performance. The method has the potential to improve the efficiency of large-scale ML workflows in a data-intensive environment.

Downloads

Download data is not yet available.

References

L. McInnes, J. Healy, and N. Saul, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, Feb. 2018.

R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining (PAKDD), Ho Chi Minh City, Vietnam, 2013, pp. 160–172, doi: 10.1007/978-3-319-18038-0_14.

Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining Knowl. Discov., vol. 2, no. 3, pp. 283–304, Sep. 1998, doi: 10.1023/A:1009769707641.

J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Prob., Berkeley, CA, USA, 1967, pp. 281–297.

J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, vol. 27, no. 4, pp. 857–871, Dec. 1971, doi: 10.2307/2528823.

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Pisa, Italy, 2008, pp. 413–422, doi: 10.1109/ICDM.2008.17.

F. J. Massey Jr., “The Kolmogorov-Smirnov test for goodness of fit,” J. Amer. Statist. Assoc., vol. 46, no. 253, pp. 68–78, Mar. 1951, doi: 10.1080/01621459.1951.10500769.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785.

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001, doi: 10.1214/aos/1013203451.

G. A. F. Seber and A. J. Lee, Linear Regression Analysis, 2nd ed. Hoboken, NJ, USA: Wiley, 2012.

D. R. Cox, “The regression analysis of binary sequences,” J. Roy. Statist. Soc. Ser. B., vol. 20, no. 2, pp. 215–242, 1958.

W. G. Cochran, Sampling Techniques, 3rd ed. New York, NY, USA: Wiley, 1977.

J. Zhao, W. Cai, and M. Wang, “Sampling representative data points for clustering based on density and distance,” IEEE Access, vol. 6, pp. 51977–51987, 2018, doi: 10.1109/ACCESS.2018.2869220.

Downloads

Published

12.06.2024

How to Cite

Vibhu Verma. (2024). Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 4872 –. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/7227

Issue

Section

Research Article