Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost

Vibhu Verma

Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost

Authors

Vibhu Verma

Keywords:

Machine Learning, Data Subsampling, Training Efficiency, Rich Representations, Computational Optimization.

Abstract

The increasing complexity of Machine Learning models raises the demanding needs for effective strategies that reduce training time without losing performance. The paper compares a few different approaches for generating subsamples of the data serving as a rich representation of the full dataset, enabling faster training while maintaining model accuracy. By leveraging ML techniques, this approach identifies and extracts representative subsets that preserve the most salient features of the original data. These are then used for training various models, reducing computational costs and time requirements. Experimental results show that across different ML tasks, the proposed approach yields significant regular reductions in training time while retaining comparable predictive performance. The method has the potential to improve the efficiency of large-scale ML workflows in a data-intensive environment.

Downloads

Download data is not yet available.

References

L. McInnes, J. Healy, and N. Saul, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, Feb. 2018.

R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical density estimates,” in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining (PAKDD), Ho Chi Minh City, Vietnam, 2013, pp. 160–172, doi: 10.1007/978-3-319-18038-0_14.

Z. Huang, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining Knowl. Discov., vol. 2, no. 3, pp. 283–304, Sep. 1998, doi: 10.1023/A:1009769707641.

J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Prob., Berkeley, CA, USA, 1967, pp. 281–297.

J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, vol. 27, no. 4, pp. 857–871, Dec. 1971, doi: 10.2307/2528823.

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Pisa, Italy, 2008, pp. 413–422, doi: 10.1109/ICDM.2008.17.

F. J. Massey Jr., “The Kolmogorov-Smirnov test for goodness of fit,” J. Amer. Statist. Assoc., vol. 46, no. 253, pp. 68–78, Mar. 1951, doi: 10.1080/01621459.1951.10500769.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), San Francisco, CA, USA, 2016, pp. 785–794, doi: 10.1145/2939672.2939785.

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001, doi: 10.1214/aos/1013203451.

G. A. F. Seber and A. J. Lee, Linear Regression Analysis, 2nd ed. Hoboken, NJ, USA: Wiley, 2012.

D. R. Cox, “The regression analysis of binary sequences,” J. Roy. Statist. Soc. Ser. B., vol. 20, no. 2, pp. 215–242, 1958.

W. G. Cochran, Sampling Techniques, 3rd ed. New York, NY, USA: Wiley, 1977.

J. Zhao, W. Cai, and M. Wang, “Sampling representative data points for clustering based on density and distance,” IEEE Access, vol. 6, pp. 51977–51987, 2018, doi: 10.1109/ACCESS.2018.2869220.

Downloads

Published

12.06.2024

How to Cite

Vibhu Verma. (2024). Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 4872 –. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/7227

Download Citation

Issue

Vol. 12 No. 4 (2024)

Section

Research Article

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.

IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.

Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Efficient Machine Learning Model Training through Data Subsampling: Balancing Performance and Computational Cost

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By