TALL ARRAY METHOD EFFICIENCY IN DATASET DIMENSIONALITY REDUCTION  BY PRINCIPAL COMPONENT ANALYSIS

Vadim Romanuke

doi:10.20535/kpisn.2025.2.331279

Authors

Vadim Romanuke Vinnytsia Trade and Economics Institute of the State Trade and Economics University, Vinnytsia, Ukraine https://orcid.org/0000-0003-3543-3087

DOI:

https://doi.org/10.20535/kpisn.2025.2.331279

Abstract

Background. Exploratory data analysis has been extensively growing since early 2000s. As of 2025, most real-practice datasets are classified as Big Data. The Big Data analytics workflow includes the data preprocessing step, which is the starting point of Big Data computational handling. At this step, the data are tried to get simplified as much as possible. The main paradigm is dimensionality reduction allowing to simplify and visualize high-dimensional datasets. Principal component analysis (PCA) is a linear dimensionality reduction technique. The PCA can be sped up by applying Tall Arrays, if the data are stored on disk. The Tall Array PCA (TAPCA) computes principal components incrementally using a divide-and-conquer strategy.

Objective. The objective is to determine when the TAPCA is factually efficient for dimensionality reduction. The two numeric types to be studied are double and single precision.

Methods. To achieve the said objective, random large datasets are generated as matrices of a specified numeric type. Then computational time of the ordinary MATLAB PCA applied to generated matrices is measured. Next, computational time of converting in-memory arrays (generated matrices) into tall arrays is measured. Computational time of the TAPCA applied to those generated matrices, to which the PCA is applied before, is measured as well.

Results. A comparative analysis of the averaged computational times reveals that computational time complexity of both the PCA and TAPCA is rather polynomial than strictly quadratic or cubic. There is a nearly-hyperbolic margin, which alternatively could be called the TAPCA efficiency threshold, in a plane of the number of dataset observations and the number of dataset features, by which the TAPCA and the ordinary PCA take approximately the same time to compute principal components.

Conclusions. In computing principal components for dimensionality reduction of large datasets stored on disk, the Tall Array method becomes efficient by two parallel processor workers if a dataset has at least 5 to 6 million entries. The Tall Array method is more efficient on datasets with double precision whose efficiency threshold is nearly 6 million entries, whereas the efficiency threshold for datasets with single precision is between 5 to 15.2 million entries

References

Ü. Demirbaga et al., Big Data Analytics. Theory, Techniques, Platforms, and Applications, Springer, Cham, 2024. Available: https://doi.org/10.1007/978-3-031-55639-5

A. Jamarani et al., “Big data and predictive analytics: A systematic review of applications”, in Artificial Intelligence Review, 2024, vol. 57, Art. no. 176. Available: https://doi.org/10.1007/s10462-024-10811-5

I. Si-ahmed et al., “Principal component analysis of multivariate spatial functional data”, in Big Data Research, 2025, vol. 39, Art. no. 100504. Available: https://doi.org/10.1016/j.bdr.2024.100504

A. Meepaganithage et al., Enhanced Maritime Safety Through Deep Learning and Feature Selection, in Advances in Visual Computing. ISVC 2024. Lecture Notes in Computer Science, vol. 15047, Springer, Cham, 2025, pp. 309–321. Available: https://doi.org/10.1007/978-3-031-77389-1_24

W. J. Ewens and K. Brumberg, Introductory Statistics for Data Analysis, Springer, Cham, 2023. Available: https://doi.org/10.1007/978-3-031-28189-1

S. Akter and S. F. Wamba, Handbook of Big Data Research Methods, Edward Elgar Publishing, 2023. Available: https://doi.org/10.4337/9781800888555

V. V. Romanuke, “Fast Kemeny consensus by searching over standard matrices distanced to the averaged expert ranking by minimal difference”, in Research Bulletin of NTUU “Kyiv Polytechnic Institute”, 2016, no. 1, pp. 58–65. Available: https://doi.org/10.20535/1810-0546.2016.1.59784

J. Cao, Data Collection in the Era of Big Data, in E-Commerce Big Data Mining and Analytics. Advanced Studies in E-Commerce, Springer, Singapore, 2023, pp. 19–28. Available: https://doi.org/10.1007/978-981-99-3588-8_2

W. K. Härdle et al., Principal Component Analysis, in: Applied Multivariate Statistical Analysis, Springer, Cham, 2024, pp. 309–345. Available: https://doi.org/10.1007/978-3-031-63833-6_11

I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments”, in Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2016, vol. 374, iss. 2065, Art. no. 20150202. Available: https://doi.org/10.1098/rsta.2015.0202

A. C. Olivieri, Principal Component Analysis, in: Introduction to Multivariate Calibration, Springer, Cham, 2024, pp. 71–87. Available: https://doi.org/10.1007/978-3-031-64144-2_4

V. V. Romanuke, “Speedup of the k-means algorithm for partitioning large datasets of flat points by a preliminary partition and selecting initial centroids”, in Applied Computer Systems, 2023, vol. 28, no. 1, pp. 1–12. Available: https://doi.org/10.2478/acss-2023-0001

S. Ekici et al., Electricity Consumption Analysis with Matlab Tall Arrays, in 1st International Engineering and Technology Symposium (1st IETS), May 2018, Batman University, Batman, Turkey, 2018. Available: https://www.researchgate.net/publication/327987720_Electricity_Consumption_Analysis_with_Matlab_Tall_Arrays

M. Paluszek and S. Thomas, Data for Machine Learning in MATLAB, in MATLAB Machine Learning Recipes, Apress, Berkeley, CA, 2024, pp. 21–48. Available: https://doi.org/10.1007/978-1-4842-9846-6_2

V. V. Romanuke, “Limitation of effectiveness in using MATLAB gpuArray method for calculating products of transpose-symmetrically sized matrices”, in Herald of Khmelnytskyi national university. Technical sciences, 2015, no. 5, pp. 243–248. Available: https://elar.khmnu.edu.ua/handle/123456789/4611

V. V. Romanuke, “Maximum-versus-mean absolute error in selecting criteria of time series forecasting quality”, in Bionics of intelligence, 2021, no. 1, pp. 3–9. Available: https://doi.org/10.30837/bi.2021.1(96).01

C. L. Valenzuela and A. J. Jones, “Evolutionary divide and conquer (I): A novel genetic approach to the TSP”, in Evolutionary Computation, 1993, vol. 1, iss. 4, pp. 313–333. Available: https://doi.org/10.1162/evco.1993.1.4.313

V. V. Romanuke, “Deep clustering of the traveling salesman problem to parallelize its solution”, in Computers & Operations Research, 2024, vol. 165, Art. no. 106548. Available: https://doi.org/10.1016/j.cor.2024.106548

T. Weinzierl, Principles of Parallel Scientific Computing: A First Guide to Numerical Concepts and Programming Methods, Springer, Cham, 2021. Available: https://doi.org/10.1007/978-3-030-76194-3

V. V. Romanuke, “Optimal construction of the pattern matrix for probabilistic neural networks in technical diagnostics based on expert estimations”, in Information, Computing and Intelligent Systems, 2021, no. 2, pp. 19–25. Available: https://doi.org/10.20535/2708-4930.2.2021.244186

R. Han and Y. Wang, The Advance and Performance Analysis of MapReduce, in Proceedings of 2nd International Conference on Artificial Intelligence, Robotics, and Communication. ICAIRC 2022. Lecture Notes in Electrical Engineering, vol. 1063, Springer, Singapore, 2023, pp. 205–213. Available: https://doi.org/10.1007/978-981-99-4554-2_20

S. Hedayati et al., “MapReduce scheduling algorithms in Hadoop: a systematic study”, in Journal of Cloud Computing, 2023, vol. 12, Art. no. 143. Available: https://doi.org/10.1186/s13677-023-00520-9

TALL ARRAY METHOD EFFICIENCY IN DATASET DIMENSIONALITY REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Authors

DOI:

Abstract

References

Downloads

Published

Issue

Section

License

Information

Developed By