Correlation Analysis Algorithm for Massive Ultra-High-Dimensional Breast Ultrasound Radiomics Feature Data in a Distributed Environment

keywords: Radiomics, massive high-dimensional data, correlation analysis, distributed computing
Radiomics is a technology that extracts a large number of quantitative features from high-throughput medical images and has become a focus of research. It can help in disease diagnosis, therapy planning, and prognosis evaluation through Big Data analysis algorithms. Radiomics technology can extract hundreds or even tens of thousands of quantifiable data features from medical images, which can no longer fit into the memory of one machine. Therefore, we propose a distributed correlation analysis algorithm (DFCA) based on a MapReduce distributed computing framework for breast ultrasound radiomics feature datasets. Each compute node will produce massive intermediate data while the DFCA calculates the Pearson correlation coefficient of radiomics features. With the increase of feature data and dimensions, the data transmission cost will be in a square growth. To reduce the cost, we propose a distributed correlation estimation algorithm (DFCEA) for radiomics features based on DFCA. The DFCEA algorithm estimates the Pearson correlation coefficient using an iterative method, which can further reduce the I/O cost. The experiment proved that our algorithms are more effective compared to the algorithms in the literature.
reference: Vol. 43, 2024, No. 3, pp. 756–776