Effect of Data Layout in the Evaluation Time of Non-Separable Functions on GPU
keywords: Non-separable function, GPU performance, parallel evaluation, Rosenbrock function, Rana function
GPUs are able to provide a tremendous computational power, but their optimal usage requires the optimization of memory access. The many threads available can mitigate the long memory access latencies, but this usually demands a reorganization of the data and algorithm to reach the performance peak. The addressed problem is to know which data layout produces a faster evaluation when dealing with population-based evolutionary algorithms optimizing non-separable functions. This knowledge will allow a more efficient design of evolutionary algorithms. Depending on the fitness function and the problem size, the most suitable layout can be implemented at the design phase of the algorithm, avoiding later costly code or data layout redesigns. In this paper, diverse non-separable functions, such as Rosenbrock and Rana functions, and data layouts are evaluated. The implemented layouts cover main techniques to maximize the performance: coalesced access to global memory, intensive use of on-chip memory: shared memory and registers, and variable reuse to minimize the global memory transactions. Conclusions about the optimum data layout related to the characteristics of the fitness function and the problem size are stated. Besides, the conclusions ease the decision-making process for future implementations of other non-separable functions.
reference: Vol. 34, 2015, No. 4, pp. 725–745