MAPS: GPU Memory Abstraction and Optimization Framework

The following graph depicts the performance of single-precision matrix multiplication using MAPS, measured on a Tesla K40 GPU:

The incremental speedup of running three different multi-GPU applications over MAPS (Game of Life, Histogram, CUBLAS Matrix Multiplication) on various GPU architectures (GTX 780, Titan Black, Tesla K40m, GTX 980) is shown below:

Maximal speedup is 3.94x on 4 GPUs.

The figure below shows the performance of automatic ILP optimizations, i.e., computing multiple elements per thread, on various GPU architectures using the Game of Life code sample:

The figure below demonstrates the performance of a single kernel that convolves an image and computes its histogram.
The performance of the fused kernel is compared to other libraries (NPP and CUB) in the following graph. The kernel was tested on both the Kepler architecture (NVIDIA Tesla K40c) and the Maxwell architecture (NVIDIA GeForce 750 Ti).

MAPS Framework Performance

Single GPU Matrix Multiplication

Multi-GPU Scaling

Instruction-Level Parallelism (ILP)

Kernel Fusion