OmpSs is an effort to integrate features from the StarSs programming model, developed by BSC, into a single programming model. In particular, our objective is to extend OpenMP with new directives to support asynchronous parallelism based on expressing task dependencies, and heterogeneity (devices like GPUs, an the Intel MIC). Also, new directives extending accelerator based APIs like CUDA or OpenCL and leveraging the use of hand-optimized CUDA and OpenCL kernels are also provided by OmpSs. Our OmpSs environment is built on top of our Mercurium compiler and Nanos++ runtime system. The work-in-progress OpenMP4.0 specification is currently adopting similar extensions regarding task dependencies and targeting accelerators.
GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model allows CPU code to access data hosted in accelerator (GPU) memory.
GMAC-1 offered a unified address space to heterogeneous systems based on accelerators like GPUs in order to aggregate all the distributed memories present in the system. The visibility of this address space is different for host and device code. Host code can access the whole address space while the device code can only access the objects allocated by the host thread that invoked the kernel. One of the advantages over the model offered by CUDA is that memory objects allocated on the GPU memory are directly accessible by the host code. Programmers do not need to have separate host/device allocations for each object and to keep both allocations synchronized. In particular, GMAC-1 has been shown to significantly reduce the complexity for multi-threaded applications to use CUDA devices.
GMAC-2 is the evolution of the GMAC-1 library. Unlike its predecessor, GMAC-2 is a framework composed by several components that can be combined to provide different degrees of functionality to the programmer. This makes GMAC-2 not only an intuitive and efficient programming environment for applications, but also a great foundation for building new programming models and development frameworks for accelerator-based systems. The lowest-level component in GMAC-2 is called gmac-hal. It provides a unified API to manage the underlying hardware. It also hides the complex details of the interfaces provided by vendors to control devices (eg. CUDA or OpenCL). Thus, programmers can create/destroy address spaces in devices, allocate/free memory and copy data among address spaces, create/destroy execution contexts and execute code on devices. These operations transparently take advantage of the features offered by the hardware (eg. full-duplex PCIe transfers, or direct GPU-to-GPU copies). This component is also being extended to support multi-node configurations, thus giving the ability to virtualize a whole cluster of machines.
Built on top of gmac-hal, there are two components that provide complementary functionality. gmacdsm allows to keep regions of memory synchronized between the host and several GPU address spaces. Moreover, this component uses the data location information to automatically optimize operations like memory copies at run-time. gmac-uvas targets those systems in which the hardware does not offer the mechanisms needed to build a unified virtual address space (UVAS). Benefiting from the new capabilities implemented in other components, the programming environment for applications shipped in GMAC-2 (gmac-exec), extends the features offered by GMAC-1. These new features specially target multi-GPU environments. For example, memory objects can now be arbitrarily mapped and modified on different devices. Moreover, programmers can map different parts of the same logical object on different devices, thus allowing them to be processed by multiple devices in parallel while keeping a sequential representation in host memory.