Summary of CUDA Support
This is the documentation for CUDA support in X10 version 2.1. For version 2.2, see X10 2.1.1 CUDA.
Using the X10/CUDA backend, one can identify fragments of an X10 program to run on the GPU. For ideal workloads, this can give a speedup of up to 30x or more. Any X10RT backend can be used with CUDA, but X10 must be built specially from source to allow this capability.
The idea behind the X10/CUDA implementation is to expose the low level CUDA fundamentals in as direct a fashion as possible, to maximise the ways in which the backend can be used, and to present as few surprises as possible to programmers. To support this, we have also strived to change X10 minimally, and 99% of the semantics of CUDA are represented with certain design patterns of X10 language features. For this reason, one needs to have a reasonably good understanding of the distributed/multicore features of the X10 language (places, async, finish, and the distributed object model, which are together commonly known as the APGAS model) before programming CUDA kernels in X10.
X10 2.1 is the first release to formally support CUDA, and there are many limitations in the current implementation that we wish to address in the coming months.
Prerequisites
You will need a CUDA-capable NVidia GPU. All recent NVidia GPUs support this. Anything from the '8000' series should work. Check the documentation for your hardware to be sure. Lists of supported chipsets can be found on the NVidia website. If specs for your card include the number of "CUDA cores" available, then you can be sure it supports CUDA.
You will need to install graphics drivers that support CUDA. We have tested with the latest versions of the drivers, so you should also use these if possible. Download CUDA drivers from the NVidia website http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html
You will also need the NVidia CUDA compiler, 'nvcc', and its runtime libraries. These are part of the 'toolkit' download from the same site. You should make sure nvcc is included your path, and the toolkit is installed in /usr/local/cuda. If x10c++ cannot find nvcc, it will print a warning message and produce an executable that will only run on the CPU. If you try to run it on your GPU you will get a message about not being able to find the cubin files, since these are output by nvcc.
Compiling
Unpack the source release tarball into a directory of your choice. Invoke ant as follows to build the whole X10 compiler and runtime:
cd x10.dist ant dist -DX10RT_CUDA=true -DNO_CHECKS=true -Doptimize=true
The build process will expect to find the CUDA toolkit headers and libraries installed on the local system in /usr/local/cuda.
If you want to use a combination of CUDA and MPI (e.g. for an Infiniband cluster of hosts with GPUs attached), include the following extra argument:
-DX10RT_MPI=true
If you want to use gdb or valgrind to debug your generated programs, you might want to build the X10 runtime without optimization. To reduce optimization, disable GC (which interferes with valgrind) and enable extra assertions, unpack the tarball into a different directory and build it as follows (again, include the extra MPI argument if needed):
cd x10.dist ant dist -DX10RT_CUDA=true -Doptimize=false -DDISABLE_GC=true -DNO_CHECKS=true
Note that your application will be built with optimization if and only if you give the -O argument to x10c++. The optimization argument above only affects the optimization of the X10 runtime. It is possible, and sometimes useful, to have an unoptimized application linked against an optimized X10 runtime, and vice versa.
Testing The Build with The CUDA Samples
Compiling
In the x10.dist/samples/cuda directory there are 3 CUDA-capable X10 programs you can now try. Change to that directory and compile each one as follows:
../../bin/x10c++ -O -NO_CHECKS -STATIC_CALLS CUDATopology.x10 -o CUDATopology ../../bin/x10c++ -O -NO_CHECKS -STATIC_CALLS CUDABlackScholes.x10 -o CUDABlackScholes ../../bin/x10c++ -O -NO_CHECKS -STATIC_CALLS KMeansCUDA.x10 -o KMeansCUDA
If you want to use MPI, you should add the following argument to x10c++ in each case:
-x10rt mpi
The -O parameter is optional here, but will give you the best performance for the parts of the code that run on the host. The argument -STATIC_CALLS is required to prevent the x10 compiler inserting casts into the kernel, which are not supported. The optional argument -NO_CHECKS avoids null pointer and array bounds checks on the host, as well as constraint checks when casting to a constrained type.
These builds will try and execute nvcc using the $PATH, so please ensure it can be found.
Executing
All of the compiled samples above should be runnable as normal X10 programs. If one used -x10rt mpi then one should use mpirun to execute them in the ordinary way for MPI programs. Otherwise use ../../bin/runx10 to run a standalone execution of the X10 program, since the default x10rt backend is pgas_sockets.
However, the applications will not use GPUs by default. To make them use your local GPUs, define the environment variable X10RT_ACCELS, like this:
X10RT_ACCELS=ALL ../../bin/runx10 CUDATopology X10RT_ACCELS=ALL ../../bin/runx10 CUDABlackScholes X10RT_ACCELS=ALL ../../bin/runx10 KMeansCUDA -i 50
This allows all GPUs to be used by each X10 host place. In the above example, only one host place exists because runx10 is a standalone execution. If you ran with mpirun then each MPI process looks at X10RT_ACCELS to choose how many GPUs to use. You will probably have to look into your mpirun documentation to see how to make it pass environment variables to the MPI processes.
If you run these samples without the environment variable set, that is equivalent to setting it to 'NONE', and no GPUs will be used. You should see the CUDABlackScholes and KMeansCUDA samples are running a lot slower in this case because they are written to use the host CPU for the computation if they do not find any GPUs available.
The CUDATopology sample just prints the GPU places available underneath every host place, so its output will change depending on the value of X10RT_ACCELS.
CUDABlackScholes is a reimplementation of the NVidia BlackScholes CUDA demo. It runs with similar performance to the NVidia version.
KMeansCUDA is an implementation of KMeansSPMD.x10, in the samples directory, that uses the GPU for acceleration. The -v parameter will make it print out the clusters at each iteration so you can watch them converge. It only works correctly when the number of clusters is a multiple of 20.
You can also specify a comma separated list of GPUs to use, and can oversubscribe each GPU several times. For instance, the following value for X10RT_ACCELS will use the first GPU 3 times and the third GPU twice:
X10RT_ACCELS=CUDA0,CUDA0,CUDA0,CUDA2,CUDA2
If the X10 package was compiled without support for CUDA, then there will always be no GPUs discovered at runtime, and X10RT_ACCELS will be silently ignored.
Writing Your Own CUDA Kernels
The X10/CUDA design utilizes existing X10 language features to express the various concepts defined by the CUDA system. The code that you write is therefore much like normal X10 code, but certain patterns are enforced because GPUs are not as capable as ordinary CPUs.
CUDA Places
A single GPU has its own memory that is distinct from the host memory and from other GPUs. Therefore we represent it with a place. In fact one can oversubscribe GPUs using the X10RT_ACCELS environment variable, and in this case many places will be created on the same GPU, but they will be distinct in terms of X10 semantics.
The Place API contains methods that can be used for finding out what GPUs are available, which in turn depends on the value of X10RT_ACCELS when the program was executed. In particular, isCUDA() and isHost() will identify whether or not a particular place is a GPU or CPU, and parent() will return the host of a GPU. For a given place p, one can iterate over p.children() in order to get the GPUs at that place. Also, p.numChildren() will return the number of children. Any GPU program ought to have a strategy for when there are no GPUs present, perhaps choosing to run on the CPU instead.
While the API contains calls that suggest some places may be cell SPEs, there is currently no support for compiling X10 programs to run on the cell. However since it is not unlikely that other kinds of accelerators will be supported in future (e.g. OpenCL) we will not guarantee that every child of a place is a CUDA GPU.
Memory On CUDA Places
Just like during the execution of any X10 program, each place has its own heap and objects live in a particular place. Unlike normal X10 programs, CUDA requires that GPU memory is allocated and managed by the host place, rather than the GPU's own place. Therefore we provide some utility functions to represent this kind of remote memory management. Each of these utility functions also works if the remote place is a CPU instead of a GPU, since under the covers it is just using normal X10 language features to implement the needed semantics.
Allocating GPU Memory
The main datatype for doing work on GPUs is Array. However it cannot be made in the usual form:
val remote_gpu_array = at (gpu) new Array[T](sz, init);
Instead, we provide the following function to call from the host, in the class x10.util.CUDAUtilities, whose semantics is the same as the above:
val remote_gpu_array = CUDAUtilities.makeRemoteArray[Float](gpu, sz, init);
As with ordinary array construction, init can be a closure, a single value to initialise all elements, or an existing array of the same kind.
To free the array, since there is no garbage collection for CUDA objects yet, call the following:
CUDAUtilities.deleteRemoteArray(remote_gpu_array);
Copying Values Between GPU and Host Places.
The usual X10 mechanisms for copying arrays can be used to copy to/from arrays in GPU memory. Note that the following is asynchronous so should be wrapped in a finish statement to ensure it has completed.
Array.asyncCopy(src_array, src_offset, dst_array, dst_offset, len);
Since X10 is a garbage collected language, we cannot allocate all our host arrays using cudaAllocHost. This means the arrays will not be pinned, so we are forced to spool the data through an intermediate pinned buffer (hidden in the X10 runtime). Because of this spooling, the bandwidth of DMAs takes up to a 30% hit. This is a known limitation of the design of CUDA, and has nothing to do with X10.
The size of the spooling buffer can be specified in bytes with the environment variable X10RT_CUDA_DMA_SLICE, and defaults to 1MB. By increasing this, one can often reduce the overhead of the spooling.
Kernels
To run X10 code on the GPU, a particular pattern of X10 language constructs must be used. Since the GPU is a remote place, we have to use the 'at' construct to run code there. And since the underlying CUDA kernels are asynchronous, we currently require the form to be:
async at (gpu) { ... }
In future we will also allow "at (gpu) { ... }" but at present it is necessary to use "finish async at (gpu) { ... }" to synchronize on the completion of a kernel.
Since only certain kinds of code can be executed on the GPU, there is an annotation @CUDA, found in the package x10.compiler, that must be used to inform the compiler that a certain block of code is intended to run on the GPU. Since places are runtime quantities, it is not possible to statically know that the value of the 'gpu' variable is actually a GPU place and not just another host. So we use the @CUDA annotation to cause the block of code to be compiled for both CPU and GPU, instead of just for the CPU like the rest of the X10 program. In fact this behaviour is useful, as it allows CUDA code to run on the host if no GPUs can be found, through the usual X10 generated code.
async at (gpu) @CUDA { ... }
Once you add the @CUDA annotation, the compiler will insist that certain restrictions are obeyed within the annotated block. It will also trigger the compiler to attempt to run 'nvcc' to generate a cubin file for each class that contains an occurrence of @CUDA. If you attempt to run code on a GPU place, and that code has not been marked with @CUDA, you will get a runtime error of the form "X10RT: async id X is not a CUDA kernel".
Kernel Structure
The basic execution pattern of a CUDA kernel is that there are a number of 'blocks', each of which spawn a number of 'threads', and each thread executes the same code. We represent this using X10 language constructs as follows:
async at (gpu) @CUDA {
for ([block] in 0..239) {
for ([thread] in 0..63) async {
...
}
}
}
Note the 'async' in the central for loop. The '...' is what one would typically call 'the CUDA kernel', as it is the code that each thread runs. Naturally, the code denoted with ... may use the loop variables 'block' and 'thread' (which can be named arbitrarily). We also allow the total number of threads and number of blocks to be specified using variables captured from the enclosing scope, as so:
val blocks = 240;
val threads = 64;
async at (gpu) @CUDA {
for ([block] in 0..blocks-1) {
for ([thread] in 0..threads-1) async {
...
}
}
}
As in normal X10 programs, the code in '...' can access variables from the enclosing scope, i.e. the host. Such captured variables will be automatically copied to the GPU. Entire arrays will be copied, which can be slow so if the array is large enough and doesn't change from one kernel to another, it may be better to create a remote array initially, using CUDAUtilities.makeRemoteArray(...). In this case, the kernel would capture the pointer to this specially created array instead of capturing a local array on the host. One must consider pre-allocating remote arrays when optimising performance of regular X10 programs too, as the copying behaviour of captured arrays is the same for GPUs and CPUs.
Stack variables inside the '...' are compiled to use CUDA registers.
Limitations
There are currently many constructs that are not supported in GPU code:
- structs
- method calls (except in special cases, e.g. Array.apply and primitive arithmetic)
- new
- creating / calling closures
- up/down casts or instanceof tests
- throwing exceptions
- catching exceptions
- spawning more asyncs or using finish
- await blocks
- atomic blocks
- next
Some of these we plan to address in future versions (e.g. method calls, structs). We would in particular like to give access to the CUDA barrier construct via the X10 next construct and X10 clocks. However there are also fundamental limitations of CUDA hardware and runtimes that make it difficult for us to implement e.g. new, async, or dynamic dispatch for closures and method calls. In the latest CUDA GPUs, some of these issues have been addressed so it is hard to say with certainty what we will and will not allow in future. It is also unclear how many of these features will be useful for writing kernels, and what ultimately comprises an appropriate and well-rounded subset of language features.
Kernel Parameters
Variables defined outside of the kernel and used within the kernel (captured variables) are the X10/CUDA equivalent of CUDA kernel parameters. There are two implementation strategies we use for implementing these kernel parameters. The default strategy is to create a struct containing all the variables and DMA this to the GPU before executing the kernel. This should always work, but is less efficient than the other technique, which is to directly use CUDA kernel parameters to hold the captured environment. In the latter case, if the environment is too large, it will not fit in the small area designated by CUDA for transferring parameters to the kernel.
The X10 programmer can currently control which mechanism is used, although in future we will automatically decide based on the size of the environment. To use CUDA kernel parameters instead of a separate DMA, add the annotation @CUDADirectParams after the @CUDA annotation. This annotation is also found in x10.compiler.
Shared Memory
Shared memory is a CUDA concept for memory that threads within a block share and can read/write to. This is distinct from registers, that are local to each individual thread. In KMeansCUDA.x10, shared memory is used as a cache, since global memory fetches are slow in CUDA.
In X10, we express these semantics by defining heap objects between the block and thread loops:
async at (gpu) @CUDA {
for ([block] in 0..63) {
val shm = new Array[Float](100, init);
for ([thread] in 0..239) async {
...
}
}
}
Since the array is not actually allocated on the heap, but in shared memory that does not outlive the CUDA kernel, it is important not to leak the 'shm' reference by writing it to memory that will be accessed in a later kernel.
Limitations
- Only Arrays of Int / Float may be specified in shared memory
- Only one array can be specified at present.
- The requirement that shared memory objects not be leaked is not yet enforced.
- Since the CUDA memory model requires that shared memory writes be protected with barriers, and we have not yet implemented 'next' in terms of these barriers, writeable shared memory is currently not very useful.
We plan to address these limitations soon after the release.
Auto Blocks / Threads
The performance of CUDA kernels is very sensitive to the number of blocks/threads used, the particulars of the kernel code (e.g. the number of registers / shared memory used) and the hardware that the code runs on. This makes it very difficult to write portable code, since there is such variety in GPUs. To help in this situation, we provide a feature that will automatically choose blocks and threads following a strategy that seems to work well in most cases. The idea is the maximise utilization while using a multiple of 64 threads, and as few blocks/threads as possible. Utilization is a CUDA concept meaning the number of hardware thread slots that are occupied.
In order to use this functionality, you may optionally declare two variables in the following manner, within the @CUDA annotation, and use these variables to define the extents of the loops:
async at (gpu) @CUDA {
val blocks = CUDAUtilities.autoBlocks();
val threads = CUDAUtilities.autoThreads();
for ([block] in 0..blocks-1) {
for ([thread] in 0..threads-1) async {
...
}
}
}
The kinds of kernel that can make use of this are kernels whose correctness is not sensitive to the number of blocks/threads, and whose shared memory requirements do not depend on the number of blocks or threads, since the amount of shared memory required is actually used to determine how many blocks/threads there will be.
The exact algorithm used is an iteration down a list of pairs of (blocks,threads), with preferred pairs (with greater utilization for a given MP) at the beginning of the list. The selected pair is the first pair that is a valid way to instantiate the kernel in question on the GPU in question. The selected pair is used to spawn the kernel, except that the number of blocks are scaled by the number of MPs (what NVidia calls 'streaming processors' but would conventionally be called cores) in that GPU. Pairs will be stepped over if they require too many registers, too much shared memory, etc. The list contains 32 pairs and the curious can have a look at it in x10.runtime/src-cpp/x10aux/network.cc to see its definition.
Performance Notes
There is considerable performance transparency with the current implementation. However, this also means we have not attempted to hide any of the performance artefacts that are peculiar to CUDA devices and the CUDA programming model. Programmers need to be aware of coalesced memory accesses and bank conflicts, and have to choose the number of blocks/threads wisely. The NVidia CUDA profiler, accessible with the CUDA_PROFILE and CUDA_PROFILE_CONFIG environment variables, is very useful to debug performance problems within X10/CUDA programs. The X10/CUDA programmer is best equipped to deal with performance issues on the GPU having read the relevant parts of the CUDA manual where the performance model is discussed in detail.