Programming and mathematics

CUDA and OpenGL interop

2026-04-01T00:00:00+02:00

Introduction

I’d like to write a fluid simulator with CUDA and document every step of the way in order to improve my writing skills. I want to keep the simulator simple and just display the result with OpenGL. This will be done using raylib. It’s a library in C to easily prototype games, which will make it easy to display and manipulate the fluid. For CUDA this will be done with the help of this c++ wrapper to simplify the CUDA code.

CUDA and OpenGL interop

CUDA has facilities for interopability with OpenGL which are pretty straight forward to use. We can simply map an opengl texture or buffer to an array or pointer and use it directly.

Let’s have a look on how to do this with a texture (it’s a very similar process for buffers). We first need to mark the OpenGL texture so it can be used it CUDA,

cudaGraphicsResource resource;

// id is a GLuint of an OpenGL texture
cudaGraphicsGLRegisterImage(&resource, 
                            id,  
                            GL_TEXTURE_2D, 
                            cudaGraphicsRegisterFlagsNone);

Following this, we can then map the resoure into a CUDA array,

cudaArray_t array;

cudaGraphicsMapResources(1, &resource, 0);
cudaGraphicsSubResourceGetMappedArray(&array, resource, 0, 0);

Unfortunately we cannot use arrays directly in a CUDA kernel. We have two options here: either copy to/from the array with,

cudaMemcpy2DToArray(array, 0, 0, data,
                    format * width,
                    format * width, height,
                    cudaMemcpyDeviceToDevice);

Or we can use a surface object which uses the array. We first need to create it before we can use it,

cudaSurfaceObject_t surface_object;
cudaResourceDesc texRes{};
texRes.resType = cudaResourceTypeArray;
texRes.res.array.array = array;

cudaCreateSurfaceObject(&surface_object, &texRes);

This can now be used in a kernel. We can’t forget of course to destroy the surface object and unmap the array,

cudaDestroySurfaceObject(surface_object);
cudaGraphicsUnmapResources(1, &resource, 0);

Designing a nice encapsulation

The mapping and creation of the surface object which then needs to be unmapped and destroyed fits very nicely with C++’s RAII principle. In other words, the mapping/instantiation can be done in a constructor and unmapping/destruction in a destructor, so we won’t forget to unmap/destroy:

class OpenglTextureView
{
public:
  ~OpenglTextureView()
  {
    cudaDestroySurfaceObject(_surface_object);
    cudaGraphicsUnmapResources(1, &_resource, 0);
  }

  cudaSurfaceObject_t view() const
  {
    return _surface_object;
  }

private:
  OpenglTextureView(cudaGraphicsResource* resource): _resource(resource)
  {
    auto status = cudaGraphicsMapResources(1, &_resource, 0);
    throw_if_error_lazy(status, "failed to map resources");

    cudaArray_t array;
    status = cudaGraphicsSubResourceGetMappedArray(&array, _resource, 0, 0);
    throw_if_error_lazy(status, "failed to map array");

    cudaResourceDesc texRes{};
    texRes.resType = cudaResourceTypeArray;
    texRes.res.array.array = array;

    status = cudaCreateSurfaceObject(&_surface_object, &texRes);
    throw_if_error_lazy(status, "failed to create surface object");
  }

  cudaGraphicsResource* _resource;
  cudaSurfaceObject_t _surface_object;
};

Note we’ve made the constructor private, this way we can also encapsulate the resource by wrapping it in another class that will also call the register function:

class OpenglTexture
{
public:
  OpenglTexture(unsigned int id)
  {
    auto status = cudaGraphicsGLRegisterImage(&_resource, id, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsNone);
    throw_if_error_lazy(status, "failed allocating 2D CUDA array");
  }

  OpenglTextureView map() const
  {
    return OpenglTextureView(_resource);
  }

private:
  cudaGraphicsResource* _resource;
};

Game of Life

To make a nice little demo, I thought that we could write a simple game of life simulation. This is pretty straightforward, we need three inputs: the previous grid, the next grid and the texture to display the new state. We set the dead/alive state on the next grid given the previous grid, and set a colour on the texture so we can display it.

__global__ void game_of_life_step(std::uint8_t* prev,
                                  std::uint8_t* next,
                                  cudaSurfaceObject_t surface,
                                  int width,
                                  int height)
{
  const uchar4 lightgray  = make_uchar4(200, 200, 200, 255);
  const uchar4 darkgray = make_uchar4(80, 80, 80, 255);

  int x = blockIdx.x * blockDim.x + threadIdx.x;
  int y = blockIdx.y * blockDim.y + threadIdx.y;

  if (x < width && y < height)
  {
    int left = (x - 1 + width) % width;
    int right = (x + 1) % width;
    int top = (y - 1 + height) % height;
    int bottom = (y + 1) % height;

    auto i = [=](int x, int y) { return x + y * width; };

    int total = prev[i(left, y)] + 
                prev[i(right, y)] + 
                prev[i(x, top)] + 
                prev[i(x, bottom)] +
                prev[i(left, top)] + 
                prev[i(right, top)] + 
                prev[i(left, bottom)] +
                prev[i(right, bottom)];

    next[i(x, y)] = total == 3 || (total == 2 && prev[i(x, y)]) ? 1 : 0;

    auto colour = next[i(x, y)] ? lightgray : darkgray;
    surf2Dwrite(colour, surface, x * 4, y);
  }
}

When we then call the kernel, at each step, we swap the previous and next grid pointers.

{
  auto view = glTexture.map();
  cuda::launch(game_of_life_step, launch_config_2d, curr, next, view.view(), width, height);
  std::swap(curr, next);
}

The result looks like this:

Github

I’ve left the details of setting up the grids for CUDA, creating the window, displaying the texture, etc out of this post for clarity. Details can be found in github.

Parallel reduce and scan on the GPU

2018-12-11T00:00:00+01:00

Introduction

GPUs are formidable parallel machines, capable of running thousands of threads simultaniously. They are excellent for embarassily parallel algorithms, but are quite different than the ones on the CPU due to the way GPUs work. You can’t just build and run an application. You need to interact with the GPU driver via one of several APIs available (CUDA, OpenCL, Vulkan, DirectX, OpenGL, etc), manage the device memory, organize the transfers between the host and the device, and dispatch the shaders that will run on the GPU.

We’ll have a look at two basic algorithms: reduce and scan. They are basic building blocks for more complex algorithms, e.g. solving linear equations or stream compaction. We’ll use Vulkan with GLSL shaders compiled to SPIR-V and its subgroup features introduced in version 1.1. This is chosen as it runs on many GPUs (NVidia, Intel, AMD, Mali, etc) and run on multiple platforms (Windows, Linux, Android, etc). It makes it easier to use than, say, CUDA.

To understand what a subgroup is, let’s review the abstract model used by Vulkan. Vulkan dispatches a certain number of shaders, which are divided in a number of work groups, themselves in a number of subgroups, and each one is divided in a number of invocations.

Each work group has its own cache that can be accessed directly by the shaders, called shared memory. Obviously accessing this memory is much faster than the global memory and algorithms are designed to make as much us of it as possible. The next subdivision, workgroups, are essentially large SIMD groups that execute the invocations in lockstep. Each SIMD, or subgroup, can communicate with special instructions, bypassing the shared memory or global memory.

Vulkan subgroups

Vulkan mandates some minimum requirements for subgroups for all drivers supporting version 1.1 We can query those capabilities to get information such as the size of subgroups (i.e. how many shaders run per subgroup) and which operations are supported.

auto properties = 
  physicalDevice.getProperties2<vk::PhysicalDeviceProperties2, vk::PhysicalDeviceSubgroupProperties>();
auto subgroupProperties = 
  properties.get<vk::PhysicalDeviceSubgroupProperties>();

std::cout << "Subgroup size: " 
          << subgroupProperties.subgroupSize 
          << std::endl;

std::cout << "Subgroup supported operations: " 
          << vk::to_string(subgroupProperties.supportedOperations) 
          << std::endl;

On my machine with a Vega 56, the following is returned:

Subgroup size: 64
Subgroup supported operations: {Basic | Vote | Arithmetic | Ballot | Shuffle | ShuffleRelative | Quad}

Arithmetic is the type of operation we’ll need to implement scan and reduce. An introduction to the other operations can be found at this tutorial khronos vulkan subgroup

Reduce

Reduce is very simple, it takes a list of elements \(x_0, x_1, x_2, ...\) and calculates its sum,

\[x = \sum_{i=0}^n x_i\]

C++17 has added it as std::reduce which can be run in parallel or sequentially. We’ll use it to compare the performance with the one running on the GPU. The equivalent operation in Vulkan for subgroups is:

float sum = subgroupAdd(value);

Every invocation belonging to the subgroup will return the total sum.

We can reduce up to 64 values on my machine. We’ll want to reduce on more elements than that, so we can use multiple subgroups to each reduce a part of the list of elements. We’ll then need to save the sum of each subgroup in the shared memory. Assuming we have less number of subgroups then the size of a subgroup, we load those values from the shared memory in the first subgroup and call subgroupAdd again. We choose then one invocation to save the sum in global memory.

Note that we’re still limited to the maximum size of a work group. Since workgroups can’t be synchronized between each other, we’ll need to use atomicAdd or simply run the entire algorith in multiple passes. This allows us to insert a barrier between the passes to synchronize the global memory on the device. At the end of the first pass, we’ll have N number of elements summed, corresponding to N workgroups, we then insert a memory barrier and dispatch again N invocations which will sum the elements with the same shader.

The reduce shader then looks like this (omitting details about declaring the input, output, sizes, etc),

shared float sdata[sumSubGroupSize];

void main()
{
  float sum = 0.0;
  if (gl_GlobalInvocationID.x < consts.n)
  {
    sum = inputs[gl_GlobalInvocationID.x];
  }

  sum = subgroupAdd(sum);

  if (gl_SubgroupInvocationID == 0)
  {
    sdata[gl_SubgroupID] = sum;
  }

  memoryBarrierShared();
  barrier();

  if (gl_SubgroupID == 0)
  {
    sum = gl_SubgroupInvocationID < gl_NumSubgroups ? 
      sdata[gl_SubgroupInvocationID] : 0;
    sum = subgroupAdd(sum);
  }

  if (gl_LocalInvocationID.x == 0)
  {
    outputs[gl_WorkGroupID.x] = sum;
  }
}

Let’s see how fast this algorithm is, comparing it with std::reduce running in sequence and in parallel. We’re also comparing with a regular reduce algorithm using only shared memory, based on the excellent slides from Mark Harris: Optimizing parallel reduce in CUDA

That’s rather disapointing, the subgroup based reduce is only slightly faster. However it is much easier to implement than the shared memory based one, and easier to read.

Scan

Scan, or prefix sum, takes a list of elements \(x_0, x_1, x_2, ...\) and produces a sequence of elements \(y_0, y_1, y_2, ...\) such that,

\[\begin{aligned} y_0 &= x_0 \\ y_1 &= x_0 + x_1 \\ y_2 &= x_0 + x_1 + x_2 \\ &... \end{aligned}\]

Again, this is available in C++17 with std::inclusive_scan, which we’ll use to compare with the GPU equivalent one. The vulkan subgroup operation is,

float value = subgroupInclusiveAdd(value);

Similarily to reduce, each invocation in the subgroup will receive the partial sum corresponding to its index (in increasing order).

We’ll use a similar strategy as for reduce to be able to scan over a bigger number of elements than the subgroup size. Each subgroup calculates the partial scan, we save the last element of the subgroup (i.e. the total sum of the subgroup) in shared memory. Assuming we have less number of subgroups then the size of a subgroup, we load those values from the shared memory in the first subgroup and call subgroupInclusiveAdd again. Finally we take each element of this subgroup, and add it to the subgroup corresponding to its index (except the first subgroup).

This works because the scan at each subgroup is the scan of the subgroup plus the total sum of every element before. If we look at the equation above and assume a subgroup size of 2, we can look at the calculation as so,

\[\begin{aligned} y_0 &= x_0 \\ y_1 &= x_0 + x_1 \\ y_2 &= x_0 + x_1 + x_2 &=& y_1 + x_2\\ y_3 &= x_0 + x_1 + x_2 + x_3 &=& y_1 + x_2 + x_3\\ y_4 &= x_0 + x_1 + x_2 + x_3 + x_4 &=& y_3 + x_4 \\ y_4 &= x_0 + x_1 + x_2 + x_3 + x_4 + x_5 &=& y_3 + x_4 + x_5 \\ &... \end{aligned}\]

which corresponds to the algorithm described.

Again as with reduce, this limits us to the maximum size of a work group. To go beyond, we’ll also need to do multiple passes. In the first pass, we’ll add the partial scan to the input data and also save it in an intermediate elements. We then perform another scan on the intermediate elements. Finally we need to add the those intermediate elements back to the original elements. Note that those two passes with the intermediate result are essentially the same operations as the ones in the shader.

The scan shader then looks like this (again, omitting declaration for inputs, sizes, etc),

shared float sdata[sumSubGroupSize];

void main()
{
  float sum = 0.0;
  if (gl_GlobalInvocationID.x < consts.n)
  {
    sum = inputs[gl_GlobalInvocationID.x];
  }

  sum = subgroupInclusiveAdd(sum);

  if (gl_SubgroupInvocationID == gl_SubgroupSize - 1)
  {
    sdata[gl_SubgroupID] = sum;
  }

  memoryBarrierShared();
  barrier();

  if (gl_SubgroupID == 0)
  {
    float warpSum = gl_SubgroupInvocationID < gl_NumSubgroups ? sdata[gl_SubgroupInvocationID] : 0;
    warpSum = subgroupInclusiveAdd(warpSum);
    sdata[gl_SubgroupInvocationID] = warpSum;
  }

  memoryBarrierShared();
  barrier();

  float blockSum = 0;
  if (gl_SubgroupID > 0)
  {
    blockSum = sdata[gl_SubgroupID - 1];
  }

  sum += blockSum;

  if (gl_GlobalInvocationID.x < consts.n)
  {
    outputs[gl_GlobalInvocationID.x] = sum;
  }

  if (gl_LocalInvocationID.x == gl_WorkGroupSize.x - 1)
  {
    partial_sums[gl_WorkGroupID.x] = sum;
  }
}

The shader to add the partial scan back to the list of elements is,

shared float sum;

void main()
{
  if (gl_WorkGroupID.x > 0 &&
      gl_GlobalInvocationID.x < consts.n)
  {
    sum = 0.0;
    if (gl_LocalInvocationID.x == 0)
    {
      sum = i.value[gl_WorkGroupID.x - 1];
    }

    memoryBarrierShared();
    barrier();

    o.value[gl_GlobalInvocationID.x] += sum;
  }
}

Again, let’s see how this fares against a CPU implementation and a GPU implementation using shared memory only. We’ve used the implementation from GPU Gems 3.

This is very impressive, much better improvements than with the reduce with subgroups!

Github

The implementation of those shaders with Vulkan and the benchmarks can be found on my github. Note that it uses a basic Vulkan engine I wrote, Vortex2D. This is used to implement a 2D fluid engine where the reduce operation is used in a linear solver and the scan operation to remove unused particles.