<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="/blog/" rel="alternate" type="text/html" /><updated>2026-04-01T19:16:39+02:00</updated><id>/blog/feed.xml</id><title type="html">Programming and mathematics</title><entry><title type="html">CUDA and OpenGL interop</title><link href="/blog/cuda-opengl-interop" rel="alternate" type="text/html" title="CUDA and OpenGL interop" /><published>2026-04-01T00:00:00+02:00</published><updated>2026-04-01T00:00:00+02:00</updated><id>/blog/cuda-opengl-interop</id><content type="html" xml:base="/blog/cuda-opengl-interop"><![CDATA[<h3 id="introduction">Introduction</h3>

<p>I’d like to write a fluid simulator with CUDA and document every step of the way in order to improve my writing skills.
I want to keep the simulator simple and just display the result with OpenGL.
This will be done using <a href="https://www.raylib.com/">raylib</a>. It’s a library in C to easily prototype games, which will make it easy to display and manipulate the fluid. 
For CUDA this will be done with the help of <a href="https://github.com/eyalroz/cuda-api-wrappers">this c++ wrapper</a> to simplify the CUDA code.</p>

<h3 id="cuda-and-opengl-interop">CUDA and OpenGL interop</h3>

<p>CUDA has facilities for interopability with OpenGL which are pretty straight forward to use. We can simply map an opengl texture or buffer to an array or pointer and use it directly.</p>

<p>Let’s have a look on how to do this with a texture (it’s a very similar process for buffers). We first need to mark the OpenGL texture so it can be used it CUDA,</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">cudaGraphicsResource</span> <span class="n">resource</span><span class="p">;</span>

<span class="c1">// id is a GLuint of an OpenGL texture</span>
<span class="n">cudaGraphicsGLRegisterImage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">resource</span><span class="p">,</span> 
                            <span class="n">id</span><span class="p">,</span>  
                            <span class="n">GL_TEXTURE_2D</span><span class="p">,</span> 
                            <span class="n">cudaGraphicsRegisterFlagsNone</span><span class="p">);</span>                                   </code></pre></figure>

<p>Following this, we can then map the resoure into a CUDA array,</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">cudaArray_t</span> <span class="n">array</span><span class="p">;</span>

<span class="n">cudaGraphicsMapResources</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">cudaGraphicsSubResourceGetMappedArray</span><span class="p">(</span><span class="o">&amp;</span><span class="n">array</span><span class="p">,</span> <span class="n">resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span></code></pre></figure>

<p>Unfortunately we cannot use arrays directly in a CUDA kernel. We have two options here: either copy to/from the array with,</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">cudaMemcpy2DToArray</span><span class="p">(</span><span class="n">array</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span>
                    <span class="n">format</span> <span class="o">*</span> <span class="n">width</span><span class="p">,</span>
                    <span class="n">format</span> <span class="o">*</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">,</span>
                    <span class="n">cudaMemcpyDeviceToDevice</span><span class="p">);</span></code></pre></figure>

<p>Or we can use a surface object which uses the array. We first need to create it before we can use it,</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">cudaSurfaceObject_t</span> <span class="n">surface_object</span><span class="p">;</span>
<span class="n">cudaResourceDesc</span> <span class="n">texRes</span><span class="p">{};</span>
<span class="n">texRes</span><span class="p">.</span><span class="n">resType</span> <span class="o">=</span> <span class="n">cudaResourceTypeArray</span><span class="p">;</span>
<span class="n">texRes</span><span class="p">.</span><span class="n">res</span><span class="p">.</span><span class="n">array</span><span class="p">.</span><span class="n">array</span> <span class="o">=</span> <span class="n">array</span><span class="p">;</span>

<span class="n">cudaCreateSurfaceObject</span><span class="p">(</span><span class="o">&amp;</span><span class="n">surface_object</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">texRes</span><span class="p">);</span></code></pre></figure>

<p>This can now be used in a kernel. We can’t forget of course to destroy the surface object and unmap the array,</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">cudaDestroySurfaceObject</span><span class="p">(</span><span class="n">surface_object</span><span class="p">);</span>
<span class="n">cudaGraphicsUnmapResources</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span></code></pre></figure>

<h3 id="designing-a-nice-encapsulation">Designing a nice encapsulation</h3>

<p>The mapping and creation of the surface object which then needs to be unmapped and destroyed fits very nicely with C++’s RAII principle.
In other words, the mapping/instantiation can be done in a constructor and unmapping/destruction in a destructor, so we won’t forget to unmap/destroy:</p>

<figure class="highlight"><pre><code class="language-cuda" data-lang="cuda"><span class="k">class</span> <span class="nc">OpenglTextureView</span>
<span class="p">{</span>
<span class="nl">public:</span>
  <span class="o">~</span><span class="n">OpenglTextureView</span><span class="p">()</span>
  <span class="p">{</span>
    <span class="n">cudaDestroySurfaceObject</span><span class="p">(</span><span class="n">_surface_object</span><span class="p">);</span>
    <span class="n">cudaGraphicsUnmapResources</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">_resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="n">cudaSurfaceObject_t</span> <span class="n">view</span><span class="p">()</span> <span class="k">const</span>
  <span class="p">{</span>
    <span class="k">return</span> <span class="n">_surface_object</span><span class="p">;</span>
  <span class="p">}</span>

<span class="k">private</span><span class="o">:</span>
  <span class="n">OpenglTextureView</span><span class="p">(</span><span class="n">cudaGraphicsResource</span><span class="o">*</span> <span class="n">resource</span><span class="p">)</span><span class="o">:</span> <span class="n">_resource</span><span class="p">(</span><span class="n">resource</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="k">auto</span> <span class="n">status</span> <span class="o">=</span> <span class="n">cudaGraphicsMapResources</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">_resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">throw_if_error_lazy</span><span class="p">(</span><span class="n">status</span><span class="p">,</span> <span class="s">"failed to map resources"</span><span class="p">);</span>

    <span class="n">cudaArray_t</span> <span class="n">array</span><span class="p">;</span>
    <span class="n">status</span> <span class="o">=</span> <span class="n">cudaGraphicsSubResourceGetMappedArray</span><span class="p">(</span><span class="o">&amp;</span><span class="n">array</span><span class="p">,</span> <span class="n">_resource</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">throw_if_error_lazy</span><span class="p">(</span><span class="n">status</span><span class="p">,</span> <span class="s">"failed to map array"</span><span class="p">);</span>

    <span class="n">cudaResourceDesc</span> <span class="n">texRes</span><span class="p">{};</span>
    <span class="n">texRes</span><span class="p">.</span><span class="n">resType</span> <span class="o">=</span> <span class="n">cudaResourceTypeArray</span><span class="p">;</span>
    <span class="n">texRes</span><span class="p">.</span><span class="n">res</span><span class="p">.</span><span class="n">array</span><span class="p">.</span><span class="n">array</span> <span class="o">=</span> <span class="n">array</span><span class="p">;</span>

    <span class="n">status</span> <span class="o">=</span> <span class="n">cudaCreateSurfaceObject</span><span class="p">(</span><span class="o">&amp;</span><span class="n">_surface_object</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">texRes</span><span class="p">);</span>
    <span class="n">throw_if_error_lazy</span><span class="p">(</span><span class="n">status</span><span class="p">,</span> <span class="s">"failed to create surface object"</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="n">cudaGraphicsResource</span><span class="o">*</span> <span class="n">_resource</span><span class="p">;</span>
  <span class="n">cudaSurfaceObject_t</span> <span class="n">_surface_object</span><span class="p">;</span>
<span class="p">};</span></code></pre></figure>

<p>Note we’ve made the constructor private, this way we can also encapsulate the resource by wrapping it in another class that will also call the register function:</p>

<figure class="highlight"><pre><code class="language-cuda" data-lang="cuda"><span class="k">class</span> <span class="nc">OpenglTexture</span>
<span class="p">{</span>
<span class="nl">public:</span>
  <span class="n">OpenglTexture</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">id</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="k">auto</span> <span class="n">status</span> <span class="o">=</span> <span class="n">cudaGraphicsGLRegisterImage</span><span class="p">(</span><span class="o">&amp;</span><span class="n">_resource</span><span class="p">,</span> <span class="n">id</span><span class="p">,</span> <span class="n">GL_TEXTURE_2D</span><span class="p">,</span> <span class="n">cudaGraphicsRegisterFlagsNone</span><span class="p">);</span>
    <span class="n">throw_if_error_lazy</span><span class="p">(</span><span class="n">status</span><span class="p">,</span> <span class="s">"failed allocating 2D CUDA array"</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="n">OpenglTextureView</span> <span class="nf">map</span><span class="p">()</span> <span class="k">const</span>
  <span class="p">{</span>
    <span class="k">return</span> <span class="n">OpenglTextureView</span><span class="p">(</span><span class="n">_resource</span><span class="p">);</span>
  <span class="p">}</span>

<span class="k">private</span><span class="o">:</span>
  <span class="n">cudaGraphicsResource</span><span class="o">*</span> <span class="n">_resource</span><span class="p">;</span>
<span class="p">};</span></code></pre></figure>

<h3 id="game-of-life">Game of Life</h3>

<p>To make a nice little demo, I thought that we could write a simple <a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">game of life</a> simulation. 
This is pretty straightforward, we need three inputs: the previous grid, the next grid and the texture to display the new state. 
We set the dead/alive state on the next grid given the previous grid, and set a colour on the texture so we can display it.</p>

<figure class="highlight"><pre><code class="language-cuda" data-lang="cuda"><span class="k">__global__</span> <span class="kt">void</span> <span class="nf">game_of_life_step</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">uint8_t</span><span class="o">*</span> <span class="n">prev</span><span class="p">,</span>
                                  <span class="n">std</span><span class="o">::</span><span class="kt">uint8_t</span><span class="o">*</span> <span class="n">next</span><span class="p">,</span>
                                  <span class="n">cudaSurfaceObject_t</span> <span class="n">surface</span><span class="p">,</span>
                                  <span class="kt">int</span> <span class="n">width</span><span class="p">,</span>
                                  <span class="kt">int</span> <span class="n">height</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">const</span> <span class="kt">uchar4</span> <span class="n">lightgray</span>  <span class="o">=</span> <span class="n">make_uchar4</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">200</span><span class="p">,</span> <span class="mi">255</span><span class="p">);</span>
  <span class="k">const</span> <span class="kt">uchar4</span> <span class="n">darkgray</span> <span class="o">=</span> <span class="n">make_uchar4</span><span class="p">(</span><span class="mi">80</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">255</span><span class="p">);</span>

  <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">y</span> <span class="o">*</span> <span class="n">blockDim</span><span class="p">.</span><span class="n">y</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">y</span><span class="p">;</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">width</span> <span class="o">&amp;&amp;</span> <span class="n">y</span> <span class="o">&lt;</span> <span class="n">height</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="kt">int</span> <span class="n">left</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">width</span><span class="p">)</span> <span class="o">%</span> <span class="n">width</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">right</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">width</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">top</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">%</span> <span class="n">height</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">bottom</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">height</span><span class="p">;</span>

    <span class="k">auto</span> <span class="n">i</span> <span class="o">=</span> <span class="p">[</span><span class="o">=</span><span class="p">](</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">width</span><span class="p">;</span> <span class="p">};</span>

    <span class="kt">int</span> <span class="n">total</span> <span class="o">=</span> <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">y</span><span class="p">)]</span> <span class="o">+</span> 
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">right</span><span class="p">,</span> <span class="n">y</span><span class="p">)]</span> <span class="o">+</span> 
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">top</span><span class="p">)]</span> <span class="o">+</span> 
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">bottom</span><span class="p">)]</span> <span class="o">+</span>
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">top</span><span class="p">)]</span> <span class="o">+</span> 
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">right</span><span class="p">,</span> <span class="n">top</span><span class="p">)]</span> <span class="o">+</span> 
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">bottom</span><span class="p">)]</span> <span class="o">+</span>
                <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">right</span><span class="p">,</span> <span class="n">bottom</span><span class="p">)];</span>

    <span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)]</span> <span class="o">=</span> <span class="n">total</span> <span class="o">==</span> <span class="mi">3</span> <span class="o">||</span> <span class="p">(</span><span class="n">total</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&amp;&amp;</span> <span class="n">prev</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)])</span> <span class="o">?</span> <span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">auto</span> <span class="n">colour</span> <span class="o">=</span> <span class="n">next</span><span class="p">[</span><span class="n">i</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)]</span> <span class="o">?</span> <span class="n">lightgray</span> <span class="o">:</span> <span class="n">darkgray</span><span class="p">;</span>
    <span class="n">surf2Dwrite</span><span class="p">(</span><span class="n">colour</span><span class="p">,</span> <span class="n">surface</span><span class="p">,</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>When we then call the kernel, at each step, we swap the previous and next grid pointers.</p>

<figure class="highlight"><pre><code class="language-cuda" data-lang="cuda"><span class="p">{</span>
  <span class="k">auto</span> <span class="n">view</span> <span class="o">=</span> <span class="n">glTexture</span><span class="p">.</span><span class="n">map</span><span class="p">();</span>
  <span class="n">cuda</span><span class="o">::</span><span class="n">launch</span><span class="p">(</span><span class="n">game_of_life_step</span><span class="p">,</span> <span class="n">launch_config_2d</span><span class="p">,</span> <span class="n">curr</span><span class="p">,</span> <span class="n">next</span><span class="p">,</span> <span class="n">view</span><span class="p">.</span><span class="n">view</span><span class="p">(),</span> <span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
  <span class="n">std</span><span class="o">::</span><span class="n">swap</span><span class="p">(</span><span class="n">curr</span><span class="p">,</span> <span class="n">next</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure>

<p>The result looks like this:</p>

<p><img src="/blog/assets/game_of_life.gif" alt="Game of Life" /></p>

<h3 id="github">Github</h3>

<p>I’ve left the details of setting up the grids for CUDA, creating the window, displaying the texture, etc out of this post for clarity.
Details can be found in <a href="https://github.com/mmaldacker/Flow">github</a>.</p>]]></content><author><name></name></author><category term="cuda" /><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Parallel reduce and scan on the GPU</title><link href="/blog/parallel-reduce-and-scan-on-the-GPU" rel="alternate" type="text/html" title="Parallel reduce and scan on the GPU" /><published>2018-12-11T00:00:00+01:00</published><updated>2018-12-11T00:00:00+01:00</updated><id>/blog/parallel-reduce-and-scan-on-the-GPU</id><content type="html" xml:base="/blog/parallel-reduce-and-scan-on-the-GPU"><![CDATA[<h3 id="introduction">Introduction</h3>

<p>GPUs are formidable parallel machines, capable of running thousands of threads simultaniously. They are excellent for embarassily parallel algorithms, but are quite different than the ones on the CPU due to the way GPUs work. You can’t just build and run an application. You need to interact with the GPU driver via one of several APIs available (CUDA, OpenCL, Vulkan, DirectX, OpenGL, etc), manage the device memory, organize the transfers between the host and the device, and dispatch the shaders that will run on the GPU.</p>

<p>We’ll have a look at two basic algorithms: reduce and scan. They are basic building blocks for more complex algorithms, e.g. solving linear equations or stream compaction.
We’ll use Vulkan with GLSL shaders compiled to SPIR-V and its subgroup features introduced in version 1.1. This is chosen as it runs on many GPUs (NVidia, Intel, AMD, Mali, etc) and run on multiple platforms (Windows, Linux, Android, etc). It makes it easier to use than, say, CUDA.</p>

<p>To understand what a subgroup is, let’s review the abstract model used by Vulkan. Vulkan dispatches a certain number of shaders, which are divided in a number of work groups, themselves in a number of subgroups, and each one is divided in a number of invocations.</p>

<p><img src="/blog/assets/gpu.png" alt="GPU abstract model" /></p>

<p>Each work group has its own cache that can be accessed directly by the shaders, called shared memory. Obviously accessing this memory is much faster than the global memory and algorithms are designed to make as much us of it as possible.
The next subdivision, workgroups, are essentially large SIMD groups that execute the invocations in lockstep. Each SIMD, or subgroup, can communicate with special instructions, bypassing the shared memory or global memory.</p>

<h3 id="vulkan-subgroups">Vulkan subgroups</h3>

<p>Vulkan mandates some minimum requirements for subgroups for all drivers supporting version 1.1 We can query those capabilities to get information such as the size of subgroups (i.e. how many shaders run per subgroup) and which operations are supported.</p>

<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">auto</span> <span class="n">properties</span> <span class="o">=</span> 
  <span class="n">physicalDevice</span><span class="p">.</span><span class="n">getProperties2</span><span class="o">&lt;</span><span class="n">vk</span><span class="o">::</span><span class="n">PhysicalDeviceProperties2</span><span class="p">,</span> <span class="n">vk</span><span class="o">::</span><span class="n">PhysicalDeviceSubgroupProperties</span><span class="o">&gt;</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">subgroupProperties</span> <span class="o">=</span> 
  <span class="n">properties</span><span class="p">.</span><span class="n">get</span><span class="o">&lt;</span><span class="n">vk</span><span class="o">::</span><span class="n">PhysicalDeviceSubgroupProperties</span><span class="o">&gt;</span><span class="p">();</span>

<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Subgroup size: "</span> 
          <span class="o">&lt;&lt;</span> <span class="n">subgroupProperties</span><span class="p">.</span><span class="n">subgroupSize</span> 
          <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

<span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Subgroup supported operations: "</span> 
          <span class="o">&lt;&lt;</span> <span class="n">vk</span><span class="o">::</span><span class="n">to_string</span><span class="p">(</span><span class="n">subgroupProperties</span><span class="p">.</span><span class="n">supportedOperations</span><span class="p">)</span> 
          <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span></code></pre></figure>

<p>On my machine with a Vega 56, the following is returned:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Subgroup size: 64
Subgroup supported operations: {Basic | Vote | Arithmetic | Ballot | Shuffle | ShuffleRelative | Quad}
</code></pre></div></div>

<p>Arithmetic is the type of operation we’ll need to implement scan and reduce. An introduction to the other operations can be found at this tutorial <a href="https://www.khronos.org/blog/vulkan-subgroup-tutorial">khronos vulkan subgroup</a></p>

<h3 id="reduce">Reduce</h3>

<p>Reduce is very simple, it takes a list of elements \(x_0, x_1, x_2, ...\) and calculates its sum,</p>

\[x = \sum_{i=0}^n x_i\]

<p>C++17 has added it as <code class="language-plaintext highlighter-rouge">std::reduce</code> which can be run in parallel or sequentially. We’ll use it to compare the performance with the one running on the GPU.
The equivalent operation in Vulkan for subgroups is:</p>

<figure class="highlight"><pre><code class="language-glsl" data-lang="glsl"><span class="kt">float</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">subgroupAdd</span><span class="p">(</span><span class="n">value</span><span class="p">);</span></code></pre></figure>

<p>Every invocation belonging to the subgroup will return the total sum.</p>

<p style="text-align: center;"><img src="/blog/assets/reduce.png" alt="Reduce" width="75%" /></p>

<p>We can reduce up to 64 values on my machine. We’ll want to reduce on more elements than that, so we can use multiple subgroups to each reduce a part of the list of elements. We’ll then need to save the sum of each subgroup in the shared memory. Assuming we have less number of subgroups then the size of a subgroup, we load those values from the shared memory in the first subgroup and call <code class="language-plaintext highlighter-rouge">subgroupAdd</code> again. We choose then one invocation to save the sum in global memory.</p>

<p><img src="/blog/assets/subgroup_reduce.png" alt="Subgroupp reduce" /></p>

<p>Note that we’re still limited to the maximum size of a work group. Since workgroups can’t be synchronized between each other, we’ll need to use <code class="language-plaintext highlighter-rouge">atomicAdd</code> or simply run the entire algorith in multiple passes. This allows us to insert a barrier between the passes to synchronize the global memory on the device. At the end of the first pass, we’ll have <code class="language-plaintext highlighter-rouge">N</code> number of elements summed, corresponding to <code class="language-plaintext highlighter-rouge">N</code> workgroups, we then insert a memory barrier and dispatch again <code class="language-plaintext highlighter-rouge">N</code> invocations which will sum the elements with the same shader.</p>

<p>The reduce shader then looks like this (omitting details about declaring the input, output, sizes, etc),</p>

<figure class="highlight"><pre><code class="language-glsl" data-lang="glsl"><span class="n">shared</span> <span class="kt">float</span> <span class="n">sdata</span><span class="p">[</span><span class="n">sumSubGroupSize</span><span class="p">];</span>

<span class="kt">void</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
  <span class="kt">float</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">consts</span><span class="p">.</span><span class="n">n</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sum</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span><span class="p">];</span>
  <span class="p">}</span>

  <span class="n">sum</span> <span class="o">=</span> <span class="n">subgroupAdd</span><span class="p">(</span><span class="n">sum</span><span class="p">);</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_SubgroupInvocationID</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupID</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="n">memoryBarrierShared</span><span class="p">();</span>
  <span class="n">barrier</span><span class="p">();</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_SubgroupID</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sum</span> <span class="o">=</span> <span class="n">gl_SubgroupInvocationID</span> <span class="o">&lt;</span> <span class="n">gl_NumSubgroups</span> <span class="o">?</span> 
      <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupInvocationID</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">sum</span> <span class="o">=</span> <span class="n">subgroupAdd</span><span class="p">(</span><span class="n">sum</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_LocalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">outputs</span><span class="p">[</span><span class="n">gl_WorkGroupID</span><span class="p">.</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>Let’s see how fast this algorithm is, comparing it with <code class="language-plaintext highlighter-rouge">std::reduce</code> running in sequence and in parallel. We’re also comparing with a regular reduce algorithm using only shared memory, based on the excellent slides from Mark Harris: <a href="https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf">Optimizing parallel reduce in CUDA</a></p>

<figure>
  <embed type="image/svg+xml" src="/blog/assets/reduce.svg" />
</figure>

<p>That’s rather disapointing, the subgroup based reduce is only slightly faster. However it is much easier to implement than the shared memory based one, and easier to read.</p>

<h3 id="scan">Scan</h3>

<p>Scan, or prefix sum, takes a list of elements \(x_0, x_1, x_2, ...\) and produces a sequence of elements \(y_0, y_1, y_2, ...\) such that,</p>

\[\begin{aligned}
y_0 &amp;= x_0 \\
y_1 &amp;= x_0 + x_1 \\
y_2 &amp;= x_0 + x_1 + x_2 \\
&amp;...
\end{aligned}\]

<p>Again, this is available in C++17 with <code class="language-plaintext highlighter-rouge">std::inclusive_scan</code>, which we’ll use to compare with the GPU equivalent one.
The vulkan subgroup operation is,</p>

<figure class="highlight"><pre><code class="language-glsl" data-lang="glsl"><span class="kt">float</span> <span class="n">value</span> <span class="o">=</span> <span class="n">subgroupInclusiveAdd</span><span class="p">(</span><span class="n">value</span><span class="p">);</span></code></pre></figure>

<p>Similarily to reduce, each invocation in the subgroup will receive the partial sum corresponding to its index (in increasing order).</p>

<p style="text-align: center;"><img src="/blog/assets/scan.png" alt="Scan" width="60%" /></p>

<p>We’ll use a similar strategy as for reduce to be able to scan over a bigger number of elements than the subgroup size. Each subgroup calculates the partial scan, we save the last element of the subgroup (i.e. the total sum of the subgroup) in shared memory. Assuming we have less number of subgroups then the size of a subgroup, we load those values from the shared memory in the first subgroup and call <code class="language-plaintext highlighter-rouge">subgroupInclusiveAdd</code> again. Finally we take each element of this subgroup, and add it to the subgroup corresponding to its index (except the first subgroup).</p>

<p><img src="/blog/assets/subgroup_scan.png" alt="Subgroup scan" /></p>

<p>This works because the scan at each subgroup is the scan of the subgroup plus the total sum of every element before. If we look at the equation above and assume a subgroup size of 2, we can look at the calculation as so,</p>

\[\begin{aligned}
y_0 &amp;= x_0 \\
y_1 &amp;= x_0 + x_1 \\
y_2 &amp;= x_0 + x_1 + x_2 &amp;=&amp; y_1 + x_2\\
y_3 &amp;= x_0 + x_1 + x_2 + x_3 &amp;=&amp; y_1 + x_2 + x_3\\
y_4 &amp;= x_0 + x_1 + x_2 + x_3 + x_4 &amp;=&amp; y_3 + x_4 \\
y_4 &amp;= x_0 + x_1 + x_2 + x_3 + x_4 + x_5 &amp;=&amp; y_3 + x_4 + x_5 \\
&amp;...
\end{aligned}\]

<p>which corresponds to the algorithm described.</p>

<p>Again as with reduce, this limits us to the maximum size of a work group. To go beyond, we’ll also need to do multiple passes. In the first pass, we’ll add the partial scan to the input data and also save it in an intermediate elements. We then perform another scan on the intermediate elements. Finally we need to add the those intermediate elements back to the original elements. Note that those two passes with the intermediate result are essentially the same operations as the ones in the shader.</p>

<p>The scan shader then looks like this (again, omitting declaration for inputs, sizes, etc),</p>

<figure class="highlight"><pre><code class="language-glsl" data-lang="glsl"><span class="n">shared</span> <span class="kt">float</span> <span class="n">sdata</span><span class="p">[</span><span class="n">sumSubGroupSize</span><span class="p">];</span>

<span class="kt">void</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
  <span class="kt">float</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">consts</span><span class="p">.</span><span class="n">n</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sum</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span><span class="p">];</span>
  <span class="p">}</span>

  <span class="n">sum</span> <span class="o">=</span> <span class="n">subgroupInclusiveAdd</span><span class="p">(</span><span class="n">sum</span><span class="p">);</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_SubgroupInvocationID</span> <span class="o">==</span> <span class="n">gl_SubgroupSize</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupID</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="n">memoryBarrierShared</span><span class="p">();</span>
  <span class="n">barrier</span><span class="p">();</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_SubgroupID</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="kt">float</span> <span class="n">warpSum</span> <span class="o">=</span> <span class="n">gl_SubgroupInvocationID</span> <span class="o">&lt;</span> <span class="n">gl_NumSubgroups</span> <span class="o">?</span> <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupInvocationID</span><span class="p">]</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">warpSum</span> <span class="o">=</span> <span class="n">subgroupInclusiveAdd</span><span class="p">(</span><span class="n">warpSum</span><span class="p">);</span>
    <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupInvocationID</span><span class="p">]</span> <span class="o">=</span> <span class="n">warpSum</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="n">memoryBarrierShared</span><span class="p">();</span>
  <span class="n">barrier</span><span class="p">();</span>

  <span class="kt">float</span> <span class="n">blockSum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">gl_SubgroupID</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">blockSum</span> <span class="o">=</span> <span class="n">sdata</span><span class="p">[</span><span class="n">gl_SubgroupID</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
  <span class="p">}</span>

  <span class="n">sum</span> <span class="o">+=</span> <span class="n">blockSum</span><span class="p">;</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">consts</span><span class="p">.</span><span class="n">n</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">outputs</span><span class="p">[</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">gl_LocalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">==</span> <span class="n">gl_WorkGroupSize</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">partial_sums</span><span class="p">[</span><span class="n">gl_WorkGroupID</span><span class="p">.</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>The shader to add the partial scan back to the list of elements is,</p>

<figure class="highlight"><pre><code class="language-glsl" data-lang="glsl"><span class="n">shared</span> <span class="kt">float</span> <span class="n">sum</span><span class="p">;</span>

<span class="kt">void</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">gl_WorkGroupID</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="o">&amp;&amp;</span>
      <span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">&lt;</span> <span class="n">consts</span><span class="p">.</span><span class="n">n</span><span class="p">)</span>
  <span class="p">{</span>
    <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">gl_LocalInvocationID</span><span class="p">.</span><span class="n">x</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
    <span class="p">{</span>
      <span class="n">sum</span> <span class="o">=</span> <span class="n">i</span><span class="p">.</span><span class="n">value</span><span class="p">[</span><span class="n">gl_WorkGroupID</span><span class="p">.</span><span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
    <span class="p">}</span>

    <span class="n">memoryBarrierShared</span><span class="p">();</span>
    <span class="n">barrier</span><span class="p">();</span>

    <span class="n">o</span><span class="p">.</span><span class="n">value</span><span class="p">[</span><span class="n">gl_GlobalInvocationID</span><span class="p">.</span><span class="n">x</span><span class="p">]</span> <span class="o">+=</span> <span class="n">sum</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span></code></pre></figure>

<p>Again, let’s see how this fares against a CPU implementation and a GPU implementation using shared memory only. We’ve used the implementation from <a href="https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html">GPU Gems 3</a>.</p>

<figure>
  <embed type="image/svg+xml" src="/blog/assets/scan.svg" />
</figure>

<p>This is very impressive, much better improvements than with the reduce with subgroups!</p>

<h3 id="github">Github</h3>

<p>The implementation of those shaders with Vulkan and the benchmarks can be found on my <a href="https://github.com/mmaldacker/VulkanSubgroups">github</a>. Note that it uses a basic Vulkan engine I wrote, <a href="https://github.com/mmaldacker/Vortex2D">Vortex2D</a>. This is used to implement a 2D fluid engine where the reduce operation is used in a linear solver and the scan operation to remove unused particles.</p>]]></content><author><name></name></author><category term="vulkan" /><summary type="html"><![CDATA[Introduction]]></summary></entry></feed>