Jul 28, 2021 3 min read

Compute Shaders and You

A number of people have asked me about the differences between a compute (or kernel) and pixel (or fragment) shader. If you're coming from a background of doing shader art (creating procedural textures), they can look quite similar.

Generally, I would naively say that they are the same except you don't have a graphics pipeline. But that is not the only difference, and there is so much more to them, that I decided I would do a dive into it and document it down for my own understanding. Hopefully you find this helpful!

No Graphics Pipeline

As I mentioned above, an obvious difference between kernel and fragment shaders is that a kernel shader is not positioned within the graphics pipeline. Unlike the fragment shader which will be processing pixels that have been rasterized, or which returns a value which ends up getting interpolated, the compute shader has nothing to do with any of this. The compute shader reads data (if you want), and it writes data (if you want).

If you've only been making visuals with a fragment and vertex shader, then there are some things you may be taking for granted as well. Because one of the major differences is the thread visibility, and your ability to define how thread groups will be executed. Where as in a fragment shader there is a guarantee that adjacent pixels will be grouped together in 2x2 groups called quads. This has consequences which I will get to next.

Fragment Shader Functions

When making visuals in a pixel shader, you generally have to manage anti-aliasing on your own. There are some functions which help with resolution-independent anti-aliasing. In Metal the derivative functions are dfdx, dfdy, and fwidth. These give you an accurate partial derivative of a value along the x and y axis in screenspace. Andfwidth(p) = fabs(dfdx(p)) + fabs(dfdy(p)). HLSL has its equivalents ddx and ddy. GLSL has dFdx and dFdy.

The compute shader does not make any guarantee to schedule work in quads, and so you end up with undefined behavior when using these functions. This can be a real problem if you're doing visuals, and don't have resources to spare for MSAA.

The reasons for how all of this works is best described in "A trip through the Graphics Pipeline" which is absolutely recommended reading! But ultimately these functions are the reason why there is a guarantee to run the pixel shader in quads.

Thread Groups

One of the most important concepts with compute shaders is the thread group. If you think about image processing, the total number of threads that need to be executed would be width * height. Those threads would be broken up into n thread groups. All the threads in a thread group get executed at once. So similar to how pixel shaders will run per pixel, but in quads, you can execute compute shader code in a thread group size of your choosing. Moreover, the thread group can share memory, allowing for further optimization.

Summary

Compute shaders are great for so many types of tasks. You can use it for physics simulation, complex calculations, particle systems, image processing, post-processing a rendered image, deep learning, computational geometry, deferred rendering, etc. With shared-memory you can store and use intermediate results directly on the GPU. And you can do all of this along side the graphics pipeline if you need it.

There are many pros and cons to all of this. And while there will always be graphics programming best practices, the beauty of the programmable graphics pipeline is that we can get really creative with how we use the APIs and hardware.

I want to say when doing visuals I would just use a fragment shader every time, but there are honestly a lot of pros and cons and it depends on the details of what you're trying to accomplish, and the system constraints within which you are working. If I hadn't been working in a tightly constrained mobile environment, I wouldn't have run into this issue of not having fwidth and calculating additional samples for anti-aliasing would have been totally fine.