Monday, January 28, 2013

Charlie & The Compute-Shader factory #1

Today’s topic is about an increasing phenomenon amongst (graphics) programmers; Chlamydia. Or did I mean to say Compute Shaders? As usual, I won't teach you the in-depth details, but just 'what it is'. Although the name may imply that Chlamydia, I mean Compute Shaders, are another typical game/graphics thing, they can be used on a much wider scale for computer applications. Plus knowing a bit about them, will also gain you a better understanding of what's-going-on under the hood of your computer.

Compute-Shader. Sounds as if it can be placed within the royal line of graphics shaders;
* Vertex Shader: processes (3D model) vertices
* Geometry Shader: Make 3D primitives(triangle, quad, point, line) from vertices
* Fragment shader: Draw pixels(fragments) on a primitive. Yippee, "Coloring".
Or, image it as follow. Pick a piece of paper, then
1- Put some dots (vertices)
2- Draw lines beetween the dots (geometry primitives)
3- Color the shapes with your pencils

Notice that these three steps have always been there in OpenGL, DirectX, or other (hardware/software) "rasterizers", even in the old Quake1 days. But as fixed, non-programmable stages. During the last ~10 years, these 3 stages have become programmable with the famous shaders. Basically a shader is a (small) program that calculates where a vertex has to be placed, how a primitive should be shaped, or how a pixel should be colored. Shaders can be used to compute different types of data as well, as a vertex or a pixel is just a bunch of numeric values well in the end. But the description above would illustrate the most common usage.

So... that Compute Shader… is a 4th step somewhere in this pipeline? Ehm, no. Not really. Yes, it is a program that "Computes" (hence the name!) *something*. But it's not a specific step in a 3D rendering pipeline, neither does it have to run on a graphics card, and neither is it automatically related to graphics at all. Although game-graphics are a useful place to implement Compute Shaders, they are more like an universal kind of program, giving access to deeper layers of the executing hardware. Basically they can compute anything. Pixel-colors, the positions of 100.000 particles, cloth physics, asteroid field simulations, ants, et cetera. As some of you may have noticed, they become handy when large numbers of similar things can be calculated in parallel. To put it short, they are an advanced, and relative new type of shader.

Eins zwei Polizei
Ok... vague. Should I really know more about Compute Shaders? Sure thing. But to truly understand the need of yet another programming-thingie, it might be good to know how massive computing, in particular on a GPU(videocard), works. First, a program is just a (gigantic) list of instructions, being executed by a processor. Second, and to put it simple, there are 2 ways of execution: sequential and parallel. Sequential means instructions get performed one by one. For example:
// Count till ten, then let a fart
1- x = 0 // initialize variable x
2- x = x + 1 // add 1
3- if (x > 10) then // compare
4- fart;
4- goto line2 // go back to line 2, add 1 again
Or to illustrate a very abstract game engine:
1- Read input (mouse, keyboard, ...)
2- Update physics (positions of entities, collisions, ...)
3- Update AI (monsters, NPC's, player, ...)
4- Play sound
5- Render graphics
* wait shortly, go back to step 1 and repeat the whole process again
This is how most program-events work, including realtime systems such as a PLC on a packaging machine. In a typical UI based (Windows) program, when clicking on a button, a sequence of actions will be executed. If those instructions are very costly, or just a lot, it will eventually stall the processor. Since it’s sequential, all the instructions have to complete first before you can click something else. The nice thing about sequential programming is that it guarantees the order of instructions goes as expected, and is therefore easy to predict. In the “game” example above, we already updated all of the 3D model positions before we draw them. No way an object suddenly moves while we are drawing it.

Paralel dimensions
The other way is parallel execution. As the name sais, 2 or more things get done simultaneously. Typically a program will split in two or more "threads" (sub-processes), each doing a specific portion of a bigger whole. Btw, each thread still executes its instructions sequentially, but an important detail is that one thread does not have to wait on another thread. In theory at least. This is particularly handy on multi-core systems. One processor could calculate the physics, another one audio, a third one renders the graphics, and so on. But also older single-core systems are familiar with threading for a long time. If you click a button that executes a super-heavy calculation, you can still move the mouse, operate another program, play a music, and download porn at the same time. This is because all these processes are split in separate threads, and your computer smartly swaps between the threads (based on priority) giving them all a chance to execute… or to lock up the computer completely and force a hard reset.
Note that in the parallel setup, the "reporting back" may happen at different times, in different orders.

Sounds smart, and potentially the fastest way to do things. But in practice working with multiple threads has a lot of caveats. There is no guarantee that something in thread-A has been finished before thread-B continues on it. So when threads have to co-operate, you’ll need some clear rules and smart tricks to “synchronize” them. It's like having construction worker A making a scaffolding before construction worker B arrives with his tools. If worker A was busy showing his buttcrack and whistling at passing girls, worker B falls down and breaks his neck if he didn't check whether A already installed the scaffolding or not. In programming terms, a thread can check if another thread did his work (or isn't busy on something) with the help of 'mutexes', 'semaphores' or perform 'atomic operations'. You can forget those terms, but to show an example:
This prevents (very, very, VERY) crazy unpredictable bugs. But at the same time it brings the whole parallel strength back to the ground. If construction worker B has to wait the whole day on worker A, he could just as well stay in bed that day. Waiting = a waste of resources. And in the worst case, we'll keep waiting forever if all tasks are dependant on each other (see Crossroad example below).

The magic key to work in parallel successfully, is to minimize the dependancy on other processes as much as possible. To less worker A has to deal with B, the less chance they are waiting on each other. Let them do separated tasks. A installs the toilet, B plasters the walls, C makes coffee and D whistles at the girls. Perfect.

Parallelism on GPU’s
You may wonder what the hell this all has to do with (Compute)Shaders, so let's get back to the videocard. One practical computer-world example of making very good use of parallel powers, is the GPU(videocard processor). As you all know, inside the videocard and monitor, Umpa Lumpa's are running around with pixels. First they paint the pixel bricks in the right color, then they climb up on each other and hold the brick in front of them - that's what you see on your monitor. Now, put your finger on your monitor, and count the pixels. 1,2,3,4... already know it? Well, if your screen resolution is 1920 x 1080 (HD TV), then you should count about 2.073.600 pixels.

As the shaders showed above, each freak'n pixel-color has to be calculated. The color is usually the result of several textures, nearby lights, reflections, shadows, and maybe some more techniques. Imagine one Umpa Lumpa had to draw all the pixel-bricks(one by one), and then stack them on each other one by one? Obviously, that is not how a videocard works. No. Videocards use multiple Umpa Lumpa's. Drawing pixels and putting them in place is something that can be done perfectly seperate. Each Umpa Lumpa would be responsible for a single pixel brick. The reason this can be done in parallel is because Umpa Lumpa's don't give a fuck about how other neighbor pixels are painted. They just focus on their own brick.

Warps / Wavefront -> Super Umpa Lumpa's
Unfortunately, the GPU doesn't have enough place to house up to 2 million Umpa Lumpa's that would be required for a HD screen. It would be physically impossible (so far) to manufacture such advanced GPU chips. Instead, the GPU bakeries came with some other smart tricks. Willy Wonka mutated his Umpa Lumpa's so each of them has 256, 512, or even more(!) arms. With 512 arms, you can paint 512 pixel-bricks at the same time of course. So a single Umpa Lumpa would become responsible for a whole bunch of pixels. In nVidia land they would call such a Umpa Lumpa a "Warp” . In the AMD(ATI) bakeries they are called "Wavefronts". Different names, but the same idea.

Your videocard card has several of those "super Umpa Lumpas" (Warps or Wavefronts) available to do ‘stuff’. They work entirely separated. For example, Warp Lumpa #1 could draw 16x6 pixels on the bottom left side of your screen, while Warp Lumpa #2 does 16x16 pixels on the top-right side. They don't bother with each other, so neither do they have to wait on each other. Typically, your screen would get separated in smaller square tiles, and each tile gets computed by a Warp/Wavefront/Umpa Lumpa. Parallelism mate.
Just look at them working!

There is one more important thing to tell about Umpa Lumpas. With 512 arms, you can handle 512 pixels at the same time. BUT. What if those pixels vary a lot? Let's say some of those pixels are using a water material with reflections, while the other half is made of simpler grass colors? 512 hands or not, the poor bastard still has 1 brain, so he moves all of his hands in the same way more or less. Having 100 hands painting water and the other hands grass is too confusing. So, he first paints the grass pixels, THEN the water pixels. Or vice-versa. Either way, when dealing with mixed orders, it takes more time to finish all the pixels.
Although processor instructions would look a bit more complicated than this, a Warp or Wavefront can apply an instruction on multiple tasks. The values may vary, but the workload keeps the same... that is, if all pixels are using the same program.

Now a bit more serious and technical. First, the whole description has been dumbed down for readability(really?!), so forgive errors (I’m not a GPU-architecture expert anyway). Second, a Warp or Wavefront can execute the same processor instructions for multiple of its "pixels" (note that a pixel could be something very different as well, so let’s name it “tasks” from now on) simultaneously. See the image above. Although the numbers (or texture values) may vary, the instructions remain the same. This way all the tasks for a Warp or Wavefront can be perfectly executed at the same time. Combined with having multiple Wavefronts or Warps, this is what makes GPU's so damn fast and able to render millions and millions of pixels (30 or more times per second!).

Typically when we apply one shader, all the pixels undergo the same instructions. However, if your shader contains a lot of branching (IF THEN ELSE), variable length loops, or contents using different shaders, the execution of the Warp and Wavefront may become slower, as it has to deal with multiple scenario's. An IF branch in your shader doesn’t have to be bad, but make sure the surrounding pixels will likely pick the same path. If all pixels within a Warp/Wavefront pick the same execution route, there Is no problem and the branch may actually make things faster. But if the contents are mixed, the processor will perform a worst case scenario. Basically it will make 1 long program with all instructions combined, executing everything even if the instructions do not apply on a pixel. In order to keep things fast, try to prevent these scenario's. This is also one of those reasons why we like to sort & batch things.

Ok. I still didn't tell what Compute Shaders themselves actually are, but let's call it quits for today. First make sure you have a slight understanding of how a videocard issues tasks amongst it's "workers". Without that knowledge, Compute Shaders can't be used efficiently anyway. See you next time. Umpa dumpa... OOps, almost forgot a screenshot.
Freak on a leash


  1. Nice work with the articles, can't wait to see the next one! This site is probably the first place I've seen, where some important details (such as branching on the GPU) are brought up in a simple way. Keep up the good work!

  2. Simplicity is key. Otherwise I wouldn't understand my own writing ;)