Monday, January 28, 2013

Charlie & The Compute-Shader factory #1

Today’s topic is about an increasing phenomenon amongst (graphics) programmers; Chlamydia. Or did I mean to say Compute Shaders? As usual, I won't teach you the in-depth details, but just 'what it is'. Although the name may imply that Chlamydia, I mean Compute Shaders, are another typical game/graphics thing, they can be used on a much wider scale for computer applications. Plus knowing a bit about them, will also gain you a better understanding of what's-going-on under the hood of your computer.

Compute-Shader. Sounds as if it can be placed within the royal line of graphics shaders;
* Vertex Shader: processes (3D model) vertices
* Geometry Shader: Make 3D primitives(triangle, quad, point, line) from vertices
* Fragment shader: Draw pixels(fragments) on a primitive. Yippee, "Coloring".
Or, image it as follow. Pick a piece of paper, then
1- Put some dots (vertices)
2- Draw lines beetween the dots (geometry primitives)
3- Color the shapes with your pencils

Notice that these three steps have always been there in OpenGL, DirectX, or other (hardware/software) "rasterizers", even in the old Quake1 days. But as fixed, non-programmable stages. During the last ~10 years, these 3 stages have become programmable with the famous shaders. Basically a shader is a (small) program that calculates where a vertex has to be placed, how a primitive should be shaped, or how a pixel should be colored. Shaders can be used to compute different types of data as well, as a vertex or a pixel is just a bunch of numeric values well in the end. But the description above would illustrate the most common usage.

So... that Compute Shader… is a 4th step somewhere in this pipeline? Ehm, no. Not really. Yes, it is a program that "Computes" (hence the name!) *something*. But it's not a specific step in a 3D rendering pipeline, neither does it have to run on a graphics card, and neither is it automatically related to graphics at all. Although game-graphics are a useful place to implement Compute Shaders, they are more like an universal kind of program, giving access to deeper layers of the executing hardware. Basically they can compute anything. Pixel-colors, the positions of 100.000 particles, cloth physics, asteroid field simulations, ants, et cetera. As some of you may have noticed, they become handy when large numbers of similar things can be calculated in parallel. To put it short, they are an advanced, and relative new type of shader.

Eins zwei Polizei
Ok... vague. Should I really know more about Compute Shaders? Sure thing. But to truly understand the need of yet another programming-thingie, it might be good to know how massive computing, in particular on a GPU(videocard), works. First, a program is just a (gigantic) list of instructions, being executed by a processor. Second, and to put it simple, there are 2 ways of execution: sequential and parallel. Sequential means instructions get performed one by one. For example:
// Count till ten, then let a fart
1- x = 0 // initialize variable x
2- x = x + 1 // add 1
3- if (x > 10) then // compare
4- fart;
4- goto line2 // go back to line 2, add 1 again
Or to illustrate a very abstract game engine:
1- Read input (mouse, keyboard, ...)
2- Update physics (positions of entities, collisions, ...)
3- Update AI (monsters, NPC's, player, ...)
4- Play sound
5- Render graphics
* wait shortly, go back to step 1 and repeat the whole process again
This is how most program-events work, including realtime systems such as a PLC on a packaging machine. In a typical UI based (Windows) program, when clicking on a button, a sequence of actions will be executed. If those instructions are very costly, or just a lot, it will eventually stall the processor. Since it’s sequential, all the instructions have to complete first before you can click something else. The nice thing about sequential programming is that it guarantees the order of instructions goes as expected, and is therefore easy to predict. In the “game” example above, we already updated all of the 3D model positions before we draw them. No way an object suddenly moves while we are drawing it.

Paralel dimensions
The other way is parallel execution. As the name sais, 2 or more things get done simultaneously. Typically a program will split in two or more "threads" (sub-processes), each doing a specific portion of a bigger whole. Btw, each thread still executes its instructions sequentially, but an important detail is that one thread does not have to wait on another thread. In theory at least. This is particularly handy on multi-core systems. One processor could calculate the physics, another one audio, a third one renders the graphics, and so on. But also older single-core systems are familiar with threading for a long time. If you click a button that executes a super-heavy calculation, you can still move the mouse, operate another program, play a music, and download porn at the same time. This is because all these processes are split in separate threads, and your computer smartly swaps between the threads (based on priority) giving them all a chance to execute… or to lock up the computer completely and force a hard reset.
Note that in the parallel setup, the "reporting back" may happen at different times, in different orders.

Sounds smart, and potentially the fastest way to do things. But in practice working with multiple threads has a lot of caveats. There is no guarantee that something in thread-A has been finished before thread-B continues on it. So when threads have to co-operate, you’ll need some clear rules and smart tricks to “synchronize” them. It's like having construction worker A making a scaffolding before construction worker B arrives with his tools. If worker A was busy showing his buttcrack and whistling at passing girls, worker B falls down and breaks his neck if he didn't check whether A already installed the scaffolding or not. In programming terms, a thread can check if another thread did his work (or isn't busy on something) with the help of 'mutexes', 'semaphores' or perform 'atomic operations'. You can forget those terms, but to show an example:
This prevents (very, very, VERY) crazy unpredictable bugs. But at the same time it brings the whole parallel strength back to the ground. If construction worker B has to wait the whole day on worker A, he could just as well stay in bed that day. Waiting = a waste of resources. And in the worst case, we'll keep waiting forever if all tasks are dependant on each other (see Crossroad example below).

The magic key to work in parallel successfully, is to minimize the dependancy on other processes as much as possible. To less worker A has to deal with B, the less chance they are waiting on each other. Let them do separated tasks. A installs the toilet, B plasters the walls, C makes coffee and D whistles at the girls. Perfect.

Parallelism on GPU’s
You may wonder what the hell this all has to do with (Compute)Shaders, so let's get back to the videocard. One practical computer-world example of making very good use of parallel powers, is the GPU(videocard processor). As you all know, inside the videocard and monitor, Umpa Lumpa's are running around with pixels. First they paint the pixel bricks in the right color, then they climb up on each other and hold the brick in front of them - that's what you see on your monitor. Now, put your finger on your monitor, and count the pixels. 1,2,3,4... already know it? Well, if your screen resolution is 1920 x 1080 (HD TV), then you should count about 2.073.600 pixels.

As the shaders showed above, each freak'n pixel-color has to be calculated. The color is usually the result of several textures, nearby lights, reflections, shadows, and maybe some more techniques. Imagine one Umpa Lumpa had to draw all the pixel-bricks(one by one), and then stack them on each other one by one? Obviously, that is not how a videocard works. No. Videocards use multiple Umpa Lumpa's. Drawing pixels and putting them in place is something that can be done perfectly seperate. Each Umpa Lumpa would be responsible for a single pixel brick. The reason this can be done in parallel is because Umpa Lumpa's don't give a fuck about how other neighbor pixels are painted. They just focus on their own brick.

Warps / Wavefront -> Super Umpa Lumpa's
Unfortunately, the GPU doesn't have enough place to house up to 2 million Umpa Lumpa's that would be required for a HD screen. It would be physically impossible (so far) to manufacture such advanced GPU chips. Instead, the GPU bakeries came with some other smart tricks. Willy Wonka mutated his Umpa Lumpa's so each of them has 256, 512, or even more(!) arms. With 512 arms, you can paint 512 pixel-bricks at the same time of course. So a single Umpa Lumpa would become responsible for a whole bunch of pixels. In nVidia land they would call such a Umpa Lumpa a "Warp” . In the AMD(ATI) bakeries they are called "Wavefronts". Different names, but the same idea.

Your videocard card has several of those "super Umpa Lumpas" (Warps or Wavefronts) available to do ‘stuff’. They work entirely separated. For example, Warp Lumpa #1 could draw 16x6 pixels on the bottom left side of your screen, while Warp Lumpa #2 does 16x16 pixels on the top-right side. They don't bother with each other, so neither do they have to wait on each other. Typically, your screen would get separated in smaller square tiles, and each tile gets computed by a Warp/Wavefront/Umpa Lumpa. Parallelism mate.
Just look at them working!

There is one more important thing to tell about Umpa Lumpas. With 512 arms, you can handle 512 pixels at the same time. BUT. What if those pixels vary a lot? Let's say some of those pixels are using a water material with reflections, while the other half is made of simpler grass colors? 512 hands or not, the poor bastard still has 1 brain, so he moves all of his hands in the same way more or less. Having 100 hands painting water and the other hands grass is too confusing. So, he first paints the grass pixels, THEN the water pixels. Or vice-versa. Either way, when dealing with mixed orders, it takes more time to finish all the pixels.
Although processor instructions would look a bit more complicated than this, a Warp or Wavefront can apply an instruction on multiple tasks. The values may vary, but the workload keeps the same... that is, if all pixels are using the same program.

Now a bit more serious and technical. First, the whole description has been dumbed down for readability(really?!), so forgive errors (I’m not a GPU-architecture expert anyway). Second, a Warp or Wavefront can execute the same processor instructions for multiple of its "pixels" (note that a pixel could be something very different as well, so let’s name it “tasks” from now on) simultaneously. See the image above. Although the numbers (or texture values) may vary, the instructions remain the same. This way all the tasks for a Warp or Wavefront can be perfectly executed at the same time. Combined with having multiple Wavefronts or Warps, this is what makes GPU's so damn fast and able to render millions and millions of pixels (30 or more times per second!).

Typically when we apply one shader, all the pixels undergo the same instructions. However, if your shader contains a lot of branching (IF THEN ELSE), variable length loops, or contents using different shaders, the execution of the Warp and Wavefront may become slower, as it has to deal with multiple scenario's. An IF branch in your shader doesn’t have to be bad, but make sure the surrounding pixels will likely pick the same path. If all pixels within a Warp/Wavefront pick the same execution route, there Is no problem and the branch may actually make things faster. But if the contents are mixed, the processor will perform a worst case scenario. Basically it will make 1 long program with all instructions combined, executing everything even if the instructions do not apply on a pixel. In order to keep things fast, try to prevent these scenario's. This is also one of those reasons why we like to sort & batch things.

Ok. I still didn't tell what Compute Shaders themselves actually are, but let's call it quits for today. First make sure you have a slight understanding of how a videocard issues tasks amongst it's "workers". Without that knowledge, Compute Shaders can't be used efficiently anyway. See you next time. Umpa dumpa... OOps, almost forgot a screenshot.
Freak on a leash

Saturday, January 12, 2013

A good start is half the work

A bit late but...
* starting Abba
Happy new year, Happy new year,
* shoots Abba *.

No one blew of his head during New Year's Eve? Good. 2013 started with mixed feelings here. Good because one of my friends finished their house after a lot of hard working, and the T22 blog reached 50+ readers. Nothing compared to truly popular sites, and if you really care about viewers you should piss on your little sister during her sleep and put it on Youtube. But still, a few years ago I would be happy there if maybe 5 readers would be interested in T22, so it feels like a little Milestone. Yet another good personal thing were promotions on work where I get more in charge of the technology/software used on all our machines. Which is certainly not that naturally anymore these days we're people lose their jobs or got to fight to maintain some work.

Mixed feelings, because that shitty crisis finally also struck my family. Little work in the construction sector means little or no more work for my father. It's not that he will starve now, but for someone who still likes to work and did his stinking best making big factories, stadiums, flats and other awesome buildings the past 40 years, this is not a honorable end of his working career. Neither a promising fate for my little brother who finished a complex study in the same sector, and can't find a proper job now. So as much luck I got with my jobs, so bad things can go with family. Or Spanish friends I met via this project, that are desperately looking for whatever they can do to make a little income.

So besides wishing you a good health and love, I especially wish you luck with keeping or finding work. This world was made by hard working men and women, and let's keep it that way. Crisis or not.

As for T22, 2013 gave a nice starting shot as well. Hopefully it wasn't some silly new-year fad, but suddenly several guys offered their help on T22 (after some months of silence). Not all of them passed, yet help offers are always heartwarming. And two of them, Mario & Fransisco (both Spanish, as most of the team members by now) are going to help on what we call "Global Mapping". Not global warming, but making empty maps. One of the reasons why our progress is slow, is because so far we focused on demo's and made just a few rooms in full detail. Making a room mesh isn't so much work, but decorating it with textures, furniture, lamps, decals, new shaders, sounds or whatever needed, eats a lot of time. With as a result that still little maps have been made... meaning there is no actual game to play. You can run through a few rooms, and then our flat world ends. In order to create & test gameplay such as exploration, solving puzzles or arm-wrestling with monsters, you need some maps at least. Then decorate them later.

So, in contrast to the real world, there are a lot of construction sites in T22, and hopefully Mario & Fransisco can help us. Further. "Demo4" is getting more shape. Federico & Diego made quite a lot of cool things last weeks to transform the ugly empty corridor to something gloomy, spooky, scary. I can't show too many pictures yet. First because their assets aren't 100% finished yet, and second because I need to reserve some "exclusive shots” for a magazine that likes to make an interview... But anyway, I’m hoping we can finally show a movie again somewhere half this year.
First results coming in. Though the textures aren't finished yet, it doesn't look bad for a first empty version. Notice the ceiling betting lit (realtime) by the new implemented GI solution.

As for my own progress, besides the realtime GI finally getting a bit successful, time was spend on upgrading the animations. As you may remember from the first demo, we already had an animated rusty garbage-can-player-robot, and the ability to pick up things. But during the last 2 years, I changed a lot of things in order to make it all better and more flexible / programmable. One of the open tasks was to switch over on a 3D-animation format processionals would use. Which isn't Milkshape (though I still like the tool for its clean simpleness). Now that we have an animator, Antonio, it was time to get his Blender / Maya work imported somehow....

But what file format to chose? Since I already had some Collada stuff implemented, I decided to extend the Collada importer. Which was a waste of time. I liked the format at first, but reading the rig alone was confusing as hell. I'm sure it's documented somehow, and I've seen tutorials as well. But it seems each 3D package delivers different output in this "universal" format, and the skeletons resulted in the typical Steven Seagal treatment: broken limbs. Problem with Collada is that it's so goddamn big that is has become too complex to quickly extract the fraction of data you need out of the file. And no, I'm don't have patience to sit, read, try, and learn for days. Writing a file reader shouldn't take more than a few hours. I'm making a game, don’t to care about file loaders. Bah, didn't like the XML overhead from Collada anyway.

So, I switched over to Autodesk FBX, another sort of "common" file format. The file looked simple at first when having a preview in Notepad. Yep, that would suit fine... Turns out that FBX is just as worse as Collada. Reading a mesh and some objects is childsplay, but getting the animation skeleton properly requires some 16th grade Tesla formula's. And of course, no simple small example with copy-paste code.

But wait! Autodesk also delivers C++ code so you don't have to bother reading the actual file. Since I'm using it would require to make a C++ batch program first to convert FBX files to another much more logical T22 format, but ok. I can live with that. But, the fuck. That stupid API doesn't just read FBX files and gives you some 3D objects. Hell no. It launches missiles to Moscow, it can be used to control Space shuttles, reads bedtime stories for your kids, and goes to bed with your wife. Yes, yet another overdone super-file-good-for-everything-and-therefore-waytoohard-to-learn platform.

But all right. After a few violent nights I finally extracted the few bits I actually needed from a FBX file. So, animations can be imported now, as well as assigning sound effects or special programmable events to keyframes. For example, at 0.3 seconds the weapon should fire a bullet & say “bang!”, at 0.31 show a nozzle flash sprite, and at 0.34 spit out an empty bullet case. Yep, time to make an animated, firing gun now!

And the moral of the story? Hire a script kid that makes a plug for Maya/Max/Blender/X that directly saves stuff into your own much cooler file format.
Radiator made by Diego (and no, the green block doesn't belong there).