The slow progress on this game is frustrating. We all know that making a game ain't easy. Imagine you had to shoot a movie like Jurassic Park all alone. Writing the script, doing the camera work, composing the sound, putting on a Brontosaur costume and do the Dino moans. Fortunately, I do get some help from talented people, though with their small numbers and busy time schedules, it's still like Steven Spielberg getting help from the neighbor kids once in a while.
Well, that's just something you have to accept when you plan a hobby game. Especially if it will be a bigger game with relative high standards on the graphics, audio, design, and... pretty much everything. Yet, its sometimes frustrating when you think you have a “Golden Egg”, but just not the resources to hatch it. Like trying to grab a 100$ bill laying down a well, but your arm being 3 cm too short. Like… well, you get the point.
Even more frustrating is the fact that a lot of *TRASH* on the TV and game console do get a (big!) budget to produce their FOUL. I see "comedy" movies with jokes that could have been written by our retarded parrot on a daily basis. Action movies with scripts that aren't bad because the director intended to be cheesy, but because the director IS BAD. And c'mon, why do idiots like Snooki and The Situation turn into millionaires? It's a disgrace. If all that wasted money would have been spend on something more creative, ambitious, new, fresh, initiatives...
The first green object in the grim dirty world of T22.
But that's how the world rolls apparently, another fact to accept. In the meanwhile, we slowly advance towards to the next demo movie release, yard by yard, in a muddy trench war. But hey, Cheer up! I actually intended to show you some fun & positive clips in this post.
If you ever made a game, or mod, or custom level, or anything creative, you are likely familiar with the Critical-Why?-Breakpoint. No? Yes you are. You start with a lot of energy on your new game, book, comic or Unreal Tournament level. After a few days (or months in case of a game), and after seeing some first results, you probably ask yourself why you are spending so much hours on it. Usually the new levels look ugly, and the game doesn't feel like a game at all. More like a machine playing some cheap effects when pressing a button. Making a creative product is about ups and downs. Sometimes you amaze yourself with good looking pictures, another time you’re about to throw the towel because it all seems pointless. This is actually why 99% of the creative (hobby) projects fail.
The last weeks I've been implementing the gun further. Not that Tower22 will become a shooter, but... gun + monster = game. It's really to give ourselves something "playable", which will encourage to implement more gameplay related stuff such as enemies to fight, an UI, climbing a ladder, or solving a puzzle. Anyway, while looking on the internet how other games fire their guns (it's more than just a spitting a projectile and let the speakers say "boom!"), I stumbled across some funny clips. How about this Doom - Alpha versions:
Doom early Alpha versions
You may remember Doom as an awesome, highly addictive, well working, and also graphically great game. But as this movie shows, the game wasn't born perfect. Even later alpha versions still sucked graphically and game wise. Shooting an ugly boomstick with bad sounds, letting non responsive monster sprites just disappear. It reminds me very much how the half finished, broken gameplay feels on the games I did/do. And that cheered me up :) You would think game Gods like id Software would create cool stuff right from the start, but their alpha versions just suck as much as mine does hehe.
Speaking of early alpha's. Asides from the realtime GI, and a new texture on the lower wood panels, this still isn't the type of picture to proudly show. But I'll promise you, this corridor will become a whole lot more special within a couple of months... Then compare again.
Doom was a long time ago, but when watching the movie, I suddenly remembered the fuzz about Doom3 & Halflife2. While the entire game-world was anxiously waiting, id and Valve were developing their (over?)hyped sequels at a -what seems for us- slow pace. But then, Oops Poops, their alpha versions "leaked", one or two years before the actual releases. Maybe they leaked it on purpose, just to see what the audience would think. Well, let me tell you what I thought: IT SUCKED!
Although the leaked Doom3 alpha was already graphically appealing, the loose pieces of gameplay felt very stiff and scripted (duh, it was intended for a E3 demonstration) and... just not like an enjoyable game. The Halflife2 leak had more challenging enemies & allowed you to play with physics (new for that time!). Yet the level design was a mess and the graphics were dull compared to Doom3. Again, a half scripted mess, and the fun in shooting Combine soldiers and zombies didn't last long.
* Doom3 leaked version
* Halflife2 leaked version
Of course, I realized those versions weren't finished, neither intended for my dirty fingers on the keyboard yet. But I remember having serious doubts, especially about Halflife2. Would that game ever become fun? It was a classical showcase that good graphics and carrying a well known title, aren’t going to save weak gameplay.
Implemented a simple backlighting method for the plant. Should evolve further when making more advanced SSS/translucent materials in the future.
Well, fortunately both Doom3 and Halflife2 also showed that you shouldn't judge a "WIP" (Work in Progress) product. Because the final versions polished the bugs, improved the visuals (especially HL2), replaced bad audio, and maybe most important, gave an immersive world that invited for exploration, and made shooting zombies fun again. Really, small tweaks can do miracles and the "completeness" is a very important quality factor. Hence, after finishing the official HL2, I realized that the leaked version already contained most of the levels globally, but in such a poor state that I couldn’t make a consistent, story driven game of it.
* Halflife2: Alpha graphics versus Finished graphics
Remember those things once you're getting a "programmers-block" again, while having Snooki puking booze over jWowww on your TV in the background. Your game isn't bad, it just needs time. Little kids shit their pants for the first 2 or 3 years as well, lazy rockstars need 10 years for their next album, you didn't learn pleasing your girl in a single day either, and stew only tastes good if its boiling for at least 6 hours. And as for the audience: be patient! Merci beaucoup.
Saturday, February 23, 2013
Saturday, February 9, 2013
Charlie & The Compute-Shader factory #2
If you wondered what Compute Shaders are, and red the first post, your question still isn't answered probably. Parallel computing, GPU's, Umpa Lumpa's, what else? Yet, it's important to understand those fundaments a bit. Sometimes you gotta the know the why's before doing something. In the programming world, there are too many techniques and ways to accomplish things, so before just wasting time on yet another technique like these Compute Shaders, it's pretty useful to sort out why (or why not) you may need them. I suppose you don't just buy thermo-nuclear particle accelerators without really knowing what they are either.
I'll be honest with you, so far zero Compute Shaders are part of the Tower22 engine. I made several, but either my outdated hardware didn't support some specific features, or I could replace it with other (simpler) methods. Like Geometry Shaders, CS (Compute Shaders) aren't exactly required for each and every situation. Most of the rendering just suits fine with the existing OpenGL, DirectX, Vertex/Geometry/Fragment shaders, so I wouldn't suddenly swap to another technique if not really needed. Certainly not as these CS are still a bit premature and (slightly) harder to write. Old fashioned shaders debug easier, and might even run a bit faster.
That said, now let's focus on Compute Shaders, and in particular on their advantage over traditional shaders. Yes, we're getting a bit more technical, so you can skip this dance if you don't give a damn about programming. Like most other programs, a CS takes input like numeric parameters or buffers, and it writes output back in buffers. A cs doesn't draw polygons or anything (remember I said a CS doesn’t have anything to do with rendering), it just fills buffers with numbers. That’s it. Writing buffers is not exactly the definition of Cool, but you must realize that common shaders basically do the same. But with the exception that these shaders are tightly integrated in the graphics rendering pipeline (to safe you work, and to protect you from screwing things up).
These in- or output buffers are usually:
For those who did animations before, you can do the same with vertex Transform Feedback (OpenGL) or Streaming (DirectX), so why use a Compute Shader instead? Well, you don't have to. I would stick with OpenGL or DirectX actually. However, there are scenario's where a CS fits better, as they are more flexible. Down below I'll list some main features of CS that are different from common shaders. But first, and good to know, you can implement CS in your app by using either OpenCL (by Khronos, the team also behind OpenGL) or nVidia's CUDA. And possibly there are more API's, but these two seem to be the best known ones. So far I only tried OpenCL, so let's focus on that one. But I guess CUDA isn't much different. Like OpenGL, OpenCL comes as a DLL with a bunch of functions to get system information, compile shaders, make buffers, share interop buffers/textures between OpenGL * OpenCL, and to launch them.
........For OpenGL / Delphi fans, they didn’t forget about us, several libraries and examples were made:
.............http://code.google.com/p/delphi-opencl/
.............http://download.cnet.com/OpenCL-for-Borland-Delphi/3000-2070_4-11881405.html
.............http://www.brothersoft.com/opencl-for-borland-delphi-449951.html
........Also, make sure to print these papers and use them as wallpaper:
.............OpenCL function card
Some very basic code examples
OpenCL super powers
===============================================
* Simplicity
Can't speak for DirectX, but in GL, it often takes quite some steps to setup a buffer, create a rendering context, get a shader doing something in a buffer, and so on. The OpenCL API is minimal. Once you wrote the basic setup steps (by looking at an example) to support CS in your application, it's really simple to use them anywhere, anytime.
* More flexible shader coding
Although premature and a shitty debugger (at least for OpenCL), the C-like code seems to allow more tricks. Where common shaders are still quite strict with dynamic loops or pointers, CS feels more like natural C. Disadvantage is that a lot of handy functions and syntaxis you're used to, are missing or different in OpenCL, so your first attempts to write are probably going to be frustrating.
* Let the CPU and GPU work in parallel
This already happens with common shaders, but for some reason, I'm not sure how the two synchronize. Anyway, with CS you can simply launch a task on the GPU (or another device) and continue doing other stuff on the CPU and check later if it's done. As said, OpenCL works simple.
* Array indexing or Pointers
A powerful feature is that you can access any slot in an array via indexing or pointers (warning: indexing = slow, pointers = fast!). In common shaders, this is not possible. While processing vertex[123], you can't look in vertex[94] for some info. You’re forced to use textures or UBO’s for data lookup then. Advanced data structures such as octrees can be accessed much easier. This is one of the main reasons you may want to use a CS, if complex data access is needed.
* CS can also write in the same input buffer
In a shader, you will always need 2 buffers. One input, one output. By "ping-ponging" you could swap buffers each cycle:
* CS in- and output don't have to be GPU hardware buffers
You can stream the results directly back to a CPU if you like. OpenGL or DX can do that as well, but it's A: slow, and B: it requires crazy tricks like reading pixels from a texture to push data back and forth between the CPU and GPU. Probably it's just as slow when using OpenCL, but at least it feels more natural as it can be coded easily.
* Shared variables
In a common shader, you can't declare a global variable like "myCounter" that is being incremented by each element being processed. But in CS, you actually can. This can become handy if you want to share the same data for a whole group of elements, count stuff, or filtering out min/max values. I'll show an example later on (Tiled Deferred Rendering).
* Threading control / Synchronizing
Now this is the Nutty Professor part. And the reason why you have to know how Umpa Lumpa's roll. First, it's up to you how you launch a CS. If you have 10.000 elements in an input array, you could for example run 20 Warps or Wavefronts, each taking care of 500 elements.
Since standard Vertex/Geom/Fragment shaders cannot access their neighbors in their buffers, each “workitem” runs isolated from the big bad world outside. So you don't have to care about synchronizing, mutexes, locks, semaphores, or whatsoever. But as shown above, in CS you actually can bother the neighbors or variables via local or global memory. And not without risk. He might attack you with a baseball if you interrupted him at the wrong time. Same troubles in CS land. If you read or write data being processed by another work-item, there is no guarantee that element already has been finished. Maybe it wasn't handled yet, or worse, maybe you caught it while it was being written. That's when you get the baseball bat in your face; corrupted values, tears and complete chaos.
Synchronizing the Multi-madness
===============================================
Fortunately, OpenCL provides some instructions to prevent this drama. But first of all, try to design your shaders in such a way that you don't have to read outside your comfort zone. Keep shared global variables or access to other elements to a minimum. You will learn that sometimes it's actually better to run a CS twice instead of having to screw around with mutexes to fit everything in a single program. And otherwise:
* Barriers
You can create a "waiting point" in your shader that ensures all elements have been executed till that point within a Warp or Wavefront. Compare it with walking with your family; each 10 seconds you are hundred yards ahead of grandpa, so you stop and wait till they catch up. Not sure why one task would finish later than another though. Maybe because of taking a different route through branching, yet to my understanding, all tasks would take that route then… Anyhow, see here, the Barrier instruction:
* Semaphores
This is to ensure you don't execute a specific block of code (usually involving reads or writes) if another element in the Warp/Wavefront has entered the same block. Ifso, wait until the other element is done first. Compare it to a ticket window. At some point, people have to line up and pass one by one. This is tricky shit though, do it wrong and your video card driver may hang & time out!
* Atomic operations
Sounds dangerous. OpenCL provides a couple of atomic operations (add, decrement, min, max, xor...). These do the same as their common equivalents, except that an “atomic write” ensures that it won’t conflict with another operation that is also accessing the same variable. Sort of a built-in semaphor. Keep in mind that some older hardware (like my GPU) may not support atomic operations yet though! You need extensions to enable them in OpenCL.
Next and last post will show a practical example that shows several techniques that wouldn’t be possible (or only with stinky workarounds) with traditional shaders, as well as using some of the synchronizing tricks explained above.
I'll be honest with you, so far zero Compute Shaders are part of the Tower22 engine. I made several, but either my outdated hardware didn't support some specific features, or I could replace it with other (simpler) methods. Like Geometry Shaders, CS (Compute Shaders) aren't exactly required for each and every situation. Most of the rendering just suits fine with the existing OpenGL, DirectX, Vertex/Geometry/Fragment shaders, so I wouldn't suddenly swap to another technique if not really needed. Certainly not as these CS are still a bit premature and (slightly) harder to write. Old fashioned shaders debug easier, and might even run a bit faster.
That said, now let's focus on Compute Shaders, and in particular on their advantage over traditional shaders. Yes, we're getting a bit more technical, so you can skip this dance if you don't give a damn about programming. Like most other programs, a CS takes input like numeric parameters or buffers, and it writes output back in buffers. A cs doesn't draw polygons or anything (remember I said a CS doesn’t have anything to do with rendering), it just fills buffers with numbers. That’s it. Writing buffers is not exactly the definition of Cool, but you must realize that common shaders basically do the same. But with the exception that these shaders are tightly integrated in the graphics rendering pipeline (to safe you work, and to protect you from screwing things up).
These in- or output buffers are usually:
* arrays of numbers or vectors (like a vertex array)Those structs or numbers could be anything, but in a 3D context it makes sense to use OpenGL or DirectX buffers such as VBO's or Textures to work with, so the output is stored on the GPU in a way OpenGL or DirectX can proceed with. To give a practical example, you could do Vertex Skinning (animating with skeleton bones) in a Compute Shader;
* arrays of structs (multiple attributes (per vertex))
* 1D, 2D or 3D texture (OpenGL / DirectX)
- Make a VBO containing all vertices, texcoords, normals, weights and bone Indices in OpenGL
- Let the CPU update a skeleton (= an array of bone matrices)
- Pass the VBO and Skeleton arrays to a Compute Shader
- Let the CS calculate the updated vertex positions by multiplying them with the bone matrices
- Let the CS stream out the results to (another) VBO
- Later on, render the updated VBO (the one with the end result vertex positions / normals)
For those who did animations before, you can do the same with vertex Transform Feedback (OpenGL) or Streaming (DirectX), so why use a Compute Shader instead? Well, you don't have to. I would stick with OpenGL or DirectX actually. However, there are scenario's where a CS fits better, as they are more flexible. Down below I'll list some main features of CS that are different from common shaders. But first, and good to know, you can implement CS in your app by using either OpenCL (by Khronos, the team also behind OpenGL) or nVidia's CUDA. And possibly there are more API's, but these two seem to be the best known ones. So far I only tried OpenCL, so let's focus on that one. But I guess CUDA isn't much different. Like OpenGL, OpenCL comes as a DLL with a bunch of functions to get system information, compile shaders, make buffers, share interop buffers/textures between OpenGL * OpenCL, and to launch them.
........For OpenGL / Delphi fans, they didn’t forget about us, several libraries and examples were made:
.............http://code.google.com/p/delphi-opencl/
.............http://download.cnet.com/OpenCL-for-Borland-Delphi/3000-2070_4-11881405.html
.............http://www.brothersoft.com/opencl-for-borland-delphi-449951.html
........Also, make sure to print these papers and use them as wallpaper:
.............OpenCL function card
Some very basic code examples
OpenCL super powers
===============================================
* Simplicity
Can't speak for DirectX, but in GL, it often takes quite some steps to setup a buffer, create a rendering context, get a shader doing something in a buffer, and so on. The OpenCL API is minimal. Once you wrote the basic setup steps (by looking at an example) to support CS in your application, it's really simple to use them anywhere, anytime.
* More flexible shader coding
Although premature and a shitty debugger (at least for OpenCL), the C-like code seems to allow more tricks. Where common shaders are still quite strict with dynamic loops or pointers, CS feels more like natural C. Disadvantage is that a lot of handy functions and syntaxis you're used to, are missing or different in OpenCL, so your first attempts to write are probably going to be frustrating.
* Let the CPU and GPU work in parallel
This already happens with common shaders, but for some reason, I'm not sure how the two synchronize. Anyway, with CS you can simply launch a task on the GPU (or another device) and continue doing other stuff on the CPU and check later if it's done. As said, OpenCL works simple.
* Array indexing or Pointers
A powerful feature is that you can access any slot in an array via indexing or pointers (warning: indexing = slow, pointers = fast!). In common shaders, this is not possible. While processing vertex[123], you can't look in vertex[94] for some info. You’re forced to use textures or UBO’s for data lookup then. Advanced data structures such as octrees can be accessed much easier. This is one of the main reasons you may want to use a CS, if complex data access is needed.
* CS can also write in the same input buffer
In a shader, you will always need 2 buffers. One input, one output. By "ping-ponging" you could swap buffers each cycle:
cycle1: input from buf1 , output to buf2This costs double the memory, as you need two buffers. With the help of ReadWrite buffers in CS, you don't have this problem. ReadWrite textures are pretty slow or not even supported on all hardware though.
cycle2: input from buf2 , output to buf1
...
* CS in- and output don't have to be GPU hardware buffers
You can stream the results directly back to a CPU if you like. OpenGL or DX can do that as well, but it's A: slow, and B: it requires crazy tricks like reading pixels from a texture to push data back and forth between the CPU and GPU. Probably it's just as slow when using OpenCL, but at least it feels more natural as it can be coded easily.
* Shared variables
In a common shader, you can't declare a global variable like "myCounter" that is being incremented by each element being processed. But in CS, you actually can. This can become handy if you want to share the same data for a whole group of elements, count stuff, or filtering out min/max values. I'll show an example later on (Tiled Deferred Rendering).
* Threading control / Synchronizing
Now this is the Nutty Professor part. And the reason why you have to know how Umpa Lumpa's roll. First, it's up to you how you launch a CS. If you have 10.000 elements in an input array, you could for example run 20 Warps or Wavefronts, each taking care of 500 elements.
Since standard Vertex/Geom/Fragment shaders cannot access their neighbors in their buffers, each “workitem” runs isolated from the big bad world outside. So you don't have to care about synchronizing, mutexes, locks, semaphores, or whatsoever. But as shown above, in CS you actually can bother the neighbors or variables via local or global memory. And not without risk. He might attack you with a baseball if you interrupted him at the wrong time. Same troubles in CS land. If you read or write data being processed by another work-item, there is no guarantee that element already has been finished. Maybe it wasn't handled yet, or worse, maybe you caught it while it was being written. That's when you get the baseball bat in your face; corrupted values, tears and complete chaos.
Synchronizing the Multi-madness
===============================================
Fortunately, OpenCL provides some instructions to prevent this drama. But first of all, try to design your shaders in such a way that you don't have to read outside your comfort zone. Keep shared global variables or access to other elements to a minimum. You will learn that sometimes it's actually better to run a CS twice instead of having to screw around with mutexes to fit everything in a single program. And otherwise:
* Barriers
You can create a "waiting point" in your shader that ensures all elements have been executed till that point within a Warp or Wavefront. Compare it with walking with your family; each 10 seconds you are hundred yards ahead of grandpa, so you stop and wait till they catch up. Not sure why one task would finish later than another though. Maybe because of taking a different route through branching, yet to my understanding, all tasks would take that route then… Anyhow, see here, the Barrier instruction:
* Semaphores
This is to ensure you don't execute a specific block of code (usually involving reads or writes) if another element in the Warp/Wavefront has entered the same block. Ifso, wait until the other element is done first. Compare it to a ticket window. At some point, people have to line up and pass one by one. This is tricky shit though, do it wrong and your video card driver may hang & time out!
* Atomic operations
Sounds dangerous. OpenCL provides a couple of atomic operations (add, decrement, min, max, xor...). These do the same as their common equivalents, except that an “atomic write” ensures that it won’t conflict with another operation that is also accessing the same variable. Sort of a built-in semaphor. Keep in mind that some older hardware (like my GPU) may not support atomic operations yet though! You need extensions to enable them in OpenCL.
Next and last post will show a practical example that shows several techniques that wouldn’t be possible (or only with stinky workarounds) with traditional shaders, as well as using some of the synchronizing tricks explained above.
Subscribe to:
Posts (Atom)