4 karma

(160 comments, 88 posts)

This user hasn't shared any profile information

Home page:

Posts by sHTiF

Texture packing to color channels.


Its been some time I’ve posted some research post so I decided to share some optimization approaches I play with mostly using Genome2D but it can be applied to any GPU framework/platform/language.

I am often hired as an external specialist to optimize various in-house engines and not too long once again on such a mission I was tasked to optimize a 2D GPU engine. The difference was there wasn’t problem with speed in this case but rather limited GPU resources to store all the texture data. I can’t go into much detail about the engine or the project involved but they basically had too much often quite large simple textures. First that came to my mind was lets move it to vectors as its way less of a memory hog than textures. An example that cames to mind is Tiny Thief as they had the exact same problem with the amount of graphics on the screen there was simply no way they could’ve store it in the GPU memory. To cut it short this idea was declined for one reason or other, most probably asset creation. So back to the drawing board, again focusing on the simplicity of the textures and after few tests I came with an idea how to essentially pack 4 textures into a single one while utilizing a simple pixel shader to just draw the correct texture of the four when rendering to the screen. Again performance here wasn’t an issue so we could allow ourselves a bit heavier pixel shader. It is often true that on high end mobile devices you are more likely going to run into memory problems than performance problems.

Enough of my chit chat, lets explain my approach how I packed textures to separate channels which allowed me to load 4 times the amount of assets as I would be able normally.

A little bit of warning here, the technique and algorithms explained here are not always the most optimized as I want to focus on simplicity of the example code here instead of explaining nuances of optimization. I am going to use framework agnostic code just to illustrate what is going on.


First we are going to extract palette information, this should be done externally and I wrote a little tool for them which I can’t give away but its pretty simple. You can also do it at runtime but that is a huge waste. So what we are going to do is go over all our textures we want to extract their pixel colors out.

  1. for (i in 0texturesCount) {
  2.    colors = new Array<Int>();
  3.    for (x in textures[i].width) {
  4.       for (y in textures[i].height) {
  5.          color = textures[i].getColorAt(textures[i].x, texture[i].y);
  6.          index = colors.indexOf(color);
  7.          if (index==-1) {
  8.             index = colors.length;
  9.             colorTexture.setColorAt(colors.length,i,color);
  10.             colors.push(color);        
  11.          }
  12.          textures[i].setColorAt(x,y,index);
  13.       }
  14.    }
  15. }

So what are we doing here is basically create a palette from the textures and replacing the actual colors with their index in the palette. Yes we will end up with bunch of textures of the same size as our original ones just color replaced with indices and a new texture with the palette where first line is the palette of first texture, second line is palette of second texture and so on. So most of you that didn’t get lost also spotted the limitations and where am I going with this. And its why I said it came to my mind once I did various tests on the textures in the project, what I discovered is that all of those textures use up to 256 colors, no single texture had more than 256 colors thats why I decided to pack these in color channels. If we had just grayscale textures for example it is obvious we could pack them nicely but since our textures use various colors thats where the palette comes in.

Palette can store any colors as long as single texture doesn’t contain more than 256 colors and even this limitations is applicable if we are going to use single color channel to store the palette index. So now our textures can have any colors in comparison to just using grayscale but whats more we are not limited to the 256 colors for all the textures as each texture can have entirely different 256 colors, yep within the same atlas as we will see.


So now that we have all these new textures we need to pack them, but we are not going to pack them into an atlas of the same size. So lets say we were able to pack all our textures into a 2048×2048 atlas now since we are going to use 4 channels we are going to pack them into 4 512×512 atlases.

Nope we can’t just splice the 2048×2048 atlas 4 times as we need for a single texture to be in a single color channel and if we just spliced it we could potentionally splice a subtexture inside.

  1. for (i in 04) {
  2.    packer = new Packer(512,512);
  3.    var j:Int = 0;
  4.    while (j&lt;textures.length) {
  5.       texture = textures[j];
  6.       if (packer.pack(texture)) {
  7.          datas.remove(data);
  8.       } else {
  9.          j++;
  10.       }
  11.    }
  12.    packers.push(packer);
  13. }

Again a simple example how the packing would go, we create a packer for 512×512 then try to pack as many textures into it once we can’t anymore we pack to second one and so on.

Now we have 4 512×512 textures and here comes the channel packing, we are going to create a single 512×512 texture that contains all of those 4. We take the first one and since it contains only values from 0 to 255 we write it to Red channel of the new one. Then we take second one and write it to the Green channel, then Blue and finally Alpha. Now we created a 512×512 texture that contains 2048×2048 atlas essentially. This a simple example from my demo at the end:

It is without packing to the alpha channel as you would see even less and this illustrates it better, its quite a noise but don’t worry you will get your real textures out of it.

These two steps mentioned should definitely be preprocessed in your project instead of generating them at runtime which can be obscenely costly.


Now that we have our assets ready we just need to render them. What we need for each sprite is to tell GPU that is the pallete index for its texture (line in the palette) and also color mask to eliminate the data in other channels. So 5 additional bytes RGBA + index, I am not going to write code here as this is very specific to your language/platform and even on how you upload your data to GPU but I think its pretty selfexplanatory.

Once you have this data on the GPU just get it to pixel shader, I assume you movie it to vertex shader first as you are probably batching in some way instead of single draw call otherwise you can send it to pixel shader directly. And now comes the shader magic. I am going to use AGAL here as its pretty selfexplanatory.

  1. tex ft0, v0, fs0 <2d,clamp,nearest>
  2. dp4 ft0.x, ft0, v3    
  3. mov ft0.y, v2.x              
  4. tex ft2, ft0, fs1 <2d,clamp,nearest>  
  5. mov oc, ft2

First we sample the atlas texture where the palette indices are stored now. Then we use a dot product with the color mask which will eliminate all the channels except the channel where our index is for this texture and store it to as the U coordinate for the UV lookup. Secondly we move the texture index we sent ourselves to the V coordinate for the UV lookup. Now we use these new UVs too lookup in our palette texture for the correct color output for this pixel and finally render it.

If we just need grayscale packing without palette lookup all we would do is get the color dot product it with the color mask and output it.


Here is a working example, even though I could just use grayscale packing here, its to showcase the palette index packing, I pack a bitmap font to 4 channels and then render thousands of them (SPACE to enable motion). Top left of the screen is the packed texture its almost invisible due to alpha channel being used for packing as well obviously. Second texture to the right is the palette texture and the one on top right is the original unpacked bitmap font.

Example HERE

If you are interested in a video about this check out my Genome2D DevCast #3 at youtube HERE

Direct GPU drawing in Genome2D


[Update] Added camera information

Hi guys, I know its been some time since my last blog post and I was also scolded by my lack of posting so here it goes.

Today I am going to talk about something which should be obvious for all Genome2D users but it seems that a lot of them don’t know about this feature and what is more its a feature that seem to attract non Genome2D users as well. I am talking about the ability to draw directly into the GPU without the need of any display list architecture in place. Its very similar to blitting and it may attract people that want to port their blitting engine to utilize GPU. And since there is no render graph overhead this approach is VERY fast as you are literally just pushing stuff onto GPU.

The initialization of Genome2D is the same and it doesn’t matter if you want to use render graph or just direct draw, you can even combine these two which use very usefull for drawing highly optimized large number of objects.

  1. var config:GContextConfig = new GContextConfig(stage, new Rectangle(0,0,stage.stageWidth,stage.stageHeight));
  3. // Initialize Genome2D
  4. genome = Genome2D.getInstance();
  5. genome.onInitialized.addOnce(genomeInitializedHandler);
  6. genome.init(config);

In the initialization handler that is called upon the Genome2D/GPU initialization you should create your textures or even texture atlases as direct draw can work with both. So again no change here there is no special textures for direct draw calls versus the GSprite/GMovieClip ones. We will also need to hook up a handler to the rendering pipeline where we can do our draw calls.

  1. // We will create a single texture from an embedded bitmap
  2. texture = GTextureFactory.createFromEmbedded("texture", TexturePNG);
  4. // Add a callback into the rendering pipeline
  5. genome.onPreRender.add(preRenderHandler);

There are two global callbacks for rendering pipeline onPreRender and onPostRender where first is called before the node render graph is rendered and second one is called after. If we are going to use only direct draw calls it doesn’t matter which one we use.

Now in the render callback handler we can do our direct draw call

  1. var context:GStage3DContext = genome.getContext();
  3. context.draw(texture, 100, 100);

This will draw our texture at the position 100,100 simple as that there is no render graph involved just direct drawing into the GPU. There are also more options.

  1. context.draw(texture, 100, 100, 2, 2, 0.5);

This will draw the same texture at 100,100 but scale 2,2 and rotated by 0.5 radians.
You can also modify the color.

  1. context.draw(texture, 100, 100, 1, 1, 0, 0, 1, 1, .5);

This will render the texture at 100, 100 without scale or rotation but its red color will be multiplied by 0 and it will be at .5 alpha.
Additionally you can involve blend modes.

  1. context.draw(texture, 100, 100, 1, 1, 0, 0, 1, 1, .5, GBlendMode.MULTIPLY);

Which will render the same as the previous block but with multiplication blendmode. And finally for advanced users you can even do direct draw calls with filters which offer default GPU shader override.

  1. context.draw(texture, 100, 100, 1, 1, 0, 0, 1, 1, .5, GBlendMode.NORMAL, filter);

This way you can do pretty much anything GPU can on shader level.

Additionally there are further custom draw calls, as you can see the current example draw quads and whats more there is limited way to transform this quad (no skew for example). This is mainly due to performance reasons as most draw calls don’t need this additional transformations which involve additional data/calculation overhead. To solve such scenarios there is a low level call using raw matrix data

  1. context.drawMatrix(texture, a, b, c, d, tx, ty, red, green, blue, alpha, blendMode, filter);

As you can see it offers all the additional modifications of the previous call but uses matrix for transformation to offer unlimited manipulation.

Next one is polygon draw which enables you to draw any shape all you need is its triangulated vertex data and corresponding UV coordinates.

  1. context.drawPoly(texture, vertices, uvs, x, y, scaleX, scaleY, rotation, red, green, blue, alpha, blendMode, filter);

Keep in mind that you need to consider all the GPU sensitive flags that result in broken batching, such as alpha/nonalpha usage, blendmode change, filter change or different data draw call.

Finally for all those that don’t have enough there are draw from source low level draw calls where you can specify your source rectangle overriding the actual texture UV coordinates. This is handy especially for people that plan to port their blitting engines that often involve source->target blit.

  1. context.drawSource(texture, sourceX, sourceY, sourceWidth, sourceHeight, x, y, scaleX, scaleY, rotation, red, green, blue, alpha, blendMode, filter);

Additionally there is drawMatrixSource which will come in the next build.

All these draw call also support custom camera through which they are rendered. To set camera in Genome2D its very simple.

  1. var contextCamera:GContextCamera = new GContextCamera();
  2. contextCamera.y = contextCamera.x = 100;
  4. context.setCamera(contextCamera);

This will set context camera at position 100,100 for all consecutive draw calls, keep in mind that default camera is looking at the center of the stage so with stage 800×600 the default camera used by Genome2D is x = 400 y = 300

All draw calls also support custom rectangle masking similarly in state machine fashion you are able to set axis aligned masking rectangle for all consecutive draw calls like this.

  1. context.setMaskRect(new Rectangle(0,0,400,300));

This will set to discard anything rendered outside the 0,0,400,300 rectangle. However the forementoned setCamera method will override this mask rectangle to default camera viewport so you need to call custom rectangle explicitly after the setCamera call.

So guys if this is something you didn’t know about or were looking for in other Stage3D frameworks just check it out. Latest Haxe build can be found here If there are any questions feel free to ask.


Genome2D/Starling Benchmarks Part1


[UPDATE 22.8.2013] Ok guys the reason for different and somewhat low benchmark numbers on iOS for both Genome2D and Starling were because I was using incorrect build. Now its remedied and all the benchmarks were rerun accross all devices for both Genome2D and Starling

Ok guys, I’ve finally had time benchmarking newest Starling v1.4 against latest Genome2Dhx version. So lets jump right to it since I am going off to vacation tomorrow it will be divided into two parts where sources come in second part after vacation since I need to polish them up.

All of the results here are averaged from 5 or more measurements.


Standard benchmark targeting 60FPS with the step of 50 also assets scaled at 100%. Using AS3 and latest swc of both frameworks.

I recreated standard Starling bechmark to be renderable by Genome2D as well to compare performances. I also changed background to use BlendMode NONE in both frameworks to avoid unnecessary fillrate issue on low end devices as well. I used the fastest way to render rotating objects, in Starling it should be using Images in Genome2D hx it is done by using custom class for transforms and then rendering with draw call. For those familiar with Genome2D nope you should not use GNodes for fastest rendering, for example particle systems are single node with smaller faster transform instances rendering each particle.

The web benchmark will automatically show you if you are using PPAPI and Debugger, all the measurements were done using release player.

Run browser benchmark: HERE


First up is Chrome where we actually have to benchmark two different Flash Players since there is the Adobe one obviously and then there is the Google PPAPI one.

Genome2D/NPAPI: 94750
Starling/NPAPI: 12850
Genome2D/PPAPI: 40200
Starling/PPAPI: 17350

As you can see there are major differences in performance between Starling and Genome2D, most of you already run this benchmark when I posted it week ago. So Genome2D is around 8 times faster than Starling in Adobe Flash Player in Chrome. Its really major difference.

As for Google’s PPAPI the difference is not that huge, but with all the PPAPI bugs I wouldn’t recommend anyone to use PPAPI except on Linux and its easy to force players to install Adobe’s player with JS. Interesting part is that Starling is actually faster in PPAPI than in NPAPI, also the main drop in performance for Genome2D is PPAPI specific pipeline for OpenGL calls which result in worse performance specifically when using Vertex Shader batching. You can get better results in Genome2D using Vertex Buffer batching (which can be easily enabled) when targeting PPAPI. I didn’t include this in these benchmarks since I wanted to use single pipeline bechmark for Starling/Genome2D through the whole process.


Genome2D: 130900
Starling: 18850

Next up is another browser benchmark and that is Firefox. As we can see Firefox is even faster than Chrome.


We will use slightly modified benchmark for mobile testing. As proposed by Daniel for starling as well here: Starling 1.4 benchmarks So we are targeting 30 FPS instead of 60 that we did on desktop. Also all the objects are scaled down to .25 to minimize fillrate impact on framework benchmarking. Daniel also suggested to enable mipmaps but in my tests it was clear that this does not have any impact on performance so we are not going to generate mipmaps, another thing is that step size is 10 for mobiles instead of 50 to increase the number of objects in slower pace.


First Android devices.

Samsung Galaxy Tab (first version)

Genome2D: 3100
Starling: 2110

Quite old tablet one of the first Android tablets if my memory serves me correctly. And the only mobile device where the difference between frameworks speeds is less than 200% maybe we are hitting fillrate wall even with scaled down assets.

Samsung S2

Genome2D: 17370
Starling: 5370

Samsung Galaxy S2 although a bit dated phone nowadays still with incredible performance, actually I would say it outperforms most of today’s mobiles with ease. Genome2D is more than three times faster here, it actually can render more objects at 30FPS on mobile than Starling can at 60FPS on my i7 (GTX660) desktop. ;)

Nexus 7

Genome2D: 15470
Starling: 4590

Nexus 7 tablet with latest Android 4.3, this is actually one of the mobile devices Daniel benchmarked in his latest benchmarks, the funny thing is that I am getting higher numbers for Starling than he did for some reason. Genome2D is still over three times faster here.


Using ad-hoc build.

iPad Mini

Genome2D: 12280
Starling: 3760

Genome2D is 4 times faster than Starling in this instance.

iPad 2

Genome2D: 12490
Starling: 3780

No suprise here since iPad 2 is same hardware as iPad Mini.

iPhone 5

Genome2D: 23070
Starling: 7820

Twice as fast as iPad Mini and iPad2 both frameworks scale almost the same.

iPod 4

Genome2D: 4440
Starling: 1470

I was expecting lesser difference here since its really old device and was expecting fillrate bottleneck even with scaled assets. Genome2D is still around three times faster.

So thats all folks for now due to time restrictions, after vacation I will post sources and maybe some additional information.

Cheers and any feedback is welcome.

Genome2D experiments Spriter/StencilShadows


Hi there guys, some of you guys that are not inside our small awesome community may be curious if there is something going on with Genome2D. Everything is going on smoothly and I am working on it almost daily there is just no time to blog. So today I decided to share two of my work in progress Genome2D experiments.

First is a support for Spriter format, some of you already know Spriter ( its an upcoming awesome tool for 2D animation, for those that are not familiar with it you should definitely check it out.

It already supports interpolation/tweening in the movement and bones support is coming next just waiting for the upcoming beta build of Spriter.

Another experiment I’ve been working on are stencil shadows. This involves additional shaders, low level draws, materials and components so its quite a major addition and I bet all of you will enjoy it. Here is a demo which is a clone of my old FlashPlayer 9 version of Genome2D demo.

Also the new Genome2D forum is up and we should all move there. I will not move the current forum db there as it would be tedious and most of the information isn’t that valuable anymore anyway. I am looking for our most experienced Genome2D users to start the new forum up :P

Thats all folks as usual due to time constraints, I am going to Venice and next week I am in Prague for the Geewa hackaton once back I will dive again into the Genome2D. Cheers.



Hi guys, just wanted to wish all of you hepp new year 2013 and here is a little fireworks demo. enjoy.
(Press F for FULLSCREEN)

You can find the source code on GitHub here as well.

sHTiF's RSS Feed
Go to Top