I am working on memory measurement of a little cocos2d demo, the demo's task is to add one png image into cocos2d texture cache, the monitors I use are Allocations and Activity Monitor in Instruments.
The test device is ipod touch4.
In order to find the relationship between app's memory and live bytes of allocations, I design two test scenarios.
The first scenario is to add texture into cache, and remove it immediately. Test Code is below:
CCTextureCache* cache = [CCTextureCache sharedTextureCache];
CCTexture2D* tex = [cache addImage:#"textureatlas_RGBA8888.png"];
[cache removeTexture:tex];
The second scenario is to add texture into cache, but leave it in cache. Test Code is below:
CCTextureCache* cache = [CCTextureCache sharedTextureCache];
CCTexture2D* tex = [cache addImage:#"textureatlas_RGBA8888.png"];
The size of png file is 270KB, and it has 1024 * 1024 size with 32-bit colour depth, its texture will take up 1024 * 1024 * 4 bytes (32 bits) = 4096 KB in cache.
Test Results are following:
The first scenario's result:
The live bytes will raise to 6.16MB during adding image into cache, in a while, it will return back to a little more than 2.08MB.
Until the app finish all operations, the level of live bytes will keep stable. At this time, the real mem in Activity Monitor is 11.61MB.
The second scenario's result:
The graph of the second test is similar as the first one, however, when the live bytes is stable, the real men is 15.62MB.
My first question is:
Why the graphs are almost same even I don't remove texture in cache in the second scenario? doesn't the Allocations consider the texture as live bytes in memory?
The second question is:
As I learned from this article: How to optimize memory usage and bundle size of a Cocos2D app (http://www.learn-cocos2d.com/2012/11/optimize-memory-usage-bundle-size-cocos2d-app/), since the png file will be converted to UIImage firstly, and then add to texture cache, one 1024*1024 texture use 4MB of memory, while loading it will use 8MB of memory for a short time, why I cannot observe it in the tracking graph of Allocations?
Thanks for your help.
Related
I have created a simple 2D image viewer in C++ using MFC and OpenGL. This image viewer allows a user to open an image, zoom in/out, pan around, and view the image in its different color layers (cyan, yellow, magenta, black). The program works wonderfully for reasonably sized images. However I am doing some stress testing on some very large images and I am easily running out of memory. One such image that I have is 16,700x15,700. My program will run out of memory before it can even draw anything because I am dynamically creating an UCHAR[] with a size of height x width x 4. I multiply it by 4 because there is one byte for each RGBA value when I feed this array to glTexImage2D(GLTEXTURE_2D, 0, GL_RGB8, width, height, 0, GL_RGBA, GLUNSIGNED_BYTE, (GLvoid*)myArray)
I've done some searching and have read a few things about splitting my image up into tiles, instead of one large texture on a single quad. Is this something that I should be doing? How will this help me with my memory? Or is there something better that I should be doing?
Your allocation is of size 16.7k * 15.7k * 4 which is ~1GB in size. The rest of the answer depends on whether you are compiling to 32 bit or 64 bit executable and whether you are making use of Physical Address Extensions (PAE). If you are unfamiliar with PAE, chances are you aren't using it, by the way.
Assuming 32 Bit
If you have a 32 bit executable, you can address 3GB of that memory so one third of your memory is being used up in a single allocation. Now, to add to the problem, when you allocate a chunk of memory, that memory must be available as a single continuous range of free memory. You might easily have more than 1GB of memory free but in chunks smaller than 1GB, which is why people suggest you split your texture up into tiles. Splitting it into 32 x 32 smaller tiles means you are allocating 1024 allocations of 1MB for example (this is probably unnecessarily fine-grained).
Note: citation required but some modes of linux allow only 2GB..
Assuming 64 Bit
It seems unlikely that you are building a 64 bit executable, but if you were then the logically addressable memory is much higher. Typical numbers will be 2^42 or 2^ 48 (4096 GB and 256 TB, respectively). This means that large allocations shouldn't fail under anything other than artificial stress tests and you will kill your swapfile before you exhaust the logical memory.
If your constraints / hardware allow, I'd suggest building to 64bit instead of 32bit. Otherwise, see below
Tiling vs. Subsampling
Tiling and subsampling are not mutually exclusive, up front. You may only need to make one change to solve your problem but you might choose to implement a more complex solution.
Tiling is a good idea if you are in 32 bit address space. It complicates the code but removes the single 1GB contiguous block problem that you seem to be facing. If you must build a 32 bit executable, I would prefer that over sub-sampling the image.
Sub-sampling the image means that you have an additional (albeit-smaller) block of memory for the subsampled vs. original image. It might have a performance advantage inside openGL but set that against additional memory pressure.
A third way, with additional complications is to stream the image from disk when necessary. If you zoom out to show the whole image, you will be subsampling >100 pixels per screen pixel on a 1920 x 1200 monitor. You might choose to create an image that is significantly subsampled by default, and use that until you are sufficiently zoomed-in that you need a higher-resolution version of a subset of the image. If you are using SSDs this can give acceptable performance but adds a lot by way of additional complication.
I am trying to use Image object for image processing.
I compared image object processing time with buffer object processing time.
I found that image object is more slower than buffer object in YUV420.
I processed separately Y and UV because of Data Size.
Y is a original size, UV is a quarter of original image size.
So, I used cl_image_format like that
Y : image_channel_order = CL_R, image_channel_data_type = CL_UNSIGNED_INT8
UV : image_channel_order = CL_RG, image_channel_data_type = CL_UNSIGNED_INT8
I thought that image object is more faster buffer object when processing image.
But, It was really an unexpected result.
I don't know the reasons.
I think image object take more bit size than buffer object with 24 bit.
but, I can't sure.
It all depends on your access pattern. For coalesced reads, a buffer is fastest, but for non-coalesced reads it is slower than using images. You can work around that using shared local memory as a cache.
Images use the texture cache that is optimized for spatial locality. So for nearby reads -- horizontally or vertically -- it can be faster than buffers.
You don't show any code, but if you're seeing better speed with buffers then you must have coalesced reads. If you changed your access pattern then it might be the other way around.
Here's the info of profile->leaks in Xcode and I ran it on iPad 2 for about 21 mins 12 seconds before crashing.
live Bytes ---- 5.45 MB
Living ---- 13547
Transitory ---- 3845036
overall Bytes -- 720.31 MB
When the app is running on device, the app crashes printing Received Memory Warning in Console.
I'm not very sure of how it works.
But please tell me if an app runs for 21 minutes on a device, uses an overall of around 720 MB memory in total during this run, yet live bytes never go beyond 7.0 MB.
I accept that the app starts by using 3.25 MB as live bytes & it goes to 5.45 MB on live bytes during this run & I'm not pretty sure how live bytes go on increasing like this.
But my question is:
Is it an app bad enough to produce crashes while running on device?
Or
Am I facing some other problem?
You probably are leaving tons of sprites in the CCTextureCache singleton. Everytime you create a CCSprite, the texture is cached (silently) so that the next time you refer to it, the loading and presentation will be faster (much faster). Run the allocations profiling in simulator (see two pics below):
and
The top image is from allocations profiling on device. Max memory 4.4 Mb.
The bottom image is the same app, the SAME gameplay sequence, while profiling in the simulator (peaks at around 78 Mb). By running in simulator, i can see in the allocations the memory used by my sprites. In the device, this memory is not accounted for by the allocations tool.
You are looking for trends and discrete big jumps. If you never come back down, you are probably leaving behind unused sprites. In my case, i choose to free specific resources from the textures at specific point in the game's execution. Here is an example from the appController
- (void)applicationDidReceiveMemoryWarning:(UIApplication *)application {
MPLOGERROR(#"Before purge");
[[CCTextureCache sharedTextureCache] dumpCachedTextureInfo];
[CCAnimationCache purgeSharedAnimationCache];
[[CCSpriteFrameCache sharedSpriteFrameCache] removeSpriteFrames];
[[CCDirector sharedDirector] purgeCachedData];
MPLOGERROR(#"%After purge");
[[CCTextureCache sharedTextureCache] dumpCachedTextureInfo];
}
This is last ditch, brute force cleanout. But you can remove specific textures at different point during game play without impacting the 'perceived' responsiveness of the app. Caches are generally sound in principle, but can become rapidly tricky in the face of constrained resources. Learn about it, experiment, and eventually you will find the right mix of 'what stays/what goes' for smooth application performance.
ps. although the simulator is fine for certain tests, dont use its 'performance' as a benchmark. Simulator performance is meaningless when it comes to graphics, it does not use your computer's GPU (and that is why you see the graphic memory allocation :) ).
My goal is to see what would happen when using more texture data than what would fit in physical GPU memory. My first attempt was to load up to 40 DDS textures, resulting in a memory footprint way higher than there was GPU memory. However, my scene would still render at 200+ fps on a 9500 GT.
My conclusion: the GPU/OpenGL is being smart and only keeps certain parts of the mipmaps in memory. I thought that should not be possible on a standard config, but whatever.
Second attempt: disable mip mapping, such that the GPU will always have to sample from the high res textures. Once again, I loaded about 40 DDS textures in memory. I verified the texture memory usage with gDEBugger: 1.2 GB. Still, my scene was rendering at 200+ fps.
The only thing I noticed was that when looking away with the camera and then centering it once again on the scene, a serious lag would occur. As if only then it would transfer textures from main memory to the GPU. (I have some basic frustum culling enabled)
My question: what is going on? How does this 1 GB GPU manage to sample from 1.2 GB of texture data at 200+ fps?
OpenGL can page complete textures in and out of texture memory in between draw-calls (not just in between frames). Only those needed for the current draw-call actually need to be resident in graphics memory, the others can just reside in system RAM. It likely only does this with a very small subset of your texture data. It's pretty much the same as any cache - how can you run algorithms on GBs of data when you only have MBs of cache on your CPU?
Also PCI-E busses have a very high throughput, so you don't really notice that the driver does the paging.
If you want to verify this, glAreTexturesResident might or might-not help, depending on how well the driver is implemented.
Even if you were forcing texture thrashing in your test (discarding and uploading of some textures from system memory to GPU memory every frame), which I'm not sure you are, modern GPUs and PCI-E have such a huge bandwidth that some thrashing does impact performance that much. One of the 9500GT models is quoted to have a bandwidth of 25.6 GB/s, and 16x PCI-E slots (500 MB/s x 16 = 8 GB/s) are the norm.
As for the lag, I would assume the GPU + CPU throttle down their power usage when you aren't drawing visible textures, and when you suddenly overload them they need a brief instant to power up. In real life apps and games this 0%-100% sudden workload changes never happen, so a slight lag is totally understandable and expected, I guess.
I'm developing "remote screencasting" application (just like VNC but not exactly), where I transfer updated tiles of screen pixels over the network. I'd like to implement the caching mechanism, and I'd like to hear your recommendations...
Here is how I think it should be done. For each tile coordinate, there is fixed size stack (cache) where I add updated tiles. When saving, I calculate some kind of checksum (probably CRC-16 would suffice, right?) of the tile data (i.e. pixels). When getting new tile (from the new screenshot of desktop), I calculate its checksum and compare to all items checksums in the stack of that tile coordinate. If the checksum matches, instead of sending the tile I send the special message e.g. "get tile from cache stack at position X". This means I need to have identical cache stacks on the server and on the client.
Here comes my questions:
What should be the default stack size (depth)? Say if stack size is 5, this means last 5 tiles of specified coordinates will be saved, and 5 times the resolution of screen pixels will be the total cache size. For big screens raw RGB buffer of screen will be approx. 5 megabytes, so having 10-level stack means 50MB cache, right? So what should be the cache depth? I think maybe 10 but need your suggestions.
I'm compressing the tiles into JPEG before sending over network. Should I implement caching of JPEG tiles, or raw RBG tiles before compression? Logical choice would be caching raw tiles as it would avoid unnecessary JPEG encoding for the tiles that would be found in cache. But saving RGB pixels will require much bigger cache size. So what's the best option - before or after compression?
Is CRC-16 checksum alone enough for comparing new screen tiles with the tiles in cache stack? I mean should I additionally make byte-by-byte comparison for the tiles when CRC matches, or is it redundant? Is the collision probability low enough to be discarded?
In general, what do you think about the scheme I described? What would you change in it? Any kind of suggestions would be appreciated!
I like the way you explained everything, this is certainly a nice idea to implement.
I implemented the similar approach for a similar application couple of months ago, Now looking for some different schemes either work along with it or replace it.
I used the cache stack size equal number of tiles as present in a screen and didn't restrict the tile to be matched with same location previous tile. I assume it is very helpful while user is moving a window. Cache size is a trade-off between processing power , memory and bandwidth. The more tiles you have in cache the more you may save the bandwidth again on the cost of memory and processing.
I used CRC16 too but this is not ideal to implement as when it hits some CRCs in cache it produces a very odd image, which was quite annoying but very rare. Best thing is the match pixel by pixel if you can afford it in-terms of processing power. In my case I couldn't.
Caching JPEG is a better idea to save the memory, because if we create BITMAP from JPEG the damage has already been done to it in-terms of quality, I assume the probability of hitting wrong CRC is the same in both cases. I, in my case, used JPEG.
I'd use a faster hash algorithm. murmur2 or Jenkins for example. It promises much better uniqueness.
See Spice remote protocol (www.spice-space.org) for example of caching.
The cache should be as big as it can be (on the client, or in an intermediate proxy).
You might check out x11vnc's cache implementation.