Create directx9 texture using a portion(s) of an image - c++

i have an image(208x8) and i would like to copy 8x8 squares from it at different areas then join all the squares to create one IDirect3DTexture9*

Depending on exactly what you are trying to do IDirect3DDevice9::UpdateSurface or IDirect3DDevice9::StretchRect might help you.
For simple operations on very small textures like you are describing, it can be advantageous to manipulate them using the CPU (i.e. with IDirect3DTexture9::LockRect). With D3D9 this usually implies that the texture be re-uploaded to VRAM, so it is generally only useful for small or infrequently modified textures. But sometimes if you are render-bound and you are careful about where you update the texture within your loop, it's possible to hide the cost of operations like this and get them "for free".
To avoid the VRAM upload, you can use a POOL_MANAGED resource combined with the appropriate usage and lock flags to situate the resource within the AGP aperture which allows for high-speed access from both the CPU and GPU, see:
If you are manipulating on the CPU, be aware of the tiling and alignment restrictions for the various texture formats. The best information about this is within the documentation that comes with the SDK (includes several whitepapers), the online documentation is incomplete.
Here's a basic example:
IDirect3DTexture9* m_tex = getYourTexture();
m_tex->LockRect(0, &outRect, d3dRect, D3DLOCK_DISCARD);
// Stride depends on your texture format - this is the number of bytes per texel.
// Note that this may be less than 1 for DXT1 textures in which case you'll need
// some bit swizzling logic. Can be inferred from Pitch and width.
int stride = 1;
int rowPitch = outRect.Pitch;
// Choose a pointer type that suits your stride.
unsigned char* pixels = (unsigned char*)outRect.pBits;
// Clear to black.
for (int y=0; y < d3dRect.height; ++y)
for (int x=0; x < d3dRect.width; ++x)
pixels[x + rowPitch * y] = 0x0;


SDL GPU Why is blitting two images in two seperates for loops way faster?

So i currently am trying out some stuff in SDL_GPU/C++ and i have the following setup, the images are 32 by 32 pixels respectively and the second image is transparent.
//..sdl init..//
GPU_Image* image = GPU_LoadImage("path");
GPU_Image* image2 = GPU_LoadImage("otherpath");
for (int i = 0; i < screenheight; i += 32) {
for (int j = 0; j < screenwidth; j += 32) {
GPU_Blit(image, NULL, screen, j, i);
GPU_Blit(image2, NULL, screen, j, i);
This codes with a WQHD sized screen has ~20FPS. When i do the following however
for (int i = 0; i < screenheight; i += 32) {
for (int j = 0; j < screenwidth; j += 32) {
GPU_Blit(image, NULL, screen, j, i);
for (int i = 0; i < screenheight; i += 32) {
for (int j = 0; j < screenwidth; j += 32) {
GPU_Blit(image2, NULL, screen, j, i);
i.e. seperate the two blitt calls in two differenct for loops i get 300FPS.
Can someone try to explain this to me or has any idea what might be going on here?
While cache locality might have an impact, I don't think it is the main issue here, especially considering the drop of frame time from 50ms to 3.3ms.
The call of interest is of course GPU_Blit, which is defined here as making some checks followed by a call to _gpu_current_renderer->impl->Blit. This Blit function seems to refer to the same one, regardless of the renderer. It's defined here.
A lot of code in there makes use of the image parameter, but two functions in particular, prepareToRenderImage and bindTexture, call FlushBlitBuffer several times if you are not rendering the same thing as in the previous blit. That looks to me like an expensive operation. I haven't used SDL_gpu before, so I can't guarantee anything, but it necessarily makes more glDraw* calls if you render something other than what you rendered previously, than if you render the same thing again and again. And glDraw* calls are usually the most expensive API calls in an OpenGL application.
It's relatively well known in 3D graphics that making as few changes to the context (in this case, the image to blit) as possible can improve performance, simply because it makes better use of the bandwidth between CPU and GPU. A typical example is grouping together all the rendering that uses some particular set of textures (e.g. materials). In your case, it's grouping all the rendering of one image, and then of the other image.
While both examples render the same number of textures, the first one forces the GPU to make hundreds/thousands (depends on screen size) texture binds while the second makes only 2 texture binds.
The cost of rendering a texture is very cheap on modern GPUs while texture binds (switching to use another texture) are quite expensive.
Note that you can use texture atlas to alleviate the texture bind bottleneck while retaining the desired render order.

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum =, 0, Xnew*4);
memset(pcount =, 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum =;
pcount =;
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
if (sc * Xnew / w > dc) {
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.
As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
Upsampling/downsampling kernel:
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

Unable to create image from compressed texture data (S3TC)

I've been trying to load compressed images with S3TC (BC/DXT) compression in Vulkan, but so far I haven't had much luck.
Here is what the Vulkan specification says about compressed images:
Compressed texture images stored using the S3TC compressed image formats are represented as a collection of 4×4 texel blocks, where each block contains 64 or 128 bits of texel data. The image is encoded as a normal 2D raster image in which each 4×4 block is treated as a single pixel.
For images created with linear tiling, rowPitch, arrayPitch and depthPitch describe the layout of the subresource in linear memory. For uncompressed formats, rowPitch is the number of bytes between texels with the same x coordinate in adjacent rows (y coordinates differ by one). arrayPitch is the number of bytes between texels with the same x and y coordinate in adjacent array layers of the image (array layer values differ by one). depthPitch is the number of bytes between texels with the same x and y coordinate in adjacent slices of a 3D image (z coordinates differ by one). Expressed as an addressing formula, the starting byte of a texel in the subresource has address:
// (x,y,z,layer) are in texel coordinates
address(x,y,z,layer) = layerarrayPitch + zdepthPitch + yrowPitch + xtexelSize + offset
For compressed formats, the rowPitch is the number of bytes between compressed blocks in adjacent rows. arrayPitch is the number of bytes between blocks in adjacent array layers. depthPitch is the number of bytes between blocks in adjacent slices of a 3D image.
// (x,y,z,layer) are in block coordinates
address(x,y,z,layer) = layerarrayPitch + zdepthPitch + yrowPitch + xblockSize + offset;
arrayPitch is undefined for images that were not created as arrays. depthPitch is defined only for 3D images.
For color formats, the aspectMask member of VkImageSubresource must be VK_IMAGE_ASPECT_COLOR_BIT. For depth/stencil formats, aspect must be either VK_IMAGE_ASPECT_DEPTH_BIT or VK_IMAGE_ASPECT_STENCIL_BIT. On implementations that store depth and stencil aspects separately, querying each of these subresource layouts will return a different offset and size representing the region of memory used for that aspect. On implementations that store depth and stencil aspects interleaved, the same offset and size are returned and represent the interleaved memory allocation.
My image is a normal 2D image (0 layers, 1 mipmap), so there's no arrayPitch or depthPitch. Since S3TC compression is directly supported by the hardware, it should be possible to use the image data without decompressing it first. In OpenGL this can be done using glCompressedTexImage2D, and this has worked for me in the past.
In OpenGL I've used GL_COMPRESSED_RGBA_S3TC_DXT1_EXT as image format, for Vulkan I'm using VK_FORMAT_BC1_RGBA_UNORM_BLOCK, which should be equivalent.
Here's my code for mapping the image data:
auto dds = load_dds("");
auto *srcData = static_cast<uint8_t*>(;
auto *destData = static_cast<uint8_t*>(vkImageMapPtr); // Pointer to mapped memory of VkImage
destData += layout.offset(); // layout = VkImageLayout of the image
assert((w %4) == 0);
assert((h %4) == 0);
assert(blockSize == 8); // S3TC BC1
auto wBlocks = w /4;
auto hBlocks = h /4;
for(auto y=decltype(hBlocks){0};y<hBlocks;++y)
auto *rowDest = destData +y *layout.rowPitch(); // rowPitch is 0
auto *rowSrc = srcData +y *(wBlocks *blockSize);
for(auto x=decltype(wBlocks){0};x<wBlocks;++x)
auto *pxDest = rowDest +x *blockSize;
auto *pxSrc = rowSrc +x *blockSize; // 4x4 image block
memcpy(pxDest,pxSrc,blockSize); // 64Bit per block
And here's the code for initializing the image:
vk::Device device = ...; // Initialization
vk::AllocationCallbacks allocatorCallbacks = ...; // Initialization
[...] // Load the dds data
uint32_t width = dds.width();
uint32_t height = dds.height();
auto format = dds.format(); // = vk::Format::eBc1RgbaUnormBlock;
vk::Extent3D extent(width,height,1);
vk::ImageCreateInfo imageInfo(
vk::ImageUsageFlagBits::eSampled | vk::ImageUsageFlagBits::eColorAttachment,
vk::Image img = nullptr;
vk::MemoryRequirements memRequirements;
uint32_t typeIndex = 0;
get_memory_type(memRequirements.memoryTypeBits(),vk::MemoryPropertyFlagBits::eHostVisible,typeIndex); // -> typeIndex is set to 1
auto szMem = memRequirements.size();
vk::MemoryAllocateInfo memAlloc(szMem,typeIndex);
vk::DeviceMemory mem;
device.allocateMemory(&memAlloc,&allocatorCallbacks,&mem); // Note: Using the default allocation (nullptr) doesn't change anything
uint32_t mipLevel = 0;
vk::ImageSubresource resource(
vk::SubresourceLayout layout;
auto *srcData = device.mapMemory(mem,0,szMem,vk::MemoryMapFlagBits(0));
[...] // Map the dds-data (See code from first post)
The code runs without issues, however the resulting image isn't correct. This is the source image:
And this is the result:
I'm certain that the problem lies in the first code snipped I've posted, however, in case it doesn't, I've written a small adaption of the triangle demo from the Vulkan SDK which produces the same result. It can be downloaded here. The source-code is included, all I've changed from the triangle demo are the "demo_prepare_texture_image"-function in tri.c (Lines 803 to 903) and the "dds.cpp" and "dds.h" files. "dds.cpp" contains the code for loading the dds, and mapping the image memory.
I'm using gli to load the dds-data (Which is supposed to "work perfectly with Vulkan"), which is also included in the download above. To build the project, the Vulkan SDK include directory has to be added to the "tri" project, and the path to the dds has to be changed (tri.c, Line 809).
The source image ("x64/Debug/" in the project) uses DXT1 compression. I've tested in on different hardware as well, with the same result.
Any example code for initializing/mapping compressed images would also help a lot.
Your problem is actually quite simple - in the demo_prepare_textures function, the first line, there is a variable tex_format, which is set to VK_FORMAT_B8G8R8A8_UNORM (which is what it is in the original sample). This eventually gets used to create the VkImageView. If you just change this to VK_FORMAT_BC1_RGBA_UNORM_BLOCK, it displays the texture correctly on the triangle.
As an aside - you can verify that your texture loaded correctly, with RenderDoc, which comes with the Vulkan SDK installation. Doing a capture of it, the and looking in the TextureViewer tab, the Inputs tab shows that your texture looks identical to the one on disk, even with the incorrect format.

Vertically flipping an Char array: is there a more efficient way?

Lets start with some code:
QByteArray OpenGLWidget::modifyImage(QByteArray imageArray, const int width, const int height){
if (vertFlip){
/* Each pixel constist of four unisgned chars: Red Green Blue Alpha.
* The field is normally 640*480, this means that the whole picture is in fact 640*4 uChars wide.
* The whole ByteArray is onedimensional, this means that 640*4 is the red of the first pixel of the second row
* This function is EXTREMELY SLOW
QByteArray tempArray = imageArray;
for (int h = 0; h < height; ++h){
for (int w = 0; w < width/2; ++w){
for (int i = 0; i < 4; ++i){[h*width*4 + 4*w + i] =[h*width*4 + (4*width - 4*w) + i ];[h*width*4 + (4*width - 4*w) + i] =[h*width*4 + 4*w + i];
return imageArray;
This is the code I use right now to vertically flip an image which is 640*480 (The image is actually not guaranteed to be 640*480, but it mostly is). The color encoding is RGBA, which means that the total array size is 640*480*4. I get the images with 30 FPS, and I want to show them on the screen with the same FPS.
On an older CPU (Athlon x2) this code is just too much: the CPU is racing to keep up with the 30 FPS, so the question is: can I do this more efficient?
I am also working with OpenGL, does that have a gimmic I am not aware of that can flip images with relativly low CPU/GPU usage?
According to this question, you can flip an image in OpenGL by scaling it by (1,-1,1). This question explains how to do transformations and scaling.
You can improve at least by doing it blockwise, making use of the cache architecture. In your example one of the accesses (either the read OR the write) will be off-cache.
For a start it can help to "capture scanlines" if you're using two loops to loop through the pixels of an image, like so:
for (int y = 0; y < height; ++y)
// Capture scanline.
char* scanline = + y*width*4;
for (int x = 0; x < width/2; ++x)
const int flipped_x = width - x-1;
for (int i = 0; i < 4; ++i)
swap(scanline[x*4 + i], scanline[flipped_x*4 + i]);
Another thing to note is that I used swap instead of a temporary image. That'll tend to be more efficient since you can just swap using registers instead of loading pixels from a copy of the entire image.
But also it generally helps if you use a 32-bit integer instead of working one byte at a time if you're going to be doing anything like this. If you're working with pixels with 8-bit types but know that each pixel is 32-bits, e.g., as in your case, you can generally get away with a case to uint32_t*, e.g.
for (int y = 0; y < height; ++y)
uint32_t* scanline = (uint32_t*) + y*width;
std::reverse(scanline, scanline + width);
At this point you might parellelize the y loop. Flipping an image horizontally (it should be "horizontal" if I understood your original code correctly) in this way is a little bit tricky with the access patterns, but you should be able to get quite a decent boost using the above techniques.
I am also working with OpenGL, does that have a gimmic I am not aware
of that can flip images with relativly low CPU/GPU usage?
Naturally the fastest way to flip images is to not touch their pixels at all and just save the flipping for the final part of the pipeline when you render the result. For this you might render a texture in OGL with negative scaling instead of modifying the pixels of a texture.
Another thing that's really useful in video and image processing is to represent an image to process like this for all your image operations:
struct Image32
uint32_t* pixels;
int32_t width;
int32_t height;
int32_t x_stride;
int32_t y_stride;
The stride fields are what you use to get from one scanline (row) of an image to the next vertically and one column to the next horizontally. When you use this representation, you can use negative values for the stride and offset the pixels accordingly. You can also use the stride fields to, say, render only every other scanline of an image for fast interactive half-res scanline previews by using y_stride=height*2 and height/=2. You can quarter-res an image by setting x stride to 2 and y stride to 2*width and then halving the width and height. You can render a cropped image without making your blit functions accept a boatload of parameters by just modifying these fields and keeping the y stride to width to get from one row of the cropped section of the image to the next:
// Using the stride representation of Image32, this can now
// blit a cropped source, a horizontally flipped source,
// a vertically flipped source, a source flipped both ways,
// a half-res source, a quarter-res source, a quarter-res
// source that is horizontally flipped and cropped, etc,
// and all without modifying the source image in advance
// or having to accept all kinds of extra drawing parameters.
void blit(int dst_x, int dst_y, Image32 dst, Image32 src);
// We don't have to do things like this (and I think I lost
// some capabilities with this version below but it hurts my
// brain too much to think about what capabilities were lost):
void blit_gross(int dst_x, int dst_y, int dst_w, int dst_h, uint32_t* dst,
int src_x, int src_y, int src_w, int src_h,
const uint32_t* src, bool flip_x, bool flip_y);
By using negative values and passing it to an image operation (ex: a blit operation), the result will naturally be flipped without having to actually flip the image. It'll end up being "drawn flipped", so to speak, just as with the case of using OGL with a negative scaling transformation matrix.

random access to buffer optimisation

I have colorBuffer Color[width*height] (most likely 800*600)
and during rasterization I call:
void setPixel(int x, int y, Color & color)
colorBuffer[y * width + x] = color;
It turns out that this random access to color buffer is really ineffective and slows my application down.
I think that it is caused the way I use it. I calculate some pixel (with rasterization algorithms) and call setPixel.
So I think my buffer is not in cache and this is the main problem. When trying to write into the whole buffer at once, it is much much faster.
Is there any way, how to optimize this?
I do not use it to fill buffer with two for cycles.
I use it to paint "random" pixels.
eg when rasterize line I use it like
calculate next point
calculate next point
setPixel(next point)
They way I see it, the access-pattern to the buffer depends in the order in which your algorithm processes the pixels. Can you not simply change that order so that it creates a sequential access-scheme to your buffer?
Yes, you should try to be cache-friendly,
but the first thing I would do is find out what's taking time.
It's simple enough. Just pause it several times and see what it's doing.
If it's mostly in calculate next point, you should see what it's doing in there, because that's where the time is going.
(I assume you understand that by "in" I mean "on the stack".)
If it's mostly in SetPixel, when you pause it, look at the disassembly window.
If it's spending much time in the prologue/epilogue of the routine, it should be inlined.
If it's spending much time in the actual move instruction into colorBuffer, then you're hitting the cache issue.
If it's spending much time in the code for the index calculation y * width + x, then you might want to see if you could somehow use an initialized pointer that you step along.
If you fix anything, you should do it all again, because you may have uncovered another opportunity to speed it up further.
The first thing to notice is that the way you process your pixels makes a huge difference to speed. If you do
for (int x = 0; x < width;++x)
for (int y = 0; y < height; ++y)
this will be really bad for performance because you're literally jumping around in memory width-wise (note that you do y*width + x).
If you simply change the order of processing to
for (int y = 0; y < height;++y)
for (int x = 0; x < width; ++x)
you already should notice a performance gain as the processor now gets a chance to cache memory accesses (which it didn't before).
Furthermore you should check if you can determine that entire blocks of pixels will have the same color value before actually setting the memory. Then you can copy those constant color values block-wise to your image array which can save you also a good deal of performance.