Loading DDS textures?

Loading DDS textures? - opengl

I'm reading about loading DDS textures. I read this article and saw this posting. (I also read the wiki about S3TC)
I understood most of the code, but there's two lines I didn't quite get.
blockSize = (format == GL_COMPRESSED_RGBA_S3TC_DXT1_EXT) ? 8 : 16;
and:
size = ((width + 3) / 4) * ((height + 3) / 4) * blockSize;
and:
bufsize = mipMapCount > 1 ? linearSize * 2 : linearSize;
What is blockSize? and why are we using 8 for DXT1 and 16 for the rest?
What is happening exactly when we're calculating size? More
specifically why are we adding 3, dividing by 4 then multiplying
by blockSize?
Why are we multiplying by 2 if mipMapCount > 1?

DXT1-5 formats are also called BCn formats (it ends with numbers but not exactly the same ones) and BC stands for block compression. Pixels are not stored separately, it only stores a block of data for the equivalent of 4x4 pixels.
The 1st line checks if it's DXT1, because it has a size of 8 byte per block. DXT3 and DXT5 have use 16 bytes per block. (Note that newer formats exist and at least one of them is 8 bytes/block: BC4).
The 2nd rounds up the dimensions of the texture to a multiple of the dimensions of a block. This is required since these formats can only store blocks, not pixels. For example, if you have a texture of 15x6 pixels, and since BCn blocks are 4x4 pixels, you will need to store 4 blocks per column, and 2 blocks per row, even if the last column/row of blocks will only be partially filled.
One way of rounding up a positive integer (let's call it i) to a multiple of another positive integer (let's call it m), is:
(i + m - 1) / m * m
Here, we need get the number of blocks on each dimension and then multiply by the size of a block to get the total size of the texture. To do that we round up width and height to the next multiple of 4, divide by 4 to get the number of block and finally and multiply it by the size of the block:
size = (((width + 3) / 4 * 4) * ((height + 3) / 4 * 4)) / 4 * blockSize;
// ^ ^ ^
If you look closely, there's a *4 followed by a /4 that can be simplified. If you do that, you'll get exactly the same code you had. The conclusion to all this could be comment any code that's not perfectly obvious :P
The 3rd line may be an approximation to calculate a buffer size big enough to store the whole mipmap chain easily. But I'm not sure what this linearSize is; it correspond to dwPitchOrLinearSize in the DDS header. In any case, you don't really need this value since you can calculate the size of each level easily with the code above.

Related

How to calculate bitmap size?

Started working on screen capturing software specifically targeted for Windows. While looking through an example on MSDN for Capturing an Image I found myself a bit confused.
Keep in mind when I refer to the size of the bitmap that does not include headers and so forth associated with an actual file. I'm talking about raw pixel data. I would have thought that the formula should be (width*height)*bits-per-pixel. However, according to the example this is the proper way to calculate the size:
DWORD dwBmpSize = ((bmpScreen.bmWidth * bi.biBitCount + 31) / 32) * 4 * bmpScreen.bmHeight;
and or: ((width*bits-per-pixel + 31) / 32) * 4 * height
I don't understand why there's the extra calculations involving 31, 32 and 4. Perhaps padding? I'm not sure but any explanations would be quite appreciated. I've already tried Googling and didn't find any particularly helpful results.

The bits representing the bitmap pixels are packed in rows. The size of each row is rounded up to a multiple of 4 bytes (a 32-bit DWORD) by padding.
(bits_per_row + 31)/32 * 4 ensures the round up to the next multiple of 32 bits. The answer is in bytes, rather than bits hence *4 rather than *32.
See: https://en.wikipedia.org/wiki/BMP_file_format

Under Bitmap Header Types you'll find the following:
The scan lines are DWORD aligned [...]. They must be padded for scan line widths, in bytes, that are not evenly divisible by four [...]. For example, a 10- by 10-pixel 24-bpp bitmap will have two padding bytes at the end of each scan line.
The formula
((bmpScreen.bmWidth * bi.biBitCount + 31) / 32) * 4
establishes DWORD-alignment (in bytes). The trailing * 4 is really the result of * 32 / 8, where the multiplication with 32 produces a value that's a multiple of 32 (in bits), and the division by 8 translates it back to bytes.
Although this does produce the desired result, I prefer a different implementation. A DWORD is 32 bits, i.e. a power of 2. Rounding up to a power of 2 can be implemented using the following formula:
(value + ((1 << n) - 1)) & ~((1 << n) - 1)
Adding (1 << n) - 1 adjusts the initial value to go past the next n-th power of 2 (unless it already is an n-th power of 2). (1 << n) - 1 evaluates to a value, where the n least significant bits are set, ~((1 << n) - 1) negates that, i.e. all bits but the n least significant bits are set. This serves as a mask to remove the n least significant bits of the adjusted initial value.
Applied to this specific case, where a DWORD is 32 bits, i.e. n is 5, and (1 << n) - 1 evaluates to 31. value is the raw scanline width in bits:
auto raw_scanline_width_in_bits{ bmpScreen.bmWidth * bi.biBitCount };
auto aligned_scanline_width_in_bits{ (raw_scanline_width_in_bits + 31) & ~31 };
auto aligned_scanline_width_in_bytes{ raw_scanline_width_in_bits / 8 };
This produces the same results, but provides a different perspective, that may be more accessible to some.

understanding vtu file size

I am having a problem with understanding/managing sizes of .vtu files in VTK. I need to write CFD output for hexahedral meshes with millions of cells and nodes. So, I am looking at ways to improve the efficiency of storage. I started with simple test cases.
Case1: 80x40x40 hexahedral mesh with 8 points for each hexahedron. So, 128000 cells and 1024000 points in total. Let's call it C1.vtu.
Case2: 80x40x40 hexahedral mesh with only unique points. So, 128000 cells and 136161 points in total. Let's call it C2.vtu.
I store one vector field (velocity) for each point in each case. I use vtkFloatArray for this data. The size of C1.vtu is 7.5 MB, and C2.vtu file is 3.0MB.
This is not what I expected when I created C2.vtu. As I store only about 13% of points (of Case1) in Case2, I expected that C2.vtu would be reduced accordingly (at least 5 times). However, the reduction is only 2.5 times.
I would like to understand what is going on internally. Also, I appreciate any insights on reducing the file size further.
I am using vtk6.2 with C++ on Ubuntu12.04.

It sounds like you have compression enabled in the writer; does writer->GetCompressor() return a non-NULL pointer? If so, then that is almost surely the reason for the difference in file sizes. Without compression, I would expect larger file sizes that you are reporting. As the comments above noted, unstructured storage adds connectivity overhead. Consider your meshes C1 and C2:
C1
connectivity size = 128000 * (1 cell type + 1 cell offset + 8 point IDs) * (4 or 8 bytes per integer)
point coordinate size = 1024000 * (3 coords) * (4 or 8 bytes per coord)
vector field size = 1024000 * (3 components per tuple) * (4 or 8 bytes per component)
that would be 28.32 MiB at a minimum (all int32/float32) yet you report it is 7.5 MB
C2
connectivity size = 128000 * (1 cell type + 1 cell offset + 8 point IDs) * (4 or 8 bytes per integer)
point coordinate size = 136161 * (3 coords) * (4 or 8 bytes per coord)
vector field size = 136161 * (3 components per tuple) * (4 or 8 bytes per component)
that would be 8 MiB at a minimum, but you report 3 MB.

packing an array of 3 values in buffer

I have the following problem I am unable to solve gracefully.
I have a data type that can take 3 possible values (0,1,2).
I have an array of 20 element of this data type.
As I want to encode the information on the least amount of memory, I did the following :
consider that each element can take up to 4 values (2 bits)
each char holds 8 bits, so I can put 4 times an element
5 char holds 40 bits, so I can store 20 elements.
I have done this and it works time.
However I'm interested evaluating the space gained by using the fact that my element can only take 3 values and not 4.
Every possible combination gives us 3 to the 20th power, which is 3,486,784,401. However 256 to the 4th power gives us 4,294,967,296 , which is greater. This means I could encode my data on 4 char .
Is there an generic method to do the 2nd idea here ? The 1st idea is simple to implement with bit mask / bit shifts. However since 3 values doesn't fit in an integer number of bits, I have no idea how to encode / decode any of these values into an array of 4 char.
Do you have any idea or reference on how it's done ? I think there must be a general method. If anything I'm interested about the feasability of this
edit : this could be simplified to : how to store 5 values from 0 to 2 into 1 byte only (as 256 >= 3^5 = 243)

You should be able to do what you said using 4 bytes. Assume that you store the 20 values into a single int32_t called value, here is how you would extract any particular element:
element[0] = value % 3;
element[1] = (value / 3) % 3;
element[2] = (value / 9) % 3;
...
element[19] = (value / 1162261467) % 3; // 1162261467 = 3 ^ 19
Or as a loop:
for (i=0;i<20;i++) {
element[i] = value % 3;
value /= 3;
}
To build value from element, you would just do the reverse, something like this:
value = 0;
for (i=19;i>=0;i--)
value = value * 3 + element[i];

There is a generic way to figure out how much bits you need:
If your data type has N different values, then you need log(N) / log(2) bits to store this value. For instance in your example, log(3) / log(2) equals 1.585 bits.
Of course in reality you will to pack a fixed amount of values in an integer number of bits, so you have to multiply this 1.585 with that amount and round up. For instance if you pack 5 of them:
1.585 × 5 = 7.925, meaning that 5 of your values just fit in one 8-bit char.
The way to unpack the values has been shown in JS1's answer. The generic formula for unpacking is element[i] = (value / (N ^ i) ) mod N
Final note, this is only meaningful if you really need to optimize memory usage. For comparison, here are some popular ways people pack these value types. Most of the time the extra space taken up is not a problem.
an array of bool: uses 8 bits to store one bool. And a lot of people really dislike the behavior of std::vector<bool>.
enum Bla { BLA_A, BLA_B, BLA_C}; an array or vector of Bla probably uses 32 bits per element (sizeof(Bla) == sizeof(int)).

Why is this "reduction factor" algo doing "+ div/2"

So I am running through "OpenCV 2 Computer Vision Application Programming Cookbook" by Robert Laganiere. Around page 42 it is talking about a image reduction algorithm. I understand the algorithm ( i think) but I do not understand exactly why one part was put in. I think I know why but if I am wrong I would like corrected. I am going to copy and paste a little bit of it in here:
"Color images are composed of 3-channel pixels. Each of these channels
corresponds to the intensity value of one of the three primary colors
(red, green, blue). Since each of these values is an 8-bit unsigned
char, the total number of colors is 256x256x256, which is more than 16
million colors. Consequently, to reduce the complexity of an analysis,
it is sometimes useful to reduce the number of colors in an image. One
simple way to achieve this goal is to simply subdivide the RGB space
into cubes of equal sizes. For example, if you reduce the number of
colors in each dimension by 8, then you would obtain a total of
32x32x32 colors. Each color in the original image is then assigned a
new color value in the color-reduced image that corresponds to the
value in the center of the cube to which it belongs. Therefore, the
basic color reduction algorithm is simple. If N is the reduction
factor, then for each pixel in the image and for each channel of this
pixel, divide the value by N (integer division, therefore the reminder
is lost). Then multiply the result by N, this will give you the
multiple of N just below the input pixel value. Just add N/2 and you
obtain the central position of the interval between two adjacent
multiples of N. if you repeat this process for each 8-bit channel
value, then you will obtain a total of 256/N x 256/N x 256/N possible
color values. How to do it... The signature of our color reduction
function will be as follows: void colorReduce(cv::Mat &image, int
div=64); The user provides an image and the per-channel reduction
factor. Here, the processing is done in-place, that is the pixel
values of the input image are modified by the function. See the
There's more... section of this recipe for a more general function
signature with input and output arguments. The processing is simply
done by creating a double loop that goes over all pixel values: "
void colorReduce(cv::Mat &image, int div=64) {
int nl= image.rows; // number of lines
// total number of elements per line
int nc= image.cols * image.channels();
for (int j=0; j<nl; j++) {
// get the address of row j
uchar* data= image.ptr<uchar>(j);
for (int i=0; i<nc; i++) {
// process each pixel ---------------------
data[i]=
data[i]/div*div + div/2;// <-HERE IS WHERE I NEED UNDERSTANDING!!!
// end of pixel processing ---------------
}}}
So I get how I am reducing the 0:255 pixel value by div amount. I then lose whatever remainder was left. Then by multiplying it by the div amount again we are scaling it back up to keep it in the range of 0:255. Why are we then adding (div/2) back into the answer? The only reason I can think is that this will cause some values to be rounded down and some rounded up. If you don't use it then all your values are rounded down. So in a way it is giving a "better" average?
Don't know, so what do you guys/girls think?

The easiest way to illustrate this is using an example.
For simplicity, let's say we are processing a single channel of an image. There are 256 distinct colors, ranging from 0 to 255. We are also going to use N=64 in our example.
Using these numbers, we will reduce the number of colors from 256 to 256/64 = 4. Let's draw a graph of our color space:
|......|......|......|......|
0 63 127 191 255
The dotted line represents our colorspace, going from 0 to 255. We have split this interval into 4 parts, and the splits are represented by the vertical lines.
In order to reduce all 256 colors to 4 colors, we are going to divide each color by 64 (losing the remainder), and then multiply it by 64 again. Let's see how this goes:
[0 , 63 ] / 64 * 64 = 0
[64 , 127] / 64 * 64 = 64
[128, 191] / 64 * 64 = 128
[192, 255] / 64 * 64 = 192
As you can see, all the colors from the first part became 0, all the colors from the second part became 64, third part 128, fourth part 192. So our color space looks like this:
|......|......|......|......|
0 63 127 191 255
|______/|_____/|_____/|_____/
| | | |
0 64 128 192
But this is not very useful. You can see that all our colors are slanted to the left of the intervals. It would be more helpful if they were in the middle of the intervals. And that's why we add 64/2 = 32 to the values. Adding half of the interval length shifts the colors to the center of the intervals. That's also what it says in the book: "Just add N/2 and you obtain the central position of the interval between two adjacent multiples of N."
So let's add 32 to our values and see how everything looks:
[0 , 63 ] / 64 * 64 + 32 = 32
[64 , 127] / 64 * 64 + 32 = 96
[128, 191] / 64 * 64 + 32 = 160
[192, 255] / 64 * 64 + 32 = 224
And the interval looks like this:
|......|......|......|......|
0 63 127 191 255
\______/\_____/\_____/\_____/
| | | |
32 96 160 224
This is a much better color reduction. The algorithm reduced our colorspace from 256 to 4 colors, and those colors are in the middle of the intervals that they reduce.

It is done to give an average of the quantization bounds, not floor of it.
For example for N = 32, all data from 0 to 31 will give 16 instead of 0.
Please check following picture or my excel file.

Concurrent matrix sum - past Exam paper

I'm currently studying in my 3rd year of university - my exam for Computer Systems and Concurrency and I'm confused about a past paper question. Nobody - even the lecturer - has answered my question.
Question:
Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.
Answer given in solutions:
> Assuming it has a single-precision FP multiply-add instruction,
Single-precision FP multiply-add performance =
\#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB
The peak DDR3 bandwidth = #Partitions * #bytes/transfer * #transfers/clock * #clocks/sec = 8 * 8 * 2 * 1G = 128 GB/sec
>Modern computers have 32-bit single precision So, if we want 3 n*n SP matrices,
maximum n is
3n^2 * 4 <= 2048 * 1024 * 1024
>nmax = 13377 = n
>The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:
>For each element of the
result, we need n multiply-adds For each row of the result,
>we need n * n multiply-adds For the entire result matrix, we need n * n * n multiply-adds Thus, approximately 2393 GFlops.
> Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory.
>That is 3 * n^2 = 512 GB of data. This process will take 512 / 128 = 4 seconds
Also, the processing will take 2393 / 192 = 12.46 seconds Thus the
entire matrix multiplication will take 16.46 seconds.
Now my questions is - how does the calculation of 3*((13377)^2) = 536,832,387
translate to 536,832,387 = 512 GB.
That is 536.8 Million values. Each value is 4 bytes long. The memory interface is 8 bytes wide - assuming the GPU cannot fetch 2 values and split them - that effectively doubles the size of the reads and writes. Therefore the 2GB of Memory used is effectively read/written twice (because 8 bytes are read and 4 ignored) Therefore only 4GB of data is passed between the RAM and the GPU.
Can someone please tell me where I am going wrong as the only way I can think of is that 536.8 Million Result is the value of the memory operations in KB - which is not stated anywhere.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js