ZLIB returns different byte values after a compression and decompression - compression

I have a 3d ndarray that I want to decompress using Zlib.
I use this part of the code to compress.
begin = 0
blockSize = 1024
compressImage = bytes('', 'utf-8')
while begin < len (nii_img1_data):
compressImage = compressImage +
begin += blockSize
compressImage = compressImage + compressImageObject.flush()
f = open('compressed.dat', 'wb')
and my decompression part is
decompressedImageObject = zlib.decompressobj(wbits=+15)
my_file = open('compressed.dat', 'rb').read()
decompressedImage = zlib.decompress(my_file, bufsize=blockSize)
decompressedImage += decompressedImageObject.flush()
decompressedImage = np.frombuffer(decompressedImage, dtype=np.int8)
Before Compression the original image has a shape (90, 104, 72).
After decompression, the byte size returns 5391360 bytes which is higher than 9010472 = 673920 bytes.
I converted the decompressed into ndarray yielding a 1d array using
decompressedImage = np.frombuffer(decompressedImage, dtype=np.int8)
and trying to convert to the same shape of original image
decompressedImage = np.reshape(decompressedImage, newshape=(-1,104,72))
returns an array of shape = (720,104,72)
What am I doing wrong? How do I fetch the original image?

I can't reproduce the behaviour you are describing. However, I can observe some problems.
while begin < len (nii_img1_data):
compressImage = compressImage +
begin += blockSize
You indicate that nii_img1_data is an ndarray with dimensions (90,104,72). However, this code appears to assume it's a bytes object or something very similar. len(nii_img1_data) will return the size of the first dimension, 90. Slicing it with nii_img1_data[0:1024] will return the whole thing because the first dimension has size smaller than 1024. Feeding the slice to zlib.Compress.compress accesses the array's underlying buffer, the size of which will depend on exactly how you acquired the array, and may contain more data than the array itself. I'm guessing in your case nii_img1_data is a slice of a larger array, since that might explain what happens next. Alternatively perhaps it has a dtype other than np.int8?
Once you've fed this buffer to zlib, it appears that the loop runs only once because len(nii_img1_data) is only 90. When you decompress, you rebuild the data that was in the buffer, it's just not the data that you expected.
I suggest you make sure you convert your array to bytes or a buffer first and that even before you try to compress or decompress it you verify that you are able to convert those bytes back into the ndarray that you expect. Only once you're satisfied should you then add compression into the mix.


bit shift operation in parallel prefix sum

The code is to compute prefix sum parallelly from OpengGL-Superbible 10.
The shader shown has a local workgroup size of 1024, which means it will process arrays of 2048 elements, as each invocation computes two elements of the output array. The shared variable shared_data is used to store the data that is in flight. When execution starts, the shader loads two adjacent elements from the input arrays into the array. Next, it executes the barrier() function. This step ensures that all of the shader invocations have loaded their data into the shared array before the inner loop begins.
#version 450 core
layout (local_size_x = 1024) in;
layout (binding = 0) coherent buffer block1
float input_data[gl_WorkGroupSize.x];
layout (binding = 1) coherent buffer block2
float output_data[gl_WorkGroupSize.x];
shared float shared_data[gl_WorkGroupSize.x * 2];
void main(void)
uint id = gl_LocalInvocationID.x;
uint rd_id;
uint wr_id;
uint mask;// The number of steps is the log base 2 of the
// work group size, which should be a power of 2
const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
uint step = 0;
// Each invocation is responsible for the content of
// two elements of the output array
shared_data[id * 2] = input_data[id * 2];
shared_data[id * 2 + 1] = input_data[id * 2 + 1];
// Synchronize to make sure that everyone has initialized
// their elements of shared_data[] with data loaded from
// the input arrays
// For each step...
for (step = 0; step < steps; step++)
// Calculate the read and write index in the
// shared array
mask = (1 << step) - 1;
rd_id = ((id >> step) << (step + 1)) + mask;
wr_id = rd_id + 1 + (id & mask);
// Accumulate the read data into our element
shared_data[wr_id] += shared_data[rd_id];
// Synchronize again to make sure that everyone
// has caught up with us
} // Finally write our data back to the output image
output_data[id * 2] = shared_data[id * 2];
output_data[id * 2 + 1] = shared_data[id * 2 + 1];
How to comprehend the bit shift operation of rd_id and wr_id intuitively? Why it works?
When we say something is "intuitive" we usually mean that our understanding is deep enough that we are not aware of our own thought processes, and "know the answer" without consciously thinking about it. Here the author is using the binary representation of integers within a CPU/GPU to make the code shorter and (probably) slightly faster. The code will only be "intuitive" for someone who is very familiar with such encodings and binary operations on integers. I'm not, so had to think about what is going on.
I would recommend working through this code since these kind of operations do occur in high performance graphics and other programming. If you find it interesting, it will eventually become intuitive. If not, that's OK as long as you can figure things out when necessary.
One approach is to just copy this code into a C/C++ program and print out the mask, rd_id, wr_id, etc. You wouldn't actually need the data arrays, or the calls to barrier() and memoryBarrierShared(). Make up values for invocation ID and workgroup size based on what the SuperBible example does. That might be enough for "Aha! I see."
If you aren't familiar with the << and >> shifts, I suggest writing some tiny programs and printing out the numbers that result. Python might actually be slightly easier, since
will show you the actual bits, whereas in C you can only print in hex.
To get you started, log2 returns the number of bits needed to represent an integer. log2(256) will be 8, log2(4096) 12, etc. (Don't take my word for it, write some code.)
x << n is multiplying x by 2 to the power n, so x << 1 is x * 2, x << 2 is x * 4, and so on. x >> n is dividing by 1, 2, 4, .. instead.
(Very important: only for non-negative integers! Again, write some code to find out what happens.)
The mask calculation is interesting. Try
mask = (1 << step);
first and see what values come out. This is a common pattern for selecting an individual bit. The extra -1 instead generates all the bits to the right.
Anding, the & operator, with a mask that has zeroes on the left and ones on the right is a faster way for an integer % a power of 2.
Finally rd_id and wr_id array indexes need to start from base positions in the array, from the invocation ID and workgroup size, and increment according to the pattern explained in the Super Bible text.

How to set the frequency band of my array after fft

How I can set frequency band to my array from KissFFT? Sampling frequency is 44100 and I need to set it to my array realPartFFT. I have no idea, how it works. I need to plot my spectrum chart to see if it counts right. When I plot it now, it still has only 513 numbers on the x axis, without the specified frequency.
int windowCount = 1024;
float floatArray[windowCount], realPartFFT[(windowCount / 2) + 1];
kiss_fftr_cfg cfg = kiss_fftr_alloc(windowCount, 0, NULL, NULL);
kiss_fft_cpx cpx[(windowCount / 2) + 1];
kiss_fftr(cfg, floatArray, cpx);
for (int i = 0; i < (windowCount / 2) + 1; ++)
realPartFFT[i] = sqrtf(powf(cpx[i].r, 2.0) + powf(cpx[i].i, 2.0));
First of all: KissFFT doesn't know anything about the source of the data. You pass it an array of real numbers of a given size N, and you get in return an array of complex values of size N/2+1. The input array may be the whether forecast of the next N hours of the number of sunspots of the past N days. KissFFT doesn't care.
The mapping back to the real world needs to be done by you, so you have to interpret the data. As of you code snippet, you are passing 1024 of floats (I assume that floatArray contains the input data). You then get back an array of 513 (=1024/2+1) pairs of floats.
If you are sampling with 44.1 KHz and pass KissFFT chunks of 1024 (your window size) samples, you will get as highest frequency 22.05 KHz and as lowest frequency about 43 Hz (44,100 / 1024). You can get even lower by passing bigger chunks to KissFFT, but keep in mind that processing time will grow (with the fourth power of N, IIRC)!
Btw: You may consider making your windowSize variable const, to allow the compiler do some optimizations. Optimizations are very valuable when doing number crunching. In this case the effect may be negligible, but it's a good starting point.

Fast data structure or algorithm to find mean of each pixel in a stack of images

I have a stack of images in which I want to calculate the mean of each pixel down the stack.
For example, let (x_n,y_n) be the (x,y) pixel in the nth image. Thus, the mean of pixel (x,y) for three images in the image stack is:
mean-of-(x,y) = (1/3) * ((x_1,y_1) + (x_2,y_2) + (x_3,y_3))
My first thought was to load all pixel intensities from each image into a data structure with a single linear buffer like so:
|All pixels from image 1| All pixels from image 2| All pixels from image 3|
To find the sum of a pixel down the image stack, I perform a series of nested for loops like so:
for(int col=0; col<img_cols; col++)
for(int row=0; row<img_rows; row++)
for(int img=0; img<num_of_images; img++)
sum_of_px += px_buffer[(img*img_rows*img_cols)+col*img_rows+row];
Basically img*img_rows*img_cols gives the buffer element of the first pixel in the nth image and col*img_rows+row gives the (x,y) pixel that I want to find for each n image in the stack.
Is there a data structure or algorithm that will help me sum up pixel intensities down an image stack that is faster and more organized than my current implementation?
I am aiming for portability so I will not be using OpenCV and am using C++ on linux.
The problem with the nested loop in the question is that it's not very cache friendly. You go skipping through memory with a long stride, effectively rendering your data cache useless. You're going to spend a lot of time just accessing the memory.
If you can spare the memory, you can create an extra image-sized buffer to accumulate totals for each pixel as you walk through all the pixels in all the images in memory order. Then you do a single pass through the buffer for the division.
Your accumulation buffer may need to use a larger type than you use for individual pixel values, since it has to accumulate many of them. If your pixel values are, say, 8-bit integers, then your accumulation buffer might need 32-bit integers or floats.
Usually, a stack of pixels
is conditionally independent from a stack
And even if they weren't (assuming a particular dataset), then modeling their interactions is a complex task and will give you only an estimate of the mean. So, if you want to compute the exact mean for each stack, you don't have any other choice but to iterate through the three loops that you supply. Languages such as Matlab/octave and libraries such as Theano (python) or Torch7 (lua) all parallelize these iterations. If you are using C++, what you do is well suited for Cuda or OpenMP. As for portability, I think OpenMP is the easier solution.
A portable, fast data structure specifically for the average calculation could be:
std::vector<std::vector<std::vector<sometype> > > VoVoV;
int i,j;
for (i=0 ; i<img_cols ; ++i)
for (j=0 ; j<img_rows ; ++j)
// The values of all images at this pixel are stored continguously,
// therefore should be fast to access.
VoVoV[col][row][img] = foo;
As a side note, 1/3 in your example will evaluate to 0 which is not what you want.
For fast summation/averaging you can now do:
sometype sum = 0;
std::vector<sometype>::iterator it = VoVoV[col][row].begin();
std::vector<sometype>::iterator it_end = VoVoV[col][row].end();
for ( ; it != it_end ; ++it)
sum += *it;
sometype avg = sum / num_of_images; // or similar for integers; check for num_of_images==0
Basically you should not rely that the compiler would optimize away the repeated calculation of always the same offsets.

Using suffix array algorithm for Burrows Wheeler transform

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index table using suffix arrays. I have found 2 relatively simple, supposedly fast O(n) algorithms for suffix array creation, DC3 and SA-IS which both come with C++/C source code. I tried using the sources (out-of-the-box compiling SA-IS source can also be found here), but failed to get proper a proper suffix array / BWT index table out. Here's what I've done:
T=input data, SA=output suffix array, n=size of T, K=alphabet size, BWT=BWT index table
I work on 8-bit bytes, but both algorithms need a unique sentinel / EOF marker in form of a zero byte (DC3 needs 3, SA-IS needs one), thus I convert all my input data to 32-bit integers, increase all symbols by 1 and append the sentinel zero bytes. This is T.
I create an integer output array SA (of size n for DC3, n+1 for KA-IS) and apply the algorithms. I get results similar to my sorting BWT transform, but some values are odd (see UPDATE 1). Also the results of both algorithms differ slightly. The SA-IS algorithm produces an excess index value at the front, so all results need to be copied left by one index (SA[i]=SA[i+1]).
To convert the suffix array to the proper BWT indices, I subtract 1 from the suffix array values, do a modulo and should have the BWT indices (according to this): BWT[i]=(SA[i]-1)%n.
This is my code to feed the SA algorithms and convert to BWT. You should be able to more or less just plug in the SA construction code from the papers:
std::vector<int32_t> SuffixArray::generate(const std::vector<uint8_t> & data)
std::vector<int32_t> SA;
if (data.size() >= 2)
//copy data over. we need to append 3 zero bytes,
//as the algorithm expects T[n]=T[n+1]=T[n+2]=0
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 3, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 3), [](int32_t & n){ n++; });
SA_DC3(T.data(), SA.data(), data.size(), 256);
//copy data over. we need to append a zero byte,
//as the algorithm expects T[n-1]=0 (where n is the size of its input data)
//also increase the symbol value by 1, because the algorithm alphabet is [1,K]
//(0 is used as an EOF marker)
std::vector<int32_t> T(data.size() + 1, 0);
std::copy(data.cbegin(), data.cend(), T.begin());
std::for_each(T.begin(), std::prev(T.end(), 1), [](int32_t & n){ n++; });
SA.resize(data.size() + 1); //crashes if not one extra byte at the end
SA_IS((unsigned char *)T.data(), SA.data(), data.size() + 1, 256, 4); //algorithm expects size including sentinel
std::rotate(SA.begin(), std::next(SA.begin()), SA.end()); //rotate left by one to get same result as DC3
return SA;
void SuffixArray::toBWT(std::vector<int32_t> & SA)
std::for_each(SA.begin(), SA.end(), [SA](int32_t & n){ n = ((n - 1) < 0) ? (n + SA.size() - 1) : (n - 1); });
What am I doing wrong?
When applying the algorithms to short amounts of test text data like "yabbadabbado" / "this is a test." / "abaaba" or a big text file (alice29.txt from the Canterbury corpus) they work fine. Actually the toBWT() function isn't even necessary.
When applying the algorithms to binary data from a file containing the full 8-bit byte alphabet (executable etc.), they don't seem to work correctly. Comparing the results of the algorithms to that of the regular BWT indices, I notice erroneous indices (4 in my case) at the front. The number of indices (incidently?) corresponds to the recursion depth of the algorithms. The indices point to where the original source data had the last occurrences of 0s (before I converted them to 1s when building T)...
There are more differing values when I binary compare the regular BWT array and the suffix array. This might be expected, as afair sorting must not necessarily be the same as with a standard sort, BUT the resulting data transformed by the arrays should be the same. It is not.
I tried modifying a simple input string till both algorithm "failed". After changing two bytes of the string "this is a test." to 255 or 0 (from 74686973206973206120746573742Eh to e.g. 746869732069732061FF74657374FFh, the last byte has to be changed!) the indices and transformed string are not correct anymore. It also seems to be enough to change the last character of the string to a character already ocurring in the string, e.g. "this is a tests" 746869732069732061207465737473h. Then two indices and two characters of the transformed strings will swapped (comparing regular sorting BWT and BWT that uses SAs).
I find the whole process of having to convert the data to 32-bit a bit awkward. If somebody has a better solution (paper, better yet, some source code) to generate a suffix array DIRECTLY from a string with an 256-char alphabet, I'd be happy.
I have now figured this out. My solution was two-fold. Some people suggested using a library, which I did SAIS-lite by Yuta Mori.
The real solution was to duplicate and concatenate the input string and run the SA-generation on this string. When saving the output string you need to filter out all SA indices above the original data size. This is not an ideal solution, because you need to allocate twice as much memory, copy twice and do the transform on the double amount of data, but it is still 50-70% faster than std::sort. If you have a better solution, I'd love to hear it.
You can find the updated code here.

Huffman's Data compression filltable and invert code problems

I just began learning about Huffman's Data compression algorithm and I need help on the following function > filltable() and invertcode()
I don't understand why a codetable array is needed.
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
Please help me understand what is going on for this part of the function and why if n is larger than 0 it is divided by ten because it is alway going to be greater than 0 no matter how many times you divided it.
Link for code: http://www.programminglogic.com/implementing-huffman-coding-in-c/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letter<27)
codeTable[(int)tree->letter] = Code;
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;
for (i=0;i<27;i++){
n = codeTable[i];
copy = 0;
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
** edit **
To make this question more clear I don't need an explanation on huffman encoding and decoding but I need a explanation on how these two functions work and why codetables are necessary.
n is an int. Therefore, it will reduce to 0 over time. If n starts at 302 at the first iteration, it will be reduced to 30 after the first n /= 10;. At the end of the second iteration of the while loop, it will be reduced to 3. at the end of the fourth iteration, it will equal 0 ( int 4 / int 10 = int 0 ).
It is integer math. No decimal bits to extend to infinity.
I made a minor update to the example program to include an end of data code. The original example code may append an extra letter to the end of the original data when decompressing. Also there's a lot of stuff "hard coded" in this code, such as the number of codes, which was 27, and which I changed to 28 to include the end of data code that I added, and also the output file names which I changed to "compress.bin" (if compressing) or "output.txt" (if decompressing). It's not an optimal implementation, but it's ok to use as a learning example. It would help if you follow the code with a source level debugger.
A more realistic Huffman program would use tables to do the encode and decode. The encode table is indexed with the input code, and each table entry contains two values, the number of bits in the code, and the code itself. The decode table is indexed with a code composed of the minimum number of bits from the input stream required to determine the code (it's at least 9 bits, but may need to be 10 bits), and each entry in that table contains two values, the actual number of bits, and the character (or end of data) represented by that code. Since the actual number of bits may be less than the number bits used to determine the code, the left over bits will need to be buffered and used before reading data from the compressed file.
One variation of a Huffman like process is to have the length of the code determined by the leading bits of each code, to reduce the size of the decode table.