Shift an array to the right with 1

Shift an array to the right with 1 - c++

I'm trying to shift an array of unsigned char to the right with some binary 1.
Example: 0000 0000 | 0000 1111 that I shift 8 times will give me 0000 1111 | 1111 1111 (left shift in binary)
So in my array I will get: {0x0F, 0x00, 0x00, 0x00} => {0xFF, 0x0F, 0x00, 0x00} (right shift in the array)
I currently have this using the function memmove:
unsigned char * dataBuffer = {0x0F, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
unsigned int shift = 4;
unsigned length = 8;
memmove(dataBuffer, dataBuffer - shift, length + shift);
for(int i = 0 ; i < 8 ; i++) printf("0x%X ", dataBuffer[i]);
Output: 0x0 0x0 0x0 0x0 0xF 0x0 0x0 0x0
Expected output: 0xFF 0x0 0x0 0x0 0x0 0x0 0x0 0x0
As you can see, I managed to shift my array only element by element and I don't know how to replace the 0 with 1. I guess that using memset could work but I can't use it correctly.
Thanks for your help!
EDIT: It's in order to fill a bitmap zone of an exFAT disk. When you write a cluster in a disk, you have to set the corresponding bit of the bitmap to 1 (first cluster is first bit, second cluster is second bit, ...).
A newly formatted drive will contain 0x0F in the first byte of the bitmap so the proposed example corresponds to my needs if I write 8 clusters, I'll need to shift the value 8 times and fill it with 1.
In the code, I write 4 cluster and need to shift the value by 4 bits but it is shifted by 4 bytes.
Setting the question as solved, it isn't possible to do what I want. Instead of shifting the bits of an array, I need to shift each byte of the array separately.

Setting the question as solved, it isn't possible to do what I want. Instead of shifting the bits of an array, I need to edit each bit of the array separately.
Here's the code if it can help anyone else:
unsigned char dataBuffer[11] = {0x0F, 0x00, 0x00, 0x00, 0, 0, 0, 0};
unsigned int sizeCluster = 6;
unsigned int firstCluster = 4;
unsigned int bitIndex = firstCluster % 8;
unsigned int byteIndex = firstCluster / 8;
for(int i = 0 ; i < sizeCluster; i++){
dataBuffer[byteIndex] |= 1 << bitIndex;
//printf("%d ", bitIndex);
//printf("%d \n\r", byteIndex);
bitIndex++;
if(bitIndex % 8 == 0){
bitIndex = 0;
byteIndex++;
}
}
for(int i = 0 ; i < 10 ; i++) printf("0x%X ", dataBuffer[i]);
OUTPUT: 0xFF 0x3 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
sizeCluster is the number of clusters I want to add in the Bitmap
firstCluster is the first cluster where I can write my data (4 clusters are used: 0, 1, 2, and 3 so I start at 4).
bitIndex is used to modify the right bit in the byte of the array => increments each time.
byteIndex is used to modify the right byte of the array => increments each time the bit is equal to 7.

In case you don't want to use C++ std::bitset for performance reasons, then your code can be rewrote like this:
#include <cstdio>
#include <cstdint>
// buffer definition
constexpr size_t clustersTotal = 83;
constexpr size_t clustersTotalBytes = (clustersTotal+7)>>3; //ceiling(n/8)
uint8_t clustersSet[clustersTotalBytes] = {0x07, 0};
// clusters 0,1 and 2 are already set (for show of)
// helper constanst bit masks for faster bit setting
// could be extended to uint64_t and array of qwords on 64b architectures
// but I couldn't be bothered to write all masks by hand.
// also I wonder when the these lookup tables would be large enough
// to disturb cache locality, so shifting in code would be faster.
const uint8_t bitmaskStarting[8] = {0xFF, 0xFE, 0xFC, 0xF8, 0xF0, 0xE0, 0xC0, 0x80};
const uint8_t bitmaskEnding[8] = {0x01, 0x03, 0x07, 0x0F, 0x1F, 0x3F, 0x7F, 0xFF};
constexpr uint8_t bitmaskFull = 0xFF;
// Input values
size_t firstCluster = 6;
size_t sizeCluster = 16;
// set bits (like "void setBits(size_t firstIndex, size_t count);" )
auto lastCluster = firstCluster + sizeCluster - 1;
printf("From cluster %d, size %d => last cluster is %d\n",
firstCluster, sizeCluster, lastCluster);
if (0 == sizeCluster || clustersTotal <= lastCluster)
return 1; // Invalid input values
auto firstClusterByte = firstCluster>>3; // div 8
auto firstClusterBit = firstCluster&7; // remainder
auto lastClusterByte = lastCluster>>3;
auto lastClusterBit = lastCluster&7;
if (firstClusterByte < lastClusterByte) {
// Set the first byte of sequence (by mask from lookup table (LUT))
clustersSet[firstClusterByte] |= bitmaskStarting[firstClusterBit];
// Set bytes between first and last (simple 0xFF - all bits set)
while (++firstClusterByte < lastClusterByte)
clustersSet[firstClusterByte] = bitmaskFull;
// Set the last byte of sequence (by mask from ending LUT)
clustersSet[lastClusterByte] |= bitmaskEnding[lastClusterBit];
} else { //firstClusterByte == lastClusterByte special case
// Intersection of starting/ending LUT masks is set
clustersSet[firstClusterByte] |=
bitmaskStarting[firstClusterBit] & bitmaskEnding[lastClusterBit];
}
for(auto i = 0 ; i < clustersTotalBytes; ++i)
printf("0x%X ", clustersSet[i]); // Your debug display of buffer
Unfortunately I didn't profile any of the versions (yours vs my), so I have no idea what is the quality of optimized C compiler output in both cases. In the ages of lame C compilers and 386-586 processors my version would be much faster. With modern C compiler the LUT usage can be a bit counterproductive, but unless somebody proves me wrong by some profiling results, I still think my version is much more efficient.
That said, as writing to file system is probably involved ahead of this, setting bits will probably take about %0.1 of CPU time even with your variant, I/O waiting will be major factor.
So I'm posting this more like an example how things can be done in different way.
Edit:
Also if you believe in the clib optimization, the:
// Set bytes between first and last (simple 0xFF - all bits set)
while (++firstClusterByte < lastClusterByte)
clustersSet[firstClusterByte] = bitmaskFull;
Can reuse clib memset magic:
//#include <cstring>
// Set bytes between first and last (simple 0xFF - all bits set)
if (++firstClusterByte < lastClusterByte)
memset(clustersSet, bitmaskFull, (lastClusterByte - firstClusterByte));

Related

What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible.
For this I was looking into SIMD instructions. Does anyone know of a way to do this/a library that implements it? I'm using MSVC on Windows and GCC on Linux, and the host language is C or C++.
Thanks!
I'm given an arbitrary permutation and need to shuffle a large number of bit vectors/pairs of bit vectors/nibbles. I know how to do this for the bits within a 64 bit value, e.g. using a Benes network.
Or shuffling blocks of 8-bit and larger around on the wider SIMD registers, e.g. using Agner Fog's GPLed VectorClass library (https://www.agner.org/optimize/vectorclass.pdf) for a template metaprogramming function that builds shuffles out of AVX2 in-lane byte shuffles and/or larger-element lane-crossing shuffles, given the shuffle as template parameter.
A more granular subdivision for permutations - into 1, 2 or 4 bit blocks - seems to be hard to achieve across wide vectors, though.
I'm able to do pre-processing on the permutation, e.g. to extract bit masks, calculate indices as necessary e.g. for a Benes network, or whatever else - happy to do that in another high level language as well, so assume that the permutation is given in whatever format is most convenient to solve the problem; small-ish lookup tables included.
I would expect the code to be significantly faster than doing something like
// actually 1 bit per element, not byte. I want a 256-bit bit-shuffle
const uint8_t in[256] = get_some_vector(); // not a compile-time constant
const uint8_t perm[256] = ...; // compile-time constant
uint8_t out[256];
for (size_t i = 0; i < 256; i ++)
out[i] = in[perm[i]];
As I said, I have a solution for <= 64 bits (which would be 64 bits, 32 bit-pairs, and 16 nibbles). The problem is also solved for blocks of size 8, 16, 32 etc. on wider SIMD registers.
EDIT: to clarify, the permutation is a compile-time constant (but not just one particular one, I'll compile the program once per permutation given).

The AVX2 256 bit permutation case
I do not think it is possible to write an efficient generic SSE4/AVX2/AVX-512 algorithm
that works for all vector sizes (128, 256, 512 bits), and element granularities (bits,
bit pairs, nibbles, bytes). One problem is that many AVX2 instructions that exist
for, for example, byte size elements, do not exist for double word elements,
and vice versa.
Below the AVX2 256 bit permutation case is discussed.
It might be possible to recycle the ideas of this case for other cases.
The idea is to extract 32 (permuted) bits per step from input vector x.
In each step 32 bytes from permutation vector pos are read.
Bits 7..3 of these pos bytes determine which byte from x is needed.
The right byte is selected by an emulated 256 bits wide AVX2 lane crossing byte
shuffle coded here by Ermlg.
Bits 2..0 of the pos bytes determine which bit is sought.
With _mm256_movemask_epi8 the 32 bits are collected in one _uint32_t
This step is repeated 8 times, to get all the 256 permuted bits.
The code does not look very elegant. Nevertheless, I would be surprised
if a significantly faster, say two times faster, AVX2 method would exist.
/* gcc -O3 -m64 -Wall -mavx2 -march=skylake bitperm_avx2.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
inline __m256i shuf_epi8_lc(__m256i value, __m256i shuffle);
int print_epi64(__m256i a);
uint32_t get_32_bits(__m256i x, __m256i pos){
__m256i pshufb_mask = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);
__m256i byte_pos = _mm256_srli_epi32(pos, 3); /* which byte within the 32 bytes */
byte_pos = _mm256_and_si256(byte_pos, _mm256_set1_epi8(0x1F)); /* mask off the unwanted bits */
__m256i bit_pos = _mm256_and_si256(pos, _mm256_set1_epi8(0x07)); /* which bit within the byte */
__m256i bit_pos_mask = _mm256_shuffle_epi8(pshufb_mask, bit_pos); /* get bit mask */
__m256i bytes_wanted = shuf_epi8_lc(x, byte_pos); /* get the right bytes */
__m256i bits_wanted = _mm256_and_si256(bit_pos_mask, bytes_wanted); /* apply the bit mask to get rid of the unwanted bits within the byte */
__m256i bits_x8 = _mm256_cmpeq_epi8(bits_wanted, bit_pos_mask); /* check if the bit is set */
return _mm256_movemask_epi8(bits_x8);
}
__m256i get_256_bits(__m256i x, uint8_t* pos){ /* glue the 32 bit results together */
uint64_t t0 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[0]));
uint64_t t1 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[32]));
uint64_t t2 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[64]));
uint64_t t3 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[96]));
uint64_t t4 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[128]));
uint64_t t5 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[160]));
uint64_t t6 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[192]));
uint64_t t7 = get_32_bits(x, _mm256_loadu_si256((__m256i*)&pos[224]));
uint64_t t10 = (t1<<32)|t0;
uint64_t t32 = (t3<<32)|t2;
uint64_t t54 = (t5<<32)|t4;
uint64_t t76 = (t7<<32)|t6;
return(_mm256_set_epi64x(t76, t54, t32, t10));
}
inline __m256i shuf_epi8_lc(__m256i value, __m256i shuffle){
/* Ermlg's lane crossing byte shuffle https://stackoverflow.com/a/30669632/2439725 */
const __m256i K0 = _mm256_setr_epi8(
0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70,
0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0);
const __m256i K1 = _mm256_setr_epi8(
0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,
0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70);
return _mm256_or_si256(_mm256_shuffle_epi8(value, _mm256_add_epi8(shuffle, K0)),
_mm256_shuffle_epi8(_mm256_permute4x64_epi64(value, 0x4E), _mm256_add_epi8(shuffle, K1)));
}
int main(){
__m256i input = _mm256_set_epi16(0x1234,0x9876,0x7890,0xABCD, 0x3456,0x7654,0x0123,0x4567,
0x0123,0x4567,0x89AB,0xCDEF, 0xFEDC,0xBA98,0x7654,0x3210);
/* Example */
/* 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 0 */
/* input 1234 9876 7890 ABCD | 3456 7654 0123 4567 | 0123 4567 89AB CDEF | FEDC BA98 7654 3210 */
/* output 0000 0000 0012 00FF | 90AB 3210 7654 ABCD | 8712 1200 FF90 AB32 | 7654 ABCD 1087 7654 */
uint8_t permutation[256] = {16,17,18,19, 20,21,22,23, 24,25,26,27, 28,29,30,31,
28,29,30,31, 32,33,34,35, 0,1,2,3, 4,5,6,7,
72,73,74,75, 76,77,78,79, 80,81,82,83, 84,85,86,87,
160,161,162,163, 164,165,166,167, 168,169,170,171, 172,173,174,175,
8,9,10,11, 12,13,14,15, 200,201,202,203, 204,205,206,207,
208,209,210,211, 212,213,214,215, 215,215,215,215, 215,215,215,215,
1,1,1,1, 1,1,1,1, 248,249,250,251, 252,253,254,255,
248,249,250,251, 252,253,254,255, 28,29,30,31, 32,33,34,35,
72,73,74,75, 76,77,78,79, 80,81,82,83, 84,85,86,87,
160,161,162,163, 164,165,166,167, 168,169,170,171, 172,173,174,175,
0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15,
200,201,202,203, 204,205,206,207, 208,209,210,211, 212,213,214,215,
215,215,215,215, 215,215,215,215, 1,1,1,1, 1,1,1,1,
248,249,250,251, 252,253,254,255, 1,1,1,1, 1,1,1,1,
1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1,
1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1};
printf("input = \n");
print_epi64(input);
__m256i x = get_256_bits(input, permutation);
printf("permuted input = \n");
print_epi64(x);
return 0;
}
int print_epi64(__m256i a){
uint64_t v[4];
int i;
_mm256_storeu_si256((__m256i*)v,a);
for (i = 3; i>=0; i--) printf("%016lX ",v[i]);
printf("\n");
return 0;
}
The output with the example permutation looks correct:
$ ./a.out
input =
123498767890ABCD 3456765401234567 0123456789ABCDEF FEDCBA9876543210
permuted input =
00000000001200FF 90AB32107654ABCD 87121200FF90AB32 7654ABCD10877654
Efficiency
If you look carefully at the algorithm, you will see that some operations only
depend on the permutation vector pos, and not on x. This means that the applying the
permutation with a variable x, and a fixed pos, should be more efficient
than applying the permutation with both variable x and pos.
This is illustrated by the following code:
/* apply the same permutation several times */
int perm_array(__m256i* restrict x_in, uint8_t* restrict pos, __m256i* restrict x_out){
for (int i = 0; i<1024; i++){
x_out[i]=get_256_bits(x_in[i], pos);
}
return 0;
}
With clang and gcc this compiles to really
nice code: Loop .L5 at line 237 only contains 16
vpshufbs instead of 24. Moreover the vpaddbs are hoisted out of the loop.
Note that there is also only one vpermq inside the loop.
I do not know if MSVC will hoist such many instructions outside the loop.
If not, it might be possible
to improve the performance of the loop by modifying the code manually.
This should be done such that
the operations which only depend on pos, and not on x, are hoisted outside the loop.
With respect to the performance on Intel Skylake:
The throughput of this loop is likely limited by the
about 32 port 5 micro-ops per loop iteration. This means that the throughput
in a loop context such as perm_array is about 256 permuted bits per 32 CPU cycles,
or about 8 permuted bits per CPU cycle.
128 bit permutations using AVX2 instructions
This code is quite similar to the 256 bit permutation case.
Although only 128 bits are permuted, the full 256 bit width of the AVX2
registers is used to achieve the best performance.
Here the byte shuffles are not emulated.
This is because there exists
an efficient single instruction to do the byte shuffling
within the 128 bit lanes: vpshufb.
Function perm_array_128 tests the performance of the bit permutation
for a fixed permutation and a variable input x.
The assembly loop contains about 11 port 5 (p5) micro-ops, if we
assume an Intel Skylake CPU.
These 11 p5 micro-ops take at least 11 CPU cycles (throughput).
So, in the best case we get a throughput of about 12 permuted bits per cycle, which is about 1.5 times as fast as the 256 bit permutation case.
/* gcc -O3 -m64 -Wall -mavx2 -march=skylake bitperm128_avx2.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
int print128_epi64(__m128i a);
uint32_t get_32_128_bits(__m256i x, __m256i pos){ /* extract 32 permuted bits out from 2x128 bits */
__m256i pshufb_mask = _mm256_set_epi8(0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1, 0,0,0,0, 0,0,0,0, 128,64,32,16, 8,4,2,1);
__m256i byte_pos = _mm256_srli_epi32(pos, 3); /* which byte do we need within the 16 byte lanes. bits 6,5,4,3 select the right byte */
byte_pos = _mm256_and_si256(byte_pos, _mm256_set1_epi8(0xF)); /* mask off the unwanted bits (unnecessary if _mm256_srli_epi8 would have existed */
__m256i bit_pos = _mm256_and_si256(pos, _mm256_set1_epi8(0x07)); /* which bit within the byte */
__m256i bit_pos_mask = _mm256_shuffle_epi8(pshufb_mask, bit_pos); /* get bit mask */
__m256i bytes_wanted = _mm256_shuffle_epi8(x, byte_pos); /* get the right bytes */
__m256i bits_wanted = _mm256_and_si256(bit_pos_mask, bytes_wanted); /* apply the bit mask to get rid of the unwanted bits within the byte */
__m256i bits_x8 = _mm256_cmpeq_epi8(bits_wanted, bit_pos_mask); /* set all bits if the wanted bit is set */
return _mm256_movemask_epi8(bits_x8); /* move most significant bit of each byte to 32 bit register */
}
__m128i permute_128_bits(__m128i x, uint8_t* pos){ /* get bit permutations in 32 bit pieces and glue them together */
__m256i x2 = _mm256_broadcastsi128_si256(x); /* broadcast x to the hi and lo lane */
uint64_t t0 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[0]));
uint64_t t1 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[32]));
uint64_t t2 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[64]));
uint64_t t3 = get_32_128_bits(x2, _mm256_loadu_si256((__m256i*)&pos[96]));
uint64_t t10 = (t1<<32)|t0;
uint64_t t32 = (t3<<32)|t2;
return(_mm_set_epi64x(t32, t10));
}
/* Test loop performance with the following loop (see assembly) -> 11 port5 uops inside the critical loop */
/* Use gcc -O3 -m64 -Wall -mavx2 -march=skylake -S bitperm128_avx2.c to generate the assembly */
int perm_array_128(__m128i* restrict x_in, uint8_t* restrict pos, __m128i* restrict x_out){
for (int i = 0; i<1024; i++){
x_out[i]=permute_128_bits(x_in[i], pos);
}
return 0;
}
int main(){
__m128i input = _mm_set_epi16(0x0123,0x4567,0xFEDC,0xBA98, 0x7654,0x3210,0x89AB,0xCDEF);
/* Example */
/* 112 96 80 64 48 32 16 0 */
/* input 0123 4567 FEDC BA98 7654 3210 89AB CDEF */
/* output 8FFF CDEF DCBA 08EF CDFF DCBA EFF0 89AB */
uint8_t permutation[128] = {16,17,18,19, 20,21,22,23, 24,25,26,27, 28,29,30,31,
32,32,32,32, 36,36,36,36, 0,1,2,3, 4,5,6,7,
72,73,74,75, 76,77,78,79, 80,81,82,83, 84,85,86,87,
0,0,0,0, 0,0,0,0, 8,9,10,11, 12,13,14,15,
0,1,2,3, 4,5,6,7, 28,29,30,31, 32,33,34,35,
72,73,74,75, 76,77,78,79, 80,81,82,83, 84,85,86,87,
0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15,
1,1,1,1, 1,1,1,1, 1,1,1,1, 32,32,32,1};
printf("input = \n");
print128_epi64(input);
__m128i x = permute_128_bits(input, permutation);
printf("permuted input = \n");
print128_epi64(x);
return 0;
}
int print128_epi64(__m128i a){
uint64_t v[2];
int i;
_mm_storeu_si128((__m128i*)v,a);
for (i = 1; i>=0; i--) printf("%016lX ",v[i]);
printf("\n");
return 0;
}
Example output for some arbitrary permutation:
$ ./a.out
input =
01234567FEDCBA98 7654321089ABCDEF
permuted input =
8FFFCDEFDCBA08EF CDFFDCBAEFF089AB

Extract set bytes position from SIMD vector

I run a bench of computations using SIMD intructions. These instructions return a vector of 16 bytes as result, named compare, with each byte being 0x00 or 0xff :
0 1 2 3 4 5 6 7 15 16
compare : 0x00 0x00 0x00 0x00 0xff 0x00 0x00 0x00 ... 0xff 0x00
Bytes set to 0xff mean I need to run the function do_operation(i) with i being the position of the byte.
For instance, the above compare vector mean, I need to run this sequence of operations :
do_operation(4);
do_operation(15);
Here is the fastest solution I came up with until now :
for(...) {
//
// SIMD computations
//
__m128i compare = ... // Result of SIMD computations
// Extract high and low quadwords for compare vector
std::uint64_t cmp_low = (_mm_cvtsi128_si64(compare));
std::uint64_t cmp_high = (_mm_extract_epi64(compare, 1));
// Process low quadword
if (cmp_low) {
const std::uint64_t low_possible_positions = 0x0706050403020100;
const std::uint64_t match_positions = _pext_u64(
low_possible_positions, cmp_low);
const int match_count = _popcnt64(cmp_low) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for (int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
// Process high quadword (similarly)
if (cmp_high) {
const std::uint64_t high_possible_positions = 0x0f0e0d0c0b0a0908;
const std::uint64_t match_positions = _pext_u64(
high_possible_positions, cmp_high);
const int match_count = _popcnt64(cmp_high) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for(int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
}
I start with extracting the first and second 64 bits integers of the 128 bits vector (cmp_low and cmp_high). Then I use popcount to compute the number of bytes set to 0xff (number of bits set to 1 divided by 8). Finally, I use pext to get positions, without zeros, like this :
0x0706050403020100
0x000000ff00ff0000
|
PEXT
|
0x0000000000000402
I would like to find a faster solution to extract the positions of the bytes set to 0xff in the compare vector. More precisely, the are very often only 0, 1 or 2 bytes set to 0xff in the compare vector and I would like to use this information to avoid some branches.

Here's a quick outline of how you could reduce the number of tests:
First use a function to project all the lsb or msb of each byte of your 128bit integer into a 16bit value (for instance, there's a SSE2 assembly instruction for that on X86 cpus: pmovmskb, which is supported on Intel and MS compilers with the _mm_movemask_pi8 intrinsic, and gcc has also an intrinsic: __builtin_ia32_ppmovmskb128, );
Then split that value in 4 nibbles;
define functions to handle each possible values of a nibble (from 0 to 15) and put these in an array;
Finally call the function indexed by each nibble (with extra parameters to indicate which nibble in the 16bits it is).

Since in your case very often only 0, 1 or 2 bytes are set to 0xff in the compare vector, a short
while-loop on the bitmask might be more efficient than a solution based on the pext
instruction. See also my answer on a similar question.
/*
gcc -O3 -Wall -m64 -mavx2 -march=broadwell esbsimd.c
*/
#include <stdio.h>
#include <immintrin.h>
int do_operation(int i){ /* some arbitrary do_operation() */
printf("i = %d\n",i);
return 0;
}
int main(){
__m128i compare = _mm_set_epi8(0xFF,0,0,0, 0,0,0,0, 0,0,0,0xFF, 0,0,0,0); /* Take some randon value for compare */
int k = _mm_movemask_epi8(compare);
while (k){
int i=_tzcnt_u32(k); /* Count the number of trailing zero bits in k. BMI1 instruction set, Haswell or newer. */
do_operation(i);
k=_blsr_u32(k); /* Clear the lowest set bit in k. */
}
return 0;
}
/*
Output:
i = 4
i = 15
*/

How to use boost::crc_optimal with a binary array (array of 0's and 1's)

I need to use boost::crc_optimal, which calculates the crc of an array (of chars?).
Example use:
// This is "123456789" in ASCII
unsigned char const data[] = { 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39 };
std::size_t const data_len = sizeof( data ) / sizeof( data[0] );
// The expected CRC for the given data
boost::uint16_t const expected = 0x29B1;
boost::crc_optimal<16, 0x1021, 0xFFFF, 0, false, false> crc_ccitt2;
crc_ccitt2 = std::for_each( data, data + data_len, crc_ccitt2 );
assert( crc_ccitt2() == expected );
The problem is that the data I am working with is a sequence of 0's and 1's. A specific example:
int data [] = {1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0};
How do I apply the crc_optimal to this sequence?
Should I just convert each 0 to 0x30 and each 1 to 0x31? In that case, how do I get the resulting crc back into binary form?
Thank you.
Edit: changed array type from float to int, since that is not the essential part.
It looks like the challenge is that I am working with arrays with lengths that are not a multiple of 8.

To use a byte-wise CRC routine, you need to convert your sequence of bits into a sequence of bytes. The order of the bits depends on the order of the CRC, which in this case (CCITT-false) is not reflected, so you consider the stream of bits to be most significant bit first. Then the first eight bits of your sequence becomes 0x85. If it were a reflected CRC (e.g. the true CCITT 16-bit CRC), then the first eight bits becomes 0xa1.
If, as in the example shown, the number of bits is not a multiple of eight, then you will need to write your own CRC routine to handle the remaining one to seven bits. A bit-wise CRC in this case would look like the following for CCITT-false, where bit is the float value converted to an integer 0 or 1:
crc = ((bit << 15) ^ crc) & 0x8000 ? (crc << 1) ^ 0x1021 : crc << 1;
// ... repeat for remaining bits ...
crc &= 0xffff;
Had this been the true CCITT 16-bit CRC which has a zero initialization value, you could do something different to handle the extra bits. In that case, you could append enough zeros to the front of the stream to make it a multiple of eight. Leading zeros with a zero initialization leaves the CRC as zero. So for the CCITT CRC-16, which is reflected, your 17 bits of data become 0x80, 0x50, 0x00.

c++ optimize array of ints

I have a 2D lookup table of int16_t.
int16_t my_array[37][73] = {{**DATA HERE**}}
I have a mixture of values that range from just above the range of int8_t to just below the range of int8_t and some of the values repeat themselves. I am trying to reduce the size of this lookup table.
What I have done so far is split each int16_t value into two int8_t values to visualize the wasted bytes.
int8_t part_1 = original_value >> 4;
int8_t part_2 = original_value & 0x0000FFFF;
// If the upper 4 bits of the original_value were empty
if(part_1 == 0) wasted_bytes_count++;
I can easily remove the zero value int8_t that are wasting a byte of space and I can also remove the duplicate values, but my question is how do I do remove those values while retaining the ability to lookup based on the two indices?
I contemplated translating this into a 1D array and adding a number following each duplicated value that would represent the number of duplicates that were removed, but I am struggling with how I would then identify what is a lookup value and what is a duplicate count. Also, it is further complicated by stripping out the zero int8_t values that were wasted bytes.
EDIT: This array is stored in ROM already. RAM is even more limited than ROM so it is already stored in ROM.
EDIT: I am going to post a bounty for this question as soon as I can. I need a complete answer of how to store the information AND retrieve it. It does not need to be a 2D array as long as I can get the same values.
EDIT: Adding the actual array below:
{150,145,140,135,130,125,120,115,110,105,100,95,90,85,80,75,70,65,60,55,50,45,40,35,30,25,20,15,10,5,0,-4,-9,-14,-19,-24,-29,-34,-39,-44,-49,-54,-59,-64,-69,-74,-79,-84,-89,-94,-99,104,109,114,119,124,129,134,139,144,149,154,159,164,169,174,179,175,170,165,160,155,150}, \
{143,137,131,126,120,115,110,105,100,95,90,85,80,75,71,66,62,57,53,48,44,39,35,31,27,22,18,14,9,5,1,-3,-7,-11,-16,-20,-25,-29,-34,-38,-43,-47,-52,-57,-61,-66,-71,-76,-81,-86,-91,-96,101,107,112,117,123,128,134,140,146,151,157,163,169,175,178,172,166,160,154,148,143}, \
{130,124,118,112,107,101,96,92,87,82,78,74,70,65,61,57,54,50,46,42,38,34,31,27,23,19,16,12,8,4,1,-2,-6,-10,-14,-18,-22,-26,-30,-34,-38,-43,-47,-51,-56,-61,-65,-70,-75,-79,-84,-89,-94,100,105,111,116,122,128,135,141,148,155,162,170,177,174,166,159,151,144,137,130}, \
{111,104,99,94,89,85,81,77,73,70,66,63,60,56,53,50,46,43,40,36,33,30,26,23,20,16,13,10,6,3,0,-3,-6,-9,-13,-16,-20,-24,-28,-32,-36,-40,-44,-48,-52,-57,-61,-65,-70,-74,-79,-84,-88,-93,-98,103,109,115,121,128,135,143,152,162,172,176,165,154,144,134,125,118,111}, \
{85,81,77,74,71,68,65,63,60,58,56,53,51,49,46,43,41,38,35,32,29,26,23,19,16,13,10,7,4,1,-1,-3,-6,-9,-13,-16,-19,-23,-26,-30,-34,-38,-42,-46,-50,-54,-58,-62,-66,-70,-74,-78,-83,-87,-91,-95,100,105,110,117,124,133,144,159,178,160,141,125,112,103,96,90,85}, \
{62,60,58,57,55,54,52,51,50,48,47,46,44,42,41,39,36,34,31,28,25,22,19,16,13,10,7,4,2,0,-3,-5,-8,-10,-13,-16,-19,-22,-26,-29,-33,-37,-41,-45,-49,-53,-56,-60,-64,-67,-70,-74,-77,-80,-83,-86,-89,-91,-94,-97,101,105,111,130,109,84,77,74,71,68,66,64,62}, \
{46,46,45,44,44,43,42,42,41,41,40,39,38,37,36,35,33,31,28,26,23,20,16,13,10,7,4,1,-1,-3,-5,-7,-9,-12,-14,-16,-19,-22,-26,-29,-33,-36,-40,-44,-48,-51,-55,-58,-61,-64,-66,-68,-71,-72,-74,-74,-75,-74,-72,-68,-61,-48,-25,2,22,33,40,43,45,46,47,46,46}, \
{36,36,36,36,36,35,35,35,35,34,34,34,34,33,32,31,30,28,26,23,20,17,14,10,6,3,0,-2,-4,-7,-9,-10,-12,-14,-15,-17,-20,-23,-26,-29,-32,-36,-40,-43,-47,-50,-53,-56,-58,-60,-62,-63,-64,-64,-63,-62,-59,-55,-49,-41,-30,-17,-4,6,15,22,27,31,33,34,35,36,36}, \
{30,30,30,30,30,30,30,29,29,29,29,29,29,29,29,28,27,26,24,21,18,15,11,7,3,0,-3,-6,-9,-11,-12,-14,-15,-16,-17,-19,-21,-23,-26,-29,-32,-35,-39,-42,-45,-48,-51,-53,-55,-56,-57,-57,-56,-55,-53,-49,-44,-38,-31,-23,-14,-6,0,7,13,17,21,24,26,27,29,29,30}, \
{25,25,26,26,26,25,25,25,25,25,25,25,25,26,25,25,24,23,21,19,16,12,8,4,0,-3,-7,-10,-13,-15,-16,-17,-18,-19,-20,-21,-22,-23,-25,-28,-31,-34,-37,-40,-43,-46,-48,-49,-50,-51,-51,-50,-48,-45,-42,-37,-32,-26,-19,-13,-7,-1,3,7,11,14,17,19,21,23,24,25,25}, \
{21,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,21,20,18,16,13,9,5,1,-3,-7,-11,-14,-17,-18,-20,-21,-21,-22,-22,-22,-23,-23,-25,-27,-29,-32,-35,-37,-40,-42,-44,-45,-45,-45,-44,-42,-40,-36,-32,-27,-22,-17,-12,-7,-3,0,3,7,9,12,14,16,18,19,20,21,21}, \
{18,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,18,17,16,14,10,7,2,-1,-6,-10,-14,-17,-19,-21,-22,-23,-24,-24,-24,-24,-23,-23,-23,-24,-26,-28,-30,-33,-35,-37,-38,-39,-39,-38,-36,-34,-31,-28,-24,-19,-15,-10,-6,-3,0,1,4,6,8,10,12,14,15,16,17,18,18}, \
{16,16,17,17,17,17,17,17,17,17,17,16,16,16,16,16,16,15,13,11,8,4,0,-4,-9,-13,-16,-19,-21,-23,-24,-25,-25,-25,-25,-24,-23,-21,-20,-20,-21,-22,-24,-26,-28,-30,-31,-32,-31,-30,-29,-27,-24,-21,-17,-13,-9,-6,-3,-1,0,2,4,5,7,9,10,12,13,14,15,16,16}, \
{14,14,14,15,15,15,15,15,15,15,14,14,14,14,14,14,13,12,11,9,5,2,-2,-6,-11,-15,-18,-21,-23,-24,-25,-25,-25,-25,-24,-22,-21,-18,-16,-15,-15,-15,-17,-19,-21,-22,-24,-24,-24,-23,-22,-20,-18,-15,-12,-9,-5,-3,-1,0,1,2,4,5,6,8,9,10,11,12,13,14,14}, \
{12,13,13,13,13,13,13,13,13,13,13,13,12,12,12,12,11,10,9,6,3,0,-4,-8,-12,-16,-19,-21,-23,-24,-24,-24,-24,-23,-22,-20,-17,-15,-12,-10,-9,-9,-10,-12,-13,-15,-17,-17,-18,-17,-16,-15,-13,-11,-8,-5,-3,-1,0,1,1,2,3,4,6,7,8,9,10,11,12,12,12}, \
{11,11,11,11,11,12,12,12,12,12,11,11,11,11,11,10,10,9,7,5,2,-1,-5,-9,-13,-17,-20,-22,-23,-23,-23,-23,-22,-20,-18,-16,-14,-11,-9,-6,-5,-4,-5,-6,-8,-9,-11,-12,-12,-12,-12,-11,-9,-8,-6,-3,-1,0,0,1,1,2,3,4,5,6,7,8,9,10,11,11,11}, \
{10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,7,6,3,0,-3,-6,-10,-14,-17,-20,-21,-22,-22,-22,-21,-19,-17,-15,-13,-10,-8,-6,-4,-2,-2,-2,-2,-4,-5,-7,-8,-8,-9,-8,-8,-7,-5,-4,-2,0,0,1,1,1,2,2,3,4,5,6,7,8,9,10,10,10}, \
{9,9,9,9,9,9,9,10,10,9,9,9,9,9,9,8,8,6,5,2,0,-4,-7,-11,-15,-17,-19,-21,-21,-21,-20,-18,-16,-14,-12,-10,-8,-6,-4,-2,-1,0,0,0,-1,-2,-4,-5,-5,-6,-6,-5,-5,-4,-3,-1,0,0,1,1,1,1,2,3,3,5,6,7,8,8,9,9,9}, \
{9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,7,5,4,1,-1,-5,-8,-12,-15,-17,-19,-20,-20,-19,-18,-16,-14,-11,-9,-7,-5,-4,-2,-1,0,0,1,1,0,0,-2,-3,-3,-4,-4,-4,-3,-3,-2,-1,0,0,0,0,0,1,1,2,3,4,5,6,7,8,8,9,9}, \
{9,9,9,8,8,8,9,9,9,9,9,8,8,8,8,7,6,5,3,0,-2,-5,-9,-12,-15,-17,-18,-19,-19,-18,-16,-14,-12,-9,-7,-5,-4,-2,-1,0,0,1,1,1,1,0,0,-1,-2,-2,-3,-3,-2,-2,-1,-1,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,8,9}, \
{8,8,8,8,8,8,9,9,9,9,9,9,8,8,8,7,6,4,2,0,-3,-6,-9,-12,-15,-17,-18,-18,-17,-16,-14,-12,-10,-8,-6,-4,-2,-1,0,0,1,2,2,2,2,1,0,0,-1,-1,-1,-2,-2,-1,-1,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,8}, \
{8,8,8,8,9,9,9,9,9,9,9,9,9,8,8,7,5,3,1,-1,-4,-7,-10,-13,-15,-16,-17,-17,-16,-15,-13,-11,-9,-6,-5,-3,-2,0,0,0,1,2,2,2,2,1,1,0,0,0,-1,-1,-1,-1,-1,0,0,0,0,-1,-1,-1,-1,-1,0,0,1,3,4,5,7,7,8}, \
{8,8,9,9,9,9,10,10,10,10,10,10,10,9,8,7,5,3,0,-2,-5,-8,-11,-13,-15,-16,-16,-16,-15,-13,-12,-10,-8,-6,-4,-2,-1,0,0,1,2,2,3,3,2,2,1,0,0,0,0,0,0,0,0,0,0,-1,-1,-2,-2,-2,-2,-2,-1,0,0,1,3,4,6,7,8}, \
{7,8,9,9,9,10,10,11,11,11,11,11,10,10,9,7,5,3,0,-2,-6,-9,-11,-13,-15,-16,-16,-15,-14,-13,-11,-9,-7,-5,-3,-2,0,0,1,1,2,3,3,3,3,2,2,1,1,0,0,0,0,0,0,0,-1,-1,-2,-3,-3,-4,-4,-4,-3,-2,-1,0,1,3,5,6,7}, \
{6,8,9,9,10,11,11,12,12,12,12,12,11,11,9,7,5,2,0,-3,-7,-10,-12,-14,-15,-16,-15,-15,-13,-12,-10,-8,-7,-5,-3,-1,0,0,1,2,2,3,3,4,3,3,3,2,2,1,1,1,0,0,0,0,-1,-2,-3,-4,-4,-5,-5,-5,-5,-4,-2,-1,0,2,3,5,6}, \
{6,7,8,10,11,12,12,13,13,14,14,13,13,11,10,8,5,2,0,-4,-8,-11,-13,-15,-16,-16,-16,-15,-13,-12,-10,-8,-6,-5,-3,-1,0,0,1,2,3,3,4,4,4,4,4,3,3,3,2,2,1,1,0,0,-1,-2,-3,-5,-6,-7,-7,-7,-6,-5,-4,-3,-1,0,2,4,6}, \
{5,7,8,10,11,12,13,14,15,15,15,14,14,12,11,8,5,2,-1,-5,-9,-12,-14,-16,-17,-17,-16,-15,-14,-12,-11,-9,-7,-5,-3,-1,0,0,1,2,3,4,4,5,5,5,5,5,5,4,4,3,3,2,1,0,-1,-2,-4,-6,-7,-8,-8,-8,-8,-7,-6,-4,-2,0,1,3,5}, \
{4,6,8,10,12,13,14,15,16,16,16,16,15,13,11,9,5,2,-2,-6,-10,-13,-16,-17,-18,-18,-17,-16,-15,-13,-11,-9,-7,-5,-4,-2,0,0,1,3,3,4,5,6,6,7,7,7,7,7,6,5,4,3,2,0,-1,-3,-5,-7,-8,-9,-10,-10,-10,-9,-7,-5,-4,-1,0,2,4}, \
{4,6,8,10,12,14,15,16,17,18,18,17,16,15,12,9,5,1,-3,-8,-12,-15,-18,-19,-20,-20,-19,-18,-16,-15,-13,-11,-8,-6,-4,-2,-1,0,1,3,4,5,6,7,8,9,9,9,9,9,9,8,7,5,3,1,-1,-3,-6,-8,-10,-11,-12,-12,-11,-10,-9,-7,-5,-2,0,1,4}, \
{4,6,8,11,13,15,16,18,19,19,19,19,18,16,13,10,5,0,-5,-10,-15,-18,-21,-22,-23,-22,-22,-20,-18,-17,-14,-12,-10,-8,-5,-3,-1,0,1,3,5,6,8,9,10,11,12,12,13,12,12,11,9,7,5,2,0,-3,-6,-9,-11,-12,-13,-13,-12,-11,-10,-8,-6,-3,-1,1,4}, \
{3,6,9,11,14,16,17,19,20,21,21,21,19,17,14,10,4,-1,-8,-14,-19,-22,-25,-26,-26,-26,-25,-23,-21,-19,-17,-14,-12,-9,-7,-4,-2,0,1,3,5,7,9,11,13,14,15,16,16,16,16,15,13,10,7,4,0,-3,-7,-10,-12,-14,-15,-14,-14,-12,-11,-9,-6,-4,-1,1,3}, \
{4,6,9,12,14,17,19,21,22,23,23,23,21,19,15,9,2,-5,-13,-20,-25,-28,-30,-31,-31,-30,-29,-27,-25,-22,-20,-17,-14,-11,-9,-6,-3,0,1,4,6,9,11,13,15,17,19,20,21,21,21,20,18,15,11,6,2,-2,-7,-11,-13,-15,-16,-16,-15,-13,-11,-9,-7,-4,-1,1,4}, \
{4,7,10,13,15,18,20,22,24,25,25,25,23,20,15,7,-2,-12,-22,-29,-34,-37,-38,-38,-37,-36,-34,-31,-29,-26,-23,-20,-17,-13,-10,-7,-4,-1,2,5,8,11,13,16,18,21,23,24,26,26,26,26,24,21,17,12,5,0,-6,-10,-14,-16,-16,-16,-15,-14,-12,-10,-7,-4,-1,1,4}, \
{4,7,10,13,16,19,22,24,26,27,27,26,24,19,11,-1,-15,-28,-37,-43,-46,-47,-47,-45,-44,-41,-39,-36,-32,-29,-26,-22,-19,-15,-11,-8,-4,-1,2,5,9,12,15,19,22,24,27,29,31,33,33,33,32,30,26,21,14,6,0,-6,-11,-14,-15,-16,-15,-14,-12,-9,-7,-4,-1,1,4}, \
{6,9,12,15,18,21,23,25,27,28,27,24,17,4,-14,-34,-49,-56,-60,-60,-60,-58,-56,-53,-50,-47,-43,-40,-36,-32,-28,-25,-21,-17,-13,-9,-5,-1,2,6,10,14,17,21,24,28,31,34,37,39,41,42,43,43,41,38,33,25,17,8,0,-4,-8,-10,-10,-10,-8,-7,-4,-2,0,3,6}, \
{22,24,26,28,30,32,33,31,23,-18,-81,-96,-99,-98,-95,-93,-89,-86,-82,-78,-74,-70,-66,-62,-57,-53,-49,-44,-40,-36,-32,-27,-23,-19,-14,-10,-6,-1,2,6,10,15,19,23,27,31,35,38,42,45,49,52,55,57,60,61,63,63,62,61,57,53,47,40,33,28,23,21,19,19,19,20,22}, \
{168,173,178,176,171,166,161,156,151,146,141,136,131,126,121,116,111,106,101,-96,-91,-86,-81,-76,-71,-66,-61,-56,-51,-46,-41,-36,-31,-26,-21,-16,-11,-6,-1,3,8,13,18,23,28,33,38,43,48,53,58,63,68,73,78,83,88,93,98,103,108,113,118,123,128,133,138,143,148,153,158,163,168}, \
Thanks for your time.

I see several options for your array compaction.
1. Separate 8-bit and 1-bit arrays
You can split your array into 2 parts: first one stores 8 low-order bits of your original array, second one stores '1' if value does not fit in 8 bits or '0' otherwise. This will take 9 bits per value (same space as in nightcracker's approach, but a little bit simpler). To read value from these two arrays, do the following:
int8_t array8[37*73] = {...};
uint16_t array1[(37*73+15)/16] = {...};
size_t offset = 37 * x + y;
int16_t item = static_cast<int16_t>(array8[offset]); // sign extend
int16_t overflow = ((array1[offset/16] >> (offset%16)) & 0x0001) << 7;
item ^= overflow;
2. Approximation
If you can approximate your array with some efficiently computed function (like polynomial or exponent), you can store in the array only the difference between your value and the approximation. This may require only 8 bits per value or even less.
3. Delta encoding
If your data is smooth enough, in addition to applying either of previous methods, you can store a shorter table with only part of the data values and other table, containing only differences between all values, absent in the first table, and values from the first table. This requires less bits for each value.
For example, you can store every fifth value and differences for other values:
Original array: 0 0 1 1 2 2 2 2 2 3 3 3 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7
Short array: 0 2 3 5 6 6
Difference array: 0 1 1 2 0 0 0 1 0 1 1 2 0 0 0 1 0 0 0 0 0 1 1 1
Alternatively, you can use differences from previous value, which requires even less bits per value:
Original array: 0 0 1 1 2 2 2 2 2 3 3 3 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7
Short array: 0 2 3 5 6 6
Delta array: 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0
Approach with delta array may be efficiently implemented using bitwise operations if a group of delta values fits exactly in int16_t.
Initialization
For option #2, preprocessor may be used. For other options, preprocessor is possible, but may be not very convenient (preprocessor is not very good to process long value lists). Some combination of preprocessor and variadic templates may be better. Or it may be easier to use some text-processing script.
Update
After looking at the actual data, I can tell some more details. Option #2 (Approximation) is not very convenient for your data. Option #1 seems to be better. Or you can use Mark Ransom's or nightcracker's approach. It doesn't matter, which one - in all cases you save 7 bits out of 16.
Option #3 (Delta encoding) allows to save much more space. It cannot be used directly, because in some cells of the array data changes abruptly. But, as far as I know, these large changes happen at most once for each row. Which may be implemented by one additional column with full data value and one special value in the delta array.
I noticed, that (ignoring these abrupt changes) difference between neighbor values is never more than +/- 32. This requires 6 bits to encode each delta value. This means 6.6 bits per value. 58% compression. About 2400 bytes. (Not much, but a little bit better than 2464K in your comments).
Middle part of the array is much more smooth. You'll need only 5 bits per value to encode it separately. This may save 300..400 bytes more. Probably it's a good idea to split this array into several parts and encode each part differently.

As nightcracker has noted your values will fit into 9 bits. There's an easier way to store those values though. Put the absolute values into a byte array and put the sign bits into a separate packed bit array.
int8_t my_array[37][73] = {{**DATA ABSOLUTE VALUES HERE**}};
int8_t my_signs[37][10] = {{**SIGN BITS HERE**}};
int16_t my_value = my_array[i][j];
if (my_signs[i][j/8] & (1 << j%8))
my_value = -my_value;
This is a 44% reduction in your original table size without too much effort.

I know from experience that visualizing things can help find a good solution to a problem. Since it isn't very clear what your data is actually representing (and so we know nothing/very little about the problem domain) we might not come up with "the best" solution (if one exists at all ofcourse). So I took the liberty and visualized the data; as the saying goes: a picture is worth a 1000 words :-)
I am sorry I do not have a solution (yet) better than the ones already posted but I thought the plot might help someone (or myself) come up with a better solution.

You want the range +-179. This means that with 360 values you'll be settled. It is possible to express 360 unique values in 9 bits. This is an example of a 9 bit integer lookup table:
// size is ceil(37 * 73 * 9 / 16)
uint16_t my_array[1520];
int16_t get_lookup_item(int x, int y) {
// calculate bitoffset
size_t bitoffset = (37 * x + y) * 9;
// calculate difference with 16 bit array offset
size_t diff = bitoffset % 16;
uint16_t item;
// our item doesn't overlap a 16 bit boundary
if (diff < (16 - 9)) {
item = my_array[bitoffset / 16]; // get item
item >>= diff;
item &= (1 << 9) - 1;
// our item does overlap a 16 bit boundary
} else {
item = my_array[bitoffset / 16];
item >>= diff;
item &= (1 << (16 - diff)) - 1;
item += my_array[bitoffset / 16 + 1] & ((1 << (9 - 16 + diff)) - 1);
}
// we now have the unsigned item, substract 179 to bring in the correct range
return item - 179;
}

Here's another approach, totally different from my first one which is why it's a separate answer.
If the number of values that won't fit in 8 bits is less than 1/8 of the total, you can devote an entire extra byte to each and still wind up with a smaller result versus keeping another 1-bit array.
In the interest of simplicity and speed I wanted to stick with full byte values, rather than bit packing. You've never said if there are speed constraints to this problem, but decoding an entire file just to look up one value seems wasteful. If this really isn't a problem for you, your best results would probably come from implementing the decoding part of some readily available open-source compression utility.
For this implementation I kept to a very simple encoding. First I did a delta as suggested by Evgeny Kluev, starting over for each row; your data is uncommonly amenable to this approach. Each byte is then encoded via the following rules:
An absolute value >= 97 is given a leading byte of 97. This value was arrived at by trying different thresholds and choosing the one that generated the smallest result. This is followed by the value less 97.
The run length is only checked for values between -96 and 96. Run lengths between 3 and 32 are encoded as 98 to 127, and run lengths between 33 and 64 are encoded as -97 to -128.
Finally the values between -96 and 96 are output as is.
This results in an encoded array of 2014 bytes, plus another of 36 bytes for indexing to the start of each row for a total of 2050 bytes.
A full implementation can be found at http://ideone.com/SNdRI . The output is identical to the table posted in the question.

As others have suggested, you can save a lot of space by storing the absolute value of each entry in an array of 8-bit integers, and the sign bit in a separate packed bit array. Mark Ransom's solution is simple and will give good performance, and will reduce the size from 5,402 bytes to 3,071 bytes, saving 43.1%.
If you are really trying to squeeze every last bit of space, you can do a bit better still by exploiting the characteristics of this data set. In particular, note that the values are mostly positive, and that there are several runs of values with the same sign. Instead of tracking the sign for every value in the "my_signs" array, you could track only the runs of negative values as a start index (two bytes, for the range [0..2701]) and a run length (one byte, since the longest run is 36 entries long). For this data set, that reduces the size of the signs table from 370 bytes to 168 bytes. The total storage is then 2,869 bytes, a savings of 46.8% compared to the original (2,533 bytes less).
Here's code that implements this strategy:
uint8_t my_array[37][73] = {{ /* ABSOLUTE VALUES OF ORIGINAL ARRAY HERE */ }};
// Sign bits for the values in my_array. The data is arranged in groups of
// three bytes. The first two give the starting index of a run of negative
// values. The third gives the length of the run. To determine if a given
// value should be negated, compute it's index as (row * 73) + col, then scan this
// table to see if that index appears in any of the runs. If it does, the value
// should be negated.
uint8_t my_signs[168] = {
0x00, 0x1f, 0x14, 0x00, 0x68, 0x15, 0x00, 0xb1, 0x16, 0x00, 0xfa, 0x18,
0x01, 0x42, 0x1a, 0x01, 0x8b, 0x1e, 0x01, 0xd2, 0x23, 0x02, 0x1a, 0x24,
0x02, 0x62, 0x24, 0x02, 0xaa, 0x25, 0x02, 0xf2, 0x25, 0x03, 0x3a, 0x25,
0x03, 0x83, 0x25, 0x03, 0xcb, 0x25, 0x04, 0x14, 0x24, 0x04, 0x5c, 0x24,
0x04, 0xa5, 0x23, 0x04, 0xee, 0x14, 0x05, 0x05, 0x0c, 0x05, 0x36, 0x14,
0x05, 0x50, 0x0a, 0x05, 0x7f, 0x13, 0x05, 0x9a, 0x09, 0x05, 0xc8, 0x12,
0x05, 0xe4, 0x07, 0x06, 0x10, 0x12, 0x06, 0x2f, 0x05, 0x06, 0x38, 0x05,
0x06, 0x59, 0x12, 0x06, 0x7f, 0x08, 0x06, 0xa2, 0x11, 0x06, 0xc7, 0x0b,
0x06, 0xeb, 0x11, 0x07, 0x10, 0x0c, 0x07, 0x34, 0x11, 0x07, 0x59, 0x0d,
0x07, 0x7c, 0x12, 0x07, 0xa2, 0x0d, 0x07, 0xc5, 0x12, 0x07, 0xeb, 0x0e,
0x08, 0x0e, 0x13, 0x08, 0x34, 0x0e, 0x08, 0x57, 0x13, 0x08, 0x7e, 0x0e,
0x08, 0x9f, 0x14, 0x08, 0xc7, 0x0e, 0x08, 0xe8, 0x14, 0x09, 0x10, 0x0e,
0x09, 0x30, 0x16, 0x09, 0x5a, 0x0d, 0x09, 0x78, 0x17, 0x09, 0xa4, 0x0c,
0x09, 0xc0, 0x18, 0x09, 0xef, 0x09, 0x0a, 0x04, 0x1d, 0x0a, 0x57, 0x14
};
int getSign(int row, int col)
{
int want = (row * 73) + col;
for (int i = 0 ; i < 168 ; i += 3) {
int16_t start = (my_signs[i] << 8) | my_signs[i + 1];
if (start > want) {
// Not going to find it, so may as well stop now.
break;
}
int runlength = my_signs[i + 2];
if (want < start + runlength) {
// Found this index in the signs array, so this entry is negative.
return -1;
}
}
return 1;
}
int16_t getValue(int row, int col)
{
return getSign(row, col) * my_values[row][col];
}
In fact you could even do a little bit better still, at the cost of more complex code, by recognizing that for the run-length encoded version of the signs table, you really need only 12 bits for the start index and 6 bits for the run length, for 18 bits total (compared to the 24 that the simple implementation above uses). That would cut the size another 42 bytes to 2,827 total, a 47.6% savings compared to the original (2,575 bytes less).

Investigating the actual array show that data is very smooth and may be compacted significantly. Simple methods do not give much space reduction after encoding 16 bit values in 9 bits. This is because different varying data characteristics at different places in the array. Splitting the array to several pieces and encoding them differently may reduce array size further, but this is more complicated and increases code size.
Approach, described here, allows to encode data blocks of variable length, giving access to original values relatively quickly (but more slowly, than simple methods). For the price of speed, compression ratio significantly increases.
The main idea is delta encoding. But in comparison to simple algorithm in my previous post, variable block length and variable bit depth are possible. This allows, for example, to use zero bit depth for deltas of the repeating values. Which means only fixed header and no delta values at all (similar to run-length encoding).
Also there is a single base value for all deltas in the block. This allows to encode linearly changing data (which is quite common for actual array) with only the base value, again spending zero space for delta values. And slightly decreases average bit depth for other cases.
Compressed data is stored in the array of bitstreams, accessed by bitstream reader. To give quick access to the start of each bitstream, index table is used (just an array of 37 16-bit indexes).
Each bitstream starts with the number of blocks in the stream (5 bits), then follows index of blocks, and finally - data blocks. Index of blocks gives a way to skip unneeded data blocks during search. Index contains: number of elements in the block (4 bits allow to encode from 9 to 24 delta values, plus the starting value), size of the base value for all deltas (1 bit for the sizes of 4 or 6), and size of the deltas (2 bits for sizes 0..3 - if base size is 4 or for sizes 2..5 - if base size is 6). These specific bit depths are probably close to optimal values, but may be changed to exchange some speed for some space or to adapt algorithm to different data array.
Data block contains starting value (9 bits), base value for deltas (4 or 6 bits), and delta values (0..3 or 2..5 bits for each).
Here is the function, extracting original values from the compressed data:
int get(size_t row, unsigned col)
{
BitstreamReader bsr(indexTable[row]);
unsigned blocks = bsr.getUI(5);
unsigned block = 0;
unsigned start = 0;
unsigned nextStart = 0;
unsigned offset = 0;
unsigned nextOffset = 0;
unsigned blockSize = 0;
unsigned baseSize = 0;
unsigned deltaSize = 0;
while (col >= nextStart) // 3 iterations on average
{
start = nextStart;
offset = nextOffset;
++block;
blockSize = bsr.getUI(4) + 9;
nextStart += blockSize;
baseSize = bsr.getUI(1)*2 + 4;
deltaSize = bsr.getUI(2) + baseSize - 4;
nextOffset += deltaSize * blockSize + baseSize + 9;
}
-- block;
bsr.skip((blocks - block) * 7 + offset);
int value = bsr.getI(9);
int base = bsr.getI(baseSize);
while(col-- > start) // 12 iterations on average
{
int delta = base + bsr.getUI(deltaSize);
value += delta;
}
return value;
}
Here is an implementation for bitstream reader:
class BitstreamReader
{
public:
BitstreamReader(size_t start): word_(start), bit_(0) {}
void skip(unsigned offset)
{
word_ += offset / 16 + ((bit_ + offset >= 16)? 1: 0);
bit_ = (bit_ + offset) % 16;
}
unsigned getUI(unsigned size)
{
unsigned old = bit_;
unsigned result = dataTable[word_] >> bit_;
result &= ((1 << size) - 1);
bit_ += size;
if (bit_ >= 16)
{
++word_;
bit_ -= 16;
if (bit_ > 0)
{
result += (dataTable[word_] & ((1 << bit_) - 1)) << (16 - old);
}
}
return result;
}
int getI(unsigned size)
{
int result = static_cast<int>(getUI(size));
return result | -(result & (1 << (size - 1)));
}
private:
size_t word_;
unsigned bit_;
};
I computed some estimate for the resulting data size. (I don't post code that allowed me to do it because of very low code quality). The result is 1250 bytes. Which is larger than best compressing programs can do. But significantly lower, than any simple methods.
Update
1250 bytes is not a limit. This algorithm may be improved to compress data harder and to work faster.
I noticed, that the number of blocks (5 bits) may be moved from bitstream to unused bits of the row index table. This saves about 30 bytes.
And to save 20 bytes more, you can store bitstreams in bytes instead of uint16, this saves space on padding bits.
So we have about 1200 bytes. Which is not exact. Size may be a little bit underestimated because I didn't take into account that not every bit depth may be encoded in the row index. Also this size may be overestimated because the only heuristic, assumed for encoder was calculating bit depth for the first 9 values and limiting the block size only if this bit depth needs to be increased by more than 2 bits. Of course, encoder may be smarter than this.
Decode speed may be also increased. If we move 9th bit from the original values to row indexes, each element of the index is exactly 8 bits. This allows to start bitstreams with the set of bytes, each of them may be decoded with faster methods, than a general bitstream's accessors. Remaining 8 bits of the original value may be moved to the place just after the row index for the same purpose. Or, alternatively, they may be included into each index entry, so that index consists of 16-bit values. After these modifications, bitstreams contain only data fields of variable length.

1049 bytes
I noticed that most runs are linear. That is why I decided to encode not the delta value, but a delta-of-delta. Think of it as a second derivative. This makes me store values -1, 0 and 1 most of the time, with some notable exceptions.
Secondly, I make the data 1-dimention. Converting it into 2 dimentsions is easy, but having it in 1 dimension permits the compression to span across several lines.
The compressed data is organized in varying-size chunks. Each chunk starts with a header:
9 bits - an absolute value, the value of input[x]
7 bits - a difference, the value of input[x+1]-input[x]
7 bits - a difference, the value of input[x+2]-input[x+1]
9 bits - the length of following data of second-derivative
2 bits each - an array of the second-derivative
The runs of second-derivatives in this example is surprisingly long, although only values -2, -1, 0 and 1 can be stored.
In the following piece of code I provide a complete, compilable code. It contains:
C (GCC) code. No C++ constructs.
The input array you provided
Visualization function to print the contents of the array
Compression function (in case your input changes a bit)
Getter function - to fetch an element out from the array
In the main function: I compress, decompress and perform a check
Have fun!
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef int16_t Arr[37][73];
typedef int16_t ArrFlat[37*73];
typedef int16_t* ArrPtr;
Arr input = { {150,145,140,135,130,125,120,115,110,105,100,95,90,85,80,75,70,65,60,55,50,45,40,35,30,25,20,15,10,5,0,-4,-9,-14,-19,-24,-29,-34,-39,-44,-49,-54,-59,-64,-69,-74,-79,-84,-89,-94,-99,104,109,114,119,124,129,134,139,144,149,154,159,164,169,174,179,175,170,165,160,155,150}, \
{143,137,131,126,120,115,110,105,100,95,90,85,80,75,71,66,62,57,53,48,44,39,35,31,27,22,18,14,9,5,1,-3,-7,-11,-16,-20,-25,-29,-34,-38,-43,-47,-52,-57,-61,-66,-71,-76,-81,-86,-91,-96,101,107,112,117,123,128,134,140,146,151,157,163,169,175,178,172,166,160,154,148,143}, \
{130,124,118,112,107,101,96,92,87,82,78,74,70,65,61,57,54,50,46,42,38,34,31,27,23,19,16,12,8,4,1,-2,-6,-10,-14,-18,-22,-26,-30,-34,-38,-43,-47,-51,-56,-61,-65,-70,-75,-79,-84,-89,-94,100,105,111,116,122,128,135,141,148,155,162,170,177,174,166,159,151,144,137,130}, \
{111,104,99,94,89,85,81,77,73,70,66,63,60,56,53,50,46,43,40,36,33,30,26,23,20,16,13,10,6,3,0,-3,-6,-9,-13,-16,-20,-24,-28,-32,-36,-40,-44,-48,-52,-57,-61,-65,-70,-74,-79,-84,-88,-93,-98,103,109,115,121,128,135,143,152,162,172,176,165,154,144,134,125,118,111}, \
{85,81,77,74,71,68,65,63,60,58,56,53,51,49,46,43,41,38,35,32,29,26,23,19,16,13,10,7,4,1,-1,-3,-6,-9,-13,-16,-19,-23,-26,-30,-34,-38,-42,-46,-50,-54,-58,-62,-66,-70,-74,-78,-83,-87,-91,-95,100,105,110,117,124,133,144,159,178,160,141,125,112,103,96,90,85}, \
{62,60,58,57,55,54,52,51,50,48,47,46,44,42,41,39,36,34,31,28,25,22,19,16,13,10,7,4,2,0,-3,-5,-8,-10,-13,-16,-19,-22,-26,-29,-33,-37,-41,-45,-49,-53,-56,-60,-64,-67,-70,-74,-77,-80,-83,-86,-89,-91,-94,-97,101,105,111,130,109,84,77,74,71,68,66,64,62}, \
{46,46,45,44,44,43,42,42,41,41,40,39,38,37,36,35,33,31,28,26,23,20,16,13,10,7,4,1,-1,-3,-5,-7,-9,-12,-14,-16,-19,-22,-26,-29,-33,-36,-40,-44,-48,-51,-55,-58,-61,-64,-66,-68,-71,-72,-74,-74,-75,-74,-72,-68,-61,-48,-25,2,22,33,40,43,45,46,47,46,46}, \
{36,36,36,36,36,35,35,35,35,34,34,34,34,33,32,31,30,28,26,23,20,17,14,10,6,3,0,-2,-4,-7,-9,-10,-12,-14,-15,-17,-20,-23,-26,-29,-32,-36,-40,-43,-47,-50,-53,-56,-58,-60,-62,-63,-64,-64,-63,-62,-59,-55,-49,-41,-30,-17,-4,6,15,22,27,31,33,34,35,36,36}, \
{30,30,30,30,30,30,30,29,29,29,29,29,29,29,29,28,27,26,24,21,18,15,11,7,3,0,-3,-6,-9,-11,-12,-14,-15,-16,-17,-19,-21,-23,-26,-29,-32,-35,-39,-42,-45,-48,-51,-53,-55,-56,-57,-57,-56,-55,-53,-49,-44,-38,-31,-23,-14,-6,0,7,13,17,21,24,26,27,29,29,30}, \
{25,25,26,26,26,25,25,25,25,25,25,25,25,26,25,25,24,23,21,19,16,12,8,4,0,-3,-7,-10,-13,-15,-16,-17,-18,-19,-20,-21,-22,-23,-25,-28,-31,-34,-37,-40,-43,-46,-48,-49,-50,-51,-51,-50,-48,-45,-42,-37,-32,-26,-19,-13,-7,-1,3,7,11,14,17,19,21,23,24,25,25}, \
{21,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,21,20,18,16,13,9,5,1,-3,-7,-11,-14,-17,-18,-20,-21,-21,-22,-22,-22,-23,-23,-25,-27,-29,-32,-35,-37,-40,-42,-44,-45,-45,-45,-44,-42,-40,-36,-32,-27,-22,-17,-12,-7,-3,0,3,7,9,12,14,16,18,19,20,21,21}, \
{18,19,19,19,19,19,19,19,19,19,19,19,19,19,19,19,18,17,16,14,10,7,2,-1,-6,-10,-14,-17,-19,-21,-22,-23,-24,-24,-24,-24,-23,-23,-23,-24,-26,-28,-30,-33,-35,-37,-38,-39,-39,-38,-36,-34,-31,-28,-24,-19,-15,-10,-6,-3,0,1,4,6,8,10,12,14,15,16,17,18,18}, \
{16,16,17,17,17,17,17,17,17,17,17,16,16,16,16,16,16,15,13,11,8,4,0,-4,-9,-13,-16,-19,-21,-23,-24,-25,-25,-25,-25,-24,-23,-21,-20,-20,-21,-22,-24,-26,-28,-30,-31,-32,-31,-30,-29,-27,-24,-21,-17,-13,-9,-6,-3,-1,0,2,4,5,7,9,10,12,13,14,15,16,16}, \
{14,14,14,15,15,15,15,15,15,15,14,14,14,14,14,14,13,12,11,9,5,2,-2,-6,-11,-15,-18,-21,-23,-24,-25,-25,-25,-25,-24,-22,-21,-18,-16,-15,-15,-15,-17,-19,-21,-22,-24,-24,-24,-23,-22,-20,-18,-15,-12,-9,-5,-3,-1,0,1,2,4,5,6,8,9,10,11,12,13,14,14}, \
{12,13,13,13,13,13,13,13,13,13,13,13,12,12,12,12,11,10,9,6,3,0,-4,-8,-12,-16,-19,-21,-23,-24,-24,-24,-24,-23,-22,-20,-17,-15,-12,-10,-9,-9,-10,-12,-13,-15,-17,-17,-18,-17,-16,-15,-13,-11,-8,-5,-3,-1,0,1,1,2,3,4,6,7,8,9,10,11,12,12,12}, \
{11,11,11,11,11,12,12,12,12,12,11,11,11,11,11,10,10,9,7,5,2,-1,-5,-9,-13,-17,-20,-22,-23,-23,-23,-23,-22,-20,-18,-16,-14,-11,-9,-6,-5,-4,-5,-6,-8,-9,-11,-12,-12,-12,-12,-11,-9,-8,-6,-3,-1,0,0,1,1,2,3,4,5,6,7,8,9,10,11,11,11}, \
{10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,7,6,3,0,-3,-6,-10,-14,-17,-20,-21,-22,-22,-22,-21,-19,-17,-15,-13,-10,-8,-6,-4,-2,-2,-2,-2,-4,-5,-7,-8,-8,-9,-8,-8,-7,-5,-4,-2,0,0,1,1,1,2,2,3,4,5,6,7,8,9,10,10,10}, \
{9,9,9,9,9,9,9,10,10,9,9,9,9,9,9,8,8,6,5,2,0,-4,-7,-11,-15,-17,-19,-21,-21,-21,-20,-18,-16,-14,-12,-10,-8,-6,-4,-2,-1,0,0,0,-1,-2,-4,-5,-5,-6,-6,-5,-5,-4,-3,-1,0,0,1,1,1,1,2,3,3,5,6,7,8,8,9,9,9}, \
{9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,7,5,4,1,-1,-5,-8,-12,-15,-17,-19,-20,-20,-19,-18,-16,-14,-11,-9,-7,-5,-4,-2,-1,0,0,1,1,0,0,-2,-3,-3,-4,-4,-4,-3,-3,-2,-1,0,0,0,0,0,1,1,2,3,4,5,6,7,8,8,9,9}, \
{9,9,9,8,8,8,9,9,9,9,9,8,8,8,8,7,6,5,3,0,-2,-5,-9,-12,-15,-17,-18,-19,-19,-18,-16,-14,-12,-9,-7,-5,-4,-2,-1,0,0,1,1,1,1,0,0,-1,-2,-2,-3,-3,-2,-2,-1,-1,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,8,9}, \
{8,8,8,8,8,8,9,9,9,9,9,9,8,8,8,7,6,4,2,0,-3,-6,-9,-12,-15,-17,-18,-18,-17,-16,-14,-12,-10,-8,-6,-4,-2,-1,0,0,1,2,2,2,2,1,0,0,-1,-1,-1,-2,-2,-1,-1,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,8}, \
{8,8,8,8,9,9,9,9,9,9,9,9,9,8,8,7,5,3,1,-1,-4,-7,-10,-13,-15,-16,-17,-17,-16,-15,-13,-11,-9,-6,-5,-3,-2,0,0,0,1,2,2,2,2,1,1,0,0,0,-1,-1,-1,-1,-1,0,0,0,0,-1,-1,-1,-1,-1,0,0,1,3,4,5,7,7,8}, \
{8,8,9,9,9,9,10,10,10,10,10,10,10,9,8,7,5,3,0,-2,-5,-8,-11,-13,-15,-16,-16,-16,-15,-13,-12,-10,-8,-6,-4,-2,-1,0,0,1,2,2,3,3,2,2,1,0,0,0,0,0,0,0,0,0,0,-1,-1,-2,-2,-2,-2,-2,-1,0,0,1,3,4,6,7,8}, \
{7,8,9,9,9,10,10,11,11,11,11,11,10,10,9,7,5,3,0,-2,-6,-9,-11,-13,-15,-16,-16,-15,-14,-13,-11,-9,-7,-5,-3,-2,0,0,1,1,2,3,3,3,3,2,2,1,1,0,0,0,0,0,0,0,-1,-1,-2,-3,-3,-4,-4,-4,-3,-2,-1,0,1,3,5,6,7}, \
{6,8,9,9,10,11,11,12,12,12,12,12,11,11,9,7,5,2,0,-3,-7,-10,-12,-14,-15,-16,-15,-15,-13,-12,-10,-8,-7,-5,-3,-1,0,0,1,2,2,3,3,4,3,3,3,2,2,1,1,1,0,0,0,0,-1,-2,-3,-4,-4,-5,-5,-5,-5,-4,-2,-1,0,2,3,5,6}, \
{6,7,8,10,11,12,12,13,13,14,14,13,13,11,10,8,5,2,0,-4,-8,-11,-13,-15,-16,-16,-16,-15,-13,-12,-10,-8,-6,-5,-3,-1,0,0,1,2,3,3,4,4,4,4,4,3,3,3,2,2,1,1,0,0,-1,-2,-3,-5,-6,-7,-7,-7,-6,-5,-4,-3,-1,0,2,4,6}, \
{5,7,8,10,11,12,13,14,15,15,15,14,14,12,11,8,5,2,-1,-5,-9,-12,-14,-16,-17,-17,-16,-15,-14,-12,-11,-9,-7,-5,-3,-1,0,0,1,2,3,4,4,5,5,5,5,5,5,4,4,3,3,2,1,0,-1,-2,-4,-6,-7,-8,-8,-8,-8,-7,-6,-4,-2,0,1,3,5}, \
{4,6,8,10,12,13,14,15,16,16,16,16,15,13,11,9,5,2,-2,-6,-10,-13,-16,-17,-18,-18,-17,-16,-15,-13,-11,-9,-7,-5,-4,-2,0,0,1,3,3,4,5,6,6,7,7,7,7,7,6,5,4,3,2,0,-1,-3,-5,-7,-8,-9,-10,-10,-10,-9,-7,-5,-4,-1,0,2,4}, \
{4,6,8,10,12,14,15,16,17,18,18,17,16,15,12,9,5,1,-3,-8,-12,-15,-18,-19,-20,-20,-19,-18,-16,-15,-13,-11,-8,-6,-4,-2,-1,0,1,3,4,5,6,7,8,9,9,9,9,9,9,8,7,5,3,1,-1,-3,-6,-8,-10,-11,-12,-12,-11,-10,-9,-7,-5,-2,0,1,4}, \
{4,6,8,11,13,15,16,18,19,19,19,19,18,16,13,10,5,0,-5,-10,-15,-18,-21,-22,-23,-22,-22,-20,-18,-17,-14,-12,-10,-8,-5,-3,-1,0,1,3,5,6,8,9,10,11,12,12,13,12,12,11,9,7,5,2,0,-3,-6,-9,-11,-12,-13,-13,-12,-11,-10,-8,-6,-3,-1,1,4}, \
{3,6,9,11,14,16,17,19,20,21,21,21,19,17,14,10,4,-1,-8,-14,-19,-22,-25,-26,-26,-26,-25,-23,-21,-19,-17,-14,-12,-9,-7,-4,-2,0,1,3,5,7,9,11,13,14,15,16,16,16,16,15,13,10,7,4,0,-3,-7,-10,-12,-14,-15,-14,-14,-12,-11,-9,-6,-4,-1,1,3}, \
{4,6,9,12,14,17,19,21,22,23,23,23,21,19,15,9,2,-5,-13,-20,-25,-28,-30,-31,-31,-30,-29,-27,-25,-22,-20,-17,-14,-11,-9,-6,-3,0,1,4,6,9,11,13,15,17,19,20,21,21,21,20,18,15,11,6,2,-2,-7,-11,-13,-15,-16,-16,-15,-13,-11,-9,-7,-4,-1,1,4}, \
{4,7,10,13,15,18,20,22,24,25,25,25,23,20,15,7,-2,-12,-22,-29,-34,-37,-38,-38,-37,-36,-34,-31,-29,-26,-23,-20,-17,-13,-10,-7,-4,-1,2,5,8,11,13,16,18,21,23,24,26,26,26,26,24,21,17,12,5,0,-6,-10,-14,-16,-16,-16,-15,-14,-12,-10,-7,-4,-1,1,4}, \
{4,7,10,13,16,19,22,24,26,27,27,26,24,19,11,-1,-15,-28,-37,-43,-46,-47,-47,-45,-44,-41,-39,-36,-32,-29,-26,-22,-19,-15,-11,-8,-4,-1,2,5,9,12,15,19,22,24,27,29,31,33,33,33,32,30,26,21,14,6,0,-6,-11,-14,-15,-16,-15,-14,-12,-9,-7,-4,-1,1,4}, \
{6,9,12,15,18,21,23,25,27,28,27,24,17,4,-14,-34,-49,-56,-60,-60,-60,-58,-56,-53,-50,-47,-43,-40,-36,-32,-28,-25,-21,-17,-13,-9,-5,-1,2,6,10,14,17,21,24,28,31,34,37,39,41,42,43,43,41,38,33,25,17,8,0,-4,-8,-10,-10,-10,-8,-7,-4,-2,0,3,6}, \
{22,24,26,28,30,32,33,31,23,-18,-81,-96,-99,-98,-95,-93,-89,-86,-82,-78,-74,-70,-66,-62,-57,-53,-49,-44,-40,-36,-32,-27,-23,-19,-14,-10,-6,-1,2,6,10,15,19,23,27,31,35,38,42,45,49,52,55,57,60,61,63,63,62,61,57,53,47,40,33,28,23,21,19,19,19,20,22}, \
{168,173,178,176,171,166,161,156,151,146,141,136,131,126,121,116,111,106,101,-96,-91,-86,-81,-76,-71,-66,-61,-56,-51,-46,-41,-36,-31,-26,-21,-16,-11,-6,-1,3,8,13,18,23,28,33,38,43,48,53,58,63,68,73,78,83,88,93,98,103,108,113,118,123,128,133,138,143,148,153,158,163,168} };
void visual(Arr arr) {
int row;
int col;
for (row=0; row<37; ++row) {
for (col=0; col<73; ++col)
printf("%3d",arr[row][col]);
printf("\n");
}
}
void visualFlat(ArrFlat arr) {
int cell;
for (cell=0; cell<37*73; ++cell) {
printf("%3d",arr[cell]);
}
printf("\n");
}
typedef struct {
int16_t absolute:9;
int16_t adiff:7;
int16_t diff:7;
unsigned short diff2_length:9;
} __attribute__((packed)) Header;
typedef union {
struct {
int16_t diff2_a:2;
int16_t diff2_b:2;
int16_t diff2_c:2;
int16_t diff2_d:2;
} __attribute__((packed));
unsigned char all;
} Chunk;
int16_t chunkGet(Chunk k, int16_t offset) {
switch (offset) {
case 0 : return k.diff2_a;
case 1 : return k.diff2_b;
case 2 : return k.diff2_c;
case 3 : return k.diff2_d;
}
}
void chunkSet(Chunk *k, int16_t offset, int16_t value) {
switch (offset) {
case 0 : k->diff2_a=value; break;
case 1 : k->diff2_b=value; break;
case 2 : k->diff2_c=value; break;
case 3 : k->diff2_d=value; break;
default: printf("Invalid offset %hd\n", offset);
}
}
unsigned char data[1049];
void compress (ArrFlat src) {
Chunk diffData;
int16_t headerIdx=0;
int16_t diffIdx;
int16_t currentDiffValue;
int16_t length=-3;
int16_t shift=0;
Header h;
int16_t position=0;
while (position<37*73) {
if (length==-3) { //encode the absolute value
h.absolute=currentDiffValue=src[position];
++position;
++length;
continue;
}
if (length==-2) { //encode the first diff value
h.adiff=currentDiffValue=src[position]-src[position-1];
if (currentDiffValue<-64 || currentDiffValue>+63)
printf("\nDIFF TOO BIG\n");
++position;
++length;
continue;
}
if (length==-1) { //encode the second diff value
h.diff=currentDiffValue=src[position]-src[position-1];
if (currentDiffValue<-64 || currentDiffValue>+63)
printf("\nDIFF TOO BIG\n");
++position;
++length;
diffData.all=0;
diffIdx=headerIdx+sizeof(Header);
shift=0;
continue;
}
//compute the diff2
int16_t diff=src[position]-src[position-1];
int16_t diff2=diff-currentDiffValue;
if (diff2>1 || diff2<-2) { //big change - restart with header
if (length>511)
printf("\nLENGTH TOO LONG\n");
if (shift!=0) { //store partial byte
data[diffIdx]=diffData.all;
diffData.all=0;
++diffIdx;
}
h.diff2_length=length;
memcpy(data+headerIdx,&h,sizeof(Header));
headerIdx=diffIdx;
length=-3;
continue;
}
chunkSet(&diffData,shift,diff2);
shift+=1;
currentDiffValue=diff;
++position;
++length;
if (shift==4) {
data[diffIdx]=diffData.all;
diffData.all=0;
++diffIdx;
shift=0;
}
}
if (shift!=0) { //finalize
data[diffIdx]=diffData.all;
++diffIdx;
}
h.diff2_length=length;
memcpy(data+headerIdx,&h,sizeof(Header));
headerIdx=diffIdx;
printf("Ending byte=%hd\n",headerIdx);
}
int16_t get(int row, int col) {
int idx=row*73+col;
int dataIdx=0;
int pos=0;
int16_t absolute;
int16_t diff;
Header h;
while (1) {
memcpy(&h, data+dataIdx, sizeof(Header));
if (idx==pos) return h.absolute;
absolute=h.absolute+h.adiff;
if (idx==pos+1) return absolute;
diff=h.diff;
absolute+=diff;
if (idx==pos+2) return absolute;
dataIdx+=sizeof(Header);
pos+=3;
if (pos+h.diff2_length <= idx) {
pos+=h.diff2_length;
dataIdx+=(h.diff2_length+3)/4;
} else break;
}
int shift=4;
Chunk diffData;
while (pos<=idx) {
if (shift==4) {
diffData.all=data[dataIdx];
++dataIdx;
shift=0;
}
diff+=chunkGet(diffData,shift);
absolute+=diff;
++shift;
++pos;
}
return absolute;
}
int main() {
printf("Input:\n");
visual(input);
int row;
int col;
ArrPtr flatInput=(ArrPtr)input;
printf("sizeof(Header)=%lu\n",sizeof(Header));
printf("sizeof(Chunk)=%lu\n",sizeof(Chunk));
compress(flatInput);
ArrFlat re;
for (row=0; row<37; ++row)
for (col=0; col<73; ++col) {
int cell=row*73+col;
re[cell]=get(row,col);
if (re[cell]!=flatInput[cell])
printf("ERROR DETECTED IN CELL %d\n",cell);
}
visual(re);
return 0;
}
A Visual Studio version (compiled with VS 2010)
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
typedef int16_t Arr[37][73];
typedef int16_t ArrFlat[37*73];
typedef int16_t* ArrPtr;
Arr input = { [... your array as above ...] };
void visual(Arr arr) {
int row;
int col;
for (row=0; row<37; ++row) {
for (col=0; col<73; ++col)
printf("%3d",arr[row][col]);
printf("\n");
}
}
void visualFlat(ArrFlat arr) {
int cell;
for (cell=0; cell<37*73; ++cell) {
printf("%3d",arr[cell]);
}
printf("\n");
}
#pragma pack(1)
typedef struct {
int16_t absolute:9;
int16_t adiff:7;
int16_t diff:7;
unsigned short diff2_length:9;
} Header;
#pragma pack(1)
typedef union {
struct {
char diff2_a:2;
char diff2_b:2;
char diff2_c:2;
char diff2_d:2;
};
unsigned char all;
} Chunk;
int16_t chunkGet(Chunk k, int16_t offset) {
switch (offset) {
case 0 : return k.diff2_a;
case 1 : return k.diff2_b;
case 2 : return k.diff2_c;
case 3 : return k.diff2_d;
}
}
void chunkSet(Chunk *k, int16_t offset, int16_t value) {
switch (offset) {
case 0 : k->diff2_a=value; break;
case 1 : k->diff2_b=value; break;
case 2 : k->diff2_c=value; break;
case 3 : k->diff2_d=value; break;
default: printf("Invalid offset %hd\n", offset);
}
}
unsigned char data[1049];
void compress (ArrFlat src) {
Chunk diffData;
int16_t headerIdx=0;
int16_t diffIdx;
int16_t currentDiffValue;
int16_t length=-3;
int16_t shift=0;
int16_t diff;
int16_t diff2;
Header h;
int16_t position=0;
while (position<37*73) {
if (length==-3) { //encode the absolute value
h.absolute=currentDiffValue=src[position];
++position;
++length;
continue;
}
if (length==-2) { //encode the first diff value
h.adiff=currentDiffValue=src[position]-src[position-1];
if (currentDiffValue<-64 || currentDiffValue>+63)
printf("\nDIFF TOO BIG\n");
++position;
++length;
continue;
}
if (length==-1) { //encode the second diff value
h.diff=currentDiffValue=src[position]-src[position-1];
if (currentDiffValue<-64 || currentDiffValue>+63)
printf("\nDIFF TOO BIG\n");
++position;
++length;
diffData.all=0;
diffIdx=headerIdx+sizeof(Header);
shift=0;
continue;
}
//compute the diff2
diff=src[position]-src[position-1];
diff2=diff-currentDiffValue;
if (diff2>1 || diff2<-2) { //big change - restart with header
if (length>511)
printf("\nLENGTH TOO LONG\n");
if (shift!=0) { //store partial byte
data[diffIdx]=diffData.all;
diffData.all=0;
++diffIdx;
}
h.diff2_length=length;
memcpy(data+headerIdx,&h,sizeof(Header));
headerIdx=diffIdx;
length=-3;
continue;
}
chunkSet(&diffData,shift,diff2);
shift+=1;
currentDiffValue=diff;
++position;
++length;
if (shift==4) {
data[diffIdx]=diffData.all;
diffData.all=0;
++diffIdx;
shift=0;
}
}
if (shift!=0) { //finalize
data[diffIdx]=diffData.all;
++diffIdx;
}
h.diff2_length=length;
memcpy(data+headerIdx,&h,sizeof(Header));
headerIdx=diffIdx;
printf("Ending byte=%hd\n",headerIdx);
}
int16_t get(int row, int col) {
int idx=row*73+col;
int dataIdx=0;
int pos=0;
int16_t absolute;
int16_t diff;
int shift;
Header h;
Chunk diffData;
while (1) {
memcpy(&h, data+dataIdx, sizeof(Header));
if (idx==pos) return h.absolute;
absolute=h.absolute+h.adiff;
if (idx==pos+1) return absolute;
diff=h.diff;
absolute+=diff;
if (idx==pos+2) return absolute;
dataIdx+=sizeof(Header);
pos+=3;
if (pos+h.diff2_length <= idx) {
pos+=h.diff2_length;
dataIdx+=(h.diff2_length+3)/4;
} else break;
}
shift=4;
while (pos<=idx) {
if (shift==4) {
diffData.all=data[dataIdx];
++dataIdx;
shift=0;
}
diff+=chunkGet(diffData,shift);
absolute+=diff;
++shift;
++pos;
}
return absolute;
}
int main() {
int row;
int col;
ArrPtr flatInput=(ArrPtr)input;
ArrFlat re;
printf("Input:\n");
visual(input);
printf("sizeof(Header)=%lu\n",sizeof(Header));
printf("sizeof(Chunk)=%lu\n",sizeof(Chunk));
compress(flatInput);
for (row=0; row<37; ++row)
for (col=0; col<73; ++col) {
int cell=row*73+col;
re[cell]=get(row,col);
if (re[cell]!=flatInput[cell])
printf("ERROR DETECTED IN CELL %d\n",cell);
}
visual(re);
return 0;
}

726 bytes
This algorithm encodes difference between the actual value and the value, produced by linear extrapolation from previous values. In other words, it uses first order Taylor series or, as CygnusX1 calls it, delta-of-delta.
After this extrapolation encoding, most values are in the range [-1 .. 1]. This is a good reason to use Arithmetic coding or Range encoding. I've implemented arithmetic coder by Arturo San Emeterio Campos. Also an algorithm for Range coder by the same author is available.
Small values in the range [-2 .. 2] are compressed by arithmetic coder, while larger values are packed in 4-bit nibbles.
There are also several optimizations used to pack it a little bit tighter:
all values are compressed to one continuous stream
last column is not encoded at all, since it is equal to the first one
while encoding first column, history is updated only partially to improve results for second column
several cases, when value jumps from -100 to 100, are handled differently
This algorithm is slow, it uses up to 8000 32-bit integer divisions and lots of bit manipulations to extract single value. But it packs data into 726-byte array and code size is not very big.
Speed may be optimized (to ~2800 32-bit integer divisions) if frequency table is properly scaled. Also using range encoding instead of arithmetic coding may give some speed improvement. Space may be optimized if both arithmetic coder data and nibbles are packed in byte arrays instead of uint16 arrays (2 bytes) and if up to two starting zero bytes are aliased with the end of some other data structure (1..2 bytes). Using second order order Taylor series did not gain any space, but possibly other methods of extrapolation will give some improvement.
A full implementation can be found here: encoder, decoder and a test. Tested on GCC.

There is another possibility:
Have two arrays: one main and one overflow.
Every element of the main array contains the 7 bits of actual data + 1 "status" bit.
If the status bit is reset, the value fits into the remaining 7 bits.
If the status bit is set, part of the value is still in these 7 bits, but the remaining bits are contained in the overflow array.
The index in the overflow array is found by counting all the preceding elements in the main array who have their status bit set.
This has the following advantages:
Very fast lookup of values that fit into 7 bits.
Can handle values of unlimited range (either by using suitably large elements in the overflow array, or by repeating the algorithm and stacking another overflow array on top etc...).
On the other hand, if you know the values will always fit into 9 bits, use the 2-bit elements in the overflow array to save additional space (some bit-twiddling required, but can be done).
For some distributions of data, it may use less space than just using 9-bit elements (either in a single array or in 8-bit array + 1-bit array) - when most values fit into 7 bits.
Fairly simple to implement, so code size won't eat-away savings done for data.
Disadvantages:
Slow lookup of values that don't fit into 7 bits. Access to such a value requires linearly traversing all the elements left of it in the main array (and examining their status bits) to determine the index in the overflow array.
For some other distributions of data, it may use more space than 9-bit approach - when there are many values that don't fit into 7 bits.
Not as simple as 8-bit array + 1-bit array approach, so while still not very large, the code will be somewhat larger than that.

Don't forget to check the size of the compiled code if the sum of code+data sizes is important. Here is an example that uses normal 8-bit encoding for the data (50% gain) and optimizes for code size.
We'll store 8-bit values for each row:
unsigned char *row_data = compressed_data[row*73];
int value = row_data[column];
For the first rows, break them in two. The first value will be encoded directly. The next part will use a negative delta from the first value. The second part will be encoded as a positive delta from 100.
if (row <= 4) {
char break = break_point[row];
if (column >= break) return 100 + value;
if (column == 0) return value;
return row_data[0] - value;
}
The break_point would be the position of the 104, 101, 100, 103, 110 in the first five rows. I haven't checked if it can be computed rather than stored. Is it perhaps 51+row?
After the 5th row the values become smoother, we can just store them in 8-bit twos-complement. The exception is the last row.
if (row != 36) return (signed char) value;
The last row can be encoded like this, without any data (which saves 73 bytes):
value = 168+5*column;
if (value <= 178) return value;
value = 359 - x; /* 359 = 176 + 183 */
if (value >= 101) return value;
value = -x;
if (value > 0) x--;
return value;
This would require about 2640 bytes, but it would be very fast and compact to access.
The first row could be encoded similar to the last (with an increment at -5, a sign change at -104, and a 359-x "flip" at 184) saving 70 bytes of data at some cost in code size.

If the duplicated are contiguous and you have extra CPU, you could use a run-length encoding.
The dataset, sadly, looks too dense for a DFA... but you can totally get one working. It'll require preprocessing and be super fast. The assembly might exceed the 4K dataset, so it may not be an option.
Assuming your 16-bit values are infrequent, a hash might work for the extra large entries (see: google sparsehash)... there's a 1-bit+ overhead per entity.
You can also use 9-bit values and manage your memory byte boundaries manually, which is the same overhead as a separate bit array... maybe more.

Given an array of uint8_t what is a good way to extract any subsequence of bits as a uint32_t?

I have run into an interesting problem lately:
Lets say I have an array of bytes (uint8_t to be exact) of length at least one. Now i need a function that will get a subsequence of bits from this array, starting with bit X (zero based index, inclusive) and having length L and will return this as an uint32_t. If L is smaller than 32 the remaining high bits should be zero.
Although this is not very hard to solve, my current thoughts on how to do this seem a bit cumbersome to me. I'm thinking of a table of all the possible masks for a given byte (start with bit 0-7, take 1-8 bits) and then construct the number one byte at a time using this table.
Can somebody come up with a nicer solution? Note that i cannot use Boost or STL for this - and no, it is not a homework, its a problem i run into at work and we do not use Boost or STL in the code where this thing goes. You can assume that: 0 < L <= 32 and that the byte array is large enough to hold the subsequence.
One example of correct input/output:
array: 00110011 1010 1010 11110011 01 101100
subsequence: X = 12 (zero based index), L = 14
resulting uint32_t = 00000000 00000000 00 101011 11001101

Only the first and last bytes in the subsequence will involve some bit slicing to get the required bits out, while the intermediate bytes can be shifted in whole into the result. Here's some sample code, absolutely untested -- it does what I described, but some of the bit indices could be off by one:
uint8_t bytes[];
int X, L;
uint32_t result;
int startByte = X / 8, /* starting byte number */
startBit = 7 - X % 8, /* bit index within starting byte, from LSB */
endByte = (X + L) / 8, /* ending byte number */
endBit = 7 - (X + L) % 8; /* bit index within ending byte, from LSB */
/* Special case where start and end are within same byte:
just get bits from startBit to endBit */
if (startByte == endByte) {
uint8_t byte = bytes[startByte];
result = (byte >> endBit) & ((1 << (startBit - endBit)) - 1);
}
/* All other cases: get ending bits of starting byte,
all other bytes in between,
starting bits of ending byte */
else {
uint8_t byte = bytes[startByte];
result = byte & ((1 << startBit) - 1);
for (int i = startByte + 1; i < endByte; i++)
result = (result << 8) | bytes[i];
byte = bytes[endByte];
result = (result << (8 - endBit)) | (byte >> endBit);
}

Take a look at std::bitset and boost::dynamic_bitset.

I would be thinking something like loading a uint64_t with a cast and then shifting left and right to lose the uninteresting bits.
uint32_t extract_bits(uint8_t* bytes, int start, int count)
{
int shiftleft = 32+start;
int shiftright = 64-count;
uint64_t *ptr = (uint64_t*)(bytes);
uint64_t hold = *ptr;
hold <<= shiftleft;
hold >>= shiftright;
return (uint32_t)hold;
}

For the sake of completness, i'am adding my solution inspired by the comments and answers here. Thanks to all who bothered to think about the problem.
static const uint8_t firstByteMasks[8] = { 0xFF, 0x7F, 0x3F, 0x1F, 0x0F, 0x07, 0x03, 0x01 };
uint32_t getBits( const uint8_t *buf, const uint32_t bitoff, const uint32_t len, const uint32_t bitcount )
{
uint64_t result = 0;
int32_t startByte = bitoff / 8; // starting byte number
int32_t endByte = ((bitoff + bitcount) - 1) / 8; // ending byte number
int32_t rightShift = 16 - ((bitoff + bitcount) % 8 );
if ( endByte >= len ) return -1;
if ( rightShift == 16 ) rightShift = 8;
result = buf[startByte] & firstByteMasks[bitoff % 8];
result = result << 8;
for ( int32_t i = startByte + 1; i <= endByte; i++ )
{
result |= buf[i];
result = result << 8;
}
result = result >> rightShift;
return (uint32_t)result;
}
Few notes: i tested the code and it seems to work just fine, however, there may be bugs. If i find any, i will update the code here. Also, there are probably better solutions!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js