What transformation solves this bit-twiddling pattern? - bit-manipulation

I have a series of nibbles (0x0 - 0xF) and a resulting transformation for which a clear pattern results. The transformation solution simply isn't coming to me (not for a lack of trying). Is there an experienced bit twiddler out there that recognizes the transformation? Thanks in advance.
0 -> 0
1 -> 1
2 -> 1
3 -> 0
----------
4 -> 2
5 -> 3
6 -> 3
7 -> 2
----------
8 -> 7
9 -> 6
A -> 6
B -> 7
----------
C -> 5
D -> 4
E -> 4
F -> 5

Looking at the individual bits it appears that one possible relationship between input and output would be:
Y0 = X0^X1^X3
Y1 = X2^X3
Y2 = X3
Y3 = 0
where X0..X3 are the input bits and Y0..Y3 are the output bits.
However this requires around 10 or more bitwise operations to implement, so you might be better off just using a lookup table.
Here is a test program in C which verifies that the bitwise logic is correct:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
static int convert_bitwise(int x)
{
int x0 = x & 1;
int x1 = (x & 2) >> 1;
int x2 = (x & 4) >> 2;
int x3 = x >> 3;
int y0 = x0 ^ x1 ^ x3;
int y1 = x2 ^ x3;
int y2 = x3;
return (y2 << 2) | (y1 << 1) | y0;
}
static int convert_lut(int x)
{
const int LUT[16] = { 0, 1, 1, 0,
2, 3, 3, 2,
7, 6, 6, 7,
5, 4, 4, 5 };
return LUT[x & 0x0f];
}
int main(int argc, char *argv[])
{
int x;
for (x = 0; x < 16; ++x)
{
int y_bitwise = convert_bitwise(x);
int y_lut = convert_lut(x);
printf("x = %2d, y (bitwise) = %d, y (LUT) = %d, (%s)\n", x, y_bitwise, y_lut, y_bitwise == y_lut ? "PASS" : "FAIL");
}
return 0;
}
Test:
$ gcc -Wall bits4.c && ./a.out
x = 0, y (bitwise) = 0, y (LUT) = 0, (PASS)
x = 1, y (bitwise) = 1, y (LUT) = 1, (PASS)
x = 2, y (bitwise) = 1, y (LUT) = 1, (PASS)
x = 3, y (bitwise) = 0, y (LUT) = 0, (PASS)
x = 4, y (bitwise) = 2, y (LUT) = 2, (PASS)
x = 5, y (bitwise) = 3, y (LUT) = 3, (PASS)
x = 6, y (bitwise) = 3, y (LUT) = 3, (PASS)
x = 7, y (bitwise) = 2, y (LUT) = 2, (PASS)
x = 8, y (bitwise) = 7, y (LUT) = 7, (PASS)
x = 9, y (bitwise) = 6, y (LUT) = 6, (PASS)
x = 10, y (bitwise) = 6, y (LUT) = 6, (PASS)
x = 11, y (bitwise) = 7, y (LUT) = 7, (PASS)
x = 12, y (bitwise) = 5, y (LUT) = 5, (PASS)
x = 13, y (bitwise) = 4, y (LUT) = 4, (PASS)
x = 14, y (bitwise) = 4, y (LUT) = 4, (PASS)
x = 15, y (bitwise) = 5, y (LUT) = 5, (PASS)
$

Related

How does the Catch2 GENERATE macro work internally?

Recently I learned about the GENERATE macro in Catch2 (from this video). And now I am curious about how it works internally.
Naively one would think that for a test case with k generators (by a generator I mean one GENERATE call site), Catch2 just runs each test case n1 * n2 * ... * nk times, where ni is the number of elements in the i-th generator, each time specifying a different combination of values from those k generators. Indeed, this naive specification seems to hold for a simple test case:
TEST_CASE("Naive") {
auto x = GENERATE(0, 1);
auto y = GENERATE(2, 3);
std::cout << "x = " << x << ", y = " << y << std::endl;
}
As expected, the output is:
x = 0, y = 2
x = 0, y = 3
x = 1, y = 2
x = 1, y = 3
which indicates the test case runs for 2 * 2 == 4 times.
However, it seems that catch isn't implementing it naively, as shown by the following case:
TEST_CASE("Depends on if") {
auto choice = GENERATE(0, 1);
int x = -1, y = -1;
if (choice == 0) {
x = GENERATE(2, 3);
} else {
y = GENERATE(4, 5);
}
std::cout << "choice = " << choice << ", x = " << x << ", y = " << y << std::endl;
}
In the above case, the actual invocation (not callsite) of GENERATE depends on choice. If the logic were implemented naively, one would expect there to be 8 lines of output (since 2 * 2 * 2 == 8):
choice = 0, x = 2, y = -1
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
choice = 1, x = -1, y = 5
Notice the duplicate lines: the naive permutation still permutes the value of a generator even if it is not actually invoked. For example, y = GENERATE(4, 5) is only invoked if choice == 1, however, even when choice != 1, the implementation still permutes the values 4 and 5, even if those are not used.
The actual output, though, is:
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
No duplicate lines. This leads me to suspect that Catch internally uses a stack to track the generators invoked and the order of their latest invocation. Each time a test case finishes one iteration, it traverses the invoked genrators in the reverse order, and advances each generator's value. If such advancement fails (i.e. the sequence of values inside the generator finishes), that generator is reset to its initial state (i.e. ready to emit the first value in sequence); otherwise (the advancement succeeded), the traversal bails out.
In psuedocode it would look like:
for each generator that is invoked in reverse order of latest invocation:
bool success = generator.moveNext();
if success: break;
generator.reset();
This explains the previous cases perfectly. But it does not explain this (rather obscure) one:
TEST_CASE("Non structured generators") {
int x = -1, y = -1;
for (int i = 0; i <= 1; ++i) {
x = GENERATE(0, 1);
if (i == 1) break;
y = GENERATE(2, 3);
}
std::cout << x << "," << y << std::endl;
}
One would expect this to run 4 == 2 * 2 times, and the output being:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
(The x changes before y since x = GENERATE(0, 1) is the last generator invoked)
However, this is not what catch actually does, this is what happens in reality:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
8 lines of output, which is the first four lines repeated twice.
So my question is, how exactly is GENERATE in Catch2 implemented? I am not looking particularly for detailed code, but a high-level description that could explain what I have seen in the previous examples.
Maybe you can try to see the code generated after the pre-processor using the -E option in GCC.
a.c:
GENERATE(0,1)
gcc -E -CC a.c
How to make G++ preprocessor output a newline in a macro?

Using less matrices with BLAS

I'm quite new to BLAS (using OpenBLAS with C++ and VisualStudio)
I know dgemm performs C <- alpha * op(A) * op(B) + beta * C
I was trying to save some allocation doing this: B <- 1 * op(A) * op(B) + 0 * B
In other words, putting the result in the B matrix,
BUT making beta = 0 and repeating B in the position of C, results in a zero answer.
Is there a way to make it right?
The code that I'm using:
double* A = new double [3*3]; //3 rows x 3 columns
A[0] = 8;
A[1] = 3;
A[2] = 4;
A[3] = 1;
A[4] = 5;
A[5] = 9;
A[6] = 6;
A[7] = 7;
A[8] = 2;
double* v = new double[3]; //3 rows x 1 column
v[0] = 3;
v[1] = 5;
v[2] = 2;
double* foo = new double[3]; //3 rows x 1 column
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
3, 1, 3,
1,
A, 3,
v, 3,
0,
foo, 3); // makes foo = [41 ; 48 ; 61], **right**
cblas_dgemm(CblasColMajor, CblasTrans, CblasTrans,
3, 1, 3,
1,
A, 3,
v, 3,
0,
v, 3); // makes v = [0 ; 0 ; 0], **wrong**
BLAS dgemm function documentation states that only the C matrix parameter is for both input and output, being overwritten by the operation result. As B is defined just for input, BLAS implementations can assume that it shouldn't be modified.
Setting B and C to the same data pointer could be triggering some error verification on the implementation you're using, returning the zeroed result to indicate that.

Finding where an object is access/modification

I looking for something similar to call hierarchy for functions and type hierarchy for classes but for understanding how an object is accessed and modified. For example, consider the following:
1 void F(int x, int &y, int* z) {
2 int a = y;
3 a++;
4 x++;
5 y++;
6 int* b = z;
7 int &c = *b;
8 c++;
9 }
10 int main() {
11 int x0 = 5;
12 int y0 = 9;
13 int z0 = 13;
14 F(x0, y0, &z0);
15 return 0;
16 }
I want to get something similar to the following:
x0 is modified in 11 and accessed in 11 and 14.
y0 is modified in 5, 12, 14 and accessed in 1, 2, 5, 12, 14.
z0 is modified in 8, 13, 14 and accessed in 1, 6, 7, 8, 13, 14.
Is there an Eclipse feature or plugin for this?

Bit manipulation: keeping the common part at the left of the last different bit

Consider two numbers written in binary (MSB at left):
X = x7 x6 x5 x4 x3 x2 x1 x0
and
Y = y7 y6 y5 y4 y3 y2 y1 y0
These numbers can have an arbitrary number of bits but both are of the same type. Now consider that x7 == y7, x6 == y6, x5 == y5, but x4 != y4.
How to compute:
Z = x7 x6 x5 0 0 0 0 0
or in other words, how to compute efficiently a number that keeps the common part at the left of the last different bit ?
template <typename T>
inline T f(const T x, const T y)
{
// Something here
}
For example, for:
x = 10100101
y = 10110010
it should return
z = 10100000
Note: it is for supercomputing purpose and this operation will be executed hundreds of billion times so scanning the bits one by one should be avoided...
My answer is based on #JerryCoffin's one.
int d = x ^ y;
d = d | (d >> 1);
d = d | (d >> 2);
d = d | (d >> 4);
d = d | (d >> 8);
d = d | (d >> 16);
int z = x & (~d);
Part of this problem shows up semi-regularly in bit-manipulation: "parallel suffix with OR", or "prefix" (that is, depending on who you listen to, the low bits are either called a suffix or a prefix). Obviously once you have a way to do that, it's trivial to extend it to what you want (as shown in the other answers).
Anyway, the obvious way is:
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
But you're probably not constrained to simple operators.
For Haswell, the fastest way I found was:
lzcnt rax, rax ; number of leading zeroes, sets carry if rax=0
mov edx, 64
sub edx, eax
mov rax, -1
bzhi rax, rax, rdx ; reset the bits in rax starting at position rdx
Other contenders were:
mov rdx, -1
bsr rax, rax ; position of the highest set bit, set Z flag if no bit
cmovz rdx, rax ; set rdx=rax iff Z flag is set
xor eax, 63
shrx rax, rdx, rax ; rax = rdx >> rax
And
lzcnt rax, rax
sbb rdx, rdx ; rdx -= rdx + carry (so 0 if no carry, -1 if carry)
not rdx
shrx rax, rdx, rax
But they were not as fast.
I've also considered
lzcnt rax, rax
mov rax, [table+rax*8]
But it's hard to compare it fairly, since it's the only one that spends cache space, which has non-local effects.
Benchmarking various ways to do this led to this question about some curious behaviour of lzcnt.
They all rely on some fast way to determine the position of the highest set bit, which you could do with a cast to float and exponent extraction if you really had to, so probably most platforms can use something like it.
A shift that gives zero if the shift-count is equal to or bigger than the operand size would be very nice to solve this problem. x86 doesn't have one, but maybe your platform does.
If you had a fast bit-reversal instruction, you could do something like: (this isn't intended to be ARM asm)
rbit r0, r0
neg r1, r0
or r0, r1, r0
rbit r0, r0
Comparing several algorithms leads to this ranking:
Having an inner loop of 1 or 10 in the test below:
Utilizing a built in bit scan function.
Filling least significant bits with or and shift (The function of
#Egor Skriptunoff).
Involving a lookup table.
Scanning the most significant bit (The second
function of #Tomas).
InnerLoops = 10:
Timing 1: 0.101284
Timing 2: 0.108845
Timing 3: 0.102526
Timing 4: 0.191911
An inner loop of 100 or greater:
Utilizing a built in bit scan function.
Involving a lookup table.
Filling least significant bits with or and shift (The function of
#Egor Skriptunoff).
Scanning the most significant bit (The second
function of #Tomas).
InnerLoops = 100:
Timing 1: 0.441786
Timing 2: 0.507651
Timing 3: 0.548328
Timing 4: 0.593668
The test:
#include <algorithm>
#include <chrono>
#include <limits>
#include <iostream>
#include <iomanip>
// Functions
// =========
inline unsigned function1(unsigned a, unsigned b)
{
a ^= b;
if(a) {
int n = __builtin_clz (a);
a = (~0u) >> n;
}
return ~a & b;
}
typedef std::uint8_t byte;
static byte msb_table[256] = {
0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
};
inline unsigned function2(unsigned a, unsigned b)
{
a ^= b;
if(a) {
unsigned n = 0;
if(a >> 24) n = msb_table[byte(a >> 24)] + 24;
else if(a >> 16) n = msb_table[byte(a >> 16)] + 16;
else if(a >> 8) n = msb_table[byte(a >> 8)] + 8;
else n = msb_table[byte(a)];
a = (~0u) >> (32-n);
}
return ~a & b;
}
inline unsigned function3(unsigned a, unsigned b)
{
unsigned d = a ^ b;
d = d | (d >> 1);
d = d | (d >> 2);
d = d | (d >> 4);
d = d | (d >> 8);
d = d | (d >> 16);
return a & (~d);;
}
inline unsigned function4(unsigned a, unsigned b)
{
const unsigned maxbit = 1u << (std::numeric_limits<unsigned>::digits - 1);
unsigned msb = maxbit;
a ^= b;
while( ! (a & msb))
msb >>= 1;
if(msb == maxbit) return 0;
else {
msb <<= 1;
msb -= 1;
return ~msb & b;
}
}
// Test
// ====
inline double duration(
std::chrono::system_clock::time_point start,
std::chrono::system_clock::time_point end)
{
return double((end - start).count())
/ std::chrono::system_clock::period::den;
}
int main() {
typedef unsigned (*Function)(unsigned , unsigned);
Function fn[] = {
function1,
function2,
function3,
function4,
};
const unsigned N = sizeof(fn) / sizeof(fn[0]);
std::chrono::system_clock::duration timing[N] = {};
const unsigned OuterLoops = 1000000;
const unsigned InnerLoops = 100;
const unsigned Samples = OuterLoops * InnerLoops;
unsigned* A = new unsigned[Samples];
unsigned* B = new unsigned[Samples];
for(unsigned i = 0; i < Samples; ++i) {
A[i] = std::rand();
B[i] = std::rand();
}
unsigned F[N];
for(unsigned f = 0; f < N; ++f) F[f] = f;
unsigned result[N];
for(unsigned i = 0; i < OuterLoops; ++i) {
std::random_shuffle(F, F + N);
for(unsigned f = 0; f < N; ++f) {
unsigned g = F[f];
auto start = std::chrono::system_clock::now();
for(unsigned j = 0; j < InnerLoops; ++j) {
unsigned index = i + j;
unsigned a = A[index];
unsigned b = B[index];
result[g] = fn[g](a, b);
}
auto end = std::chrono::system_clock::now();
timing[g] += (end-start);
}
for(unsigned f = 1; f < N; ++f) {
if(result[0] != result[f]) {
std::cerr << "Different Results\n" << std::hex;
for(unsigned g = 0; g < N; ++g)
std::cout << "Result " << g+1 << ": " << result[g] << '\n';
exit(-1);
}
}
}
for(unsigned i = 0; i < N; ++i) {
std::cout
<< "Timing " << i+1 << ": "
<< double(timing[i].count()) / std::chrono::system_clock::period::den
<< "\n";
}
}
Compiler:
g++ 4.7.2
Hardware:
Intel® Core™ i3-2310M CPU # 2.10GHz × 4 7.7 GiB
It's a little ugly, but assuming 8-bit inputs, you can do something like this:
int x = 0xA5; // 1010 0101
int y = 0xB2; // 1011 0010
unsigned d = x ^ y;
int mask = ~(d | (d >> 1) | (d >> 2) | (d >> 3) | (d >> 4) | (d >> 5) | (d >> 6));
int z = x & mask;
We start by computing the exclusive-or of the numbers, which will give a 0 where they're equal, and a 1 where they're different. For your example, that gives:
00010111
We then shift that right and inclusive-or it with itself each of 7 possible bit positions:
00010111
00001011
00000101
00000010
00000001
That gives:
00011111
Which is 0's where the original numbers were equal, and 1's where they were different. We then invert that to get:
11100000
Then we and that with one of the original inputs (doesn't matter which) to get:
10100000
...exactly the result we wanted (and unlike a simple x & y, it'll also work for other values of x and y).
Of course, this can be extended out to an arbitrary width, but if you were working with (say) 64-bit numbers, the d | (d>>1) | ... | (d>>63); would be a little on the long and clumsy side.
You may reduce it to much easier problem of finding the highest set bit (highest 1), which is actually the same as finding ceil(log2 X).
unsigned int x, y, c, m;
int b;
c = x ^ y; // xor : 00010111
// now it comes: b = number of highest set bit in c
// perhaps some special operation or instruction exists for that
b = -1;
while (c) {
b++;
c = c >> 1;
} // b == 4
m = (1 << (b + 1)) - 1; // creates a mask: 00011111
return x & ~m; // x AND NOT M
return y & ~m; // should return the same result
In fact, if you can compute the ceil(log2 c) easily, then just subtract 1 and you have m, without the need for computing b using the loop above.
If you don't have such functionality, simple optimized code, which uses just basic assembly level operations (bit shifts by one bit: <<=1, >>=1) would look like this:
c = x ^ y; // c == 00010111 (xor)
m = 1;
while (c) {
m <<= 1;
c >>= 1;
} // m == 00100000
m--; // m == 00011111 (mask)
return x & ~m; // x AND NOT M
This can be compiled to a very fast code, mostly like one or two machine instructions per line.

Elegant way the find the Vertices of a Cube

Nearly every OpenGL tutorial lets you implement drawing a cube. Therefore the vertices of the cube are needed. In the example code I saw a long list defining every vertex. But I would like to compute the vertices of a cube rather that using a overlong list of precomputed coordinates.
A cube is made of eight vertices and twelve triangles. Vertices are defined by x, y, and z. Triangles are defined each by the indexes of three vertices.
Is there an elegant way to compute the vertices and the element indexes of a cube?
When i was "porting" the csg.js project to Java I've found some cute code which generated cube with selected center point and radius. (I know it's JS, but anyway)
// Construct an axis-aligned solid cuboid. Optional parameters are `center` and
// `radius`, which default to `[0, 0, 0]` and `[1, 1, 1]`. The radius can be
// specified using a single number or a list of three numbers, one for each axis.
//
// Example code:
//
// var cube = CSG.cube({
// center: [0, 0, 0],
// radius: 1
// });
CSG.cube = function(options) {
options = options || {};
var c = new CSG.Vector(options.center || [0, 0, 0]);
var r = !options.radius ? [1, 1, 1] : options.radius.length ?
options.radius : [options.radius, options.radius, options.radius];
return CSG.fromPolygons([
[[0, 4, 6, 2], [-1, 0, 0]],
[[1, 3, 7, 5], [+1, 0, 0]],
[[0, 1, 5, 4], [0, -1, 0]],
[[2, 6, 7, 3], [0, +1, 0]],
[[0, 2, 3, 1], [0, 0, -1]],
[[4, 5, 7, 6], [0, 0, +1]]
].map(function(info) {
return new CSG.Polygon(info[0].map(function(i) {
var pos = new CSG.Vector(
c.x + r[0] * (2 * !!(i & 1) - 1),
c.y + r[1] * (2 * !!(i & 2) - 1),
c.z + r[2] * (2 * !!(i & 4) - 1)
);
return new CSG.Vertex(pos, new CSG.Vector(info[1]));
}));
}));
};
I solved this problem with this piece code (C#):
public CubeShape(Coord3 startPos, int size) {
int l = size / 2;
verts = new Coord3[8];
for (int i = 0; i < 8; i++) {
verts[i] = new Coord3(
(i & 4) != 0 ? l : -l,
(i & 2) != 0 ? l : -l,
(i & 1) != 0 ? l : -l) + startPos;
}
tris = new Tris[12];
int vertCount = 0;
void AddVert(int one, int two, int three) =>
tris[vertCount++] = new Tris(verts[one], verts[two], verts[three]);
for (int i = 0; i < 3; i++) {
int v1 = 1 << i;
int v2 = v1 == 4 ? 1 : v1 << 1;
AddVert(0, v1, v2);
AddVert(v1 + v2, v2, v1);
AddVert(7, 7 - v2, 7 - v1);
AddVert(7 - (v1 + v2), 7 - v1, 7 - v2);
}
}
If you want to understand more of what is going on, you can check out the github page I wrote that explains it.