Recently I learned about the GENERATE macro in Catch2 (from this video). And now I am curious about how it works internally.
Naively one would think that for a test case with k generators (by a generator I mean one GENERATE call site), Catch2 just runs each test case n1 * n2 * ... * nk times, where ni is the number of elements in the i-th generator, each time specifying a different combination of values from those k generators. Indeed, this naive specification seems to hold for a simple test case:
TEST_CASE("Naive") {
auto x = GENERATE(0, 1);
auto y = GENERATE(2, 3);
std::cout << "x = " << x << ", y = " << y << std::endl;
}
As expected, the output is:
x = 0, y = 2
x = 0, y = 3
x = 1, y = 2
x = 1, y = 3
which indicates the test case runs for 2 * 2 == 4 times.
However, it seems that catch isn't implementing it naively, as shown by the following case:
TEST_CASE("Depends on if") {
auto choice = GENERATE(0, 1);
int x = -1, y = -1;
if (choice == 0) {
x = GENERATE(2, 3);
} else {
y = GENERATE(4, 5);
}
std::cout << "choice = " << choice << ", x = " << x << ", y = " << y << std::endl;
}
In the above case, the actual invocation (not callsite) of GENERATE depends on choice. If the logic were implemented naively, one would expect there to be 8 lines of output (since 2 * 2 * 2 == 8):
choice = 0, x = 2, y = -1
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
choice = 1, x = -1, y = 5
Notice the duplicate lines: the naive permutation still permutes the value of a generator even if it is not actually invoked. For example, y = GENERATE(4, 5) is only invoked if choice == 1, however, even when choice != 1, the implementation still permutes the values 4 and 5, even if those are not used.
The actual output, though, is:
choice = 0, x = 2, y = -1
choice = 0, x = 3, y = -1
choice = 1, x = -1, y = 4
choice = 1, x = -1, y = 5
No duplicate lines. This leads me to suspect that Catch internally uses a stack to track the generators invoked and the order of their latest invocation. Each time a test case finishes one iteration, it traverses the invoked genrators in the reverse order, and advances each generator's value. If such advancement fails (i.e. the sequence of values inside the generator finishes), that generator is reset to its initial state (i.e. ready to emit the first value in sequence); otherwise (the advancement succeeded), the traversal bails out.
In psuedocode it would look like:
for each generator that is invoked in reverse order of latest invocation:
bool success = generator.moveNext();
if success: break;
generator.reset();
This explains the previous cases perfectly. But it does not explain this (rather obscure) one:
TEST_CASE("Non structured generators") {
int x = -1, y = -1;
for (int i = 0; i <= 1; ++i) {
x = GENERATE(0, 1);
if (i == 1) break;
y = GENERATE(2, 3);
}
std::cout << x << "," << y << std::endl;
}
One would expect this to run 4 == 2 * 2 times, and the output being:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
(The x changes before y since x = GENERATE(0, 1) is the last generator invoked)
However, this is not what catch actually does, this is what happens in reality:
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
x = 0, y = 2
x = 1, y = 2
x = 0, y = 3
x = 1, y = 3
8 lines of output, which is the first four lines repeated twice.
So my question is, how exactly is GENERATE in Catch2 implemented? I am not looking particularly for detailed code, but a high-level description that could explain what I have seen in the previous examples.
Maybe you can try to see the code generated after the pre-processor using the -E option in GCC.
a.c:
GENERATE(0,1)
gcc -E -CC a.c
How to make G++ preprocessor output a newline in a macro?
Consider two numbers written in binary (MSB at left):
X = x7 x6 x5 x4 x3 x2 x1 x0
and
Y = y7 y6 y5 y4 y3 y2 y1 y0
These numbers can have an arbitrary number of bits but both are of the same type. Now consider that x7 == y7, x6 == y6, x5 == y5, but x4 != y4.
How to compute:
Z = x7 x6 x5 0 0 0 0 0
or in other words, how to compute efficiently a number that keeps the common part at the left of the last different bit ?
template <typename T>
inline T f(const T x, const T y)
{
// Something here
}
For example, for:
x = 10100101
y = 10110010
it should return
z = 10100000
Note: it is for supercomputing purpose and this operation will be executed hundreds of billion times so scanning the bits one by one should be avoided...
My answer is based on #JerryCoffin's one.
int d = x ^ y;
d = d | (d >> 1);
d = d | (d >> 2);
d = d | (d >> 4);
d = d | (d >> 8);
d = d | (d >> 16);
int z = x & (~d);
Part of this problem shows up semi-regularly in bit-manipulation: "parallel suffix with OR", or "prefix" (that is, depending on who you listen to, the low bits are either called a suffix or a prefix). Obviously once you have a way to do that, it's trivial to extend it to what you want (as shown in the other answers).
Anyway, the obvious way is:
x |= x >> 1
x |= x >> 2
x |= x >> 4
x |= x >> 8
x |= x >> 16
But you're probably not constrained to simple operators.
For Haswell, the fastest way I found was:
lzcnt rax, rax ; number of leading zeroes, sets carry if rax=0
mov edx, 64
sub edx, eax
mov rax, -1
bzhi rax, rax, rdx ; reset the bits in rax starting at position rdx
Other contenders were:
mov rdx, -1
bsr rax, rax ; position of the highest set bit, set Z flag if no bit
cmovz rdx, rax ; set rdx=rax iff Z flag is set
xor eax, 63
shrx rax, rdx, rax ; rax = rdx >> rax
And
lzcnt rax, rax
sbb rdx, rdx ; rdx -= rdx + carry (so 0 if no carry, -1 if carry)
not rdx
shrx rax, rdx, rax
But they were not as fast.
I've also considered
lzcnt rax, rax
mov rax, [table+rax*8]
But it's hard to compare it fairly, since it's the only one that spends cache space, which has non-local effects.
Benchmarking various ways to do this led to this question about some curious behaviour of lzcnt.
They all rely on some fast way to determine the position of the highest set bit, which you could do with a cast to float and exponent extraction if you really had to, so probably most platforms can use something like it.
A shift that gives zero if the shift-count is equal to or bigger than the operand size would be very nice to solve this problem. x86 doesn't have one, but maybe your platform does.
If you had a fast bit-reversal instruction, you could do something like: (this isn't intended to be ARM asm)
rbit r0, r0
neg r1, r0
or r0, r1, r0
rbit r0, r0
Comparing several algorithms leads to this ranking:
Having an inner loop of 1 or 10 in the test below:
Utilizing a built in bit scan function.
Filling least significant bits with or and shift (The function of
#Egor Skriptunoff).
Involving a lookup table.
Scanning the most significant bit (The second
function of #Tomas).
InnerLoops = 10:
Timing 1: 0.101284
Timing 2: 0.108845
Timing 3: 0.102526
Timing 4: 0.191911
An inner loop of 100 or greater:
Utilizing a built in bit scan function.
Involving a lookup table.
Filling least significant bits with or and shift (The function of
#Egor Skriptunoff).
Scanning the most significant bit (The second
function of #Tomas).
InnerLoops = 100:
Timing 1: 0.441786
Timing 2: 0.507651
Timing 3: 0.548328
Timing 4: 0.593668
The test:
#include <algorithm>
#include <chrono>
#include <limits>
#include <iostream>
#include <iomanip>
// Functions
// =========
inline unsigned function1(unsigned a, unsigned b)
{
a ^= b;
if(a) {
int n = __builtin_clz (a);
a = (~0u) >> n;
}
return ~a & b;
}
typedef std::uint8_t byte;
static byte msb_table[256] = {
0, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
};
inline unsigned function2(unsigned a, unsigned b)
{
a ^= b;
if(a) {
unsigned n = 0;
if(a >> 24) n = msb_table[byte(a >> 24)] + 24;
else if(a >> 16) n = msb_table[byte(a >> 16)] + 16;
else if(a >> 8) n = msb_table[byte(a >> 8)] + 8;
else n = msb_table[byte(a)];
a = (~0u) >> (32-n);
}
return ~a & b;
}
inline unsigned function3(unsigned a, unsigned b)
{
unsigned d = a ^ b;
d = d | (d >> 1);
d = d | (d >> 2);
d = d | (d >> 4);
d = d | (d >> 8);
d = d | (d >> 16);
return a & (~d);;
}
inline unsigned function4(unsigned a, unsigned b)
{
const unsigned maxbit = 1u << (std::numeric_limits<unsigned>::digits - 1);
unsigned msb = maxbit;
a ^= b;
while( ! (a & msb))
msb >>= 1;
if(msb == maxbit) return 0;
else {
msb <<= 1;
msb -= 1;
return ~msb & b;
}
}
// Test
// ====
inline double duration(
std::chrono::system_clock::time_point start,
std::chrono::system_clock::time_point end)
{
return double((end - start).count())
/ std::chrono::system_clock::period::den;
}
int main() {
typedef unsigned (*Function)(unsigned , unsigned);
Function fn[] = {
function1,
function2,
function3,
function4,
};
const unsigned N = sizeof(fn) / sizeof(fn[0]);
std::chrono::system_clock::duration timing[N] = {};
const unsigned OuterLoops = 1000000;
const unsigned InnerLoops = 100;
const unsigned Samples = OuterLoops * InnerLoops;
unsigned* A = new unsigned[Samples];
unsigned* B = new unsigned[Samples];
for(unsigned i = 0; i < Samples; ++i) {
A[i] = std::rand();
B[i] = std::rand();
}
unsigned F[N];
for(unsigned f = 0; f < N; ++f) F[f] = f;
unsigned result[N];
for(unsigned i = 0; i < OuterLoops; ++i) {
std::random_shuffle(F, F + N);
for(unsigned f = 0; f < N; ++f) {
unsigned g = F[f];
auto start = std::chrono::system_clock::now();
for(unsigned j = 0; j < InnerLoops; ++j) {
unsigned index = i + j;
unsigned a = A[index];
unsigned b = B[index];
result[g] = fn[g](a, b);
}
auto end = std::chrono::system_clock::now();
timing[g] += (end-start);
}
for(unsigned f = 1; f < N; ++f) {
if(result[0] != result[f]) {
std::cerr << "Different Results\n" << std::hex;
for(unsigned g = 0; g < N; ++g)
std::cout << "Result " << g+1 << ": " << result[g] << '\n';
exit(-1);
}
}
}
for(unsigned i = 0; i < N; ++i) {
std::cout
<< "Timing " << i+1 << ": "
<< double(timing[i].count()) / std::chrono::system_clock::period::den
<< "\n";
}
}
Compiler:
g++ 4.7.2
Hardware:
Intel® Core™ i3-2310M CPU # 2.10GHz × 4 7.7 GiB
It's a little ugly, but assuming 8-bit inputs, you can do something like this:
int x = 0xA5; // 1010 0101
int y = 0xB2; // 1011 0010
unsigned d = x ^ y;
int mask = ~(d | (d >> 1) | (d >> 2) | (d >> 3) | (d >> 4) | (d >> 5) | (d >> 6));
int z = x & mask;
We start by computing the exclusive-or of the numbers, which will give a 0 where they're equal, and a 1 where they're different. For your example, that gives:
00010111
We then shift that right and inclusive-or it with itself each of 7 possible bit positions:
00010111
00001011
00000101
00000010
00000001
That gives:
00011111
Which is 0's where the original numbers were equal, and 1's where they were different. We then invert that to get:
11100000
Then we and that with one of the original inputs (doesn't matter which) to get:
10100000
...exactly the result we wanted (and unlike a simple x & y, it'll also work for other values of x and y).
Of course, this can be extended out to an arbitrary width, but if you were working with (say) 64-bit numbers, the d | (d>>1) | ... | (d>>63); would be a little on the long and clumsy side.
You may reduce it to much easier problem of finding the highest set bit (highest 1), which is actually the same as finding ceil(log2 X).
unsigned int x, y, c, m;
int b;
c = x ^ y; // xor : 00010111
// now it comes: b = number of highest set bit in c
// perhaps some special operation or instruction exists for that
b = -1;
while (c) {
b++;
c = c >> 1;
} // b == 4
m = (1 << (b + 1)) - 1; // creates a mask: 00011111
return x & ~m; // x AND NOT M
return y & ~m; // should return the same result
In fact, if you can compute the ceil(log2 c) easily, then just subtract 1 and you have m, without the need for computing b using the loop above.
If you don't have such functionality, simple optimized code, which uses just basic assembly level operations (bit shifts by one bit: <<=1, >>=1) would look like this:
c = x ^ y; // c == 00010111 (xor)
m = 1;
while (c) {
m <<= 1;
c >>= 1;
} // m == 00100000
m--; // m == 00011111 (mask)
return x & ~m; // x AND NOT M
This can be compiled to a very fast code, mostly like one or two machine instructions per line.
Nearly every OpenGL tutorial lets you implement drawing a cube. Therefore the vertices of the cube are needed. In the example code I saw a long list defining every vertex. But I would like to compute the vertices of a cube rather that using a overlong list of precomputed coordinates.
A cube is made of eight vertices and twelve triangles. Vertices are defined by x, y, and z. Triangles are defined each by the indexes of three vertices.
Is there an elegant way to compute the vertices and the element indexes of a cube?
When i was "porting" the csg.js project to Java I've found some cute code which generated cube with selected center point and radius. (I know it's JS, but anyway)
// Construct an axis-aligned solid cuboid. Optional parameters are `center` and
// `radius`, which default to `[0, 0, 0]` and `[1, 1, 1]`. The radius can be
// specified using a single number or a list of three numbers, one for each axis.
//
// Example code:
//
// var cube = CSG.cube({
// center: [0, 0, 0],
// radius: 1
// });
CSG.cube = function(options) {
options = options || {};
var c = new CSG.Vector(options.center || [0, 0, 0]);
var r = !options.radius ? [1, 1, 1] : options.radius.length ?
options.radius : [options.radius, options.radius, options.radius];
return CSG.fromPolygons([
[[0, 4, 6, 2], [-1, 0, 0]],
[[1, 3, 7, 5], [+1, 0, 0]],
[[0, 1, 5, 4], [0, -1, 0]],
[[2, 6, 7, 3], [0, +1, 0]],
[[0, 2, 3, 1], [0, 0, -1]],
[[4, 5, 7, 6], [0, 0, +1]]
].map(function(info) {
return new CSG.Polygon(info[0].map(function(i) {
var pos = new CSG.Vector(
c.x + r[0] * (2 * !!(i & 1) - 1),
c.y + r[1] * (2 * !!(i & 2) - 1),
c.z + r[2] * (2 * !!(i & 4) - 1)
);
return new CSG.Vertex(pos, new CSG.Vector(info[1]));
}));
}));
};
I solved this problem with this piece code (C#):
public CubeShape(Coord3 startPos, int size) {
int l = size / 2;
verts = new Coord3[8];
for (int i = 0; i < 8; i++) {
verts[i] = new Coord3(
(i & 4) != 0 ? l : -l,
(i & 2) != 0 ? l : -l,
(i & 1) != 0 ? l : -l) + startPos;
}
tris = new Tris[12];
int vertCount = 0;
void AddVert(int one, int two, int three) =>
tris[vertCount++] = new Tris(verts[one], verts[two], verts[three]);
for (int i = 0; i < 3; i++) {
int v1 = 1 << i;
int v2 = v1 == 4 ? 1 : v1 << 1;
AddVert(0, v1, v2);
AddVert(v1 + v2, v2, v1);
AddVert(7, 7 - v2, 7 - v1);
AddVert(7 - (v1 + v2), 7 - v1, 7 - v2);
}
}
If you want to understand more of what is going on, you can check out the github page I wrote that explains it.