I have an integer constant, lets say:
expr x = ctx.int_const("x");
What I'm trying to do is apply random constraints on the individual bits of x. However, it turns out you cannot use bit-wise operations with integer sorts, only bit-vectors. My initial approach before realizing this was this:
for(int i = 0; i < 32; i++){
int mask = 0x00000001 << i;
if(rand()%2)
solver.add((x & mask) == 0);
else
solver.add((x & mask) != 0);
}
This of course does not work, as Z3 throws an exception.
After a bit of digging through the API, I found the Z3_mk_int2bv function, and figured I'd give that a try:
for(int i = 0; i < 32; i++){
if(rand()%2)
solver.add(z3::expr(ctx(),Z3_mk_int2bv(ctx(), 32, v())).extract(i, i) == ctx().bv_val(0, 1));
else
solver.add(z3::expr(ctx(),Z3_mk_int2bv(ctx(), 32, v())).extract(i, i) != ctx().bv_val(0, 1));
}
While no assertion gets thrown on the above solver add calls, the actual solving time suddenly exploded. So much so that I have yet to see how long it actually takes. Adding similar expressions using bit-vectors does not take a major toll on the SAT solver, with the solver time being less than a second as far I can tell.
I'm wondering what it is about the above expression that could cause the solver performance to degrade so badly, and whether there's a better approach?
int2bv is expensive. There are many reasons for this, but bottom line the solver now has to negotiate between the theory of integers and bit-vectors, and the heuristics probably don't help much. Notice that to do a proper conversion the solver has to perform repeated divisions, which is quite costly. Furthermore, talking about bits of a mathematical integer doesn't make much sense to start with: What if it's a negative number? Do you assume some sort of a infinite-width 2's complement representation? Or is it some other mapping? All this makes it harder to reason with such conversions. And for a long time int2bv was uninterpreted in z3 for this and similar reasons. You can find many posts regarding this on stack-overflow, for instance see here: Z3 : Questions About Z3 int2bv?
Your best bet would be to simply use bit-vectors to start with. If you're reasoning about machine arithmetic, why not model everything with bit-vectors to start with?
If you're stuck with the Int type, my recommendation would be to simply stick to mod function, making sure the second argument is a constant. This might avoid some of the complexity, but without looking at actual problems, it's hard to opine any further.
Related
Often I convert some if statements into boolean expressions for code compactness. For instance, if I have something like
foo(int x)
{
if (x > 5) return 100 + 5;
return 100;
}
I'll do it like
foo(int x)
{
return 100 + (x > 5) * 5;
}
This is very simple so no problem, the thing is when I have multiple tests, I can greatly simplify them (at the expense of readability but that's a different issue).
So the question is if that (x > 5) evaluation is as onerous as explicitly branching with it.
In both cases the expression (x > 5) has to be checked if it evaluates to true . And as demonstrated already, both versions compile to the same assembly even without any optimization enabled.
However, the Philosophy section of C++ Core Guidelines has these two rules you would do well to pay heed to:
P.1: Express ideas directly in code
P.3: Express intent
Though these rules cannot be enforced in anyway, adhering to them will make you adopt the version with the if statement.
Doing so will make it less onerous for those who have to maintain the code after you and even yourself a few months later.
You seem to be conflating C++ language constructs with patterns in the assembly. It may have been viable to reason about code on this level given the compilers of the late eighties or early nineties. At this point, however, compilers apply a lot of optimizations and transformations whose correctness or utility is not even obvious to the average programmer. A very simple example is the common beginner's mistake of assuming the following equivalences:
std::uint16_t a = ...;
a *= 2; // a multiplication in assembly
a *= 17; // ditto
a /= 3; // a division in assembly
They may then be surprised to find out that their compiler of choice translates these into the assembly equivalent of e.g.:
a <<= 1u;
a = (a << 4u) + a; // or even (a << 4u) | a if a < 16
a *= 43691u;
Note that the last transformation is only allowed if a is known to be a multiple of the divisor, so you may not see this kind of optimization all too often. How does it even work? In mathematical terms, uint16_t can be thought of as the residue class ring Z/(2^16)Z, and in this ring, there exists a multiplicative inverse for any element that is coprime to 2^16 (i.e. not divisible by 2). If d (e.g. 3) is coprime to 2, it has such an inverse, and then dividing by d is simply equivalent to multiplying by the inverse of d if the remainder is known to be zero. (I won't go into how this inverse can be calculated here.)
Here is another surprising optimization:
long arithsum(long n)
{
long result = 0;
for (long i=0; i<=n; ++i)
result += i;
return result;
}
GCC with -O3 rather mundanely translates this into an unrolled loop of additions. My version (9.0.0svn-something) of Clang, however, will pull a Gauss on you if you do this, and translate this into something like:
long arithsum(long n)
{
return (n * (n+1)) >> 1;
}
Anyway, the same caveats apply to if/switch etc. – while these are control flow structures, and so you'd think they correspond to branching, this may not be so. Likewise, what appears to be a non-branching operation might be translated to a branching operation if the compiler has an optimization rule under which this seems beneficial, or even if it is just unable to translate its own AST or intermediate representation into machine code without use of branching (on the given architecture).
TL;DR: Before you try to outsmart your compiler, figure out which assembly the compiler produces for the straightforward / readable code in this first place. If this assembly is good, there is no point in making the code more subtle / less readable.
Assuming by onerous you mean 1/0. Sure it might work in C/C++ due to implicit typecasting but might not for other languages. If that's what you want to achieve why not use ternary operator (? :) which also makes the code more readable
foo(int x) {
return (x > 5) ? (100 + 5) : 100;
}
Also read this stackoverflow article -- bool to int conversion
Is there any difference in the performance of the following code snippets? Which one performs best and why?
int i = 1000000000;
while(i != 0) { i--; }
or
int i = 1000000000;
while(i) { i--; }
or
int i = 1000000000;
while(i > 0) { i--; }
I see a lot of people use the first example and wonder why. Easier to read?
They are all the same in this context and any decent compiler will generate equivalent code for all three.
In any case, trying to hand-optimize trivial things like this (integer comparisons) is pointless. Your compiler will figure it out and do a much better job during code-gen than you ever could. So just stop trying and instead just write the most readable code you can and then trust the compiler - in any case, none of this makes any performance difference.
Is there any difference in the performance of the following code snippets?
No.
First two are equivalent, and all three can be optimized to exactly same assembly.
I see a lot of people use the first example and wonder why. Easier to read?
It requires the reader to know fewer language rules than the second one. In particular, the second program requires the knowledge that conditional expression is converted to bool, and that the conversion from int has the same result as inequality with zero.
Note that if i were replaced with a floating point number, or if the decrement were modified to have more complexity (for example: decrement by 2), then the third option would be easiest to prove correct. With integers and single decrement, there is no difference.
In my software I am using the input values from the user at run time and performing some mathematical operations. Consider for simplicity below example:
int multiply(const int a, const int b)
{
if(a >= INT_MAX || B >= INT_MAX)
return 0;
else
return a*b;
}
I can check if the input values are greater than the limits, but how do I check if the result will be out of limits? It is quite possible that a = INT_MAX - 1 and b = 2. Since the inputs are perfectly valid, it will execute the undefined code which makes my program meaningless. This means any code executed after this will be random and eventually may result in crash. So how do I protect my program in such cases?
This really comes down to what you actually want to do in this case.
For a machine where long or long long (or int64_t) is a 64-bit value, and int is a 32-bit value, you could do (I'm assuming long is 64 bit here):
long x = static_cast<long>(a) * b;
if (x > MAX_INT || x < MIN_INT)
return 0;
else
return static_cast<int>(x);
By casting one value to long, the other will have to be converted as well. You can cast both if that makes you happier. The overhead here, above a normal 32-bit multiply is a couple of clock-cycles on modern CPU's, and it's unlikely that you can find a safer solution, that is also faster. [You can, in some compilers, add attributes to the if saying that it's unlikely to encourage branch prediction "to get it right" for the common case of returning x]
Obviously, this won't work for values where the type is as big as the biggest integer you can deal with (although you could possibly use floating point, but it may still be a bit dodgy, since the precision of float is not sufficient - could be done using some "safety margin" tho' [e.g. compare to less than LONG_INT_MAX / 2], if you don't need the entire range of integers.). Penalty here is a bit worse tho', especially transitions between float and integer isn't "pleasant".
Another alternative is to actually test the relevant code, with "known invalid values", and as long as the rest of the code is "ok" with it. Make sure you test this with the relevant compiler settings, as changing the compiler options will change the behaviour. Note that your code then has to deal with "what do we do when 65536 * 100000 is a negative number", and your code didn't expect so. Perhaps add something like:
int x = a * b;
if (x < 0) return 0;
[But this only works if you don't expect negative results, of course]
You could also inspect the assembly code generated and understand the architecture of the actual processor [the key here is to understand if "overflow will trap" - which it won't by default in x86, ARM, 68K, 29K. I think MIPS has an option of "trap on overflow"], and determine whether it's likely to cause a problem [1], and add something like
#if (defined(__X86__) || defined(__ARM__))
#error This code needs inspecting for correct behaviour
#endif
return a * b;
One problem with this approach, however, is that even the slightest changes in code, or compiler version may alter the outcome, so it's important to couple this with the testing approach above (and make sure you test the ACTUAL production code, not some hacked up mini-example).
[1] The "undefined behaviour" is undefined to allow C to "work" on processors that have trapping overflows of integer math, as well as the fact that that a * b when it overflows in a signed value is of course hard to determine unless you have a defined math system (two's complement, one's complement, distinct sign bit) - so to avoid "defining" the exact behaviour in these cases, the C standard says "It's undefined". It doesn't mean that it will definitely go bad.
Specifically for the multiplication of a by b the mathematically correct way to detect if it will overflow is to calculate log₂ of both values. If their sum is higher than the log₂ of the highest representable value of the result, then there is overflow.
log₂(a) + log₂(b) < log₂(UINT_MAX)
The difficulty is to calculate quickly the log₂ of an integer. For that, there are several bit twiddling hacks that can be used, like counting bit, counting leading zeros (some processors even have instructions for that). This site has several implementations
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
The simplest implementation could be:
unsigned int log2(unsigned int v)
{
unsigned int r = 0;
while (v >>= 1)
r++;
return r;
}
In your program you only need to check then
if(log2(a) + log2(b) < MYLOG2UINTMAX)
return a*b;
else
printf("Overflow");
The signed case is similar but has to take care of the negative case specifically.
EDIT: My solution is not complete and has an error which makes the test more severe than necessary. The equation works in reality if the log₂ function returns a floating point value. In the implementation I limited thevalue to unsigned integers. This means that completely valid multiplication get refused. Why? Because log2(UINT_MAX) is truncated
log₂(UINT_MAX)=log₂(4294967295)≈31.9999999997 truncated to 31.
We have there for to change the implementation to replace the constant to compare to
#define MYLOG2UINTMAX (CHAR_BIT*sizeof (unsigned int))
You may try this:
if ( b > ULONG_MAX / a ) // Need to check a != 0 before this division
return 0; //a*b invoke UB
else
return a*b;
It is told that modulo operator "%" and divide operator "/" are very inefficient in embedded C++.
How can I alternatively achieve the following expression:
a = b % c;
I understand that this can be achieved using the following logic:
a = b - c;
while (a >= c) {
a = a - c;
}
But my question is, is this code involving while loops efficient enough, compared to % operator?
Thanks,
Kirti
Division and modulus are indeed costly hardware operations, whatever you do (this is more related to hardware architecture than to languages or compilers), perhaps ten times slower than addition.
However, on current laptops or servers, and on high-end microcontrollers, cache misses are often much slower than divisions!
The GCC compiler is often able to optimize them, when the divisor is a constant.
Your naive loop is usually much more slower than using the hardware division instruction (or the library routine doing it, if not provided by hardware). I believe you are wrong in avoiding the division & replacing it with your loop.
You might tune your algorithms -e.g. by having power of twos- but I don't recommend using your code. Remember that premature optimization is evil so first try to get your program correct, then profile it to find the trouble spots.
Nothing is going to be considerably more efficient than the % operator. If there was a better way to do it, then any reasonable compiler would automatically convert it. When you're told that % and / are inefficient, that's just because those are difficult operations - if you need to perform a modulo, then do that.
There may be special cases when there are better ways - for example, mod a power of two can be written as a binary or - but those are probably optimized by your compiler.
That code will almost certainly be slower than however your processor/compiler decides to perform the divide/mod. Generally, shortcuts are pretty hard to come by for basic arithmetic operators, since the mcu/cpu designers and compiler programmers are pretty good at optimizing this for almost all applications.
One common shortcut in embedded devices (where every cycle/byte can make a difference) is to keep everything in terms of base-2 to use the bit shift operators to perform multiplication and division, and the bitwise and (&) to perform modulo.
Examples:
unsigned int x = 100;
unsigned int y1 = x << 4; // same as x * 2^4 = x*16
unsigned int y2 = x >> 6; // same as x / 2^6 = x/64
unsigned int y3 = x & 0x07; // same as x % 8
If the divisor is known at compile time, the operation can be transformed into a multiplication by a reciprocal, with some shifts, adds, and other fast operations. This will be faster on any modern processor, even if it implements division in hardware. Embedded targets usually have highly optimized routines for divide / modulo, since these operations are required by the standard.
If you have profiled your code carefully and found that a modulo operator is the major cost in an inner loop then there is an optimisation that might help. You might be already familiar with the trick for determining the sign of an integer using arithmetic left shifts (for 32 bit values):
sign = ( x >> 31 ) | 1;
This extends the sign bit across the word, so negative values yield -1 and positive values 0. Then bit 0 is set so that positive values result in 1.
If we're only incrementing values by a quantity that is less than the modulo then this same trick can be used to wrap the result:
val += inc;
val -= modulo & ( static_cast< int32_t >( ( ( modulo - 1 ) - val ) ) >> 31 );
Alternatively, if you are decrementing by values less than the modulo then the relevant code is:
int32_t signedVal = static_cast< int32_t >( val - dec );
val = signedVal + ( modulo & ( signedVal >> 31 ) );
I've added the static_cast operators because I was passing in uint32_t, but you might not find them necessary.
Does this help much as opposed to a simple % operator? That depends on your compiler and CPU architecture. I found a simple loop ran 60% faster on my i3 processor when compiled under VS2012, however on the ARM11 chip in the Raspberry Pi and compiling with GCC I only got a 20% improvement.
Division by a constant can be achieved by a shift if a power of 2 or a mul add shift combination for others.
http:// masm32.com/board/index.php?topic=9937.0 has x86 assembly version as well as C source in download from first post. that generates this code for you.
In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?
Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.
There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.
You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.
How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).
Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.
Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}