I need to perform a rather complex check over a vector and I have to repeat it thousands and millions of times. To make it more efficient, I translate given formula into C++ source code and compile it in heavily-optimized binary, which I call in my code. The formula is always purely Boolean: only &&, || and ! used. Typical source code looks like this:
#include <assert.h>
#include <vector>
using DataType = std::vector<bool>;
static const char T = 1;
static const char F = 0;
const std::size_t maxidx = 300;
extern "C" bool check (const DataType& l);
bool check (const DataType& l) {
assert (l.size() == maxidx);
return (l[0] && l[1] && l[2]) || (l[3] && l[4] && l[5]); //etc, very large line with && and || everywhere
}
I compile it as follows:
g++ -std=c++11 -Ofast -march=native -fpic -c check.cpp
Performance of the resulting binary is crucial.
It worked perfectly util recent test case with the large number of variables (300, as you can see above). With this test case, g++ consumes more than 100 GB of memory and freezes forever.
My question is pretty straightforward: how can I simplify that code for the compiler? Should I use some additional variables, get rid of vector or something else?
EDIT1: Ok, here is the screenshot from top utility.
cc1plus is busy with my code. The check function depends on 584 variables (sorry for a imprecise number in the example above) and it contains 450'000 expressions.
I would agree with #akakatak's comment below. It seems that g++ performs something O(N^2).
The obvious optimization here is to toss out the vector and use a bit-field, based on the fastest possible integer type:
uint_fast8_t item [n];
You could write this as
#define ITEM_BYTES(items) ((items) / sizeof(uint_fast8_t))
#define ITEM_SIZE(items) ( ITEM_BYTES(items) / CHAR_BIT + (ITEM_BYTES(items)%CHAR_BIT!=0) )
...
uint_fast8_t item [ITEM_SIZE(n)];
Now you have a chunk of memory with n segments, where each segment is the ideal size for your CPU. In each such segment, set bits to 1=true or 0=false, using bitwise operators.
Depending on how you want to optimize, you would group the bits in different ways. I would suggest storing 3 bits of data in every segment, since you always wish to check 3 adjacent boolean numbers. This mean that "n" in the above example will be the total number of booleans divided by 3.
You can then simply iterate through the array like:
bool items_ok ()
{
for(size_t i=0; i<n; i++)
{
if( (item[i] & 0x7u) == 0x7u )
{
return true;
}
}
return false;
}
With the above method you optimize:
The data size in which comparisons are made, and with it possible alignment issues.
The overall memory use.
The number of branches needed for the comparisons.
This also rules out any risks of ineffectiveness caused by the usual C++ meta programming. I would never trust std::vector, std::array or std::bitfield to produce optimal code.
Once you have the above working you can always test if std::bitfield etc containers yields the very same, effective machine code. If you find that they spawned any form of unrelated madness in your machine code, then don't use them.
It's a necro-posting a little bit, but I still should share my results.
The solution proposed by Thilo in comments above is the best. It's very simple and it provides measurable compile time improvement. Just split your expression into chunks of the same size. But, in my experience, you have to choose an appropriate sub expression length carefully - you can encounter significant execution performance drop in case of large number of sub expressions; a compiler will not be able to optimize the whole expression perfectly.
Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
I'm writing a piece of code using GCC's vector extensions (__attribute__((vector_size(x)))) that needs several constant masks. These masks are simple enough to fill in sequentially, but adding them as vector literals is tedious and error prone, not to mention limiting potential changes in vector size.
Is it possible to generate the constants using a constexpr function?
I've tried generating the values like this:
using u8v = uint8_t __attribute__((vector_size(64)));
auto generate_mask = []() constexpr {
u8v ret;
for (size_t i = 0; i < sizeof(u8v); ++i) {
ret[i] = i & 0xff;
}
return ret;
};
constexpr auto mask = generate_mask();
but GCC says modification of u8v is not a constant expression.
Is there some workaround, using G++ with C++20 features?
Technically possible, but complicated to the point of being unusable. An example unrelated to SIMD.
Still, the workarounds aren’t that bad.
SIMD vectors can’t be embedded into instruction stream anyway. Doesn’t matter your code says constexpr auto mask, in reality the compiler will generate the complete vector, place the data in the read-only segment of your binary, and load from there as needed.
The exception is when your vector can be generated faster than RAM loads. A vector with all zeroes can be made without RAM access with vpxor, a vector with all bits set can be made with vpcmpeqd, and vectors with the same value in all lanes can be made with broadcast instructions. Anything else will result in full-vector load from the binary.
This means you can replace your constexpr with const and the binary will be the same. If that code is in a header file, ideally you’d also want __declspec(selectany) when building with VC++, or in your case the GCC’s equivalent is __attribute__((weak))
I am not sure is there any sense in this case bits to be a non-type template:
template< int bits >
inline bool FitsInBits(int x )
{
#if bits >= 32
#error 'bits' should be less than 32
#endif
return x >= -((int)1<<(bits-1)) && x < ((int)1<<(bits-1));
}
Rather then:
inline bool FitsInBits(int x, int bits )
{
return x >= -((int)1<<(bits-1)) && x < ((int)1<<(bits-1));
}
As I understand the compiler will create many variants of FitsInBits in the first case at compile time? But I don't see how this will optimize the calculation.
The template isn't necessarily more efficient, but it assures you that the values of: -((int)1<<(bits-1)) and ((int)1<<(bits-1)) can be computed at compile time, so all that happens at run-time is basically:
return x >= constant_a && x < constant_b;
This, in turn, is sufficiently trivial that there's a pretty good chance the compiler can/will generate inline-code for it.
Assuming that bits is a value known at compile time, the non-template version can (and probably will) do the same--but since bits is a normal parameter to a normal function, you could (perhaps accidentally) pass some value for bits that wasn't known until run time (e.g., based on data entered by the user).
If you do so, the compiler can't/won't give you a warning about having done so--and in this case, the expressions probably won't reduce to constants as above, so you'd very likely end up with code to load 1 into one register, load bits into another, decrement the second, shift the first left by the number of places specified in the second, compare to x, conditionally jump to a label, negate the first, compare again, do another conditional jump, etc. This is still only around a dozen instructions (or so), but undoubtedly slower than when the values are constants.
I have written optimized code for an algorithm that computes a vector of quantities. I have timed it before and after various attempts at getting the data computed in the function out of the function. I think that the specific nature of the computation or the nature of the vector of quantities is not relevant. An outline of the code, timings, and details follow.
All code was compiled with the following flags:
g++ -Wall -Wextra -Werror -std=c++11 -pedantic -O3
I have a class like this:
#ifndef C_H
#define C_H
#include <iostream>
#include <iterator>
#include <vector>
Class c {
public:
void doWork( int param1, int param2 ) const {
std::array<unsigned long,40> counts = {{0}};
// LOTS of branches and inexpensive operations:
// additions, subtractions, incrementations, and dereferences
for( /* loop 1 */ ) {
// LOTS MORE branches and inexpensive operations
counts[ /* data dependent */ ] += /* data dependent */;
for( /* loop 2 */ ) {
// YET MORE branches inexpensive operations
counts[ /* data dependent */ ] += /* data dependent */;
}
}
counts [ /* data dependent */ ] = /* data dependent */;
/* exclude for profiling
std::copy( &counts[0], &counts[40], std::ostream_iterator( std::cout, "," ) );
std::cout << "\n";
*/
}
private:
// there is private data here that is processed above
// the results get added into the array/vector as they are computed
};
#endif
And a main like this:
#include <iostream>
#include "c.h"
int main( int argc, char * argv ) {
Class c( //set the private data of c by passing data in );
int param1;
int param2;
while( std::cin >> param1 >> param2 ) {
c.doWork( int param1, int param2 );
}
}
Here are some relevant details about the data:
20 million pairs read at standard input (redirected from a file)
20 million calls to c.doWork
60 million TOTAL iterations through the outer loop in c.doWork
180 million TOTAL iterations through the inner loop in c.doWork
All of this requires exactly 5 minutes and 48 seconds to run. Naturally I can print the array within the class function, and that is what I have been doing, but I am going to release the code publicly, and some use cases may include wanting to do something other than printing the vector. In that case, I need to change the function signature to actually get the data to the user. This is where the problem arises. Things that I have tried:
Creating a vector in main and passing it in by reference:
std::vector<unsigned long> counts( 40 );
while( std::cin >> param1 >> param2 ) {
c.doWork( param1, param2, counts );
std::fill( counts.begin(), counts.end(), 0 );
}
This requires 7 minutes 30 seconds. Removing the call to std::fill only reduces this by 15 seconds, so that doesn't account for the discrepancy.
Creating a vector within the doWork function and returning it, taking advantage of move semantics.
Since this requires a dynamic allocation for each result, I didn't expect this to be fast. Strangely it's not a lot slower. 7 minutes 40 seconds.
Returning the std::array currently in doWork by value.
Naturally this has to copy the data upon return since the stack array does not support move semantics. 7 minutes 30 seconds
Passing a std::array in by reference.
while( std::cin >> param1 >> param2 ) {
std::array<unsigned long,40> counts = {{0}};
c.doWork( param1, param2, counts )
}
I would expect this to be roughly equivalent to the original. The data is placed on the stack in the main function, and it is passed by reference to doWork, which fills it. 7 minutes 20 seconds. This one really stymies me.
I have not tried passing pointers in to doWork, because this should be equivalent to passing by reference.
One solution is naturally to have two versions of the function: one that prints locally and one that returns. The roadblock is that I would have to duplicate ALL code, because the entire issue here is that I cannot efficiently get the results out of a function.
So I am mystified. I understand that any of these solutions require an extra dereference for every access to the array/vector inside doWork, but these extra dereferences are highly trivial compared to the huge number of other fast operations and more troublesome data-dependent branches.
I welcome any ideas to explain this. My only thought is that the code is being optimized by the compiler so that some otherwise necessary components of computation are being omitted in the original case, because the compiler realizes that it is not necessary. But this seems to be contraindicated on several counts:
Making changes to the code inside the loops does change the timings.
The original timings are 5 minutes 50 seconds, whereas just reading the pairs from the file takes 12 seconds, so a lot is being done.
Maybe only operations involving counts are being optimized away, but that seems like a strangely selective optimization given that if those are being optimized away the compiler could realize that supporting computations in doWork are also unecessary.
If operations involving counts ARE being optimized away, why are they not optimized in the other cases. I am not actually using them in main.
Is it the case that doWork is compiled and optimized independently of main, and thus if the function has any obligation to return the data in any form it cannot be certain of whether it will be used or not?
Is my method of profiling without printing, which was to avoid the cost of the printing to emphasize the relative differences in various methods, flawed?
I am grateful for any light you can shed.
What I would do is pause it a few times and see what it's doing most of the time. Looking at your code, I would suspect the most time going into either a) the innermost loop, especially the index calculation, or 2) the allocation of the std::array.
If the size of counts is always 40, I would just do
long counts[40];
memset(counts, 0, sizeof(counts));
That allocates on the stack, which takes no time, and memset takes no time compared to whatever else you're doing.
If the size varies at runtime, then what I do is some static allocation, like this:
void myRoutine(){
/* this does not claim to be pretty. it claims to be efficient */
static int nAlloc = 0;
static long* counts = NULL;
/* this initially allocates the array, and makes it bigger if necessary */
if (nAlloc < /* size I need */){
if (counts) delete counts;
nAlloc = /* size I need */;
counts = new long[nAlloc];
}
memset(counts, 0, sizeof(long)*nAlloc);
/* do the rest of the stuff */
}
This way, counts is always big enough, and
the point is to 1) do new as few times as possible, and 2) keep the indexing into counts as simple as possible.
But first I would do the pauses, just to be sure.
After fixing it, I would do that again to see what's the next thing I could fix.
Compiler optimizations are one place to look at but there is one more place that you need to look. Changes that you made in the code can disturb the cache layout. If memory allocated to the array is in a different part of memory each time, number of cache misses in your system can increase, which in turn degrades the performance. You can take a look at hardware performance counters on your CPU to make a better guess about it.
There are times when unorthodox solutions are applicable, and this may be one. Have you considered making the array a global?
Still, the one crucial benefit that local variables have is that the optimizer can find all access to it, using information from the function only. That makes register assignment a whole lot easier.
A static variable inside the function is almost the same, but in your case the address of that stack array would escape, beating the optimizer once again.
While answering another question I got curious about this. I'm well aware that
if( __builtin_expect( !!a, 0 ) ) {
// not likely
} else {
// quite likely
}
will make the "quite likely" branch faster (in general) by doing something along the lines of hinting to the processor / changing the assembly code order / some kind of magic. (if anyone can clarify that magic that would also be great).
But does this work for a) inline ifs, b) variables and c) values other than 0 and 1? i.e. will
__builtin_expect( !!a, 0 ) ? /* unlikely */ : /* likely */;
or
int x = __builtin_expect( t / 10, 7 );
if( x == 7 ) {
// likely
} else {
// unlikely
}
or
if( __builtin_expect( a, 3 ) ) {
// likely
// uh-oh, what happens if a is 2?
} else {
// unlikely
}
have any effect? And does all of this depend on the architecture being targeted?
Did you read the GCC documentation?
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be zero. Since you are limited to integral expressions for exp, you should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
To explain this a bit... __builtin_expect is specifically useful for communicating which branch you think the program's likely to take. You ask how the compiler can use that insight - well, consider this code:
if (x == 0)
return 10 * y;
else
return 39;
In machine code, the CPU can typically be asked to "goto" another line (which takes time, and depending on the CPU may prevent other execution optimisations - i.e. beneath the level of machine code - for example, see the Branches heading under http://en.wikipedia.org/wiki/Instruction_pipeline), or to call some other code, but there's not really an if/else concept where both true and false code are equal... you have to branch away to find the code for one or the other. The way that's done is basically, in pseudo-code:
test whether x is 0
if it was goto else_return_39
return 10 * y
else_return_39:
return 39
Given most CPUs are slower following the goto down to the else_return_39: label than to just fall through to return 10 * y, code for the "true" branch will be reached faster than for the false branch. Of course, the machine code could test whether x is not 0, put the "false" code (return 39) first and thereby reverse the performance characteristics.
This is what __builtin_expect controls - you can tell the compiler to put the true or the false branch where less branching is needed to reach it, thereby getting a tiny performance boost.
But does this work for a) inline ifs, b) variables and c) values other than 0 and 1?
a) Whether the surrounding function is inlined or not doesn't change the need for branching where the if statement appears (unless the optimiser sees the condition the if statement tests is always true or false and only one branch could never run). So, it's equally applicable to inlined code.
[ Your comment shows you were interested in conditional expressions - a ? b : c - I'm not sure - there's a disputed answer to that question at https://stackoverflow.com/questions/14784481/can-i-use-gccs-builtin-expect-with-ternary-operator-in-c that might prove insightful one way or the other, or the basis for further exploration ]
b) variables - you postulated:
int x = __builtin_expect( t / 10, 7 );
if( x == 7 ) {
That won't work - the compiler's not obliged to associate such expectations with variables and remember them the next time an if is seen. You can verify this (as I did for gcc 3.4.4) using gcc -S to produce assembly language output: the assembly doesn't change regardless of the expected value.
c) values other than 0 and 1
It works for integral (long) values, so yes. The last paragraph of the documentation quoted above address this, specifically:
you should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
Why? Well, if the pointer type is larger than long, then calling __builtin_conversion(long, long) would effectively slice off some of the less-significant bits for the test, and fail to incorporate the (high-order) rest in the test. Similarly, floating point values might be larger than a long, and the conversion not produce the result you expect. By using a boolean expression such as ptr != NULL (given true converts to 1 and false to 0) you're sure to get intended results.
But does this work for a) inline ifs, b) variables and c) values other than 0 and 1?
It works for an expression context that is used to determine branching.
So, a) Yes. b) No. c) Yes.
And does all of this depend on the architecture being targeted?
Yes!
It leverages architectures that use instruction pipelining, which allow a CPU to begin working on upcoming instructions before the current instruction has been completed.
(if anyone can clarify that magic that would also be great).
("Branch prediction" complicates this description, so I'm intentionally omitting it)
Any code resembling an if statement implies that an expression may result in the CPU jumping to a different location in the program. These jumps invalidate what's in the CPU's instruction pipeline.
__builtin_expect allows (without guarantee) gcc to try to assemble the code so the likely scenario involves fewer jumps than the alternate.