Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
So say I have this construct where two Tasks run at the same time (pseudo code):
int a, b, c;
std::atomic<bool> flag;
TaskA()
{
while (1) {
a = 5;
b = 2;
c = 3;
flag.store(true, std::memory_order_release);
}
}
TaskB()
{
while (1) {
flag.load(std::memory_order_acquire);
b = a;
c = 2*b;
}
}
The memory barrier is supposed to be at the flag variable. As I understand it this means, that the operations in TaskB (b = a and c = 2b) are executed AFTER the assignements in TaskA (a = 5, b = 2, c = 3). But does this also mean, that there's no way that TaskA could already loop around and execute b = 2 a second time when TaskB is at c = 2*b still? Is this somehow prevented, our would I need a second barrier at the start of the loop?
No amount of barriers can help you avoid data-race UB if you begin another write of the non-atomic variables right after the release-store.
It will always be possible (and likely) for some non-atomic writes to a,b, and c to be "happening" while your reader is reading those variables, therefore in the C abstract machine you have data-race UB. (In your example, from unsynced write+read of a, unsynced write+write of b, and the write+read of b, and write+write of c.)
Also, even without loops, your example would still not safely avoid data-race UB, because your TaskB accesses a,b, and c unconditionally after the flag.load. So you do that stuff whether or not you observe the data_ready = 1 signal from the writer saying that the vars are ready to be read.
Of course in practice on real implementations, repeatedly writing the same data is unlikely to cause problems here, except that the value read for b will depend on how the compiler optimizes. But that's because your example also writes.
Mainstream CPUs don't have hardware race detection, so it won't actually fault or something, and if you did actually wait for flag==1 and then just read, you would see the expected values even if the writer was running more assignments of the same values. (A DeathStation 9000 could implement those assignments by storing something else in that space temporarily so the bytes in memory are actually changing, not stable copies of the values before the first release-store, but that's not something that you'd expect a real compiler to do. I wouldn't bet on it, though, and this seems like an anti-pattern).
This is why lock-free queues use multiple array elements, or why a seqlock doesn't work this way. (A seqlock can't be implemented both safely and efficiently in ISO C++ because it relies on reading maybe-torn data and then detecting tearing; if you use narrow-enough relaxed atomics for the chunks of data, you're hurting efficiency.)
The whole idea of wanting to write again, maybe before a reader has finished reading, sounds a lot like you should be looking into the idea of a SeqLock. https://en.wikipedia.org/wiki/Seqlock and see the other links in my linked answer in the last paragraph.
In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.
There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.
All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.
I have few questions about using lock to protect my shared data structure. I am using C/C++/ObjC/Objc++
For example I have a counter class that used in multi-thread environment
class MyCounter {
private:
int counter;
std::mutex m;
public:
int getCount() const {
return counter;
}
void increase() {
std::lock_guard<std::mutex> lk(m);
counter++;
}
};
Do I need to use std::lock_guard<std::mutex> lk(m); in getCount() method to make it thread-safe?
What happen if there is only two threads: a reader thread and a writer thread then do I have to protect it at all? Because there is only one thread is modifying the variable so I think no lost update will happen.
If there are multiple writer/reader for a shared primitive type variable (e.g. int) what disaster may happen if I only lock in write method but not read method? Will 8bits type make any difference compare to 64bits type?
Is any primitive type are atomic by default? For example write to a char is always atomic? (I know this is true in Java but don't know about c++ and I am using llvm compiler on Mac if platform matters)
Yes, unless you can guarantee that changes to the underlying variable counter are atomic, you need the mutex.
Classic example, say counter is a two-byte value that's incremented in (non-atomic) stages:
(a) add 1 to lower byte
if lower byte is 0:
(b) add 1 to upper byte
and the initial value is 255.
If another thread comes in anywhere between the lower byte change a and the upper byte change b, it will read 0 rather than the correct 255 (pre-increment) or 256 (post-increment).
In terms of what data types are atomic, the latest C++ standard defines them in the <atomic> header.
If you don't have C++11 capabilities, then it's down to the implementation what types are atomic.
Yes, you would need to lock the read as well in this case.
There are several alternatives -- a lock is quite heavy here. Atomic operations are the most obvious (lock-free). There are also other approaches to locking in this design -- the read write lock is one example.
Yes, I believe that you do need to lock the read as well. But since you are using C++11 features, why don't you use std::atomic<int> counter; instead?
As a rule of thumb, you should lock the read too.
Read and write to int is atomic on most architecture (and since int is guaranted to be the machine's word size, you should almost never experience corrupted int)
Yet, the answer from #paxdiablo is correct, and will happen if you have someone doing this:
#pragma pack(push, 1)
struct MyObj
{
char a;
MyCounter cnt;
};
#pragma pack(pop)
In that specific case, cnt will not be aligned to a word boundary, and the int MyCounter::counter will/might be emulated in multiple operations in CPU supporting unaligned access (like x86). Thus, you could get this sequence of operations:
Thread A: [...] set counter to 255 (counter is 0x000000FF)
getCount() => CPU reads low byte: lo:255
<interrupted here>
Thread B: increase() => counter is incremented, leading to counter = 256 = 0x00000100)
<interrupted here>
Thread A: CPU read high bytes: 0x000001, concatenate: 0x000001FF, returns 511 !
Now, let's say you never use unaligned access. Yet, if you are doing something like this:
ThreadA.cpp:
int g = clientCounter.getCount();
while (g > 0)
{
processFirstClient();
g = clientCounter.getCount();
}
ThreadB.cpp:
if (acceptClient()) clientCounter.increase();
The compiler is completely allowed to replace the loop in Thread A by this:
if (clientCounter.getCount())
while(true) processFirstClient();
Why ? That's because for each instruction, the compiler will evaluate side-effects of such expression. The getCount() is so simple that the compiler will deduce: it's a read of a single variable, and it's not modified anywhere in ThreadA.cpp, thus, it's constant. Because it's constant, let's simplify this.
If you add a mutex, the mutex code will insert a memory barrier telling the compiler "hey, don't expect anything after this barrier is crossed".
Thus, the "optimization" above can not happen since getCount might have been modified.
Sure, you could have written volatile int counter instead of counter, and the compiler would have avoided this optimization too.
In the end, if you have to write a ton of code just to avoid a mutex, you're doing it wrong (and probably will get wrong results).
You cant gaurantee that multiple threads wont modify your variable at the same time. and if such a situation occurs your variable will be garbled or program might crash. In order to avoid such cases its always better and safer to make the program thread safe.
You can use the synchronization techinques available like: Mutex, Lock, Synchronization attribute(available for MS c++)
I don't know if this is true, but when I was reading FAQ on one of the problem providing sites, I found something, that poke my attention:
Check your input/output methods. In C++, using cin and cout is too slow. Use these, and you will guarantee not being able to solve any problem with a decent amount of input or output. Use printf and scanf instead.
Can someone please clarify this? Is really using scanf() in C++ programs faster than using cin >> something ? If yes, that is it a good practice to use it in C++ programs? I thought that it was C specific, though I am just learning C++...
Here's a quick test of a simple case: a program to read a list of numbers from standard input and XOR all of the numbers.
iostream version:
#include <iostream>
int main(int argc, char **argv) {
int parity = 0;
int x;
while (std::cin >> x)
parity ^= x;
std::cout << parity << std::endl;
return 0;
}
scanf version:
#include <stdio.h>
int main(int argc, char **argv) {
int parity = 0;
int x;
while (1 == scanf("%d", &x))
parity ^= x;
printf("%d\n", parity);
return 0;
}
Results
Using a third program, I generated a text file containing 33,280,276 random numbers. The execution times are:
iostream version: 24.3 seconds
scanf version: 6.4 seconds
Changing the compiler's optimization settings didn't seem to change the results much at all.
Thus: there really is a speed difference.
EDIT: User clyfish points out below that the speed difference is largely due to the iostream I/O functions maintaining synchronization with the C I/O functions. We can turn this off with a call to std::ios::sync_with_stdio(false);:
#include <iostream>
int main(int argc, char **argv) {
int parity = 0;
int x;
std::ios::sync_with_stdio(false);
while (std::cin >> x)
parity ^= x;
std::cout << parity << std::endl;
return 0;
}
New results:
iostream version: 21.9 seconds
scanf version: 6.8 seconds
iostream with sync_with_stdio(false): 5.5 seconds
C++ iostream wins! It turns out that this internal syncing / flushing is what normally slows down iostream i/o. If we're not mixing stdio and iostream, we can turn it off, and then iostream is fastest.
The code: https://gist.github.com/3845568
http://www.quora.com/Is-cin-cout-slower-than-scanf-printf/answer/Aditya-Vishwakarma
Performance of cin/cout can be slow because they need to keep themselves in sync with the underlying C library. This is essential if both C IO and C++ IO is going to be used.
However, if you only going to use C++ IO, then simply use the below line before any IO operations.
std::ios::sync_with_stdio(false);
For more info on this, look at the corresponding libstdc++ docs.
Probably scanf is somewhat faster than using streams. Although streams provide a lot of type safety, and do not have to parse format strings at runtime, it usually has an advantage of not requiring excessive memory allocations (this depends on your compiler and runtime). That said, unless performance is your only end goal and you are in the critical path then you should really favour the safer (slower) methods.
There is a very delicious article written here by Herb Sutter "The String Formatters of Manor Farm" who goes into a lot of detail of the performance of string formatters like sscanf and lexical_cast and what kind of things were making them run slowly or quickly. This is kind of analogous, probably to the kind of things that would affect performance between C style IO and C++ style. The main difference with the formatters tended to be the type safety and the number of memory allocations.
I just spent an evening working on a problem on UVa Online (Factovisors, a very interesting problem, check it out):
http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&category=35&page=show_problem&problem=1080
I was getting TLE (time limit exceeded) on my submissions. On these problem solving online judge sites, you have about a 2-3 second time limit to handle potentially thousands of test cases used to evaluate your solution. For computationally intensive problems like this one, every microsecond counts.
I was using the suggested algorithm (read about in the discussion forums for the site), but was still getting TLEs.
I changed just "cin >> n >> m" to "scanf( "%d %d", &n, &m )" and the few tiny "couts" to "printfs", and my TLE turned into "Accepted"!
So, yes, it can make a big difference, especially when time limits are short.
If you care about both performance and string formatting, do take a look at Matthew Wilson's FastFormat library.
edit -- link to accu publication on that library: http://accu.org/index.php/journals/1539
The statements cin and cout in general use seem to be slower than scanf and printf in C++, but actually they are FASTER!
The thing is: In C++, whenever you use cin and cout, a synchronization process takes place by default that makes sure that if you use both scanf and cin in your program, then they both work in sync with each other. This sync process takes time. Hence cin and cout APPEAR to be slower.
However, if the synchronization process is set to not occur, cin is faster than scanf.
To skip the sync process, include the following code snippet in your program right in the beginning of main():
std::ios::sync_with_stdio(false);
Visit this site for more information.
There are stdio implementations (libio) which implements FILE* as a C++ streambuf, and fprintf as a runtime format parser. IOstreams don't need runtime format parsing, that's all done at compile time. So, with the backends shared, it's reasonable to expect that iostreams is faster at runtime.
Yes iostream is slower than cstdio.
Yes you probably shouldn't use cstdio if you're developing in C++.
Having said that, there are even faster ways to get I/O than scanf if you don't care about formatting, type safety, blah, blah, blah...
For instance this is a custom routine to get a number from STDIN:
inline int get_number()
{
int c;
int n = 0;
while ((c = getchar_unlocked()) >= '0' && c <= '9')
{
// n = 10 * n + (c - '0');
n = (n << 3) + ( n << 1 ) + c - '0';
}
return n;
}
The problem is that cin has a lot of overhead involved because it gives you an abstraction layer above scanf() calls. You shouldn't use scanf() over cin if you are writing C++ software because that is want cin is for. If you want performance, you probably wouldn't be writing I/O in C++ anyway.
Of course it's ridiculous to use cstdio over iostream. At least when you develop software (if you are already using c++ over c, then go all the way and use it's benefits instead of only suffering from it's disadvantages).
But in the online judge you are not developing software, you are creating a program that should be able to do things Microsoft software takes 60 seconds to achieve in 3 seconds!!!
So, in this case, the golden rule goes like (of course if you dont get into even more trouble by using java)
Use c++ and use all of it's power (and heaviness/slowness) to solve the problem
If you get time limited, then change the cins and couts for printfs and scanfs
(if you get screwed up by using the class string, print like this: printf(%s,mystr.c_str());
If you still get time limited, then try to make some obvious optimizations (like avoiding too many embedded for/while/dowhiles or recursive functions). Also make sure to pass by reference objects that are too big...
If you still get time limited, then try changing std::vectors and sets for c-arrays.
If you still get time limited, then go on to the next problem...
#include <stdio.h>
#include <unistd.h>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
static int scanuint(unsigned int* x)
{
char c;
*x = 0;
do
{
c = getchar_unlocked();
if (unlikely(c==EOF)) return 1;
} while(c<'0' || c>'9');
do
{
//*x = (*x<<3)+(*x<<1) + c - '0';
*x = 10 * (*x) + c - '0';
c = getchar_unlocked();
if (unlikely(c==EOF)) return 1;
} while ((c>='0' && c<='9'));
return 0;
}
int main(int argc, char **argv) {
int parity = 0;
unsigned int x;
while (1 != (scanuint(&x))) {
parity ^= x;
}
parity ^=x;
printf("%d\n", parity);
return 0;
}
There's a bug at the end of the file, but this C code is dramatically faster than the faster C++ version.
paradox#scorpion 3845568-78602a3f95902f3f3ac63b6beecaa9719e28a6d6 ▶ make test
time ./xor-c < rand.txt
360589110
real 0m11,336s
user 0m11,157s
sys 0m0,179s
time ./xor2-c < rand.txt
360589110
real 0m2,104s
user 0m1,959s
sys 0m0,144s
time ./xor-cpp < rand.txt
360589110
real 0m29,948s
user 0m29,809s
sys 0m0,140s
time ./xor-cpp-noflush < rand.txt
360589110
real 0m7,604s
user 0m7,480s
sys 0m0,123s
The original C++ took 30sec the C code took 2sec.
Even if scanf were faster than cin, it wouldn't matter. The vast majority of the time, you will be reading from the hard drive or the keyboard. Getting the raw data into your application takes orders of magnitude more time than it takes scanf or cin to process it.