Why does std::atomic_compare_exchange and all its brothers and sisters update the passed expected value?
I am wondering if the are any reasons besides the given simplicity in loops, e.g.: is there an intrinsic function which can do that in one operation to improve performance?
The processor has to load the current value, in order to do the "compare" part of the operation. When the comparison fails the caller needs to know the new value, to retry the compare-exchange (you almost always use it in a loop), so if it wasn't returned (e.g. by modifying the expected value that is passed by reference) then the caller would need to do another atomic load to get the new value. That's wasteful, because the processor has already loaded the value. You should only be messing about with low-level atomic operations when extreme performance is the only option, so in that case you do not want to perform two operations when one will do.
is there an intrinsic function which can do that in one operation to improve performance
That can do what, specifically? The instruction has to load the current value to do the comparison, so on a mismatch yielding the current value costs nothing and is pretty much guaranteed to be useful.
Related
I'm discussing with some colleagues about the efficiency of If statements and which is the best in cost of memory and CPU use, at this stage, doesn't matter the language used.
The two conditionals is the following:
If value is present then
skip
If value = "1234" then
execute
So, the first controlls if the value is null, in that case exit (skip) the routine, the second statement compare the variable to a specific value.
What I'm thinking is that the first uses more CPU and the second more Ram, what do you think about it?
Do I have to use both so that if the value is null the second statement is skipped? Or is better to use only the second who compare two values?
Thank you
Can you elaborate why second uses more ram? "1234" will be placed in memory only once as a const value. Code which do comparison is also compiled and generated only once. In fact second If might be more CPU consuming by comparing strings, but I don't think you can do much with this. So not really sure how you get to your conclusions. Am I missing something?
I was thinking about reference counting based on atomic integers that would be safe from overflow. How to do it?
Please let's not focus on whether such overflow is a realistic problem or not. The task itself got my interest even if not practically important.
Example
Example implementation of reference counting is shown as an example in Boost.Atomic. Based on that example we can extract following sample code:
struct T
{
boost::atomic<boost::uintmax_t> counter;
};
void add_reference(T* ptr)
{
ptr->counter.fetch_add(1, boost::memory_order_relaxed);
}
void release_reference(T* ptr)
{
if (ptr->counter.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete ptr;
}
}
In addition following explanation is given
Increasing the reference counter can always be done with memory_order_relaxed: New references to an object can only be formed from an existing reference, and passing an existing reference from one thread to another must already provide any required synchronization.
It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread. This is achieved by a "release" operation after dropping a reference (any access to the object through this reference must obviously happened before), and an "acquire" operation before deleting the object.
It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.
EDIT >>>
It seems that Boost.Atomic documentation might be wrong here. The acq_rel might be needed after all.
At least such is the implementation of boost::shared_ptr when done using std::atomic (there are other implementations as well). See file boost/smart_ptr/detail/sp_counted_base_std_atomic.hpp.
Also Herb Sutter mentions it in his lecture C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 (reference counting part starts at 1:19:51). Also he seems to be discouraging use of fences in this talk.
Thanks to user 2501 for pointing that out in comments below.
<<< END EDIT
Initial attempts
Now the problem is that add_reference as written could (at some point) overflow. And it would do so silently. Which obviously could lead to problems when calling matched release_reference that would prematurely destroy the object. (Provided that add_reference would be then called once again to reach 1.)
I was thinking how to make add_reference detect overflow and fail gracefully without risking anything.
Comparing to 0 once we leave fetch_add will not do since between the two some other thread could call add_reference again (reaching 1) and then release_reference (erroneously destroying the object in effect).
Checking first (with load) will not help either. This way some other thread could add its own reference between our calls to load and fetch_add.
Is this the solution?
Then I thought that maybe we could start with load but only if then we do compare_exchange.
So first we do load and get a local value. If it is std::numeric_limits<boost::uintmax_t>::max() then we fail. add_reference cannot add another reference as all possible are already taken.
Otherwise we make another local value which is the previous local reference count plus 1.
And now we do compare_exchange providing as expected value the original local reference count (this ensures that no other thread modified reference count in the mean time) and as the desired value the incremented local reference count.
Since compare_exchange can fail we have to do this (including load) in a loop. Until it succeeds or max value is detected.
Some questions
Is such solution correct?
What memory ordering would be required to make it valid?
Which compare_exchange should be used? _weak or _strong?
Would it affect release_reference function?
Is it used in practice?
The solution is correct, maybe it could be improved with one thing. Currently, if the value reaches max in the local CPU, it can be decreased by another CPU but the current CPU would still cache the old value. It would be worth doing dummy compare_exchange with the same expected and newValue to confirm the max is still there and only then throw an exception (or whatever you want).
For the rest:
It doesn't matter whether you use _weak or _strong as it will run in loop anyway and therefore the next load will get quite reliably the latest value.
For the add_reference and release_reference - who would then check whether it was really added or not? Would it throw an exception. If yes, it would work probably. But generally it's better to allow such a low level things not to fail and rather use uintptr_t for the reference counter so it could never overflow as it's big enough to cover the address space and therefore any number of objects existing at the same time.
No, it's not used in practice for the above reasons.
Quick math: say uint is 32 bits, so max uint is 4G (4 billion something). Each reference/pointer is at least 4 bytes (8 if you are on a 64 bit system) so to overflow you need 16Gbytes of memory dedicated to storing references pointing to the same object, which should point to a serious deign flaw.
I would say it's not a problem today, nor in the foreseeable future.
This question is moot. Even assuming atomic increment takes 1 CPU cycle (it does not!), on 4GHz CPU it would take half a year to wrap around 64bit integer, providing CPU does nothing else but keep incrementing.
Taking into account realities of an actual program, I find it hard to believe this is a real issue which can pester you.
Concerning objects (especially strings), call by reference is faster than call-by-value because the function call does not need to create a copy of the original object. Using const, one can also ensure that the reference is not abused.
My question is whether const call-by-reference is also faster if using primitive types, like bool, int or double.
void doSomething(const string & strInput, unsigned int iMode);
void doSomething(const string & strInput, const unsigned int & iMode);
My suspicion is that it is advantageous to use call-by-reference as soon as the primitive type's size in bytes exceeds the size of the address value. Even if the difference is small, I'd like to take the advantage because I call some of these functions quite often.
Additional question: Does inlining have an influence on the answer to my question?
My suspicion is that it is advantageous to use call-by-reference as soon as the primitive type's size in bytes exceeds the size of the address value. Even if the difference is small, I'd like to take the advantage because I call some of these functions quite often.
Performance tweaking based on hunches works about 0% of the time in C++ (that's is a gut feeling I have about statistics, it works usually...)
It is correct that the const T& will be smaller than the T if sizeof(T) > sizeof(ptr), so usually 32-bits, or 64, depending on the system..
Now ask yourself :
1) How many built-in types are bigger than 64 bits ?
2) Is not copying 32-bits worth making the code less clear ? If your function becomes significantly faster because you didn't copy a 32bit value to it, maybe it doesn't do much ?
3) Are you really that clever ? (spoiler alert : no.) See this great answer for the reason why it is almost always a bad idea :
https://stackoverflow.com/a/4705871/1098041
Ultimately just pass by value. If after (thorough) profiling you identify that some function is a bottleneck, and all of the other optimizations that you tried weren't enough (and you should try most of them before this), pass-by-const-reference.
Then See that it doesn't change anything. roll-over and cry.
I addition to the other answers I would like to note that when you pass a reference and use (aka dereference) that a lot in your function, it could be slower than making a copy.
This is because local variables to a function (usually) get loaded into the cache together, but when one of them is a pointer/reference and the function uses that, it could result in a cache miss. Meaning it needs to go to the (slower) main memory to get the pointed to variable, which could be slower than making the copy which is loaded in cache together with the function.
So even for 'small objects' it could be potentially faster to just pass by value.
(I read this in the very good book: Computer Systems: a programmers perspective)
Some more interesting discussion on the whole cache hit/miss topic: How does one write code that best utilizes the CPU cache to improve performance?
On a 64-bit architecture, there is no primitive type---at least not in C++11---which is larger than a pointer/reference. You should test this, but intuitively, there should be the same amount of data shuffled around for a const T& as for an int64_t, and less for any primitive where sizeof(T) < sizeof(int64_t). Therefore, insofar as you can measure any difference, passing primitives by value should be faster if your compiler is doing the obvious thing---which is why I stress that if you need certainty here, you should write a test case.
Another consideration is that primitive function parameters can end up in CPU registers, which makes accessing them as fast as a memory access can be. You might find there are more instructions being generated for your const T& parameters than for your T parameters when T is a primitive. You could test this by checking the assembler output by your compiler.
I was taught:
Pass by value when an argument variable is one of the fundamental built-in types, such as bool, int, or float. Objects of these types are so small that passing by reference doesn't result in any gain in efficiency. Also if you want to make a copy of a variable.
Pass a constant reference when you want to efficiently pass a value that you don't need to change.
Pass a reference only when you want to alter the value of the argument variable. But to try to avoid changing argument variables whenever possible.
const is a keyword which is evaluated at compiletime. It does not have any impact on runtime performance. You can read some more about this here: https://isocpp.org/wiki/faq/const-correctness
Under an x86 processor I am not sure of the difference between compare-and-swap atomic operation and Load-link/store-conditional operation. Is the latter safer than the former? Is it the case that the first is better than the second?
There are three common styles of atomic primitive: Compare-Exchange, Load-Linked/Store-Conditional, and Compare-And-Swap.
A CompareExchange operation will atomically read a memory location and, if it matches a compare value, store a specified new value. If the value that was read does not match the compare value, no store takes place. In any case, the operation will report the original value read.
A Compare-And-Swap operation is similar to CompareExchange, except that it does not report what value was read--merely whether whatever value was read matched the compare value. Note that a CompareExchange may be used to implement Compare-And-Swap by having it report whether the value read from memory matched the specified compare value.
The LL/SC combination allows a store operation to be conditioned upon whether some outside influence might have affected the target since its value was loaded. In particular, it guarantees that if the store succeeds, the location has not been written at all by outside code. Even if outside code wrote a new value and then re-wrote the original value, that would be guaranteed to cause the conditional code to fail. Conceptually, this might make LL/SC seem more powerful than other methods, since it wouldn't have the "ABA" problem. Unfortunately, LL/SC semantics allow for stores to spontaneously fail, and the probability of spontaneously failure may go up rapidly as the complexity of the code between the load and store is increased. While using LL/SC to implement something like an atomic increment directly would be more efficient than using it to implement a compare-and-swap, and then implementing an atomic increment using that compare-and-swap implementation, in situations where one would need to do much between a load and store, one should generally use LL-SC to implement a compare-and-swap, and then use that compare-and-swap method in a load-modify-CompareAndSwap loop.
Of the three primitives, the Compare-And-Swap is the least powerful, but it can be implemented in terms of either of the other two. CompareAndSwap can do a pretty good job of emulating CompareExchange, though there are some corner cases where such emulation might live-lock. Neither CompareExchange nor Compare-And-Swap can offer guarantees quite as strong as LL-SC, though the limited amount of code one can reliably place within an LL/SC loop limits the usefulness of its guarantees.
x86 does not provide LL/SC instructions. Check out wikipedia for platforms that do. Also see this SO question.
I have do an extensive calculation on a big vector of integers. The vector size is not changed during the calculation. The size of the vector is frequently accessed by the code. What is faster in general: using the vector::size() function or using helper constant vectorSize storing the size of the vector?
I know that compilers usually able to inline the size() function when setting the proper compiler flags, however, making a function inline is something that a compiler may do but can not be forced.
Interesting question.
So, what's going to happened ? Well if you debug with gdb you'll see something like 3 member variables (names are not accurate):
_M_begin: pointer to the first element of the dynamic array
_M_end: pointer one past the last element of the dynamic array
_M_capacity: pointer one past the last element that could be stored in the dynamic array
The implementation of vector<T,Alloc>::size() is thus usually reduced to:
return _M_end - _M_begin; // Note: _Mylast - _Myfirst in VC 2008
Now, there are 2 things to consider when regarding the actual optimizations possible:
will this function be inlined ? Probably: I am no compiler writer, but it's a good bet since the overhead of a function call would dwarf the actual time here and since it's templated we have all the code available in the translation unit
will the result be cached (ie sort of having an unnamed local variable): it could well be, but you won't know unless you disassemble the generated code
In other words:
If you store the size yourself, there is a good chance it will be as fast as the compiler could get it.
If you do not, it will depend on whether the compiler can establish that nothing else is modifying the vector; if not, it cannot cache the variable, and will need to perform memory reads (L1) every time.
It's a micro-optimization. In general, it will be unnoticeable, either because the performance does not matter or because the compiler will perform it regardless. In a critical loop where the compiler does not apply the optimization, it can be a significant improvement.
As I understand the 1998 C++ specification, vector<T>::size() takes constant time, not linear time. So, this question likely boils down to whether it's faster to read a local variable than calling a function that does very little work.
I'd therefore claim that storing your vector's size() in a local variable will speed up your program by a small amount, since you'll only call that function (and therefore the small constant amount of time it takes to execute) once instead of many times.
Performance of vector::size() : is it
as fast as reading a variable?
Probably not.
Does it matter
Probably not.
Unless the work you're doing per iteration is tiny (like one or two integer operations) the overhead will be insignificant.
In every implementation I've, seen vector::size() performs a subtraction of end() and begin(), ie its not as fast as reading a variable.
When implementing a vector, the implementer has to make a choice between which shall be fastest, end() or size(), ie storing the number of initialized elements or the pointer/iterator to the element after the last initialized element.
In other words; iterate by using iterators.
If you are worried of the size() performance, write your index based for loop like this;
for (size_t i = 0, i_end = container.size(); i < i_end; ++i){
// do something performance critical
}
I always save vector.size() in a local variable (if the the size doesn't change inside the loop!).
Calling it on each iteration vs. saving it in a local variable can be faster.
At least, that's what I experienced.
I can't give you any real numbers, as I tested this a very long time ago. However from what I can recall, it made a noticeable difference (however potentially only in debug mode), especially when nesting loops.
And to all the people complaining about micro-optimization:
It's a single additional line of code that introduces no downsides.
You could write yourself a functor for your loop body and call it via std::for_each. It does the iteration for you, and then your question becomes moot. However, you're introducing a function call (that may or may not get inlined) for every loop iteration, so you'd best profile it if you're not getting the performance you expect.
Always get a profile of your application before looking at this sort of micro optimization. Remember that even if it performs a subtraction, the compiler could still easily optimize it in many ways that would negate any performance loss.