Why is a Compare-And-Swap operation limited by Amdahl's law? - concurrency

Martin Thompson asserts that a STM that relies on a ref that relies on CAS will ultimately be limited by Amdahl's law. Amdahl's law being that the maximum performance of a parallel program is limited by the sequential (non-parallel) part of the program. Is Martin Thompson saying that CAS is by its nature non-parallel?

I would think that's exactly his point. The swap must come after the results of the compare are known, so ultimately you cannot run any faster than "compare, then swap, then next compare, then next swap, the next compare, ...".
Of course, in most realistic cases, you won't get anywhere close to reaching that limit -- and you would be unbelievably thrilled with the performance if you did. It's kind of like saying that cars can never go faster than the speed of light. It's almost undoubtedly true, but car manufacturers needn't worry about it.

Related

Why parallel computing helps the use of non-local resources?

I've been reading about parallel computing and it's obvious how it helps saving time and solving larger problems, but I don't get how it takes advantage of the use of non-local resources.
Better ask IF, before next asking WHY:
With all due respect, the answer for the IF is not granted at all.
Any, and here let me emphasise that before ANY attempt to change a pure-[SERIAL] code-execution scheduling into any form of either "just"-[CONCURRENT] or even true-[PARALLEL] first collect sufficient quantitative evidence that such efforts will bring an expected set of performance benefits.
This is not obvious and not every professional designer starts with comparing apples to apples.
There are costs to be paid, ALWAYS:
If speaking about using resources, always start with accounting all the costs, that such a use accrues - be it a local or non-local resource. NUMA-architectures have brought many surprises once costs-of-use are indeed accounted for.
In modern criticism of the original Amdahl's Law formulation one has to decode a principal must to also include all the add-on overhead costs, as the original -- overhead-ignorant or overhead-naive if one may wish to -- Amdahl's Law formulation was immensely misleading.
The next cardinal vector of the contemporary criticism of the original law formulation comes from the fact that the block of the yet [PAR]-convertible part of the problem processing is not infinitely divisible, but exhibits some sort of further indivisible "atomic"-part, that will never benefit from the next available free resources ( even if such resources are in almost infinite capacity ), that could be used for parallel processing, but the "atomically"-indivisible part cannot be further split, as its internal dependency-chain cannot be split and/or distributed.
There are not many problems, that enjoy above 90% [PARALLEL]-fraction, so always keep in mind the even overhead-naive formulation demonstration glass-ceiling for the expected acceleration - reaching < 10x for even infinite amount of processors ( plus check both the parts of the criticism of the original law formulation ). On the very contrary, each and every improvement in the [SERIAL]-fraction of the process will always 1:1 influence the gain in performance.
Answer:
So, IF has been validated first that some non-local resource can help more, than has to be and always will be paid ( in a sum of all the types of the add-on costs ) for a chance to use of such a distributed-system parallel-processing non-local resource, then WHY NOT.
Understanding the Economy-of-Use rules the game.

I could not understand the meaning of the last paragraph in the the text below

The section below can be found on this article by Hans Boehm (BTW a very good article, as far as I could read it):
The problem with pure sequential consistency
Unfortunately, it is too expensive to guarantee sequentially
consistent executions in all cases. For example, reconsider the
program above. Most processors do not actually wait for the initial
assignments to complete before executing the second statements in each
thread. The initial assignments just cause the value 1 to be saved in
a store buffer, where it waits to be copied to memory, but is not yet
visible to another processor. If the two threads were to execute in
lock step on different processors, it is quite likely that both r1 and
r2 will be zero, because the reads of both x and y occur before the
writes become visible to the other thread.
Similarly, compilers routinely transform code in ways that violate
sequential consistency. For example, if our initial example occurred
as part of a larger program, the compiler might decide to move one or
both of the r2 = x and r1 = y "load" operations to the beginning of
their respective threads, to give them more time to complete before
the values in r1 and r2 are needed. Causing the load to happen early
is essentially equivalent to the hardware delaying the store; it can
again cause both loads to read zero. Since the two operations in each
thread touch independent variables, there is no reason for the
compiler to believe that they are not interchangeable. This kind of
compiler transformation can produce significant performance
improvements, and we do not want to disallow it.
In essence, both hardware and compilers may perform very similar
optimizations that reorder memory references as perceived by other
threads.
There has been a significant amount of research on reducing the cost
of sequential consistency, both in hardware, and through more complete
compiler analysis. But most experts feel that the hardware costs are
still too high, and the compiler optimizations require too much
information about the entire program to be generally feasible. And, as
we will see below, insisting on pure sequential consistency generally
does not provide us with any real advantage.
After a few trials I was able to understand the article up to, but not including this last paragraph.
"pure sequential consistency" is equivalent to declaring every single variable in a C++ program to be std::atomic. The compiler and processor are completely locked down in terms of reordering: they simply may not reorder memory accesses at all. Multithreaded programs would have very nice predictable semantics, and be slower than molasses. The "cost" referred to in the first sentence is the loss of performance necessitated by the requirements of pure sequential consistency. Research into finding ways to reduce that cost - either in hardware or by making compilers smarter to find more optimization opportunities that do not violate pure sequential consistency - has not been fruitful.
The last sentence "And, as we will see below, insisting on pure sequential consistency generally does not provide us with any real advantage." suggests that relaxing the memory ordering constraints from pure sequential consistency can greatly increase performance, hopefully without rendering the memory semantics incomprehensible.

Optimization Techniques for C++

In his talk a few days ago at Facebook - slides, video, Andrei Alexandrescu talks about common intuitions that might prove us wrong. For me one very interesting point came up on Slide 7 where he states that the assumption "Fewer instructions = faster code" is not true and more instructions will not necessarily mean slower code.
Here comes my problem: The audio quality of his talk (around 6:20min) is not that well and I don't understand the explanation very well, but from what I get is that he is comparing retired instructions with optimality of an algorithm on a performance level.
However, from my understanding this cannot be done because these are two independent structural levels. Instructions (especially actually retired instructions) are one very important measure and basically, gives you an idea about performance to achieve a goal. If we leave out the latency of an instruction, we can generalize that fewer retired instructions = faster code. Now, of course there are cases where an algorithm that performs complex calculations inside a loop will yield better performance even though it is performed inside the loop, because it will break the loop earlier (think graph traversal). But wouldn't it be more useful to compare to algorithms on a complexity level rather than saying this loop has more instructions and is better than the other? From my point of view, the better algorithm will have less retired instructions in the end.
Can someone please help me to understand where he was going with his example, and how can there be a case where (significantly) more retired instructions lead to better performance?
The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).
Here are some cases where more instructions are faster:
Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.
CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.
It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.
Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.
The number of instructions is not a good measure in itself.
Fewer retired instructions (because there is nothing more to do) = faster code.
Fewer retired instructions (because they have to wait for dependencies) = slower code.
It can sometimes be that more instructions in the code also means more retired instructions, because they can use up execution slots that would otherwise be wasted in case 2.

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =
You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.
As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.
Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.
This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.
You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.
as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.

Performance of C++ Operators

Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.
About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.
No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.
What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.
The best answer is to time it with your compiler.
Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)
Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.
Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you
Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things
Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.
"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...
The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).