I am using intel VTune to profile my program.
The CPU I am using is IVY Bridge.
All the hardware instruction event can be found here:
https://software.intel.com/en-us/node/589933
FP_COMP_OPS_EXE.X87
Number of FP Computational Uops Executed this
cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs,
FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not
distinguish an FADD used in the middle of a transcendental flow from a
s
FP_COMP_OPS_EXE.X87 seems to include Integer Multiplication and Integer Division; however, there is no Integer Addition and Integer Subtraction there. I can not find those two kinds of instruction either from the above website.
Can anyone tell me what is the event that counts integer addition and integer subtraction instructions?
I'm reading a lot into your question, but here goes:
It's possible that if your code is computationally bound you could find ways to infer the significance of integer adds and subs without measuring them directly. For example, UOPS_RETIRED.ALL - FP_COMP_OPS_EXE.ALL would give you a very rough estimate of adds and subs, assuming that you've already done something to establish that your code is compute bound.
Have you? If not, it might help to start with VTune's basic analysis and then eliminate memory, cache and front end bottlenecks. If you've already done this, you have a few more options:
Cross-reference UOPS_DISPATCHED_PORT with an Ivy Bridge block diagram, or even better, a list of which specific types of arithmetic can execute on which ports (which I can't find).
Modify your program source, compiler flags or assembly, rerun a coarser-grained profile like basic analysis, and see whether you see an impact at the level of a measure like INST_RETIRED.ANY / CPU_CLK_UNHALTED.
Sorry there doesn't seem to be a more direct answer.
Related
I am looking to implement KL Divergence in C++ efficiently. (CPU Only for now).
Much like AES or FTT (Fast Fourier transform) whereby use of a common function has lead to hardware level optimizations (Intel AES and Intel FTT).
Is there anything similar for natural log, or slightly higher level efficiencies (ASM/C) that prevent bottlenecks of executing many Natural log functions in success (If they exist)?
Same use case examples:
.Many parallel and independent node executions; each one performing 20~ KL calculations from localized (not shared or pointer reffed) memory.
.Scheduled KL: executing in stepped-parallel where the hardware is setup expecting uses of for example tables (assuming tables are used at this lower level - AES implement ions indicated this is a high probability) .
I have found a possible candiate, but I'm unsure: ALTFP_LOG
You can use SSE instructions to calculate the logarithm of many values in parallel. But whether you can actually make use of those instructions depends heavily on how the rest of the calculations you are going to do depend on the logarithms you calculate, so it is not possible to give a more specific answer.
I have a very complicated and sophisticated data fitting program which uses the Levenverg-Marquardt algorithm to do fitting in double precision (basically the fitting class is templatized, but I use instantiate it to doubles). The fitting process involves:
Calculating an error function (chi-square)
Solving a system of linear equations (I use lapack for that)
calculating the derivatives of a function with respect to the parameters, which I want to fit to the data (usually 20+ parameters)
calculating the function value continuously: the function is a complicated combination of a sinusoidal and exponential functions with a few harmonics.
A colleague of mine has suggested that I use integers for at least 10 times faster at least. My questions are:
Is that true that I will get that kind of improvement?
Is it safe to convert everything to integers? And what are the drawbacks to this?
What advice would you have for this whole issue? What would you do?
The program is developed to calculate some parameters from the signal online, which means that the program must be as fast as possible, but I'm wondering whether it's worth it to start the project of converting everything to integers.
The amount of improvement depends on your platform. For example, if your platform has a fast floating point coprocessor, performing arithmetic in floating point may be faster than integral arithmetic.
You may be able to get more performance gain by optimizing your algorithms rather than switching to integer arithmetic.
Another method for boosting performance is to reduce data cache hits and also reducing branches and loops.
I would measure performance of the program to find out where the bottlenecks are and then review the sections that where most of the performance takes place. For example, in my embedded system, micro-optimizations like what you are suggesting, saved 3 microseconds. This gain is not worth the effort to retest the entire system. If it works, don't fix it. Concentrate on correctness and robustness first.
The bottom line here is that you have to test it and decide for yourself. Profile a release build using real data.
1- Is that true that I will get that kind of improvement?
Maybe yes, maybe no. It depends on a number of factors, such as
How long it takes to convert from double to int
How big a word is on your machine
What platform/toolset you're using and what optimizations you have enabled
(Maybe) how big a cache line is on your platform
How fast your memory is
How fast your platform computes floating-point versus integer.
And who knows what else. In short, too many complex variables for anyone to be able to say for sure if you will or will not improve performance.
But I would be highly skeptical about your friend's claim, "at least 10 times faster at least."
2- Is it safe to convert everything to integers? And what are the
drawbacks to this?
It depends on what you're converting and how. Obviously converting a value like 123.456 to an integer is decidedly unsafe.
Drawbacks include loss of precision, loss of accuracy, and the expense in terms of space and time to actually do the conversions. Another significant drawback is the fact that you have to write a substantial amount of code, and every line of code you write is a probable source of new bugs.
3- What advice would you have for this whole issue? What would you do?
I would step back & take a deep breath. Profile your code under real-world conditions. Identify the sources of the bottlenecks. Find out what the real problems are, and if there even are any.
Identify inefficiencies in your algorithms, and fix them.
Throw hardware at the problem.
Then you can endeavor to start micro-optimizing. This would be my last resort, especially if the optimization technique you are considering would require writing a lot of code.
First, this reeks of attempting to optimize unnecessarily.
Second, doubles are a minimum of 64-bits. ints on most systems are 32-bits. So you have a couple of choices: truncate the double (which reduces your precision to a single), or store it in the space of 2 integers, or store it as an unsigned long long (which is at least 64-bits as well). For the first 2 options, you are facing a performance hit as you must convert the numbers back and forth between the doubles you are operating on and the integers you are storing it as. For the third option, you are not gaining any performance increase (in terms of memory usage) as they are basically the same size - so you'd just be converting them to integers for no reason.
So, to get to your questions:
1) Doubtful, but you can try it to see for yourself.
2) The problem isn't storage as the bits are just bits when they get into memory. The problem is the arithmetic. Since you stated you need double precision, attempting to do those operations on an integer type will not give you the results you are looking for.
3) Don't optimize until it has been proven something needs to have a performance improvement. And always remember Amdahl's Law: Make the common case fast and the rare case correct.
What I would do is:
First tune it in single-thread mode (by the random-pausing method) until you can't find any way to reduce cycles. The kinds of things I've found are:
a large fraction of time spent in library functions like sin, cos, exp, and log where the arguments were often unchanged, so the answers would be the same. The solution for that is called "memoizing", where you figure out a place to store old values of arguments and results, and check there first before calling the function.
In calling library functions like DGEMM (lapack matrix-multiply) that one would assume are optimized to the teeth, they are actually spending a large fraction of time calling a function to determine if the matrices are upper or lower triangle, square, symmetric, or whatever, rather than actually doing the multiplication. If so, the answer is obvious - write a special routine just for your situation.
Don't say "but I don't have those problems". Of course - you probably have different problems - but the process of finding them is the same.
Once you've made it as fast as possible in single-thread, then figure out how to parallelize it. Multi-threading can have high overhead, so it's best not to tightly-couple the threads.
Regarding your question about converting from doubles to integers, the other answers are right on the money. It only makes sense in very particular situations.
I am having an issue with the simple code bellow. I am trying to use OpenMP with GFortran. The Results of the code bellow for x should be the same with AND without !$OMP statements, since the parallel code and serial code should output the same result.
program test
implicit none
!INCLUDE 'omp_lib.h'
integer i,j
Real(8) :: x,t1,t2
x=0.0d0
!$OMP PARALLEL DO PRIVATE(i,j) shared(X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
!$OMP ATOMIC
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
!$OMP END PARALLEL Do
write(*,*) x
end program test
But strangely I am getting the following results for x:
Parallel:-3.17822355415XXXXX
Serial: -3.1782235541569084
where XXXXX is some random digits. Every time I run the serial code, I get the same result (-3.1782235541569084). How can i fix it? Is this problem due to some OpenMP working precision option?
Floating-point arithmetic is not strictly associative. In f-p arithmetic neither a+(b+c)==(a+b)+c nor a*(b*c)==(a*b)*c is always true, as they both are in real arithmetic. This is well-known, and extensively explained in answers to other questions here on SO and at other reputable places on the web. I won't elaborate further on that point here.
As you have written your program the order of operations by which the final value of X is calculated is non-deterministic, that is it may (and probably does) vary from execution to execution. The atomic directive only permits one thread at a time to update X but it doesn't impose any ordering constraints on the threads reaching the directive.
Given the nature of the calculation in your program I believe that the difference you see between serial and parallel executions may be entirely explained by this non-determinism.
Before you think about 'fixing' this you should first be certain that it is a problem. What makes you think that the serial code's answer is the one true answer ? If you were to run the loops backwards (still serially) and get a different answer (quite likely) which answer is the one you are looking for ? In a lot of scientific computing, which is probably the core domain for OpenMP, the data available and the numerical methods used simply don't support assertions of the accuracy of program results beyond a handful of significant figures.
If you still think that this is a problem that needs to be fixed, the easiest approach is to simply take out the OpenMP directives.
To add to what High Performance Mark said, another source of discrepancy is that the compiler might have emitted x87 FPU instructions to do the math. x87 uses 80-bit internal precision and an optimised serial code would only use register arithmetic before it actually writes the final value to the memory location of X. In the parallel case, since X is a shared variable, at each iteration the memory location is being updated. This means that the 80-bit x87 FPU register is flushed to a 64-bit memory location and then read back, and some bits of precision are thus lost on each iteration, which then adds up to the observed discrepancy.
This effect is not present if modern 64-bit CPU is being used together with a compiler that emits SIMD instructions, e.g. SSE2+ or AVX. Those only work with 64-bit internal precision and then using only register addressing does not result in better precision than if the memory value is being flushed and reloaded in each iteration. In this case the difference comes from the non-associativity as explained by High Performance Mark.
Those effects are pretty much expected and usually accounted for. They are well studied and understood, and if your CFD algorithm breaks down when run in parallel, then the algorithm is highly numerically unstable and I would in no way trust the results it gives, even in the serial case.
By the way, a better way to implement your loop would be to use reduction:
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
This would allow the compiler to generate register-optimised code for each thread and then the loss of precision would only occur at the very end when the threads sum their local partial values in order to obtain the final value of X.
I USED THE CLAUSE ORDERED WITH YOU CODE AND THIS WORK. BUT WORK WITH THIS CLAUSE IS THE SAME THAT RUN THE CODE IN SERIAL.
How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference
Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.
About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.
No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.
What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.
The best answer is to time it with your compiler.
Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)
Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.
Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you
Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things
Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.
"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...
The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).