How is FLOPS/IOPS calculated and what is its use? - c++

I have been following some tutorials for OpenCL and a lot of times people speak in terms of FLOPS. Wikipedia does explain the formula but does not tell what it actually means? For example, 1 light year = 9.4605284 × 10^15 meters but what it means is the distance traveled by light in a year. Similarly what does FLOP mean?
Answer to a similar question says 100 IOPS for the code
for(int i = 0; i < 100; ++i)
Ignoring the initialisation, I see 100 increment operations, so there's 100IOPS. But I also see 100 comparison operations. So why isn't it 200IOPS? So what types of operators are included in FLOPS/IOPS calculation?
Secondly I want to know what would you do by calculating the FLOPS of your algorithm?
I am asking this because the value is specific for a CPU clock speed and no of cores.
Any guidance on this arena would be very helpful.

"FLOPS" stands for "Floating Point Operations Per Second" and it's exactly that. It's used as a measure of the computing speed of large, number based (usually scientific) operations. Measuring it is a matter of knowing two things:
1.) The precise execution time of your algorithm
2.) The precise number of floating point operations involved in your algorithm
You can get pretty good approximations of the first one from profiling tools, and the second one from...well you might be on your own there. You can look through the source for floating point operations like "1.0 + 2.0" or look at the generated assembly code, but those can both be misleading. There's probably a debugger out there that will give you FLOPS directly.
It's important to understand that there is a theoretical maximum FLOPS value for the system you're running on, and then there is the actual achieved FLOPS of your algorithm. The ratio of these two can give you a sense of the efficiency of your algorithm. Hope this helps.

Related

Calculating FLOPS (Floating-point Operations per Seconds)

How can I calculate FLOPS of my application?
If I have the total number of executed instructions, I can divide it by the execution time. But, how to count the number of executed instructions?
My question is general and answer for any language is highly appreciated. But I am looking to find a solution for my application which is developed by C/C++ and CUDA.
I do not know whether the tags are proper, please correct me if I am wrong.
What I do if the number of floating point operations is not easily modeled is to produce two executables: One that is the production version and gives me the execution time, and an instrumented one that counts all floating point operations while performing them (surely that will be slow, but that doesn't matter for our purpose). Then I can compute the FLOP/s value by dividing the number of floating point ops from the second executable by the time from the first one.
This could probably even be automated, but I haven't had a need for this so far.
You should mathematically model what's done with your data. Isolate one loop iteration. Then count all simple floating-point additions, multiplications, divisions, etc. For example,
y = x * 2 * (y + z*w) is 4 floating-point operations. Multiply the resulting number by the number of iterations. The result will be the number of instructions you're searching for.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

Typical time of execution for elementary functions

It is well-known that the processor instruction for multiplication takes several times more time than addition, division is even worse (UPD: which is not true any more, see below). What about more complex operations like exponent? How difficult are they?
Motivation. I am interested because it would help in algorithm design to estimate performance-critical parts of algorithms on early stage. Suppose I want to apply a set of filters to an image. One of them operates on 3×3 neighborhood of each pixel, sums them and takes atan. Another one sums more neighbouring pixels, but does not use complicated functions. Which one would execute longer?
So, ideally I want to have approximate relative times of elementary operations execution, like multiplication typically takes 5 times more time than addition, exponent is about 100 multiplications. Of course, it is a deal of orders of magnitude, not the exact values. I understand that it depends on the hardware and on the arguments, so let's say we measure average time (in some sense) for floating-point operations on modern x86/x64. For operations that are not implemented in hardware, I am interested in typical running time for C++ standard libraries.
Have you seen any sources when such thing was analyzed? Does this question makes sense at all? Or no rules of thumb like this could be applied in practice?
First off, let's be clear. This:
It is well-known that processor instruction for multiplication takes
several times more time than addition
is no longer true in general. It hasn't been true for many, many years, and needs to stop being repeated. On most common architectures, integer multiplies are a couple cycles and integer adds are single-cycle; floating-point adds and multiplies tend to have nearly equal timing characteristics (typically around 4-6 cycles latency, with single-cycle throughput).
Now, to your actual question: it varies with both the architecture and the implementation. On a recent architecture, with a well written math library, simple elementary functions like exp and log usually require a few tens of cycles (20-50 cycles is a reasonable back-of-the-envelope figure). With a lower-quality library, you will sometimes see these operations require a few hundred cycles.
For more complicated functions, like pow, typical timings range from high tens into the hundreds of cycles.
You shouldn't be concerned about this. If I tell you that a typical C library implementation of transcendental functions tend to take around 10 times a single floating point addition/multiplication (or 50 floating point additions/multiplications), and around 5 times a floating point division, this wouldn't be useful to you.
Indeed, the way your processor schedules memory accesses will interfere badly with any premature optimization you'd do.
If after profiling you find that a particular implementation using transcendental functions is too slow, you can contemplate setting up a polynomial interpolation scheme. This will include a table and therefore will incur extra cache issues, so make sure to measure and not guess.
This will likely involve Chebyshev approximation. Document yourself about it, this is a particularly useful technique in this kind of domains.
I have been told that compilers are quite bad in optimizing floating point code. You may want to write custom assembly code.
Also, Intel Performance Primitives (if you are on Intel CPU) is something good to own if you are ready to trade off some accuracy for speed.
You could always start a second thread and time the operations. Most elementary operations don't have that much difference in execution time. The big difference is how many times the are executed. The O(n) is generally what you should be thinking about.

How expensive is fmod in terms of processor time?

In my game, I need to ensure that angles do not exceed 2 pi. so I use fmod(angle,TWO_PI);
Is this noticeably expensive to do about 100 times per second?
100 times per second? That's almost zero, you shouldn't trouble yourself.
Even if fmod takes 100 clock cycles - that's 10,000 cycles/Sec. For 1 1GHz CPU - that's 0.001% CPU.
BTW: why do you want to do fmod of TWO_PI? If you're going to take sin() or cos() - you can skip it.
If you want to ensure that angles do not exceed 2pi radians, you should use angle < TWO_PI. Using fmod will give you the remainder, which is useful if you want to find the actual angle and ignore multiple revolutions, but doesn't give you any information about which is greater.
Using < is very efficient, and as long as you aren't doing it 100,000+ times a second or don't have a lot of other code involved in the loop you should be fine. fmod is a fair bit more expensive as it involves division AND floating-point arithmetic, but 100 times per second is still almost negligible on most modern hardware, so I doubt you'll have much trouble at all. If you're still worried, do some tests. If you need help interpreting the tests or have other specific questions, post the code and we'll help you analyze them. :D

benchmarking trig lookup tables performance gains vs cpp implementation

We are developing a real-time system that will be performing sin/cos calculations during a time critical period of operation. We're considering using a lookup table to help with performance, and I'm trying to benchmark the benefit/cost of implementing a table. Unfortunately we don't yet know what degree of accuracy we will need, but probably around 5-6 decimal points.
I figure that a through comparison of C++ trig functions to lookup approaches has already been done previously. I was hoping that someone could provide me with a link to a site documenting any such benchmarking. If such results don't exist I would appreciate any suggestions for how I can determine how much memory is required for a lookup table assuming a given minimum accuracy, and how I can determine the potential speed benefits.
Thanks!
I can't answer all your questions, but instead of trying to determine theoretical speed benefits you would almost certainly be better off profiling it in your actual application. Then you get an accurate picture of what sort of improvement you stand to gain in your specific problem domain, which is the most useful information for your needs.
What accuracy is your degree input (let's use degrees over radians to keep the discussion "simpler"). Tenths of a degree? Hundredths of a degree? If your angle precision is not great, then your trig result cannot be any better.
I've seen this implemented as an array indexed by hundredths of a degree (keeping the angle as an integer w/two implied decimal point also helps with the calculation - no need to use high precision float/double radian angles).
Store SIN values of 0.00 to to 90.00 degrees would be 9001 32 bit float result values.
SIN[0] = 0.0
...
SIN[4500] = 0.7071068
...
SIN[9000] = 1.0
If you have SIN, the trig property of COS(a) = SIN(90-a)
just means you do
SIN[9000-a]
to get COS(a)
If you need more precision but don't have the memory for more table space, you could do linear interpolation between the two entries in the array, e.g. SIN of 45.00123 would be
SIN[4500] + 0.123 * (SIN[4501] - SIN[4500])
The only way to know the performance characteristics of the two approaches is to try them.
Yes, there are probably benchmarks of this made by others, but they didn't run in the context of your code, and they weren't running on your hardware, so they're not very applicable to your situation.
One thing you can do, however, is to look up the instruction latencies in the manuals for your CPU. (Intel and AMD have this information available in PDF form on their websites, and most other CPU manufacturers have similar documents)
Then you can at least find out how fast the actual trig instructions are, giving you a baseline that the lookup table will have to beat to be worthwhile.
But that only gives you a rough estimate of one side of the equation. You might be able to make a similar rough estimate of the cost of a lookup table as well, if you know the latencies of the CPU's caches, and you have a rough idea of the latency of memory accesses.
But the only way to get accurate information is to try it. Implement both, and see what happens in your application. Only then will you know which is better in your case.