How expensive is fmod in terms of processor time? - c++

In my game, I need to ensure that angles do not exceed 2 pi. so I use fmod(angle,TWO_PI);
Is this noticeably expensive to do about 100 times per second?

100 times per second? That's almost zero, you shouldn't trouble yourself.
Even if fmod takes 100 clock cycles - that's 10,000 cycles/Sec. For 1 1GHz CPU - that's 0.001% CPU.
BTW: why do you want to do fmod of TWO_PI? If you're going to take sin() or cos() - you can skip it.

If you want to ensure that angles do not exceed 2pi radians, you should use angle < TWO_PI. Using fmod will give you the remainder, which is useful if you want to find the actual angle and ignore multiple revolutions, but doesn't give you any information about which is greater.
Using < is very efficient, and as long as you aren't doing it 100,000+ times a second or don't have a lot of other code involved in the loop you should be fine. fmod is a fair bit more expensive as it involves division AND floating-point arithmetic, but 100 times per second is still almost negligible on most modern hardware, so I doubt you'll have much trouble at all. If you're still worried, do some tests. If you need help interpreting the tests or have other specific questions, post the code and we'll help you analyze them. :D

Related

Recommended samples for performance benchmarks?

I'm writing performance benchmarks for some of my code. This is both to compare my own implementations as I develop/experiment, and to compare against "competing" implementations. I have no problem writing these, and getting usable results.
It's very well established that more samples are a good thing, as it reduces the impact of erroneous data and gives a more true result.
So, if I'm profiling a given function/procedure/whatever, how many samples does it seem reasonable to get?
I'm currently doing about 1 million samples for each test. These are individual operations, the results rarely take longer than 10s per item, even on an old laptop. Most are under a hundredth of a second.
Actually, it is not well established that more samples are a good thing.
It is nothing more than common wisdom.
I think you are sharing in a general confusion about the reason for profiling, whether the purpose is to measure performance or to find speedups.
For measuring performance, you don't need samples at all.
what you need is a stopwatch, whether in software or not.
If your process runs too quickly for the resolution of the stopwatch, just run your process 10^3 or 10^6 times, measure it, and divide by that number.
For finding speedups, sampling the call stack is very effective, provided the samples contain line-level or instruction-level call site information.
How many samples do you need?
Well, if you see it doing something that could be removed on one sample, that probably doesn't mean much.
But if you see it on two samples, that estimates it's costing time fraction F of about 2/N where N is the number of samples.
Example: if you see it twice in 10 samples, that means it costs roughly 20% of time.
In general, if the speedup is going to save you fraction F of time, it takes on average 2/F samples to see it twice.
Example: if it is going to save 30% of time (F = 0.3) you need on average 2/0.3 = 6.67 samples to see it twice.
Of course, if you see it more than twice, all the better.
Bottom line, for finding speedups, you don't need a lot of samples.
What you do need is to examine each one for activity that could be removed.
What you don't need is to mush them together into "statistics" (like most profilers do).
Many people understand this.
If you want a bit more rigorous explanation, look here.

How is FLOPS/IOPS calculated and what is its use?

I have been following some tutorials for OpenCL and a lot of times people speak in terms of FLOPS. Wikipedia does explain the formula but does not tell what it actually means? For example, 1 light year = 9.4605284 × 10^15 meters but what it means is the distance traveled by light in a year. Similarly what does FLOP mean?
Answer to a similar question says 100 IOPS for the code
for(int i = 0; i < 100; ++i)
Ignoring the initialisation, I see 100 increment operations, so there's 100IOPS. But I also see 100 comparison operations. So why isn't it 200IOPS? So what types of operators are included in FLOPS/IOPS calculation?
Secondly I want to know what would you do by calculating the FLOPS of your algorithm?
I am asking this because the value is specific for a CPU clock speed and no of cores.
Any guidance on this arena would be very helpful.
"FLOPS" stands for "Floating Point Operations Per Second" and it's exactly that. It's used as a measure of the computing speed of large, number based (usually scientific) operations. Measuring it is a matter of knowing two things:
1.) The precise execution time of your algorithm
2.) The precise number of floating point operations involved in your algorithm
You can get pretty good approximations of the first one from profiling tools, and the second one from...well you might be on your own there. You can look through the source for floating point operations like "1.0 + 2.0" or look at the generated assembly code, but those can both be misleading. There's probably a debugger out there that will give you FLOPS directly.
It's important to understand that there is a theoretical maximum FLOPS value for the system you're running on, and then there is the actual achieved FLOPS of your algorithm. The ratio of these two can give you a sense of the efficiency of your algorithm. Hope this helps.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

benchmarking trig lookup tables performance gains vs cpp implementation

We are developing a real-time system that will be performing sin/cos calculations during a time critical period of operation. We're considering using a lookup table to help with performance, and I'm trying to benchmark the benefit/cost of implementing a table. Unfortunately we don't yet know what degree of accuracy we will need, but probably around 5-6 decimal points.
I figure that a through comparison of C++ trig functions to lookup approaches has already been done previously. I was hoping that someone could provide me with a link to a site documenting any such benchmarking. If such results don't exist I would appreciate any suggestions for how I can determine how much memory is required for a lookup table assuming a given minimum accuracy, and how I can determine the potential speed benefits.
Thanks!
I can't answer all your questions, but instead of trying to determine theoretical speed benefits you would almost certainly be better off profiling it in your actual application. Then you get an accurate picture of what sort of improvement you stand to gain in your specific problem domain, which is the most useful information for your needs.
What accuracy is your degree input (let's use degrees over radians to keep the discussion "simpler"). Tenths of a degree? Hundredths of a degree? If your angle precision is not great, then your trig result cannot be any better.
I've seen this implemented as an array indexed by hundredths of a degree (keeping the angle as an integer w/two implied decimal point also helps with the calculation - no need to use high precision float/double radian angles).
Store SIN values of 0.00 to to 90.00 degrees would be 9001 32 bit float result values.
SIN[0] = 0.0
...
SIN[4500] = 0.7071068
...
SIN[9000] = 1.0
If you have SIN, the trig property of COS(a) = SIN(90-a)
just means you do
SIN[9000-a]
to get COS(a)
If you need more precision but don't have the memory for more table space, you could do linear interpolation between the two entries in the array, e.g. SIN of 45.00123 would be
SIN[4500] + 0.123 * (SIN[4501] - SIN[4500])
The only way to know the performance characteristics of the two approaches is to try them.
Yes, there are probably benchmarks of this made by others, but they didn't run in the context of your code, and they weren't running on your hardware, so they're not very applicable to your situation.
One thing you can do, however, is to look up the instruction latencies in the manuals for your CPU. (Intel and AMD have this information available in PDF form on their websites, and most other CPU manufacturers have similar documents)
Then you can at least find out how fast the actual trig instructions are, giving you a baseline that the lookup table will have to beat to be worthwhile.
But that only gives you a rough estimate of one side of the equation. You might be able to make a similar rough estimate of the cost of a lookup table as well, if you know the latencies of the CPU's caches, and you have a rough idea of the latency of memory accesses.
But the only way to get accurate information is to try it. Implement both, and see what happens in your application. Only then will you know which is better in your case.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.