How get very precise elapsed time C++ - c++

I'm running my code on Ubuntu, and I need to get the elapsed time about a function on my program. I need a very accurate time, like nano seconds or at least micro seconds.
I read about chrono.h but it uses system time, and I prefer use CPU time.
Is there a way to do that, and have that granularity (nano seconds)?

std::chrono does have a high_resolution_clock, though please bear in mind that the precision is limited by the processor.
If you want to use functions directory from libc, you can use gettimeofday but as before there is no guarantee that this will be nanosecond accurate. (this is only microsecond accuracy)

The achievable precision of the clock is one of the properties of different hardware / OS that still leak into virtually every language, and, to be honest, having been in the same situation I find building yourself an own abstraction that is good enough in your case is often the only choice.
That being said, I would avoid the STL for high-precision timing. Since it is a library standard with no one true implementation, it has to create an abstraction, which implies one of:
use a least common denominator
leak hardware/OS details through platform-dependent behavior
In the second case you are essentially back to where you started, if you want to have uniform behavior. If you can afford the possible loss of precision or the deviations of a standard clock, then by all means use it. Clocks are hard and subtle.
If you know your target environment you can choose the appropriate clocks the oldschool way (#ifdef PLATFORM_ID...), e.g. clock_gettime(), QPC), and implement the most precise abstraction you can get. Of course you are limited by the same choice the STL has to make, but by reducing the set of platforms, you can generally improve on the lcd-requirement.
If you need a more theoretical way to convince yourself of this argumentation, you can consider the set of clocks with their maximum precision, and a sequence of accesses to the current time. For clocks advancing uniformly in uniform steps, if two accesses happen faster than the maximum precision of one clock, but slower than the maximum precision of another clock, you are bound to get different behavior. If on the other hand you ensure that two accesses are at least the maximum precision of the slowest clock apart the behavior is the same. Now of course real clocks are not advancing uniformly (clock drift), and also not in unit-steps.

While there is a standards function that should return the CPU time (std::clock) in reality there's no portable way to do this.
On POSIX systems (which Linux is attempting to be) then std::clock should do the right thing though. Just don't expect it to work the same on non-POSIX platforms if you ever want to make your application portable.
The values returned by std::clock are also approximate, and the precision and resolution is system dependent.

Related

Is there a standard library implementation where high_resolution_clock is not a typedef?

The C++ Draft par 20.12.7.3 reads:
high_resolution_clock may be a synonym for system_clock or steady_clock
Of course this may mandates nothing but I wonder :
Is there any point for high_resolution_clock to something other that a typedef ?
Are there such implementations ?
If a clock with a shorter tick period is devised it can either be steady or not steady. So if such a mechanism exists, wouldn't we want to "improve" system_clock and high_resolution_clock as well, defaulting to the typedef solution once more ?
The reason that specs have wording such as "may" and "can", and other vague words that allow for other possibilities comes from the wish by the spec writers not wanting to (unnecessarily) limit the implementation of a "better" solution of something.
Imagine a system where time in general is counted in seconds, and the system_clock is just that - the system_clock::period will return 1 second. This time is stored as a single 64-bit integer.
Now, in the same system, there is also a time in nano-seconds, but it's stored as a 128-bit integer. The resulting time-calculations are slightly more complex due to this large integer format, and for someone that only needs 1s precision for their time (in a system where a large number of calculations on time are made), you wouldn't want to have the extra penalty of using high_precision_clock, when the system doesn't need it.
As to if there are such things in real life, I'm not sure. The key is that it's not a violation to the standard, if you care to implement it such.
Note that steady is very much a property of "what happens when the system changes time" (e.g. if the outside network has been down for a several days, and the internal clock in the system has drifted off from the atomic clock that the network time updates to). Using steady_clock will guarantee that time doesn't go backwards or suddenly jumps forward 25 seconds all of a sudden. Likewise, there is no problem when there is a "leap second" or similar time adjustment in the computer system. On the other hand, a system_clock is guaranteed to give you the correct new time if you give it a forward duration past a daylight savings time, or some such, where steady_clock will just tick along hour after hour, regardless. So choosing the right one of those will affect the recording of your favourite program in the digital TV recorder - steady_clock will record at the wrong time [my DTV recorder did this wrong a few years back, but they appear to have fixed it now].
system_clock should also take into account the user (or sysadmin) changing the clock in the system, steady_clock should NOT do so.
Again, high_resolution_clock may or may not be steady - it's up to the implementor of the C++ library to give the appropriate response to is_steady.
In the 4.9.2 version of <chrono>, we find this using high_resolution_clock = system_clock;, so in this case it's a direct typedef (by a different name). But the spec doesn't REQUIRE this.

Difference between clock() and MPI_Wtime()

Quick Question . for MPI implementation of my code ,i am getting a huge difference in both. I know MPI_Wtime is the real time elapsed by each processor and clock() gives a rough idea of the expected time . Do anyone wants to add some assertion ?
The clock function is utterly useless. It measures cpu time, not real time/wall time, and moreover it has the following serious issues:
On most implementations, the resolution is extremely bad, for example, 1/100 of a second. CLOCKS_PER_SECOND is not the resolution, just the scale.
With typical values of CLOCKS_PER_SECOND (Unix standards require it to be 1 million, for example), clock will overflow in a matter of minutes on 32-bit systems. After overflow, it returns -1.
Most historical implementations don't actually return -1 on overflow, as the C standard requires, but instead wrap. As clock_t is usually a signed type, attempting to perform arithmetic with the wrapped values will produce either meaningless results or undefined behavior.
On Windows it does the completely wrong thing and measures elapsed real time, rather than cpu time.
The official definition of clock is that it gives you CPU-time. In Windows, for hysterical historical reasons - it would break some apps if you change it to reflect CPU-time now - on Windows, the time is just elapsed time.
MPI_Wtime gives, as you say, the "current time on this processor", which is quite different. If you do something that sleeps for 1 minute, MPI_Wtime will move 60 seconds forward, where clock (except for Windows) would be pretty much unchanged.

Portable good precision double timestamp in C++?

Here's what I'd need to do:
double now=getdoubletimestampsomehow();
Where getdoubletimestampsomehow() should be a straight-forward, easy to use function returning a double value representing the number of seconds elapsed from a given date. I'd need it to be quite precise, but I don't really need it to be more precise than a few milliseconds. Portability is quite important, if it isn't possible to directly port it anywhere could you please tell me both an unix and a windows way to do it?
Have you looked at Boost and particularly its Date_Time library ? Here is the seconds since epoch example.
You will be hard-pressed to find something more portable, and of higher resolution.
Portable good precision double timestamp in C++?
There is no portable way to get high-precision timestamp (milliseconds) without using 3rd party libraries. Maximum precision you'll get is 1 second, using time/localtime/gmtime.
If you're fine with 3rd party libraries, use either Boost or Qt 4.
both an unix and a windows way to do it?
GetSystemTime on Windows and gettimeofday on linux.
Please note that if you're planning to use timestamps to determine order of some events, then it might be a bad idea. System clock might have very limited precision (10 milliseconds on windows platform), in which case several operations performed consequently can produce same timestamp. So, to determine order of events you would need "logical timestamps" ("vector clock" is one of examples).
On windows platform, there are highly precise functions that can be used to determine how much time has passed since some point in the past (QueryPerformanceCounter), but they aren't connected to timestamps.
C++11 introduced the <chrono> header containing quite a few portable clocks. The highest resolution clock among them is the std::chrono::high_resolution_clock.
It provides the current time as a std::chrono::time_point object which has a time_since_epoch member. This might contain what you want.
Reference:
Prior to the release of the C++11 standard, there was no standard way in which one could accurately measure the execution time of a piece of code. The programmer was forced to use external libraries like Boost, or routines provided by each operating system.
The C++11 chrono header file provides three standard clocks that could be used for timing one’s code:
system_clock - this is the real-time clock used by the system;
high_resolution_clock - this is a clock with the shortest tick period possible on the current system;
steady_clock - this is a monotonic clock that is guaranteed to never be adjusted.
If you want to measure the time taken by a certain piece of code for execution, you should generally use the steady_clock, which is a monotonic clock that is never adjusted by the system. The other two clocks provided by the chrono header can be occasionally adjusted, so the difference between two consecutive time moments, t0 < t1, is not always positive.
Doubles are not precise - therefore you idea for double now=getdoubletimestampsomehow(); falls down at the first hurdle.
Others have mentioned other possibilities. I would explore those.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

Overhead of casting double to float?

So I have megabytes of data stored as doubles that need to be sent over a network... now I don't need the precision that a double offers, so I want to convert these to a float before sending them over the network. What is the overhead of simply doing:
float myFloat = (float)myDouble;
I'll be doing this operation several million times every few seconds and don't want to slow anything down. Thanks
EDIT: My platform is x64 with Visual Studio 2008.
EDIT 2: I have no control over how they are stored.
As Michael Burr said, while the overhead strongly depends on your platform, the overhead is definitely less than the time needed to send them over the wire.
a rough estimate:
800MBit/s payload on a excellent Gigabit wire, 25M-floats/second.
On a 2GHz single core, that gives you a whopping 80 clock cycles for each value converted to break even - anythign less, and you will save time. That should be more than enough on all architectures :)
A simple load-store cycle (barring all caching delays) should be below 5 cycles per value. With instruction interleaving, SIMD extensions and/or parallelizing on multiple cores, you are likely to do multiple conversions in a single cycle.
Also, the receiver will be happy having to handle only half the data. Remember that memory access time is nonlinear.
The only thing arguing against the conversion would be is if the transfer should have minimal CPU load: a modern architecture could transfer the data from disk/memory to bus without CPU intervention. However, with above numbers I'd say that doesn't matter in practice.
[edit]
I checked some numbers, the 387 coprocessor would indeed have taken around 70 cycles for a load-store cycle. On the initial pentium, you are down to 3 cycles without any parallelization.
So, unless you run a gigabit network on a 386...
It's going to depend on your implementation of C++ libraries. Test it and see.
Even if it does take time this will not be the slow point in your application.
Your FPU can do the conversion a lot quicker than it can send network traffic (so the bottleneck here will more than likely be the write to the socket).
But as with al things like this measure it and see.
Personally I don't think any time spent here will affect the real time spent sending the data.
Assuming that you're talking about a significant number of packets to ship the data (a reasonable assumption if you're sending millions of values) casting the doubles to float will likely reduce the number of network packets by about half (assuming sizeof(double)==8 and sizeof(float)==4).
Almost certainly the savings in network traffic will dominate whatever time is spent performing the conversion. But as everyone says, measuring some tests will be the proof of the pudding.
Bearing in mind that most compilers deal with doubles a lot more efficiently than floats -- many promote float to double before performing operations on them -- I'd consider taking the block of data, ZIPping/compressing it, then sending the compressed block across. Depending on what your data looks like, you could get 60-90% compression, vs. the 50% you'd get converting 8-byte values to four bytes.
You don't have any choice but to measure them yourself and see. You could use timers to measure them. Looks like some has already implemented a neat C++ timer class
I think this cast is a lot cheaper than you think since it doesn't really involve any kind of calculation. In fact, it's just bitshifting to get rid of some of the digits of exponent and mantissa.
It will also depend on the CPU and what floating point support it has. In the bad old days (1980s), processors supported integer operations only. Floating point math had to be emulated in software. A separate chip for floating point (a coprocessor) could be bought separately.
Modern CPUs now have SIMD instructions, so large amounts of floating point data can be processed at once. These instructions include MMX, SSE, 3DNow! and the like. Your compiler may know how to make use of these instructions, but you may need to write your code in a particular way, and turn on the right options.
Finally, the fastest way to process floating point data is in a video card. A fairly new language called OpenCL lets you send tasks to the video card to be processed there.
It all depends on how much performance you need.