Two consequent std::chrono::high_resolution_clock::now() gives ~270ns difference

Two consequent std::chrono::high_resolution_clock::now() gives ~270ns difference - c++

I want to measure duration of a piece of code with a std::chrono clock, but it seems too heavy to measure something that lasts nanoseconds. That program:
#include <cstdio>
#include <chrono>
int main() {
using clock = std::chrono::high_resolution_clock;
// try several times
for (int i = 0; i < 5; i++) {
// two consequent now() here, one right after another without anything in between
printf("%dns\n", (int)std::chrono::duration_cast<std::chrono::nanoseconds>(clock::now() - clock::now()).count());
}
return 0;
}
Always gives me around 100-300ns. Is this because of two syscalls? Is it possible to have less duration between two now()? Thanks!
Environment: Linux Ubuntu 18.04, kernel 4.18, load average is low, stdlib is linked dynamically.

Use rdtsc instruction to measure times with the highest resolution and the least overhead possible:
#include <iostream>
#include <cstdint>
int main() {
uint64_t a = __builtin_ia32_rdtsc();
uint64_t b = __builtin_ia32_rdtsc();
std::cout << b - a << " cpu cycles\n";
}
Output:
19 cpu cycles
To convert the cycles to nanoseconds divide cycles by the base CPU frequency in GHz. For example, for a 4.2GHz i7-7700k divide by 4.2.
TSC is a global counter in the CPU shared across all cores.
Modern CPUs have a constant TSC that ticks at the same rate regardless of the current CPU frequency and boost. Look for constant_tsc in /proc/cpuinfo, flags field.
Also note, that __builtin_ia32_rdtsc is more effective than the inline assembly, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48877

If you want to measure the duration of very fast code snippets it is generally a good idea to run them multiple times and take the average time of all runs, the ~200ns that you mention will be negligible then because they are distributed over all runs.
Example:
#include <cstdio>
#include <chrono>
using clock = std::chrono::high_resolution_clock;
auto start = clock::now();
int n = 10000; // adjust depending on the expected runtime of your code
for (unsigned int i = 0; i < n; ++i)
functionYouWantToTime();
auto result =
std::chrono::duration_cast<std::chrono::nanoseconds>(start - clock::now()).count() / n;

Just do not use time clocks for nanoseconds benchmark. Instead, use CPU ticks - on any hardware modern enough to worry about nanoseconds, CPU ticks are monotonic, steady and synchronized between cores.
Unfortunately, C++ does not expose CPU tick clock, so you'd have to use RDTSC instruction directly (it can be nicely wrapped in the inline function or you can use compiler's intrinsics). The difference in CPU ticks could also be converted into time if you so desire (by using CPU frequency), but normally for such a low-latency benchmarks it is not necessary.

Related

C++ Linux fastest way to measure time (faster than std::chrono) ? Benchmark included

#include <iostream>
#include <chrono>
using namespace std;
class MyTimer {
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
double getCounter() {
ender = std::chrono::steady_clock::now();
return double(std::chrono::duration_cast<std::chrono::nanoseconds>(ender - starter).count()) /
1000000; // millisecond output
}
// timer need to have nanosecond precision
int64_t getCounterNs() {
return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - starter).count();
}
};
MyTimer timer1, timer2, timerMain;
volatile int64_t dummy = 0, res1 = 0, res2 = 0;
// time run without any time measure
void func0() {
dummy++;
}
// we're trying to measure the cost of startCounter() and getCounterNs(), not "dummy++"
void func1() {
timer1.startCounter();
dummy++;
res1 += timer1.getCounterNs();
}
void func2() {
// start your counter here
dummy++;
// res2 += end your counter here
}
int main()
{
int i, ntest = 1000 * 1000 * 100;
int64_t runtime0, runtime1, runtime2;
timerMain.startCounter();
for (i=1; i<=ntest; i++) func0();
runtime0 = timerMain.getCounter();
cout << "Time0 = " << runtime0 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func1();
runtime1 = timerMain.getCounter();
cout << "Time1 = " << runtime1 << "ms\n";
timerMain.startCounter();
for (i=1; i<=ntest; i++) func2();
runtime2 = timerMain.getCounter();
cout << "Time2 = " << runtime2 << "ms\n";
return 0;
}
I'm trying to profile a program where certain critical parts have execution time measured in < 50 nanoseconds. I found that my timer class using std::chrono is too expensive (code with timing takes 40% more time than code without). How can I make a faster timer class?
I think some OS-specific system calls would be the fastest solution. The platform is Linux Ubuntu.
Edit: all code is compiled with -O3. It's ensured that each timer is only initialized once, so the measured cost is due to the startMeasure/stopMeasure functions only. I'm not doing any text printing.
Edit 2: the accepted answer doesn't include the method to actually convert number-of-cycles to nanoseconds. If someone can do that, it'd be very helpful.

What you want is called "micro-benchmarking". It can get very complex. I assume you are using Ubuntu Linux on x86_64. This is not valid form ARM, ARM64 or any other platforms.
std::chrono is implemented at libstdc++ (gcc) and libc++ (clang) on Linux as simply a thin wrapper around the GLIBC, the C library, which does all the heavy lifting. If you look at std::chrono::steady_clock::now() you will see calls to clock_gettime().
clock_gettime() is a VDSO, ie it is kernel code that runs in userspace. It should be very fast but it might be that from time to time it has to do some housekeeping and take a long time every n-th call. So I would not recommend for microbenchmarking.
Almost every platform has a cycle counter and x86 has the assembly instruction rdtsc. This instruction can be inserted in your code by crafting asm calls or by using the compiler-specific builtins __builtin_ia32_rdtsc() or __rdtsc().
These calls will return a 64-bit integer representing the number of clocks since the machine power up. rdtsc is not immediate but fast, it will take roughly 15-40 cycles to complete.
It is not guaranteed in all platforms that this counter will be the same for each core so beware when the process gets moved from core to core. In modern systems this should not be a problem though.
Another problem with rdtsc is that compilers will often reorder instructions if they find they don't have side effects and unfortunately rdtsc is one of them. So you have to use fake barriers around these counter reads if you see that the compiler is playing tricks on you - look at the generated assembly.
Also a big problem is cpu out of order execution itself. Not only the compiler can change the order of execution but the cpu can as well. Since the x86 486 the Intel CPUs are pipelined so several instructions can be executed at the same time - roughly speaking. So you might end up measuring spurious execution.
I recommend you to get familiar with the quantum-like problems of micro-benchmarking. It is not straightforward.
Notice that rdtsc() will return the number of cycles. You have to convert to nanoseconds using the timestamp counter frequency.
Here is one example:
#include <iostream>
#include <cstdio>
void dosomething() {
// yada yada
}
int main() {
double sum = 0;
const uint32_t numloops = 100000000;
for ( uint32_t j=0; j<numloops; ++j ) {
uint64_t t0 = __builtin_ia32_rdtsc();
dosomething();
uint64_t t1 = __builtin_ia32_rdtsc();
uint64_t elapsed = t1-t0;
sum += elapsed;
}
std::cout << "Average:" << sum/numloops << std::endl;
}
This paper is a bit outdated (2010) but it is sufficiently up to date to give you a good introduction to micro-benchmarking:
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures

How can I obtain consistently high throughput in this loop?

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

Measuring CPU time in c++

If I had the following code
clock_t t;
t = clock();
//algorithm
t = clock() - t;
t would equal the number of ticks to run the program. Is this the same is CPU time? Are there any other ways to measure CPU time in C++?
OS -- Debian GNU/Linux
I am open to anything that will work. I am wanting to compare the CPU time of two algorithms.

clock() is specified to measure CPU time however not all implementations do this. In particular Microsoft's implementation in VS does not count additional time when multiple threads are running, or count less time when the program's threads are sleeping/waiting.
Also note that clock() should measure the CPU time used by the entire program, so while CPU time used by multiple threads in //algorithm will be measured, other threads that are not part of //algorithm also get counted.
clock() is the only method specified in the standard to measure CPU time, however there are certainly other, platform specific, methods for measuring CPU time.
std::chrono does not include any clock for measuring CPU time. It only has a clock synchronized to the system time, a clock that advances at a steady rate with respect to real time, and a clock that is 'high resolution' but which does not necessarily measure CPU time.

How to mesure cpu-time used ?
#include <ctime>
std::clock_t c_start = std::clock();
// your_algorithm
std::clock_t c_end = std::clock();
long_double time_elapsed_ms = 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC;
std::cout << "CPU time used: " << time_elapsed_ms << " ms\n";
Of course, if you display time in seconds :
std::cout << "CPU time used: " << time_elapsed_ms / 1000.0 << " s\n";
Source : http://en.cppreference.com/w/cpp/chrono/c/clock

A non-standard way to do this using pthreads is:
#include <pthread.h>
#include <time.h>
timespec cpu_time()
{
thread_local bool initialized(false);
thread_local clockid_t clock_id;
if (!initialized)
{
pthread_getcpuclockid(pthread_self(), &clock_id);
initialized = true;
}
timespec result;
clock_gettime(clock_id, &result);
return result;
}
Addition: A more standard way is:
#include <sys/resource.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
struct rusage r;
getrusage(RUSAGE_SELF, &r);
printf("CPU usage: %d.%06d\n", r.ru_utime.tv_sec, r.ru_utime.tv_usec);
}

C++ has a chrono library. See http://en.cppreference.com/w/cpp/chrono. There are also often platform dependent ways to get at high resolution timers (obviously varies by platform).

c++ thread overhead

I'm playing around with threads in C++, in particular using them to parallelize a map operation.
Here's the code:
#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>
double multByTwo(double x){
return x*2;
}
double doJunk(double x){
return cos(pow(sin(x*2),3));
}
template <typename T>
void map(T* data, int n, T (*ptr)(T)){
for (int i=0; i<n; i++)
data[i] = (*ptr)(data[i]);
}
template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
int NUMCORES = 3;
std::vector<std::thread> threads;
for (int i=0; i<NUMCORES; i++)
threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
for (std::thread& t : threads)
t.join();
}
int main()
{
int n = 1000000000;
double* nums = new double[n];
for (int i=0; i<n; i++)
nums[i] = i;
std::cout<<"go"<<std::endl;
clock_t c1 = clock();
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
// also try with &doJunk
//parallelMap(nums, n, &multByTwo);
map(nums, n, &doJunk);
std::cout << nums[342] << std::endl;
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("CPU elapsed time is %f seconds\n", double(clock()-c1)/CLOCKS_PER_SEC);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("Actual elapsed time is %f seconds\n", elapsed);
}
With multByTwo the parallel version is actually slightly slower (1.01 seconds versus .95 real time), and with doJunk its faster (51 versus 136 real time). This implies to me that
the parallelization is working, and
there is a REALLY large overhead with declaring
new threads. Any thoughts as to why the overhead is so large, and how I can avoid it?

Just a guess: what you're likely seeing is that the multByTwo code is so fast that you're achieving memory saturation. The code will never run any faster no matter how much processor power you throw at it, because it's already going as fast as it can get the bits to and from RAM.

You did not specify the hardware that you test your program nor the compiler version and the operating system. I did try your code on our four-socket Intel Xeon systems under 64-bit Scientific Linux with g++ 4.7 compiled from source.
First on an older Xeon X7350 system I got the following timings:
multByTwo with map
CPU elapsed time is 6.690000 seconds
Actual elapsed time is 6.691940 seconds
multByTwo with parallelMap on 3 cores
CPU elapsed time is 7.330000 seconds
Actual elapsed time is 2.480294 seconds
The parallel speedup is 2.7x.
doJunk with map
CPU elapsed time is 209.250000 seconds
Actual elapsed time is 209.289025 seconds
doJunk with parallelMap on 3 cores
CPU elapsed time is 220.770000 seconds
Actual elapsed time is 73.900960 seconds
The parallel speedup is 2.83x.
Note that X7350 is from the quite old pre-Nehalem "Tigerton" family with FSB bus and a shared memory controller located in the north bridge. This is a pure SMP system with no NUMA effects.
Then I run your code on a four-socket Intel X7550. These are Nehalem ("Beckton") Xeons with memory controller integrated into the CPU and hence a 4-node NUMA system. Threads running on one socket and accessing memory located on another socket will run somewhat slower. The same is also true for a serial process that might get migrated to another socket by some stupid scheduler decision. Binding in such a system is very important as you may see from the timings:
multByTwo with map
CPU elapsed time is 4.270000 seconds
Actual elapsed time is 4.264875 seconds
multByTwo with map bound to NUMA node 0
CPU elapsed time is 4.160000 seconds
Actual elapsed time is 4.160180 seconds
multByTwo with map bound to NUMA node 0 and CPU socket 1
CPU elapsed time is 5.910000 seconds
Actual elapsed time is 5.912319 seconds
mutlByTwo with parallelMap on 3 cores
CPU elapsed time is 7.530000 seconds
Actual elapsed time is 3.696616 seconds
Parallel speedup is only 1.13x (relative to the fastest node-bound serial execution). Now with binding:
multByTwo with parallelMap on 3 cores bound to NUMA node 0
CPU elapsed time is 4.630000 seconds
Actual elapsed time is 1.548102 seconds
Parallel speedup is 2.69x - as much as for the Tigerton CPUs.
multByTwo with parallelMap on 3 cores bound to NUMA node 0 and CPU socket 1
CPU elapsed time is 5.190000 seconds
Actual elapsed time is 1.760623 seconds
Parallel speedup is 2.36x - 88% of the previous case.
(I was too impatient to wait for the doJunk code to finish on the relatively slower Nehalems but I would expect somewhat better performance as was in Tigerton case)
There is one caveat with NUMA binding though. If you force e.g. binding to NUMA node 0 with numactl --cpubind=0 --membind=0 ./program this will limit memory allocation to this node only and on your particular system the memory attached to CPU 0 might not be enough and a run-time failure will most likely occur.
As you can see there are factors, other than the overhead from creating threads, that can significantly influence your code execution time. Also on very fast systems the overhead can be too high compared to the computational work done by each thread. That's why when asking questions concerning parallel performance, one should always include as much details as possible about the hardware and the environment used to measure the performance.

Multiple threads can only do more work in less time on a multi-core machine.
Other wise they are just taking turns in a Round-Robin fashion.

Spawning new threads can be an expensive operation depending on the platform. The easiest way to avoid this overhead is to spawn a few threads at the launch of the program and have some sort of job queue. I believe std::async will do this for you.

c++ get milliseconds since some date

I need some way in c++ to keep track of the number of milliseconds since program execution. And I need the precision to be in milliseconds. (In my googling, I've found lots of folks that said to include time.h and then multiply the output of time() by 1000 ... this won't work.)

clock has been suggested a number of times. This has two problems. First of all, it often doesn't have a resolution even close to a millisecond (10-20 ms is probably more common). Second, some implementations of it (e.g., Unix and similar) return CPU time, while others (E.g., Windows) return wall time.
You haven't really said whether you want wall time or CPU time, which makes it hard to give a really good answer. On Windows, you could use GetProcessTimes. That will give you the kernel and user CPU times directly. It will also tell you when the process was created, so if you want milliseconds of wall time since process creation, you can subtract the process creation time from the current time (GetSystemTime). QueryPerformanceCounter has also been mentioned. This has a few oddities of its own -- for example, in some implementations it retrieves time from the CPUs cycle counter, so its frequency varies when/if the CPU speed changes. Other implementations read from the motherboard's 1.024 MHz timer, which does not vary with the CPU speed (and the conditions under which each are used aren't entirely obvious).
On Unix, you can use GetTimeOfDay to just get the wall time with (at least the possibility of) relatively high precision. If you want time for a process, you can use times or getrusage (the latter is newer and gives more complete information that may also be more precise).
Bottom line: as I said in my comment, there's no way to get what you want portably. Since you haven't said whether you want CPU time or wall time, even for a specific system, there's not one right answer. The one you've "accepted" (clock()) has the virtue of being available on essentially any system, but what it returns also varies just about the most widely.

See std::clock()

Include time.h, and then use the clock() function. It returns the number of clock ticks elapsed since the program was launched. Just divide it by "CLOCKS_PER_SEC" to obtain the number of seconds, you can then multiply by 1000 to obtain the number of milliseconds.

Some cross platform solution. This code was used for some kind of benchmarking:
#ifdef WIN32
LARGE_INTEGER g_llFrequency = {0};
BOOL g_bQueryResult = QueryPerformanceFrequency(&g_llFrequency);
#endif
//...
long long osQueryPerfomance()
{
#ifdef WIN32
LARGE_INTEGER llPerf = {0};
QueryPerformanceCounter(&llPerf);
return llPerf.QuadPart * 1000ll / ( g_llFrequency.QuadPart / 1000ll);
#else
struct timeval stTimeVal;
gettimeofday(&stTimeVal, NULL);
return stTimeVal.tv_sec * 1000000ll + stTimeVal.tv_usec;
#endif
}

The most portable way is using the clock function.It usually reports the time that your program has been using the processor, or an approximation thereof. Note however the following:
The resolution is not very good for GNU systems. That's really a pity.
Take care of casting everything to double before doing divisions and assignations.
The counter is held as a 32 bit number in GNU 32 bits, which can be pretty annoying for long-running programs.
There are alternatives using "wall time" which give better resolution, both in Windows and Linux. But as the libc manual states: If you're trying to optimize your program or measure its efficiency, it's very useful to know how much processor time it uses. For that, calendar time and elapsed times are useless because a process may spend time waiting for I/O or for other processes to use the CPU.

Here is a C++0x solution and an example why clock() might not do what you think it does.
#include <chrono>
#include <iostream>
#include <cstdlib>
#include <ctime>
int main()
{
auto start1 = std::chrono::monotonic_clock::now();
auto start2 = std::clock();
sleep(1);
for( int i=0; i<100000000; ++i);
auto end1 = std::chrono::monotonic_clock::now();
auto end2 = std::clock();
auto delta1 = end1-start1;
auto delta2 = end2-start2;
std::cout << "chrono: " << std::chrono::duration_cast<std::chrono::duration<float>>(delta1).count() << std::endl;
std::cout << "clock: " << static_cast<float>(delta2)/CLOCKS_PER_SEC << std::endl;
}
On my system this outputs:
chrono: 1.36839
clock: 0.36
You'll notice the clock() method is missing a second. An astute observer might also notice that clock() looks to have less resolution. On my system it's ticking by in 12 millisecond increments, terrible resolution.
If you are unable or unwilling to use C++0x, take a look at Boost.DateTime's ptime microsec_clock::universal_time().

This isn't C++ specific (nor portable), but you can do:
SYSTEMTIME systemDT;
In Windows.
From there, you can access each member of the systemDT struct.
You can record the time when the program started and compare the current time to the recorded time (systemDT versus systemDTtemp, for instance).
To refresh, you can call GetLocalTime(&systemDT);
To access each member, you would do systemDT.wHour, systemDT.wMinute, systemDT.wMilliseconds.
To get more information on SYSTEMTIME.

Do you want wall clock time, CPU time, or some other measurement? Also, what platform is this? There is no universally portable way to get more precision than time() and clock() give you, but...
on most Unix systems, you can use gettimeofday() and/or clock_gettime(), which give at least microsecond precision and access to a variety of timers;
I'm not nearly as familiar with Windows, but one of these functions probably does what you want.

You can try this code (get from StockFish chess engine source code (GPL)):
#include <iostream>
#include <stdio>
#if !defined(_WIN32) && !defined(_WIN64) // Linux - Unix
# include <sys/time.h>
typedef timeval sys_time_t;
inline void system_time(sys_time_t* t) {
gettimeofday(t, NULL);
}
inline long long time_to_msec(const sys_time_t& t) {
return t.tv_sec * 1000LL + t.tv_usec / 1000;
}
#else // Windows and MinGW
# include <sys/timeb.h>
typedef _timeb sys_time_t;
inline void system_time(sys_time_t* t) { _ftime(t); }
inline long long time_to_msec(const sys_time_t& t) {
return t.time * 1000LL + t.millitm;
}
#endif
struct Time {
void restart() { system_time(&t); }
uint64_t msec() const { return time_to_msec(t); }
long long elapsed() const {
return long long(current_time().msec() - time_to_msec(t));
}
static Time current_time() { Time t; t.restart(); return t; }
private:
sys_time_t t;
};
int main() {
sys_time_t t;
system_time(&t);
long long currentTimeMs = time_to_msec(t);
std::cout << "currentTimeMs:" << currentTimeMs << std::endl;
Time time = Time::current_time();
for (int i = 0; i < 1000000; i++) {
//Do something
}
long long e = time.elapsed();
std::cout << "time elapsed:" << e << std::endl;
getchar(); // wait for keyboard input
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js