System call cost

System call cost - c++

I'm currently working on operating system operations overheads.
I'm actually studying the cost to make a system call and I've developed a simple C++ program to observe it.
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
uint64_t
rdtscp(void) {
uint32_t eax, edx;
__asm__ __volatile__("rdtscp" //! rdtscp instruction
: "+a" (eax), "=d" (edx) //! output
: //! input
: "%ecx"); //! registers
return (((uint64_t)edx << 32) | eax);
}
int main(void) {
uint64_t before;
uint64_t after;
struct timeval t;
for (unsigned int i = 0; i < 10; ++i) {
before = rdtscp();
gettimeofday(&t, NULL);
after = rdtscp();
std::cout << after - before << std::endl;
std::cout << t.tv_usec << std::endl;
}
return 0;
}
This program is quite straightforward.
The rdtscp function is just a wrapper to call the RTDSCP instruction (a processor instruction which loads the 64-bits cycle count into two 32-bits registers). This function is used to take the timing.
I iterate 10 times. At each iteration I call gettimeofday and determine the take it took to execute it (as a number of CPU cycles).
The results are quite unexpected:
8984
64008
376
64049
164
64053
160
64056
160
64060
160
64063
164
64067
160
64070
160
64073
160
64077
Odd lines in the output are the number of cycles needed to execute the system call. Even lines are the value contains in t.tv_usec (which is set by gettimeofday, the system call that I'm studying).
I don't really understand how that it is possible: the number of cycles drastically decreases, from nearly 10,000 to around 150! But the timeval struct is still updated at each call!
I've tried on different operating system (debian and macos) and the result is similar.
Even if the cache is used, I don't see how it is possible. Making a system call should result in a context switch to switch from user to kernel mode and we still need to read the clock in order to update the time.
Does someone has an idea?

The answer ? try another system call. There's vsyscalls on linux, and they accelerate things for certain syscalls:
What are vdso and vsyscall?
The short version: the syscall is not performed, but instead the kernel maps a region of memory where the process can access the time information. Cost ? Not much (no context switch).

Related

How can I obtain consistently high throughput in this loop?

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

c++ pass value by reference vs by copy of POD

I heard that passing variable by reference is not always faster than passing by value. Passing by reference is faster for big variables but for small one this problem could be tricky.
Passing by value requires time for copy creation but taking value of local variable should be faster.
Passing by reference do not waste time for creating variable copy but there is look at pointer and then on required data.
I am aware that this detail is not so important in optimization problem however it was interesting for me to measure it (i know that -O0 is passe for optimization but this code is to simple, after optimization i was not sure what i was measuring)
g++ -std=c++14 -O0 -g3 -DSIZE_OF_DATA_ARRAY=16 main.cpp && ./a.out
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
SIZE_OF_DATA_ARRAY | copy time[s] | reference time [s]
4 |0.04 |0.045
8 |0.04 |0.46
16|0.04 |0.05
17|0.07 |0.05
24|0.07 |0.05
My questions:
Why time of execution is quite constant for copying vs struct size?
Why there is threshold between 16 and 17 on copying?
Guess: it is connected with cache
My code:
#include <iostream>
#include <vector>
#include <limits>
#include <chrono>
#include <iomanip>
#include <vector>
#include <algorithm>
struct Data {
double x[SIZE_OF_DATA_ARRAY];
};
double workOnData(Data &data) {
for (auto i = 0; i < 10; ++i) {
data.x[0] -= 0.5 * (data.x[0] - 1);
}
return data.x[0];
}
void runTestSuite() {
auto queries = 1000000;
Data data;
for (auto i = 0; i < queries; ++i) {
data.x[0] = i;
auto val = workOnData(data);
if (val == -357)
data.x[0] = 1;
}
}
int main() {
std::cout << "sizeof(Data) = " << sizeof(Data) << "\n";
size_t numberOfTests = 99;
std::vector<std::chrono::duration<double>> timeMeasurements{numberOfTests};
std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
for (auto i = 0; i < numberOfTests; ++i) {
startTime = std::chrono::system_clock::now();
runTestSuite();
endTime = std::chrono::system_clock::now();
timeMeasurements[i] = endTime - startTime;
}
std::sort(timeMeasurements.begin(), timeMeasurements.end());
std::chrono::system_clock::time_point now = std::chrono::system_clock::now();
std::time_t now_c = std::chrono::system_clock::to_time_t(now);
std::cout << std::put_time(std::localtime(&now_c), "%F %T")
<< ": median time = " << timeMeasurements[numberOfTests * 0.5].count() << "s\n";
return 0;
}

Why time of execution is quite constant for copying vs struct size?
The best understanding is from viewing the assembly language to see the instructions that the compiler emitted. Optimizations here would depend on the optimization setting of the compiler and whether you are in release or debug configuration.
Also depends on the processor. For example, some processors may have specialized instructions for copying large blocks of memory. Other processors may copy data in parallel chunks, depending on the size of the structure. Also, some platforms may have hardware assistance, such as a DMA controller.
Alas, sometimes unrolling may be faster than using special instructions or hardware assistance (depends on the data size).
Why there is threshold between 16 and 17 on copying?
The threshold may be between alignment boundaries and non-alignment.
Let's take a 32-bit processor. It likes to access (fetch) 4 bytes at a time. Accessing 24 bytes requires 6 fetches. Access 16 bytes takes 4 fetches.
However, accessing 17, 18, or 19 bytes requires 5 fetches. It may fetch another 4 bytes to get those remainder bytes.
Another scenario is the implementation of the copy function. Some copy functions may use 32 bit copies for the first set of 4 byte quantities, then switch to byte comparing for the remainders. It could switch to byte copying for all the bytes depending on the size of the data. Many possibilities.
The truth for your system lies in debugging the assembly language or function for copying the data.
Cache Hits & Misses
Your performance metrics may be skewed by processor cache operations. If the processor has your data in cache, the loop will be a lot faster. Usually there will be a performance hit for the first access of the data. There may be more time wasted reloading the data cache if your data is too big for the cache or lies outside the domain of the data cache size.
Instruction Cache Issues
A lot of processors have large pipelines and caches for data instructions. When they encounter a branch (such as the end of a for loop), the processor may have to reset the instruction cache and reload from another location in the program. This takes time. You can demonstrate by unrolling the loop in different size chunks and measuring the performance.

How to Detect the Number of Physical Processors / Cores on Windows, Mac and Linux

I have a multi threaded c++ application that runs on Windows, Mac and a few Linux flavors.
To make a long story short: In order for it to run at maximum efficiency, I have to be able to instantiate a single thread per physical processor/core. Creating more threads than there are physical processors/cores degrades the performance of my program considerably. I can already correctly detect the number of logical processors/cores correctly on all three of these platforms. To be able to detect the number of physical processors/cores correctly I'll have to detect if hyper-treading is supported AND active.
My question therefore is if there is a way to detect whether Hyper Threading is supported and enabled? If so, how exactly.

EDIT: This is no longer 100% correct due to Intel's ongoing befuddlement.
The way I understand the question is that you are asking how to detect the number of CPU cores vs. CPU threads which is different from detecting the number of logical and physical cores in a system. CPU cores are often not considered physical cores by the OS unless they have their own package or die. So an OS will report that a Core 2 Duo, for example, has 1 physical and 2 logical CPUs and an Intel P4 with hyper-threads will be reported exactly the same way even though 2 hyper-threads vs. 2 CPU cores is a very different thing performance wise.
I struggled with this until I pieced together the solution below, which I believe works for both AMD and Intel processors. As far as I know, and I could be wrong, AMD does not yet have CPU threads but they have provided a way to detect them that I assume will work on future AMD processors which may have CPU threads.
In short here are the steps using the CPUID instruction:
Detect CPU vendor using CPUID function 0
Check for HTT bit 28 in CPU features EDX from CPUID function 1
Get the logical core count from EBX[23:16] from CPUID function 1
Get actual non-threaded CPU core count
If vendor == 'GenuineIntel' this is 1 plus EAX[31:26] from CPUID function 4
If vendor == 'AuthenticAMD' this is 1 plus ECX[7:0] from CPUID function 0x80000008
Sounds difficult but here is a, hopefully, platform independent C++ program that does the trick:
#include <iostream>
#include <string>
using namespace std;
void cpuID(unsigned i, unsigned regs[4]) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
int main(int argc, char *argv[]) {
unsigned regs[4];
// Get vendor
char vendor[12];
cpuID(0, regs);
((unsigned *)vendor)[0] = regs[1]; // EBX
((unsigned *)vendor)[1] = regs[3]; // EDX
((unsigned *)vendor)[2] = regs[2]; // ECX
string cpuVendor = string(vendor, 12);
// Get CPU features
cpuID(1, regs);
unsigned cpuFeatures = regs[3]; // EDX
// Logical core count per CPU
cpuID(1, regs);
unsigned logical = (regs[1] >> 16) & 0xff; // EBX[23:16]
cout << " logical cpus: " << logical << endl;
unsigned cores = logical;
if (cpuVendor == "GenuineIntel") {
// Get DCP cache info
cpuID(4, regs);
cores = ((regs[0] >> 26) & 0x3f) + 1; // EAX[31:26] + 1
} else if (cpuVendor == "AuthenticAMD") {
// Get NC: Number of CPU cores - 1
cpuID(0x80000008, regs);
cores = ((unsigned)(regs[2] & 0xff)) + 1; // ECX[7:0] + 1
}
cout << " cpu cores: " << cores << endl;
// Detect hyper-threads
bool hyperThreads = cpuFeatures & (1 << 28) && cores < logical;
cout << "hyper-threads: " << (hyperThreads ? "true" : "false") << endl;
return 0;
}
I haven't actually tested this on Windows or OSX yet but it should work as the CPUID instruction is valid on i686 machines. Obviously, this wont work for PowerPC but then they don't have hyper-threads either.
Here is the output on a few different Intel machines:
Intel(R) Core(TM)2 Duo CPU T7500 # 2.20GHz:
logical cpus: 2
cpu cores: 2
hyper-threads: false
Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz:
logical cpus: 4
cpu cores: 4
hyper-threads: false
Intel(R) Xeon(R) CPU E5520 # 2.27GHz (w/ x2 physical CPU packages):
logical cpus: 16
cpu cores: 8
hyper-threads: true
Intel(R) Pentium(R) 4 CPU 3.00GHz:
logical cpus: 2
cpu cores: 1
hyper-threads: true

Note this, does not give the number of physically cores as intended, but logical cores.
If you can use C++11 (thanks to alfC's comment beneath):
#include <iostream>
#include <thread>
int main() {
std::cout << std::thread::hardware_concurrency() << std::endl;
return 0;
}
Otherwise maybe the Boost library is an option for you. Same code but different include as above. Include <boost/thread.hpp> instead of <thread>.

Windows only solution desribed here:
GetLogicalProcessorInformation
for linux, /proc/cpuinfo file. I am not running linux
now so can't give you more detail. You can count
physical/logical processor instances. If logical count
is twice as physical, then you have HT enabled
(true only for x86).

The current highest voted answer using CPUID appears to be obsolete. It reports both the wrong number of logical and physical processors. This appears to be confirmed from this answer cpuid-on-intel-i7-processors.
Specifically, using CPUID.1.EBX[23:16] to get the logical processors or CPUID.4.EAX[31:26]+1 to get the physical ones with Intel processors does not give the correct result on any Intel processor I have.
For Intel CPUID.Bh should be used Intel_thread/Fcore and cache topology. The solution does not appear to be trivial. For AMD a different solution is necessary.
Here is source code by by Intel which reports the correct number of physical and logical cores as well as the correct number of sockets https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/. I tested this on a 80 logical core, 40 physical core, 4 socket Intel system.
Here is source code for AMD http://developer.amd.com/resources/documentation-articles/articles-whitepapers/processor-and-core-enumeration-using-cpuid/. It gave the correct result on my single socket Intel system but not on my four socket system. I don't have a AMD system to test.
I have not dissected the source code yet to find a simple answer (if one exists) with CPUID. It seems that if the solution can change (as it seems to have) that the best solution is to use a library or OS call.
Edit:
Here is a solution for Intel processors with CPUID leaf 11 (Bh). The way to do this is loop over the logical processors and get the x2APIC ID for each logical processor from CPUID and count the number of x2APIC IDs were the least significant bit is zero. For systems without hyper-threading the x2APIC ID will always be even. For systems with hyper-threading each x2APIC ID will have an even and odd version.
// input: eax = functionnumber, ecx = 0
// output: eax = output[0], ebx = output[1], ecx = output[2], edx = output[3]
//static inline void cpuid (int output[4], int functionnumber)
int getNumCores(void) {
//Assuming an Intel processor with CPUID leaf 11
int cores = 0;
#pragma omp parallel reduction(+:cores)
{
int regs[4];
cpuid(regs,11);
if(!(regs[3]&1)) cores++;
}
return cores;
}
The threads must be bound for this to work. OpenMP by default does not bind threads. Setting export OMP_PROC_BIND=true will bind them or they can be bound in code as shown at thread-affinity-with-windows-msvc-and-openmp.
I tested this on my 4 core/8 HT system and it returned 4 with and without hyper-threading disabled in the BIOS. I also tested in on a 4 socket system with each socket having 10 cores / 20 HT and it returned 40 cores.
AMD processors or older Intel processors without CPUID leaf 11 have to do something different.

From gathering ideas and concepts from some of the above ideas, I have come up with this solution. Please critique.
//EDIT INCLUDES
#ifdef _WIN32
#include <windows.h>
#elif MACOS
#include <sys/param.h>
#include <sys/sysctl.h>
#else
#include <unistd.h>
#endif
For almost every OS, the standard "Get core count" feature returns the logical core count. But in order to get the physical core count, we must first detect if the CPU has hyper threading or not.
uint32_t registers[4];
unsigned logicalcpucount;
unsigned physicalcpucount;
#ifdef _WIN32
SYSTEM_INFO systeminfo;
GetSystemInfo( &systeminfo );
logicalcpucount = systeminfo.dwNumberOfProcessors;
#else
logicalcpucount = sysconf( _SC_NPROCESSORS_ONLN );
#endif
We now have the logical core count, now in order to get the intended results, we first must check if hyper threading is being used or if it's even available.
__asm__ __volatile__ ("cpuid " :
"=a" (registers[0]),
"=b" (registers[1]),
"=c" (registers[2]),
"=d" (registers[3])
: "a" (1), "c" (0));
unsigned CPUFeatureSet = registers[3];
bool hyperthreading = CPUFeatureSet & (1 << 28);
Because there is not an Intel CPU with hyper threading that will only hyper thread one core (at least not from what I have read). This allows us to find this is a really painless way. If hyper threading is available,the logical processors will be exactly double the physical processors. Otherwise, the operating system will detect a logical processor for every single core. Meaning the logical and the physical core count will be identical.
if (hyperthreading){
physicalcpucount = logicalcpucount / 2;
} else {
physicalcpucount = logicalcpucount;
}
fprintf (stdout, "LOGICAL: %i\n", logicalcpucount);
fprintf (stdout, "PHYSICAL: %i\n", physicalcpucount);

To follow on from math's answer, as of boost 1.56 there exists the physical_concurrency attribute which does exactly what you want.
From the documentation - http://www.boost.org/doc/libs/1_56_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency
The number of physical cores available on the current system. In contrast to hardware_concurrency() it does not return the number of virtual cores, but it counts only physical cores.
So an example would be
#include <iostream>
#include <boost/thread.hpp>
int main()
{
std::cout << boost::thread::physical_concurrency();
return 0;
}

I know this is an old thread, but no one mentioned hwloc. The hwloc library is available on most Linux distributions and can also be compiled on Windows. The following code will return the number of physical processors. 4 in the case of a i7 CPU.
#include <hwloc.h>
int nPhysicalProcessorCount = 0;
hwloc_topology_t sTopology;
if (hwloc_topology_init(&sTopology) == 0 &&
hwloc_topology_load(sTopology) == 0)
{
nPhysicalProcessorCount =
hwloc_get_nbobjs_by_type(sTopology, HWLOC_OBJ_CORE);
hwloc_topology_destroy(sTopology);
}
if (nPhysicalProcessorCount < 1)
{
#ifdef _OPENMP
nPhysicalProcessorCount = omp_get_num_procs();
#else
nPhysicalProcessorCount = 1;
#endif
}

It is not sufficient to test if an Intel CPU has hyperthreading, you also need to test if hyperthreading is enabled or disabled. There is no documented way to check this. An Intel guy came up with this trick to check if hyperthreading is enabled: Check the number of programmable performance counters using CPUID[0xa].eax[15:8] and assume that if the value is 8, HT is disabled, and if the value is 4, HT is enabled (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/831551).
There is no problem on AMD chips: The CPUID reports 1 or 2 threads per core depending on whether simultaneous multithreading is disabled or enabled.
You also have to compare the thread count from the CPUID with the thread count reported by the operating system to see if there are multiple CPU chips.
I have made a function that implements all of this. It reports both the number of physical processors and the number of logical processors. I have tested it on Intel and AMD processors in Windows and Linux. It should work on Mac as well. I have published this code at
https://github.com/vectorclass/add-on/tree/master/physical_processors

On OS X, you can read these values from sysctl(3) (the C API, or the command line utility of the same name). The man page should give you usage information. The following keys may be of interest:
$ sysctl hw
hw.ncpu: 24
hw.activecpu: 24
hw.physicalcpu: 12 <-- number of cores
hw.physicalcpu_max: 12
hw.logicalcpu: 24 <-- number of cores including hyper-threaded cores
hw.logicalcpu_max: 24
hw.packages: 2 <-- number of CPU packages
hw.ncpu = 24
hw.availcpu = 24

On Windows, there are GetLogicalProcessorInformation and GetLogicalProcessorInformationEx available for Windows XP SP3 or older and Windows 7+ respectively. The difference is that GetLogicalProcessorInformation doesn't support setups with more than 64 logical cores, which might be important for server setups, but you can always fall back to GetLogicalProcessorInformation if you're on XP. Example usage for GetLogicalProcessorInformationEx (source):
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
BOOL rc;
DWORD length = 0;
DWORD offset = 0;
DWORD ncpus = 0;
DWORD prev_processor_info_size = 0;
for (;;) {
rc = psutil_GetLogicalProcessorInformationEx(
RelationAll, buffer, &length);
if (rc == FALSE) {
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
if (buffer) {
free(buffer);
}
buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(length);
if (NULL == buffer) {
return NULL;
}
}
else {
goto return_none;
}
}
else {
break;
}
}
ptr = buffer;
while (offset < length) {
// Advance ptr by the size of the previous
// SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX struct.
ptr = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*)\
(((char*)ptr) + prev_processor_info_size);
if (ptr->Relationship == RelationProcessorCore) {
ncpus += 1;
}
// When offset == length, we've reached the last processor
// info struct in the buffer.
offset += ptr->Size;
prev_processor_info_size = ptr->Size;
}
free(buffer);
if (ncpus != 0) {
return ncpus;
}
else {
return NULL;
}
return_none:
if (buffer != NULL)
free(buffer);
return NULL;
On Linux, parsing /proc/cpuinfo might help.

I don't know that all three expose the information in the same way, but if you can safely assume that the NT kernel will report device information according to the POSIX standard (which NT supposedly has support for), then you could work off that standard.
However, differing of device management is often cited as one of the stumbling blocks to cross platform development. I would at best implement this as three strands of logic, I wouldn't try to write one piece of code to handle all platforms evenly.
Ok, all that's assuming C++. For ASM, I presume you'll only be running on x86 or amd64 CPUs? You'll still need two branch paths, one for each architecture, and you'll need to test Intel separate from AMD (IIRC) but by and large you just check for the CPUID. Is that what you're trying to find? The CPUID from ASM on Intel/AMD family CPUs?

OpenMP should do the trick:
// test.cpp
#include <omp.h>
#include <iostream>
using namespace std;
int main(int argc, char** argv) {
int nThreads = omp_get_max_threads();
cout << "Can run as many as: " << nThreads << " threads." << endl;
}
most compilers support OpenMP. If you are using a gcc-based compiler (*nix, MacOS), you need to compile using:
$ g++ -fopenmp -o test.o test.cpp
(you might also need to tell your compiler to use the stdc++ library):
$ g++ -fopenmp -o test.o -lstdc++ test.cpp
As far as I know OpenMP was designed to solve this kind of problems.

This is very easy to do in Python:
$ python -c "import psutil; psutil.cpu_count(logical=False)"
4
Maybe you could look at the psutil source code to see what is going on?

You may use the library libcpuid (Also on GitHub - libcpuid).
As can be seen in its documentation page:
#include <stdio.h>
#include <libcpuid.h>
int main(void)
{
if (!cpuid_present()) { // check for CPUID presence
printf("Sorry, your CPU doesn't support CPUID!\n");
return -1;
}
if (cpuid_get_raw_data(&raw) < 0) { // obtain the raw CPUID data
printf("Sorry, cannot get the CPUID raw data.\n");
printf("Error: %s\n", cpuid_error()); // cpuid_error() gives the last error description
return -2;
}
if (cpu_identify(&raw, &data) < 0) { // identify the CPU, using the given raw data.
printf("Sorrry, CPU identification failed.\n");
printf("Error: %s\n", cpuid_error());
return -3;
}
printf("Found: %s CPU\n", data.vendor_str); // print out the vendor string (e.g. `GenuineIntel')
printf("Processor model is `%s'\n", data.cpu_codename); // print out the CPU code name (e.g. `Pentium 4 (Northwood)')
printf("The full brand string is `%s'\n", data.brand_str); // print out the CPU brand string
printf("The processor has %dK L1 cache and %dK L2 cache\n",
data.l1_data_cache, data.l2_cache); // print out cache size information
printf("The processor has %d cores and %d logical processors\n",
data.num_cores, data.num_logical_cpus); // print out CPU cores information
}
As can be seen, data.num_cores, holds the number of Physical cores of the CPU.

c++ get milliseconds since some date

I need some way in c++ to keep track of the number of milliseconds since program execution. And I need the precision to be in milliseconds. (In my googling, I've found lots of folks that said to include time.h and then multiply the output of time() by 1000 ... this won't work.)

clock has been suggested a number of times. This has two problems. First of all, it often doesn't have a resolution even close to a millisecond (10-20 ms is probably more common). Second, some implementations of it (e.g., Unix and similar) return CPU time, while others (E.g., Windows) return wall time.
You haven't really said whether you want wall time or CPU time, which makes it hard to give a really good answer. On Windows, you could use GetProcessTimes. That will give you the kernel and user CPU times directly. It will also tell you when the process was created, so if you want milliseconds of wall time since process creation, you can subtract the process creation time from the current time (GetSystemTime). QueryPerformanceCounter has also been mentioned. This has a few oddities of its own -- for example, in some implementations it retrieves time from the CPUs cycle counter, so its frequency varies when/if the CPU speed changes. Other implementations read from the motherboard's 1.024 MHz timer, which does not vary with the CPU speed (and the conditions under which each are used aren't entirely obvious).
On Unix, you can use GetTimeOfDay to just get the wall time with (at least the possibility of) relatively high precision. If you want time for a process, you can use times or getrusage (the latter is newer and gives more complete information that may also be more precise).
Bottom line: as I said in my comment, there's no way to get what you want portably. Since you haven't said whether you want CPU time or wall time, even for a specific system, there's not one right answer. The one you've "accepted" (clock()) has the virtue of being available on essentially any system, but what it returns also varies just about the most widely.

See std::clock()

Include time.h, and then use the clock() function. It returns the number of clock ticks elapsed since the program was launched. Just divide it by "CLOCKS_PER_SEC" to obtain the number of seconds, you can then multiply by 1000 to obtain the number of milliseconds.

Some cross platform solution. This code was used for some kind of benchmarking:
#ifdef WIN32
LARGE_INTEGER g_llFrequency = {0};
BOOL g_bQueryResult = QueryPerformanceFrequency(&g_llFrequency);
#endif
//...
long long osQueryPerfomance()
{
#ifdef WIN32
LARGE_INTEGER llPerf = {0};
QueryPerformanceCounter(&llPerf);
return llPerf.QuadPart * 1000ll / ( g_llFrequency.QuadPart / 1000ll);
#else
struct timeval stTimeVal;
gettimeofday(&stTimeVal, NULL);
return stTimeVal.tv_sec * 1000000ll + stTimeVal.tv_usec;
#endif
}

The most portable way is using the clock function.It usually reports the time that your program has been using the processor, or an approximation thereof. Note however the following:
The resolution is not very good for GNU systems. That's really a pity.
Take care of casting everything to double before doing divisions and assignations.
The counter is held as a 32 bit number in GNU 32 bits, which can be pretty annoying for long-running programs.
There are alternatives using "wall time" which give better resolution, both in Windows and Linux. But as the libc manual states: If you're trying to optimize your program or measure its efficiency, it's very useful to know how much processor time it uses. For that, calendar time and elapsed times are useless because a process may spend time waiting for I/O or for other processes to use the CPU.

Here is a C++0x solution and an example why clock() might not do what you think it does.
#include <chrono>
#include <iostream>
#include <cstdlib>
#include <ctime>
int main()
{
auto start1 = std::chrono::monotonic_clock::now();
auto start2 = std::clock();
sleep(1);
for( int i=0; i<100000000; ++i);
auto end1 = std::chrono::monotonic_clock::now();
auto end2 = std::clock();
auto delta1 = end1-start1;
auto delta2 = end2-start2;
std::cout << "chrono: " << std::chrono::duration_cast<std::chrono::duration<float>>(delta1).count() << std::endl;
std::cout << "clock: " << static_cast<float>(delta2)/CLOCKS_PER_SEC << std::endl;
}
On my system this outputs:
chrono: 1.36839
clock: 0.36
You'll notice the clock() method is missing a second. An astute observer might also notice that clock() looks to have less resolution. On my system it's ticking by in 12 millisecond increments, terrible resolution.
If you are unable or unwilling to use C++0x, take a look at Boost.DateTime's ptime microsec_clock::universal_time().

This isn't C++ specific (nor portable), but you can do:
SYSTEMTIME systemDT;
In Windows.
From there, you can access each member of the systemDT struct.
You can record the time when the program started and compare the current time to the recorded time (systemDT versus systemDTtemp, for instance).
To refresh, you can call GetLocalTime(&systemDT);
To access each member, you would do systemDT.wHour, systemDT.wMinute, systemDT.wMilliseconds.
To get more information on SYSTEMTIME.

Do you want wall clock time, CPU time, or some other measurement? Also, what platform is this? There is no universally portable way to get more precision than time() and clock() give you, but...
on most Unix systems, you can use gettimeofday() and/or clock_gettime(), which give at least microsecond precision and access to a variety of timers;
I'm not nearly as familiar with Windows, but one of these functions probably does what you want.

You can try this code (get from StockFish chess engine source code (GPL)):
#include <iostream>
#include <stdio>
#if !defined(_WIN32) && !defined(_WIN64) // Linux - Unix
# include <sys/time.h>
typedef timeval sys_time_t;
inline void system_time(sys_time_t* t) {
gettimeofday(t, NULL);
}
inline long long time_to_msec(const sys_time_t& t) {
return t.tv_sec * 1000LL + t.tv_usec / 1000;
}
#else // Windows and MinGW
# include <sys/timeb.h>
typedef _timeb sys_time_t;
inline void system_time(sys_time_t* t) { _ftime(t); }
inline long long time_to_msec(const sys_time_t& t) {
return t.time * 1000LL + t.millitm;
}
#endif
struct Time {
void restart() { system_time(&t); }
uint64_t msec() const { return time_to_msec(t); }
long long elapsed() const {
return long long(current_time().msec() - time_to_msec(t));
}
static Time current_time() { Time t; t.restart(); return t; }
private:
sys_time_t t;
};
int main() {
sys_time_t t;
system_time(&t);
long long currentTimeMs = time_to_msec(t);
std::cout << "currentTimeMs:" << currentTimeMs << std::endl;
Time time = Time::current_time();
for (int i = 0; i < 1000000; i++) {
//Do something
}
long long e = time.elapsed();
std::cout << "time elapsed:" << e << std::endl;
getchar(); // wait for keyboard input
}

Timer function to provide time in nano seconds using C++

I wish to calculate the time it took for an API to return a value.
The time taken for such an action is in the space of nanoseconds. As the API is a C++ class/function, I am using the timer.h to calculate the same:
#include <ctime>
#include <iostream>
using namespace std;
int main(int argc, char** argv) {
clock_t start;
double diff;
start = clock();
diff = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;
cout<<"printf: "<< diff <<'\n';
return 0;
}
The above code gives the time in seconds. How do I get the same in nano seconds and with more precision?

What others have posted about running the function repeatedly in a loop is correct.
For Linux (and BSD) you want to use clock_gettime().
#include <sys/time.h>
int main()
{
timespec ts;
// clock_gettime(CLOCK_MONOTONIC, &ts); // Works on FreeBSD
clock_gettime(CLOCK_REALTIME, &ts); // Works on Linux
}
For windows you want to use the QueryPerformanceCounter. And here is more on QPC
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
EDIT 2013/07/16:
It looks like there is some controversy on the efficacy of QPC under certain circumstances as stated in http://msdn.microsoft.com/en-us/library/windows/desktop/ee417693(v=vs.85).aspx
...While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for
multiple processors, bugs in the BIOS or drivers may result in these routines returning
different values as the thread moves from one processor to another...
However this StackOverflow answer https://stackoverflow.com/a/4588605/34329 states that QPC should work fine on any MS OS after Win XP service pack 2.
This article shows that Windows 7 can determine if the processor(s) have an invariant TSC and falls back to an external timer if they don't. http://performancebydesign.blogspot.com/2012/03/high-resolution-clocks-and-timers-for.html Synchronizing across processors is still an issue.
Other fine reading related to timers:
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks
http://lwn.net/Articles/209101/
http://performancebydesign.blogspot.com/2012/03/high-resolution-clocks-and-timers-for.html
QueryPerformanceCounter Status?
See the comments for more details.

This new answer uses C++11's <chrono> facility. While there are other answers that show how to use <chrono>, none of them shows how to use <chrono> with the RDTSC facility mentioned in several of the other answers here. So I thought I would show how to use RDTSC with <chrono>. Additionally I'll demonstrate how you can templatize the testing code on the clock so that you can rapidly switch between RDTSC and your system's built-in clock facilities (which will likely be based on clock(), clock_gettime() and/or QueryPerformanceCounter.
Note that the RDTSC instruction is x86-specific. QueryPerformanceCounter is Windows only. And clock_gettime() is POSIX only. Below I introduce two new clocks: std::chrono::high_resolution_clock and std::chrono::system_clock, which, if you can assume C++11, are now cross-platform.
First, here is how you create a C++11-compatible clock out of the Intel rdtsc assembly instruction. I'll call it x::clock:
#include <chrono>
namespace x
{
struct clock
{
typedef unsigned long long rep;
typedef std::ratio<1, 2'800'000'000> period; // My machine is 2.8 GHz
typedef std::chrono::duration<rep, period> duration;
typedef std::chrono::time_point<clock> time_point;
static const bool is_steady = true;
static time_point now() noexcept
{
unsigned lo, hi;
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return time_point(duration(static_cast<rep>(hi) << 32 | lo));
}
};
} // x
All this clock does is count CPU cycles and store it in an unsigned 64-bit integer. You may need to tweak the assembly language syntax for your compiler. Or your compiler may offer an intrinsic you can use instead (e.g. now() {return __rdtsc();}).
To build a clock you have to give it the representation (storage type). You must also supply the clock period, which must be a compile time constant, even though your machine may change clock speed in different power modes. And from those you can easily define your clock's "native" time duration and time point in terms of these fundamentals.
If all you want to do is output the number of clock ticks, it doesn't really matter what number you give for the clock period. This constant only comes into play if you want to convert the number of clock ticks into some real-time unit such as nanoseconds. And in that case, the more accurate you are able to supply the clock speed, the more accurate will be the conversion to nanoseconds, (milliseconds, whatever).
Below is example code which shows how to use x::clock. Actually I've templated the code on the clock as I'd like to show how you can use many different clocks with the exact same syntax. This particular test is showing what the looping overhead is when running what you want to time under a loop:
#include <iostream>
template <class clock>
void
test_empty_loop()
{
// Define real time units
typedef std::chrono::duration<unsigned long long, std::pico> picoseconds;
// or:
// typedef std::chrono::nanoseconds nanoseconds;
// Define double-based unit of clock tick
typedef std::chrono::duration<double, typename clock::period> Cycle;
using std::chrono::duration_cast;
const int N = 100000000;
// Do it
auto t0 = clock::now();
for (int j = 0; j < N; ++j)
asm volatile("");
auto t1 = clock::now();
// Get the clock ticks per iteration
auto ticks_per_iter = Cycle(t1-t0)/N;
std::cout << ticks_per_iter.count() << " clock ticks per iteration\n";
// Convert to real time units
std::cout << duration_cast<picoseconds>(ticks_per_iter).count()
<< "ps per iteration\n";
}
The first thing this code does is create a "real time" unit to display the results in. I've chosen picoseconds, but you can choose any units you like, either integral or floating point based. As an example there is a pre-made std::chrono::nanoseconds unit I could have used.
As another example I want to print out the average number of clock cycles per iteration as a floating point, so I create another duration, based on double, that has the same units as the clock's tick does (called Cycle in the code).
The loop is timed with calls to clock::now() on either side. If you want to name the type returned from this function it is:
typename clock::time_point t0 = clock::now();
(as clearly shown in the x::clock example, and is also true of the system-supplied clocks).
To get a duration in terms of floating point clock ticks one merely subtracts the two time points, and to get the per iteration value, divide that duration by the number of iterations.
You can get the count in any duration by using the count() member function. This returns the internal representation. Finally I use std::chrono::duration_cast to convert the duration Cycle to the duration picoseconds and print that out.
To use this code is simple:
int main()
{
std::cout << "\nUsing rdtsc:\n";
test_empty_loop<x::clock>();
std::cout << "\nUsing std::chrono::high_resolution_clock:\n";
test_empty_loop<std::chrono::high_resolution_clock>();
std::cout << "\nUsing std::chrono::system_clock:\n";
test_empty_loop<std::chrono::system_clock>();
}
Above I exercise the test using our home-made x::clock, and compare those results with using two of the system-supplied clocks: std::chrono::high_resolution_clock and std::chrono::system_clock. For me this prints out:
Using rdtsc:
1.72632 clock ticks per iteration
616ps per iteration
Using std::chrono::high_resolution_clock:
0.620105 clock ticks per iteration
620ps per iteration
Using std::chrono::system_clock:
0.00062457 clock ticks per iteration
624ps per iteration
This shows that each of these clocks has a different tick period, as the ticks per iteration is vastly different for each clock. However when converted to a known unit of time (e.g. picoseconds), I get approximately the same result for each clock (your mileage may vary).
Note how my code is completely free of "magic conversion constants". Indeed, there are only two magic numbers in the entire example:
The clock speed of my machine in order to define x::clock.
The number of iterations to test over. If changing this number makes your results vary greatly, then you should probably make the number of iterations higher, or empty your computer of competing processes while testing.

With that level of accuracy, it would be better to reason in CPU tick rather than in system call like clock(). And do not forget that if it takes more than one nanosecond to execute an instruction... having a nanosecond accuracy is pretty much impossible.
Still, something like that is a start:
Here's the actual code to retrieve number of 80x86 CPU clock ticks passed since the CPU was last started. It will work on Pentium and above (386/486 not supported). This code is actually MS Visual C++ specific, but can be probably very easy ported to whatever else, as long as it supports inline assembly.
inline __int64 GetCpuClocks()
{
// Counter
struct { int32 low, high; } counter;
// Use RDTSC instruction to get clocks count
__asm push EAX
__asm push EDX
__asm __emit 0fh __asm __emit 031h // RDTSC
__asm mov counter.low, EAX
__asm mov counter.high, EDX
__asm pop EDX
__asm pop EAX
// Return result
return *(__int64 *)(&counter);
}
This function has also the advantage of being extremely fast - it usually takes no more than 50 cpu cycles to execute.
Using the Timing Figures:
If you need to translate the clock counts into true elapsed time, divide the results by your chip's clock speed. Remember that the "rated" GHz is likely to be slightly different from the actual speed of your chip. To check your chip's true speed, you can use several very good utilities or the Win32 call, QueryPerformanceFrequency().

To do this correctly you can use one of two ways, either go with RDTSC or with clock_gettime().
The second is about 2 times faster and has the advantage of giving the right absolute time. Note that for RDTSC to work correctly you need to use it as indicated (other comments on this page have errors, and may yield incorrect timing values on certain processors)
inline uint64_t rdtsc()
{
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax, %%eax\n"
"cpuid\n"
"rdtsc\n"
: "=a" (lo), "=d" (hi)
:
: "%ebx", "%ecx" );
return (uint64_t)hi << 32 | lo;
}
and for clock_gettime: (I chose microsecond resolution arbitrarily)
#include <time.h>
#include <sys/timeb.h>
// needs -lrt (real-time lib)
// 1970-01-01 epoch UTC time, 1 mcs resolution (divide by 1M to get time_t)
uint64_t ClockGetTime()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
return (uint64_t)ts.tv_sec * 1000000LL + (uint64_t)ts.tv_nsec / 1000LL;
}
the timing and values produced:
Absolute values:
rdtsc = 4571567254267600
clock_gettime = 1278605535506855
Processing time: (10000000 runs)
rdtsc = 2292547353
clock_gettime = 1031119636

I am using the following to get the desired results:
#include <time.h>
#include <iostream>
using namespace std;
int main (int argc, char** argv)
{
// reset the clock
timespec tS;
tS.tv_sec = 0;
tS.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &tS);
...
... <code to check for the time to be put here>
...
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tS);
cout << "Time taken is: " << tS.tv_sec << " " << tS.tv_nsec << endl;
return 0;
}

For C++11, here is a simple wrapper:
#include <iostream>
#include <chrono>
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() { beg_ = clock_::now(); }
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count(); }
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
Or for C++03 on *nix,
class Timer
{
public:
Timer() { clock_gettime(CLOCK_REALTIME, &beg_); }
double elapsed() {
clock_gettime(CLOCK_REALTIME, &end_);
return end_.tv_sec - beg_.tv_sec +
(end_.tv_nsec - beg_.tv_nsec) / 1000000000.;
}
void reset() { clock_gettime(CLOCK_REALTIME, &beg_); }
private:
timespec beg_, end_;
};
Example of usage:
int main()
{
Timer tmr;
double t = tmr.elapsed();
std::cout << t << std::endl;
tmr.reset();
t = tmr.elapsed();
std::cout << t << std::endl;
return 0;
}
From https://gist.github.com/gongzhitaao/7062087

In general, for timing how long it takes to call a function, you want to do it many more times than just once. If you call your function only once and it takes a very short time to run, you still have the overhead of actually calling the timer functions and you don't know how long that takes.
For example, if you estimate your function might take 800 ns to run, call it in a loop ten million times (which will then take about 8 seconds). Divide the total time by ten million to get the time per call.

You can use the following function with gcc running under x86 processors:
unsigned long long rdtsc()
{
#define rdtsc(low, high) \
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high))
unsigned int low, high;
rdtsc(low, high);
return ((ulonglong)high << 32) | low;
}
with Digital Mars C++:
unsigned long long rdtsc()
{
_asm
{
rdtsc
}
}
which reads the high performance timer on the chip. I use this when doing profiling.

If you need subsecond precision, you need to use system-specific extensions, and will have to check with the documentation for the operating system. POSIX supports up to microseconds with gettimeofday, but nothing more precise since computers didn't have frequencies above 1GHz.
If you are using Boost, you can check boost::posix_time.

I'm using Borland code here is the code ti_hund gives me some times a negativnumber but timing is fairly good.
#include <dos.h>
void main()
{
struct time t;
int Hour,Min,Sec,Hun;
gettime(&t);
Hour=t.ti_hour;
Min=t.ti_min;
Sec=t.ti_sec;
Hun=t.ti_hund;
printf("Start time is: %2d:%02d:%02d.%02d\n",
t.ti_hour, t.ti_min, t.ti_sec, t.ti_hund);
....
your code to time
...
// read the time here remove Hours and min if the time is in sec
gettime(&t);
printf("\nTid Hour:%d Min:%d Sec:%d Hundreds:%d\n",t.ti_hour-Hour,
t.ti_min-Min,t.ti_sec-Sec,t.ti_hund-Hun);
printf("\n\nAlt Ferdig Press a Key\n\n");
getch();
} // end main

Using Brock Adams's method, with a simple class:
int get_cpu_ticks()
{
LARGE_INTEGER ticks;
QueryPerformanceFrequency(&ticks);
return ticks.LowPart;
}
__int64 get_cpu_clocks()
{
struct { int32 low, high; } counter;
__asm cpuid
__asm push EDX
__asm rdtsc
__asm mov counter.low, EAX
__asm mov counter.high, EDX
__asm pop EDX
__asm pop EAX
return *(__int64 *)(&counter);
}
class cbench
{
public:
cbench(const char *desc_in)
: desc(strdup(desc_in)), start(get_cpu_clocks()) { }
~cbench()
{
printf("%s took: %.4f ms\n", desc, (float)(get_cpu_clocks()-start)/get_cpu_ticks());
if(desc) free(desc);
}
private:
char *desc;
__int64 start;
};
Usage Example:
int main()
{
{
cbench c("test");
... code ...
}
return 0;
}
Result:
test took: 0.0002 ms
Has some function call overhead, but should be still more than fast enough :)

You can use Embedded Profiler (free for Windows and Linux) which has an interface to a multiplatform timer (in a processor cycle count) and can give you a number of cycles per seconds:
EProfilerTimer timer;
timer.Start();
... // Your code here
const uint64_t number_of_elapsed_cycles = timer.Stop();
const uint64_t nano_seconds_elapsed =
mumber_of_elapsed_cycles / (double) timer.GetCyclesPerSecond() * 1000000000;
Recalculation of cycle count to time is possibly a dangerous operation with modern processors where CPU frequency can be changed dynamically. Therefore to be sure that converted times are correct, it is necessary to fix processor frequency before profiling.

If this is for Linux, I've been using the function "gettimeofday", which returns a struct that gives the seconds and microseconds since the Epoch. You can then use timersub to subtract the two to get the difference in time, and convert it to whatever precision of time you want. However, you specify nanoseconds, and it looks like the function clock_gettime() is what you're looking for. It puts the time in terms of seconds and nanoseconds into the structure you pass into it.

What do you think about that:
int iceu_system_GetTimeNow(long long int *res)
{
static struct timespec buffer;
//
#ifdef __CYGWIN__
if (clock_gettime(CLOCK_REALTIME, &buffer))
return 1;
#else
if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &buffer))
return 1;
#endif
*res=(long long int)buffer.tv_sec * 1000000000LL + (long long int)buffer.tv_nsec;
return 0;
}

Here is a nice Boost timer that works well:
//Stopwatch.hpp
#ifndef STOPWATCH_HPP
#define STOPWATCH_HPP
//Boost
#include <boost/chrono.hpp>
//Std
#include <cstdint>
class Stopwatch
{
public:
Stopwatch();
virtual ~Stopwatch();
void Restart();
std::uint64_t Get_elapsed_ns();
std::uint64_t Get_elapsed_us();
std::uint64_t Get_elapsed_ms();
std::uint64_t Get_elapsed_s();
private:
boost::chrono::high_resolution_clock::time_point _start_time;
};
#endif // STOPWATCH_HPP
//Stopwatch.cpp
#include "Stopwatch.hpp"
Stopwatch::Stopwatch():
_start_time(boost::chrono::high_resolution_clock::now()) {}
Stopwatch::~Stopwatch() {}
void Stopwatch::Restart()
{
_start_time = boost::chrono::high_resolution_clock::now();
}
std::uint64_t Stopwatch::Get_elapsed_ns()
{
boost::chrono::nanoseconds nano_s = boost::chrono::duration_cast<boost::chrono::nanoseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(nano_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_us()
{
boost::chrono::microseconds micro_s = boost::chrono::duration_cast<boost::chrono::microseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(micro_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_ms()
{
boost::chrono::milliseconds milli_s = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(milli_s.count());
}
std::uint64_t Stopwatch::Get_elapsed_s()
{
boost::chrono::seconds sec = boost::chrono::duration_cast<boost::chrono::seconds>(boost::chrono::high_resolution_clock::now() - _start_time);
return static_cast<std::uint64_t>(sec.count());
}

Minimalistic copy&paste-struct + lazy usage
If the idea is to have a minimalistic struct that you can use for quick tests, then I suggest you just copy and paste anywhere in your C++ file right after the #include's. This is the only instance in which I sacrifice Allman-style formatting.
You can easily adjust the precision in the first line of the struct. Possible values are: nanoseconds, microseconds, milliseconds, seconds, minutes, or hours.
#include <chrono>
struct MeasureTime
{
using precision = std::chrono::microseconds;
std::vector<std::chrono::steady_clock::time_point> times;
std::chrono::steady_clock::time_point oneLast;
void p() {
std::cout << "Mark "
<< times.size()/2
<< ": "
<< std::chrono::duration_cast<precision>(times.back() - oneLast).count()
<< std::endl;
}
void m() {
oneLast = times.back();
times.push_back(std::chrono::steady_clock::now());
}
void t() {
m();
p();
m();
}
MeasureTime() {
times.push_back(std::chrono::steady_clock::now());
}
};
Usage
MeasureTime m; // first time is already in memory
doFnc1();
m.t(); // Mark 1: next time, and print difference with previous mark
doFnc2();
m.t(); // Mark 2: next time, and print difference with previous mark
doStuff = doMoreStuff();
andDoItAgain = doStuff.aoeuaoeu();
m.t(); // prints 'Mark 3: 123123' etc...
Standard output result
Mark 1: 123
Mark 2: 32
Mark 3: 433234
If you want summary after execution
If you want the report afterwards, because for example your code in between also writes to standard output. Then add the following function to the struct (just before MeasureTime()):
void s() { // summary
int i = 0;
std::chrono::steady_clock::time_point tprev;
for(auto tcur : times)
{
if(i > 0)
{
std::cout << "Mark " << i << ": "
<< std::chrono::duration_cast<precision>(tprev - tcur).count()
<< std::endl;
}
tprev = tcur;
++i;
}
}
So then you can just use:
MeasureTime m;
doFnc1();
m.m();
doFnc2();
m.m();
doStuff = doMoreStuff();
andDoItAgain = doStuff.aoeuaoeu();
m.m();
m.s();
Which will list all the marks just like before, but then after the other code is executed. Note that you shouldn't use both m.s() and m.t().

plf::nanotimer is a lightweight option for this, works in Windows, Linux, Mac and BSD etc. Has ~microsecond accuracy depending on OS:
#include "plf_nanotimer.h"
#include <iostream>
int main(int argc, char** argv)
{
plf::nanotimer timer;
timer.start()
// Do something here
double results = timer.get_elapsed_ns();
std::cout << "Timing: " << results << " nanoseconds." << std::endl;
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js