Is there an equivalent instruction to rdtsc in ARM? - c++

For my project I must use inline assembly instructions such as rdtsc to calculate the execution time of some C/C++ instructions.
The following code seems to work on Intel but not on ARM processors:
{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t0 = ((unsigned long)a) | (((unsigned long)d) << 32);}
//The C++ statement to measure its execution time
{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t1 = ((unsigned long)a) | (((unsigned long)d) << 32);}
time = t1-t0;
My question is:
How to write an inline assembly code similar to the above (to calculate the execution elapsed time of an instruction) to work on ARM processors?

You should read the PMCCNTR register of a co-processor p15 (not an actual co-processor, just an entry point for CPU functions) to obtain a cycle count. Note that it is available to an unprivileged app only if:
Unprivileged PMCCNTR reads are alowed:
Bit 0 of PMUSERENR register must be set to 1 (official docs)
PMCCNTR is actually counting cycles:
Bit 31 of PMCNTENSET register must be set to 1 (official docs)
This is a real-world example of how it`s done.

For Arm64, the system register CNTVCT_EL0 can be used to retrieve the counter from
user space.
// SPDX-License-Identifier: GPL-2.0
u64 rdtsc(void)
{
u64 val;
/*
* According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
* system counter is at least 56 bits wide; from Armv8.6, the counter
* must be 64 bits wide. So the system counter could be less than 64
* bits wide and it is attributed with the flag 'cap_user_time_short'
* is true.
*/
asm volatile("mrs %0, cntvct_el0" : "=r" (val));
return val;
}
Please refer this patch https://lore.kernel.org/patchwork/patch/1305380/ for more details.

Related

How to convert RDTSC Clock ticks to Real Time in C or C++?

This is the code I am using in C to convert RDTSC clock ticks to time in usec. Assembly code to read the TSC. To covert RDTSC clock ticks to time, need to divide it with CPU clock frequency in GHz. CPU Frequency = 0.963 is in GHz. I don't know where am I wrong in the conversion?
static inline uint64_t RDTSC()
{
unsigned int hi, lo;
__asm__ volatile("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
start = RDTSC();
end = RDTSC();
printf("\n-%.4f us-",((double)(end-start)/(double)0.963)/1000);

How to verify CPU cache line size with c++ code?

I read an article from Igor's blog. The article said:
... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
The article also provides c# code to verify above conclusion:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.
And then I refer the above idea to implement a c++ version to verify the cache line size as the following:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;
const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;
int testCacheHit(int * pBuffer, int size, int step)
{
int result = 0;
for (int i = 0; i < size;) {
result += pBuffer[i];
i += step;
}
return result;
}
int main()
{
int * pBuffer = new int[buff_size*total_buff_count];
for (int i = 0; i < total_buff_count; ++i) {
int step = (int)pow(2, i);
auto start = std::chrono::system_clock::now();
volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
}
delete[] pBuffer;
}
But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following.
The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.
My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.
Drop between 1024 and 2048
Note that you are using an uninitialized array. This basically means that
int * pBuffer = new int[buff_size*total_buff_count];
does not cause your program to actually ask for any physical memory. Instead, just some virtual address space is reserved.
Then, whey you first touch some array element, a page fault is triggered and the OS maps the page to physical memory. This is a relatively slow operation which may significantly influece your experiment. Since a page size on your system is likely 4 kB, it can hold 1024 4-byte integers. When you go for 2048 step, then only every second page is actually accessed and the runtime drops proportionally.
You can avoid the negative effect of this mechanism by "touching" the memory in advance:
int * pBuffer = new int[buff_size*total_buff_count]{};
When I tried that, I got almost linear decrease of time between 64 and 8192 step sizes.
Drop between 1 and 2
A cache line size on your system is definitely not 2048 bytes, it's very likely 64 bytes (generally, it may have different values and even different values for different cache levels).
As for the first part, for step being 1, there are simply much more arithmetic operations involved (addition of array elements and increments of i).
Difference from Igor's experiment
We can only speculate about why Igor's experiment gave practically the same times in both cases. I would guess that the runtime of arithmetics is negligible there, since there is only a single loop counter increment involved and he writes into the array, which requires an additional transfer of cached lines back to memory. (We can say that the byte/op ratio is much higher than in your experiment.)
How to verify CPU cache line size with c++ code?
There is std::hardware_destructive_interference_size in C++17, which should provide the smallest cache line size. Note that it is a compile time value and the compiler relies on your input on what machine is targeted. When targeting entire architecture, the number may be inaccurate.
How to verify CPU cache line size with c++ code?
You reliably cannot.
And you should write portable C++ code. Read n3337.
Imagine that you did not enable compiler optimizations in your C++ compiler. And imagine that you run your C++ compiler in some emulator (like these).
On Linux specifically, you could parse the /proc/cpuinfo pseudo file and get the CPU cache line size from it.
For example:
% head -20 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen Threadripper 2970WX 24-Core Processor
stepping : 2
microcode : 0x800820b
cpu MHz : 1776.031
cache size : 512 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
BTW, there are many different organizations and levels of caches.
You could imagine a C++ application on Linux parsing the output of /proc/cpuinfo then making HTTP requests (using libcurl) to the Web to get more from it.
See also this answer.

Get bits from byte

I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?
(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.
I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}
pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.

Is the way I am using rdtsc: atomic

I am using a x64 machine with C++ code
__asm__ __volatile__ ("rdtsc" : "=a" lo), "=d" (hi));
Where lo and hi are unsigned ints and store the 32 bit value of the clock counter. Will lo and hi be stored atomically, or is it possible that due to the case when lo overruns and increments hi I get one of the values representing before the roll over and one after, which would result in a count that is either 2^32 too low or 2^32 too high (depending on which one was captured first).

what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.
Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.
What I'm interested in:
which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
does anyone have own experience about this performance impact?
I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).
which factors influence performance of these operations? Probably the
compiler and compiler version; but does the operating system or the
CPU make/model influence this as well?
Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.
The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).
Will a normal 32-bit system use the 64-bit registers of modern CPUs?
This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".
which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.
Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.
Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.
The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.
The compiler will also need to use more registers.
are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.
If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.
does anyone have own experience about this performance impact?
I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.
Changing your OS to a 64-bit version of the same OS, would help, perhaps?
Edit:
Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )
Here's the code I wrote, trying to replicate a few common functons:
#include <iostream>
#include <cstdint>
#include <ctime>
using namespace std;
static __inline__ uint64_t rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
template<typename T>
static T add_numbers(const T *v, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i];
return sum;
}
template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
T sum[size] = {};
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
sum[i] += v[i][j];
}
T tsum=0;
for(int i = 0; i < size; i++)
tsum += sum[i];
return tsum;
}
template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i] * mul;
return sum;
}
template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
T sum = 0;
for(int i = 0; i < size; i++)
sum += v[i] / mul;
return sum;
}
template<typename T>
void fill_array(T *v, const int size)
{
for(int i = 0; i < size; i++)
v[i] = i;
}
template<typename T, const int size>
void fill_array(T v[size][size])
{
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
v[i][j] = i + size * j;
}
uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
uint32_t res = add_numbers(v, size);
return res;
}
uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
uint64_t res = add_numbers(v, size);
return res;
}
uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
const uint32_t c = 7;
uint32_t res = add_mul_numbers(v, c, size);
return res;
}
uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
const uint64_t c = 7;
uint64_t res = add_mul_numbers(v, c, size);
return res;
}
uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
const uint32_t c = 7;
uint32_t res = add_div_numbers(v, c, size);
return res;
}
uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
const uint64_t c = 7;
uint64_t res = add_div_numbers(v, c, size);
return res;
}
template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
uint32_t res = add_matrix(v);
return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
uint64_t res = add_matrix(v);
return res;
}
template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
fill_array(v, size);
uint64_t long t = rdtsc();
T res = func(v, size);
t = rdtsc() - t;
cout << "result = " << res << endl;
cout << name << " time in clocks " << dec << t << endl;
}
template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
fill_array(v);
uint64_t long t = rdtsc();
T res = func(v);
t = rdtsc() - t;
cout << "result = " << res << endl;
cout << name << " time in clocks " << dec << t << endl;
}
int main()
{
// spin up CPU to full speed...
time_t t = time(NULL);
while(t == time(NULL)) ;
const int vsize=10000;
uint32_t v32[vsize];
uint64_t v64[vsize];
uint32_t m32[100][100];
uint64_t m64[100][100];
runbench(bench_add_numbers, "Add 32", v32, vsize);
runbench(bench_add_numbers, "Add 64", v64, vsize);
runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);
runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);
runbench2(bench_matrix, "Matrix 32", m32);
runbench2(bench_matrix, "Matrix 64", m64);
}
Compiled with:
g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x
And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.
result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823
As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.
And is it faster on 64-bit I hear some of you ask:
Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:
result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733
Edit, update for 2016:
four variants, with and without SSE, in 32- and 64-bit mode of the compiler.
I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)
As a short table:
Test name | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t | 20837 | 10221 | 3701 | 3017 |
----------------------------------------------------------
Add uint64_t | 18633 | 11270 | 9328 | 9180 |
----------------------------------------------------------
Add Mul 32 | 26785 | 18342 | 11510 | 11562 |
----------------------------------------------------------
Add Mul 64 | 44701 | 17693 | 29213 | 16159 |
----------------------------------------------------------
Add Div 32 | 44570 | 47695 | 17713 | 17523 |
----------------------------------------------------------
Add Div 64 | 405258 | 52875 | 405150 | 47043 |
----------------------------------------------------------
Matrix 32 | 41470 | 15811 | 21542 | 8622 |
----------------------------------------------------------
Matrix 64 | 22184 | 15168 | 13757 | 12448 |
Full results with compile options.
$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184
$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757
$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448
$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168
More than you ever wanted to know about doing 64-bit math in 32-bit mode...
When you use 64-bit numbers on 32-bit mode (even on 64-bit CPU if an code is compiled for 32-bit), they are stored as two separate 32-bit numbers, one storing higher bits of a number, and another storing lower bits. The impact of this depends on an instruction. (tl;dr - generally, doing 64-bit math on 32-bit CPU is in theory 2 times slower, as long you don't divide/modulo, however in practice the difference is going to be smaller (1.3x would be my guess), because usually programs don't just do math on 64-bit integers, and also because of pipelining, the difference may be much smaller in your program).
Addition/subtraction
Many architectures support so called carry flag. It's set when the result of addition overflows, or result of subtraction doesn't underflow. The behaviour of those bits can be show with long addition and long subtraction. C in this example shows either a bit higher than the highest representable bit (during operation), or a carry flag (after operation).
C 7 6 5 4 3 2 1 0 C 7 6 5 4 3 2 1 0
0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 1 - 0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0 = 0 1 1 1 1 1 1 1 1
Why is carry flag relevant? Well, it just so happens that CPUs usually have two separate addition and subtraction operations. In x86, the addition operations are called add and adc. add stands for addition, while adc for addition with carry. The difference between those is that adc considers a carry bit, and if it is set, it adds one to the result.
Similarly, subtraction with carry subtracts 1 from the result if carry bit is not set.
This behaviour allows easily implementing arbitrary size addition and subtraction on integers. The result of addition of x and y (assuming those are 8-bit) is never bigger than 0x1FE. If you add 1, you get 0x1FF. 9 bits is enough therefore to represent results of any 8-bit addition. If you start addition with add, and then add any bits beyond initial ones with adc, you can do addition on any size of data you like.
Addition of two 64-bit values on 32-bit CPU is as follows.
Add first 32 bits of b to first 32 bits of a.
Add with carry later 32 bits of b to later 32 bits of a.
Analogically for subtraction.
This gives 2 instructions, however, because of instruction pipelinining, it may be slower than that, as one calculation depends on the other one to finish, so if CPU doesn't have anything else to do than 64-bit addition, CPU may wait for the first addition to be done.
Multiplication
It so happens on x86 that imul and mul can be used in such a way that overflow is stored in edx register. Therefore, multiplying two 32-bit values to get 64-bit value is really easy. Such a multiplication is one instruction, but to make use of it, one of multiplication values must be stored in eax.
Anyway, for a more general case of multiplication of two 64-bit values, they can be calculated using a following formula (assume function r removes bits beyond 32 bits).
First of all, it's easy to notice the lower 32 bits of a result will be multiplication of lower 32 bits of multiplied variables. This is due to congrugence relation.
a1 ≡ b1 (mod n)
a2 ≡ b2 (mod n)
a1a2 ≡ b1b2 (mod n)
Therefore, the task is limited to just determining the higher 32 bits. To calculate higher 32 bits of a result, following values should be added together.
Higher 32 bits of multiplication of both lower 32 bits (overflow which CPU can store in edx)
Higher 32 bits of first variable mulitplied with lower 32 bits of second variable
Lower 32 bits of first variable multiplied with higher 32 bits of second variable
This gives about 5 instructions, however because of relatively limited number of registers in x86 (ignoring extensions to an architecture), they cannot take too much advantage of pipelining. Enable SSE if you want to improve speed of multiplication, as this increases number of registers.
Division/Modulo (both are similar in implementation)
I don't know how it works, but it's much more complex than addition, subtraction or even multiplication. It's likely to be ten times slower than division on 64-bit CPU however. Check "Art of Computer Programming, Volume 2: Seminumerical Algorithms", page 257 for more details if you can understand it (I cannot in a way that I could explain it, unfortunately).
If you divide by a power of 2, please refer to shifting section, because that's what essentially compiler can optimize division to (plus adding the most significant bit before shifting for signed numbers).
Or/And/Xor
Considering those operations are single bit operations, nothing special happens here, just bitwise operation is done twice.
Shifting left/right
Interestingly, x86 actually has an instruction to perform 64-bit left shift called shld, which instead of replacing the least significant bits of value with zeros, it replaces them with most significant bits of a different register. Similarly, it's the case for right shift with shrd instruction. This would easily make 64-bit shifting a two instructions operation.
However, that's only a case for constant shifts. When a shift is not constant, things get tricker, as x86 architecture only supports shift with 0-31 as a value. Anything beyond that is according to official documentation undefined, and in practice, bitwise and operation with 0x1F is performed on a value. Therefore, when a shift value is higher than 31, one of value storages is erased entirely (for left shift, that's lower bytes, for right shift, that's higher bytes). The other one gets the value that was in the register that was erased, and then shift operation is performed. This in result, depends on branch predictor to make good predictions, and is a bit slower because a value needs to be checked.
__builtin_popcount[ll]
__builtin_popcount(lower) + __builtin_popcount(higher)
Other builtins
I'm too lazy to finish the answer at this point. Does anyone even use those?
Unsigned vs signed
Addition, subtraction, multiplication, or, and, xor, shift left generate the exact same code. Shift right uses only slightly different code (arithmetic shift vs logical shift), but structurally it's the same. It's likely that division does generate a different code however, and signed division is likely to be slower than unsigned division.
Benchmarks
Benchmarks? They are mostly meaningless, as instruction pipelining will usually lead to things being faster when you don't constantly repeat the same operation. Feel free to consider division slow, but nothing else really is, and when you get outside of benchmarks, you may notice that because of pipelining, doing 64-bit operations on 32-bit CPU is not slow at all.
Benchmark your own application, don't trust micro-benchmarks that don't do what your application does. Modern CPUs are quite tricky, so unrelated benchmarks can and will lie.
Your question sounds pretty weird in its environment. You use time_t that uses up 32 bits. You need additional info, what means more bits. So you are forced to use something bigger than int32. It doesn't matter what the performance is, right? Choices will go between using just say 40 bits or go ahead to int64. Unless millions of instances must be stored of it, the latter is a sensible choice.
As others pointed out the only way to know the true performance is to measure it with profiler, (in some gross samples a simple clock will do). so just go ahead and measure. It must not be hard to globalreplace your time_t usage to a typedef and redefine it to 64 bit and patch up the few instances where real time_t was expected.
My bet would be on "unmeasurable difference" unless your current time_t instances take up at least a few megs of memory. on current Intel-like platforms the cores spend most of the time waiting for external memory to get into cache. A single cache miss stalls for hundred(s) of cycles. What makes calculating 1-tick differences on instructions infeasible. Your real performance may drop due yo things like your current structure just fits a cache line and the bigger one needs two. And if you never measured your current performance you might discover that you could gain extreme speedup of some funcitons just by adding some alignment or exchange order of some members in a structure. Or pack(1) the structure instead of using the default layout...
Addition/subtraction basically becomes two cycles each, multiplication and division depend on the actual CPU. The general perfomance impact will be rather low.
Note that Intel Core 2 supports EM64T.