Variance in RDTSC overhead - c++

I'm constructing a micro-benchmark to measure performance changes as I experiment with the use of SIMD instruction intrinsics in some primitive image processing operations. However, writing useful micro-benchmarks is difficult, so I'd like to first understand (and if possible eliminate) as many sources of variation and error as possible.
One factor that I have to account for is the overhead of the measurement code itself. I'm measuring with RDTSC, and I'm using the following code to find the measurement overhead:
extern inline unsigned long long __attribute__((always_inline)) rdtsc64() {
unsigned int hi, lo;
__asm__ __volatile__(
"xorl %%eax, %%eax\n\t"
"cpuid\n\t"
"rdtsc"
: "=a"(lo), "=d"(hi)
: /* no inputs */
: "rbx", "rcx");
return ((unsigned long long)hi << 32ull) | (unsigned long long)lo;
}
unsigned int find_rdtsc_overhead() {
const int trials = 1000000;
std::vector<unsigned long long> times;
times.resize(trials, 0.0);
for (int i = 0; i < trials; ++i) {
unsigned long long t_begin = rdtsc64();
unsigned long long t_end = rdtsc64();
times[i] = (t_end - t_begin);
}
// print frequencies of cycle counts
}
When running this code, I get output like this:
Frequency of occurrence (for 1000000 trials):
234 cycles (counted 28 times)
243 cycles (counted 875703 times)
252 cycles (counted 124194 times)
261 cycles (counted 37 times)
270 cycles (counted 2 times)
693 cycles (counted 1 times)
1611 cycles (counted 1 times)
1665 cycles (counted 1 times)
... (a bunch of larger times each only seen once)
My questions are these:
What are the possible causes of the bi-modal distribution of cycle counts generated by the code above?
Why does the fastest time (234 cycles) only occur a handful of times—what highly unusual circumstance could reduce the count?
Further Information
Platform:
Linux 2.6.32 (Ubuntu 10.04)
g++ 4.4.3
Core 2 Duo (E6600); this has constant rate TSC.
SpeedStep has been turned off (processor is set to performance mode and is running at 2.4GHz); if running in 'ondemand' mode, I get two peaks at 243 and 252 cycles, and two (presumably corresponding) peaks at 360 and 369 cycles.
I'm using sched_setaffinity to lock the process to one core. If I run the test on each core in turn (i.e., lock to core 0 and run, then lock to core 1 and run), I get similar results for the two cores, except that the fastest time of 234 cycles tends to occur slightly fewer times on core 1 than on core 0.
Compile command is:
g++ -Wall -mssse3 -mtune=core2 -O3 -o test.bin test.cpp
The code that GCC generates for the core loop is:
.L105:
#APP
# 27 "test.cpp" 1
xorl %eax, %eax
cpuid
rdtsc
# 0 "" 2
#NO_APP
movl %edx, %ebp
movl %eax, %edi
#APP
# 27 "test.cpp" 1
xorl %eax, %eax
cpuid
rdtsc
# 0 "" 2
#NO_APP
salq $32, %rdx
salq $32, %rbp
mov %eax, %eax
mov %edi, %edi
orq %rax, %rdx
orq %rdi, %rbp
subq %rbp, %rdx
movq %rdx, (%r8,%rsi)
addq $8, %rsi
cmpq $8000000, %rsi
jne .L105

RDTSC can return inconsistent results for a number of reasons:
On some CPUs (especially certain older Opterons), the TSC isn't synchronized between cores. It sounds like you're already handling this by using sched_setaffinity -- good!
If the OS timer interrupt fires while your code is running, there'll be a delay introduced while it runs. There's no practical way to avoid this; just throw out unusually high values.
Pipelining artifacts in the CPU can sometimes throw you off by a few cycles in either direction in tight loops. It's perfectly possible to have some loops that run in a non-integer number of clock cycles.
Cache! Depending on the vagaries of the CPU cache, memory operations (like the write to times[]) can vary in speed. In this case, you're fortunate that the std::vector implementation being used is just a flat array; even so, that write can throw things off. This is probably the most significant factor for this code.
I'm not enough of a guru on the Core2 microarchitecture to say exactly why you're getting this bimodal distribution, or how your code ran faster those 28 times, but it probably has something to do with one of the reasons above.

The Intel Programmer's manual recommends you use lfence;rdtsc or rdtscp if you want to ensure that instructions prior to the rdtsc have actually executed. This is because rdtsc isn't a serializing instruction by itself.

You should make sure that frequency throttling/green functionality is disabled at the OS level. Restart the machine. You may otherwise have a situation where the cores have unsynchronized time stamp counter values.
The 243 reading is by far the most common which is one reason for using it. On the other hand suppose you get an elapsed time <243: you subtract the overhead and get an underflow. Since the arithmetic is unsigned you end up with an enormous result. This fact speaks for using the lowest reading (234) instead. It is extremely difficult to accurately measure sequences that are only a few cycles long. On a typical x86 # a few GHz I'd recommend against timing sequences shorter than 10ns and even at that length they would typically be far from rock-solid.
The rest of my answer here is what I do, how I handle the results and my reasoning on the subject matter.
As to overhead the easiest way is to use code such as this
unsigned __int64 rdtsc_inline (void);
unsigned __int64 rdtsc_function (void);
The first form emits the rdtsc instruction into the generated code (as in your code). The second will cause a the function to be called, the rdtsc executed and a return instruction. Perhaps it will generate stack frames. Obviously the second form is much slower than the first.
The (C) code for overhead calculation can then be written
unsigned __int64 start_cycle,end_cycle; /* place these # the module level*/
unsigned __int64 overhead;
/* place this code inside a function */
start_cycle=rdtsc_inline();
end_cycle=rdtsc_inline();
overhead=end_cycle-start_cycle;
If you are using the inline variant you will get a low(er) overhead. You will also run the risk of calculating an overhead which is greater than it "should" be (especially for the function form) which in turn means that if you measure very short/fast sequences you may run into having a previously calculated overhead which is greater than the measurement itself. When you attempt to adjust for the overhead you will get an underflow which will lead to messy conditions. The best way to handle this is to
time the overhead several times and always use the smallest value achieved,
not measure really short code sequences as you may run into pipelining effects which will require messy synchronising instructions before the rdtsc instruction and
if you must measure very short sequences regard the results as indications rather than as facts
I've previously discussed what I do with the results in this thread.
Another thing that I do is to integrate the measurement code into the application. The overhead is insignificant. After a result has been computed I send it to a special structure where I count the number of measurements, sum x and x^2 values and determine min and max measurements. Later on I can use the data to calculate average and standard deviation. The structure itself is indexed and I can measure different performance aspects such individual application functions ("functional performance"), time spent in cpu, disk read/writing, network read/writing ("non-functional performance") etc.
If an application is instrumented in this manner and monitored from the very beginning I expect that the risk of it having performance problems during its lifetime will be greatly reduced.

Related

Performance of a string operation with strlen vs stop on zero

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution.
I'll take as an example the following two implementations of the upper method:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
INDEX is simple a typedef of uint_fast32_t.
Now I can test the speed of those methods in my main.cpp:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
if (!(_function()) && ok) \
ok = false; \
} \
char output[TEST_RECURSIVE_OUTPUT_STR]; \
sprintf(output, "[%s] Test %s %s: %ld ms\n", \
ok ? "OK" : "Failed", \
TEST_RECURSIVE_BUILD_TYPE, \
#_function, \
(clock() - before) * 1000 / CLOCKS_PER_SEC); \
fprintf(stdout, output); \
fprintf(file_log, output); \
}
String a;
String b;
bool stringUpperStrlen()
{
a.upperStrlen();
return true;
}
bool stringUpperPtr()
{
b.upperPtr();
return true;
}
int main(int argc, char** argv) {
...
a = "Hello World!";
b = "Hello World!";
TEST_RECURSIVE(stringUpperPtr);
TEST_RECURSIVE(stringUpperStrlen);
...
return 0;
}
Then I can compile and test with cmake in Debug or Release with the following results.
[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms
[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms
So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.
So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.
The stringUpperStrlen assembly:
_Z17stringUpperStrlenv:
.LFB72:
.cfi_startproc
pushq %r13
.cfi_def_cfa_offset 16
.cfi_offset 13, -16
xorl %eax, %eax
pushq %r12
.cfi_def_cfa_offset 24
.cfi_offset 12, -24
pushq %rbp
.cfi_def_cfa_offset 32
.cfi_offset 6, -32
xorl %ebp, %ebp
pushq %rbx
.cfi_def_cfa_offset 40
.cfi_offset 3, -40
pushq %rcx
.cfi_def_cfa_offset 48
orq $-1, %rcx
movq a#GOTPCREL(%rip), %r13
movq 0(%r13), %rdi
repnz scasb
movq %rcx, %rdx
notq %rdx
leaq -1(%rdx), %rbx
.L4:
cmpq %rbp, %rbx
je .L3
movq 0(%r13), %r12
addq %rbp, %r12
movsbl (%r12), %edi
incq %rbp
call toupper#PLT
movb %al, (%r12)
jmp .L4
.L3:
popq %rdx
.cfi_def_cfa_offset 40
popq %rbx
.cfi_def_cfa_offset 32
popq %rbp
.cfi_def_cfa_offset 24
popq %r12
.cfi_def_cfa_offset 16
movb $1, %al
popq %r13
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE72:
.size _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
.globl _Z14stringUpperPtrv
.type _Z14stringUpperPtrv, #function
The stringUpperPtr assembly:
_Z14stringUpperPtrv:
.LFB73:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq b#GOTPCREL(%rip), %rax
movq (%rax), %rbx
.L9:
movsbl (%rbx), %edi
testb %dil, %dil
je .L8
call toupper#PLT
movb %al, (%rbx)
incq %rbx
jmp .L9
.L8:
movb $1, %al
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE73:
.size _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
.section .rodata.str1.1,"aMS",#progbits,1
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
So how do you explain this difference in performance?
Thanks in advance.
EDIT:
CMake generate something like this command to compile:
/bin/g++-8 -Os -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o xpp-tests libxpp.so
/bin/g++-8 -O3 -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o Release/xpp-tests Release/libxpp.so
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16
# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG -Wall -pipe -fPIC -march=native -fno-strict-aliasing
CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1
The define TEST_RECURSIVE will call _function 1000000 times in my examples.
You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x; take on a PC?
If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
It's even possible that x is currently in swap, so that reading it requires reading from a disk.
And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
je .L3: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi: Read the current character from the string in memory.
incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper#PLT
movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
jmp .L4: Unconditional jump to the beginning of the loop.
For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:
movsbl (%rbx), %edi: read from the address containing the current.
testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
je .L8: conditional jump, to exit the loop if the character is zero.
call toupper#PLT
movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:
#include <iostream>
#include <chrono>
#define NUM_ITERATIONS 100000000
int main()
{
// Block 1: one operation in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
// Block 2: two operations in loop body
{
int64_t x = 0, y = 0;
auto start = std::chrono::high_resolution_clock::now();
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
}
return 0;
}
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:
1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000
I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.
Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?
Edit:
Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).
This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.
On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.
Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See #rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately #SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)
Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result
I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.
(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.
So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.
See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.
The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
orig_main:
...
call std::chrono::_V2::system_clock::now() # demangled C++ symbol name
mov rbp, rax # save the return value = start
call std::chrono::_V2::system_clock::now()
# end in RAX
So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.
With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
#include <benchmark/benchmark.h>
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
benchmark::DoNotOptimize(x2 += 31);
benchmark::DoNotOptimize(y2 += 31);
}
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3
.L7: # do{
add rax, 31 # x2 += 31
add rdx, 31 # y2 += 31
sub rbx, 1
jne .L7 # }while(--count != 0)
I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.
For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.
static void TargetFunc(benchmark::State& state) {
uint64_t x2 = 0, y2 = 0;
// Code inside this loop is measured repeatedly
for (auto _ : state) {
x2 += 16;
y2 += 17;
asm volatile("" : "+r"(x2), "+r"(y2));
}
}
All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see #rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.
The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away
TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).
ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.
I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.
Educated guess:
It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.
Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.
Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.
But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.
Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.
I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.
Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.
I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.
I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.
There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.
Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.
My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.
.code
public test1
align 16
test1 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst10: add qword ptr[rsp+8],17
dec rcx
jnz tst10
add rsp,16
ret
test1 endp
public test2
align 16
test2 proc
sub rsp,16
mov qword ptr[rsp+0],0
mov qword ptr[rsp+8],0
tst20: add qword ptr[rsp+0],17
add qword ptr[rsp+8],-37
dec rcx
jnz tst20
add rsp,16
ret
test2 endp
end
I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).
3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.
4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.
.code
public test1
align 16
test1 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst10: add rdx,17
dec rcx
jnz tst10
ret
test1 endp
public test2
align 16
test2 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst20: add rdx,17
add r8,-37
dec rcx
jnz tst20
ret
test2 endp
public test3
align 16
test3 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst30: add rdx,17
add r8,-37
add r9,47
dec rcx
jnz tst30
ret
test3 endp
public test4
align 16
test4 proc
xor rdx,rdx
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
tst40: add rdx,17
add r8,-37
add r9,47
add r10,-17
dec rcx
jnz tst40
ret
test4 endp
end
#PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.
I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.
But it seems that #Adrian McCarthy got it right with the dynamic frequency scaling.
Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.
http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA
I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case.
http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0
Edit: Thanks to #PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.
Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).
Processors are so complex these days that we can only guess.
The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.
One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.
Let's see the assembly :
Some defines:
#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM :
mov QWORD PTR [rbp-32], 0
.L13:
cmp QWORD PTR [rbp-32], 999999999
jg .L12
add QWORD PTR [rbp-24], 17
add QWORD PTR [rbp-32], 1
jmp .L13
.L12:
C/C++ :
for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}
ASM:
mov QWORD PTR [rbp-80], 0
.L21:
cmp QWORD PTR [rbp-80], 999999999
jg .L20
add QWORD PTR [rbp-64], 17
sub QWORD PTR [rbp-72], 31
add QWORD PTR [rbp-80], 1
jmp .L21
.L20:
So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.
So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.
In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.
So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.
One solution is to completely get rid of it by using a while loop:
C/C++:
while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }
ASM:
.L15:
movabs rax, 16999999999
cmp QWORD PTR [rbp-40], rax
jg .L14
add QWORD PTR [rbp-40], 17
jmp .L15
.L14:
An other one is to use the antequated "register" C keyword:
C/C++:
register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }
ASM:
mov ebx, 0
.L17:
cmp rbx, 999999999
jg .L16
add QWORD PTR [rbp-48], 17
add rbx, 1
jmp .L17
.L16:
Here are my results:
x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000
Code is here: https://onlinegdb.com/S1lAANEhI

How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

The idea is that I'd like to collect returned values of double into a vector register for processing for machine imm width at a time without storing back into memory first.
The particular processing is a vfma with other two operands that are all constexpr, so that they can simply be summoned by _mm256_setr_pd or aligned/unaligned memory load from constexpr array.
Is there a way to store double in %ymm at particular position directly from value in %rax for collecting purpose?
The target machine is Kaby Lake. More efficient of future vector instructions are welcome also.
Inline-assembly is usually a bad idea: modern compilers do a good job with x86 intrinsics.
Putting the bit-pattern for a double into RAX is usually also not useful, and smells like you've probably already gone down the wrong path into sub-optimal territory. Vector shuffle instructions are usually better: element-wise insert/extract instructions already cost a shuffle uop on Intel hardware, except for vmovq %xmm0, %rax to get the low element.
Also, if you're going to insert it into another vector, you should shuffle/immediate-blend. (vpermpd / vblendpd).
L1d and store-forwarding cache is fast, and even store-forwarding stalls are not a disaster. Choose wisely between ALU vs. memory to gather or scatter of data into / from SIMD vectors. Also remember that insert/extract instructions need an immediate index, so if you have a runtime index for a vector, you definitely want to store it and index. (See https://agner.org/optimize/ and other performance links in https://stackoverflow.com/tags/x86/info)
Lots of insert/extract can quickly bottleneck on port 5 on Haswell and later. See Loading an xmm from GP regs for more details, and some links to gcc bug reports where I went into more detail about strategies for different element widths on different uarches and with SSE4.1 vs. without SSE4.1, etc.
There's no PD version of extractps r/m32, xmm, imm, and insertps is a shuffle between XMM vectors.
To read/write the low lane of a YMM, you'll have to use integer vpextrq $1, %xmm0, %rax / pinsrq $1, %rax, %xmm0. Those aren't available in YMM width, so you need multiple instructions to read/write the high lane.
The VEX version vpinsrq $1, %rax, %xmm0 will zero the high lane(s) of the destination vector's full YMM or ZMM width, which is why I suggested pinsrq. On Skylake and later, it won't cause an SSE/AVX transition stall. See Using ymm registers as a "memory-like" storage location for an example (NASM syntax), and also Loading an xmm from GP regs
For the low element, use vmovq %xmm0, %rax to extract, it's cheaper than vpextrq (1 uop instead of 2).
For ZMM, my answer on that linked XMM from GP regs question shows that you can use AVX512F to merge-mask an integer register into a vector, given a mask register with a single bit set.
vpbroadcastq %rax, %zmm0{%k1}
Similarly, vpcompressq can move an element selected by a single-bit mask to the bottom for vmovq.
But for extract, if have an index instead of 1<<index to start with, you may be better off with vmovd %ecx, %zmm1 / vpermd %zmm0, %zmm1, %zmm2 / vmovq %zmm2, %rax. This trick even works with vpshufb for byte elements (with a lane at least). For lane crossing, maybe shuffle + vmovd with the high bits of the byte index, then scalar right-shift using the low bits of the index as the byte-within-word offset. See also How to use _mm_extract_epi8 function? for intrinsics for a variable-index emulation of pextrb.
High lane of a YMM, with AVX2
I think your best bet to write an element in the high lane of a YMM with AVX2 needs a scratch register:
vmovq %rax, %xmm0 (copy to a scratch vector)
shuffle into position with vinsertf128 (AVX1) or vpbroadcastq/vbroadcastsd. it's faster than vpermq/vpermpd on AMD. (But the reg-reg version is still AVX2-only)
vblendpd (FP) or vpblendd (integer) into the target YMM reg. Immediate blends with dword or larger element width are very cheap (1 uop for any vector ALU port on Intel).
This is only 3 uops, but 2 of them need port 5 on Intel CPUs. (So it costs the same as a vpinsrq + a blend). Only the blend is on the critical path from vector input to vector output, setting up ymm0 from rax is independent.
To read the highest element, vpermpd or vpermq $3, %ymm1, %ymm0 (AVX2), then vmovq from xmm0.
To read the 2nd-highest element, vextractf128 $1, %ymm1, %xmm0 (AVX1) and vmovq. vextractf128 is faster than vpermq/pd on AMD CPUs.
A bad alternative to avoid a scratch reg for insert would be vpermq or vperm2i128 to shuffle the qword you want to replace into the low lane, pinsrq (not vpinsrq), then vpermq to put it back in the right order. That's all shuffle uops, and pinsrq is 2 uops. (And causes an SSE/AVX transition stall on Haswell, but not Skylake). Plus all those operations are part of a dependency chain for the register you're modifying, unlike setting up a value in another register and blending.

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

I am developing a simple VM and I am in the middle of a crossroad.
My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.
However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.
So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.
Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:
.globl __Z20assign_i8u_reg8_imm8v
.def __Z20assign_i8u_reg8_imm8v; .scl 2; .type 32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
.cfi_startproc
movl _ip, %eax
movb 3(%eax), %cl
movzbl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $4, _ip
ret
.cfi_endproc
LFE13:
.p2align 2,,3
.globl __Z18assign_i8u_reg_regv
.def __Z18assign_i8u_reg_regv; .scl 2; .type 32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
.cfi_startproc
movl _ip, %edx
movl _sp, %eax
movzbl 3(%edx), %ecx
movb (%ecx,%eax), %cl
movzbl 2(%edx), %edx
movb %cl, (%eax,%edx)
addl $4, _ip
ret
.cfi_endproc
LFE14:
.p2align 2,,3
.globl __Z24assign_i8u_reg_globCachev
.def __Z24assign_i8u_reg_globCachev; .scl 2; .type 32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
.cfi_startproc
movl _ip, %eax
movl _sp, %edx
movl 4(%eax), %ecx
addl %edx, %ecx
movl %ecx, 4(%eax)
movb (%ecx), %cl
movzwl 2(%eax), %eax
movb %cl, (%eax,%edx)
addl $8, _ip
ret
.cfi_endproc
LFE15:
.p2align 2,,3
.globl __Z19assign_i8u_reg_globv
.def __Z19assign_i8u_reg_globv; .scl 2; .type 32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
.cfi_startproc
movl _ip, %eax
movl 4(%eax), %edx
movb (%edx), %cl
movzwl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $8, _ip
ret
.cfi_endproc
This example contains the instructions to:
assign unsigned byte from immediate value to register
assign unsigned byte from register to register
assign unsigned byte from global offset to register and, cache and change to direct instruction
assign unsigned byte from global offset to register (the now cached previous version)
... and so on...
Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.
I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?
I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp pointer, which points to the stack that implements the registers (there is no other stack), ip is logically the current instruction pointer, and gp is the global pointer (not referenced, accessed as an offset).
EDIT: Also, this is the basic format I am implementing the instructions in:
INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
FETCH(globallAddressCache);
REG(quint8, i.d16_1) = GLOB(quint8);
INC(globallAddressCache);
}
FETCH returns a reference to the struct, which the instruction is using based on the opcode
REG returns a reference to register value T from offset
GLOB retursn a reference to global value from a cached global offset (effectively absolute address)
INC just increments the instruction pointer by the size of the instruction.
Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.
EDIT: I would like to add a few points to the question:
I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.
To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.
My instruciton size data is based on those stats.
You might want to consider separating the VM ISA and its implementation.
For instance, in a VM I wrote I had a "load value direct" instruction. The next value in the instruction stream wasn't decoded as an instruction, but loaded as a value into a register. You can consider this one macro instruction or two separate values.
Another instruction I implemented was a "load constant value", which took loaded a constant from memory (using a base address for the table of constants and an offset). A common pattern in the instruction stream was therefore load value direct (index); load constant value. Your VM implementation may recognize this pattern and handle the pair with a single optimized implementation.
Obviously, if you have enough bits, you can use some of them to identify a register. With 8 bits it may be necessary to have a single register for all operations. But again, you could add another instruction with register X which modifies the next operation. In your C++ code, that instruction would merely set the currentRegister pointer which the other instructions use.
Will more compiled routines make up for a larger dispatch loop?
I take it you didn't fancy having single byte instructions with a second byte of extra opcode for certain instructions? I think a decode for 16-bit opcodes may be less efficient than 8-bit + extra byte(s), assuming the extra byte(s) aren't too common or too difficult to decode in themselves.
If it was me, I'd work on getting the compiler (not necessarily a full-fledged compiler with "everything", but a basic model) going with a fairly limited set of "instructions". Keep the code generation part fairly flexible so that it'll be easy to alter the actual encoding later. Once you have that working, you can experiment with various encodings and see what the result is in performance, and other aspects.
A lot of your minor question points are very hard to answer for anyone that hasn't done both of the choices. I have never written a VM in this sense, but I have worked on several disassemblers, instruction set simulators and such things. I have also implemented a couple of languages of different kinds, in terms of interpreted languages.
You probably also want to consider a JIT approach, where instead of loading bytecode, you interpret the bytecode and produce direct machine code for the architecture in question.
The GCC code doesn't look terrible, but there are several places where code depends on the value of the immediately preceding instruction - which is not great in modern processors. Unfortunately, I don't see any solution to that - it's a "too short code to shuffle things around" problem - adding more instructions obviously won't work.
I do see one little problem: Loading a 32-bit constant will require that it's 32-bit aligned for best performance. I have no idea how (or if) Java VM's deal with that.
I think you are asking the wrong question, and not because it is a bad question, on the contrary, it is an interesting subject and I suspect many people are interested in the results just as I am.
However, so far no one is sharing similar experience, so I guess you may have to do some pioneering. Instead of wondering which approach to use and waste time on the implementation of boilerplate code, focus on creating a “reflection” component that describes the structure and properties of the language, create a nice polymorphic structure with virtual methods, without worrying about performance, create modular components you can assemble during runtime, there is even the option to use a declarative language once you have established the object hierarchy. Since you appear to use Qt, you have half the work cut out for you. Then you can use the tree structure to analyze and generate a variety of different code – C code to compile or bytecode for a specific VM implementation, of which you can create multiple, you can even use that to programmatically generate the C code for your VM instead of typing it all by hand.
I think this set of advices will be more beneficial in case you resort to pioneering on the subject without a concrete answer in advance, it will allow you to easily test out all the scenarios and make your mind based on actual performance rather than personal assumptions and those of others. Then maybe you can share the results and answer your question with performance data.
The instruction length in bytes has been handled the same way for quite a while. Obviously being limited to 256 instructions isn't a good thing when there's so many types of operations you wish to perform.
This is why there's an prefix value. Back in the gameboy architecture, there wasn't enough room to include the needed 256 bit-control instructions, that's why one opcode was used as a prefix instruction. This kept the original 256 opcodes as well as 256 more if starting with that prefix byte.
For example:
One operation might look like this: D6 FF = SUB A, 0xFF
But a prefixed instruction would be presented as: CB D6 FF = SET 2, (HL)
If the processor read CB it'd immediately start looking in another instruction set of 256 opcodes.
The same goes for x86 architecture today. Where any instructions prefixed with 0F would be a part of another instruction set, essentially.
With the sort of execution you're using for your emulator, this is the best way of extending your instruction set. 16-bit opcodes would take up way more space than necessary, and the prefix doesn't provide such a long search.
One thing you should decide is what balance you wish to strike between code-file size efficiency, cache efficiency, and raw-execution-speed efficiency. Depending upon the coding patterns for the code you're interpreting, it may be helpful to have each instruction, regardless of its length in the code file, get translated into a structure containing a pointer and an integer. The first pointer would point to a function that takes a pointer to the instruction-info structure as well as to the execution context. The main execution loop would thus be something like:
do
{
pc = pc->func(pc, &context);
} while(pc);
the function associated with an "add short immediate instruction" would be something like:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += pc->operand;
return pc+1;
}
while "add long immediate" would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += (uint32_t)pc->operand + ((int64_t)(pc[1].operand) << 32);
return pc+2;
}
and the function associated with an "add local" instruction would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
CONTEXT_ITEM *op_stack = context->op_stack;
op_stack[0].asInt64 += op_stack[pc->operand].asInt64;
return pc+1;
}
Your "executables" would consist of compressed bytecode format, but they would then get translated into a table of instructions, eliminating a level of indirection when decoding the instructions at run-time.

Is multiplication of two numbers a constant time algorithm?

Suppose I write,
int a = 111;
int b = 509;
int c = a * b;
So what is the time complexity to compute 'a * b' ? How is the multiplication operation executed?
Compiling this function:
int f(int a, int b) {
return a * b;
}
With gcc -O3 -march=native -m64 -fomit-frame-pointer -S gives me the following assembly:
f:
movl %ecx, %eax
imull %edx, %eax
ret
The first instruction (movl) loads the first argument, the second instruction (imull) loads the second argument and multiplies it with the first - then the result gets returned.
The actual multiplication is done with imull, which - depending on your CPU type - will take a certain amount of CPU cycles.
If you look at Agner Fog's instruction timing tables you can see how much time each instruction will take. On most x86 processors it seems to be a small constant, however the imul instruction on the AMD K8 with a 64 bit argument and result shows as 4-5 CPU cycles. I don't know if that's a measurement issue or really variable time however.
Also note that there's other factors involved than just the execution time. The integer has to be moved through the processor and get into the right place to get multiplied. All of this and other factors make latency, which is also noted in Agner Fog's tables. There are other issues such as cache issues which also make life more difficult - it's not that easy to simply say how fast something will run without running it.
x86 isn't the only architecture, and it's actually not inconceivable there are CPU's and architectures out there that have non-constant time multiplication. This is especially important for cryptography where algorithms using multiplication might be susceptible to timing attacks on those platforms.
Multiplication itself on most common architectures will be constant. Time to load registers may vary depending on the location of the variables (L1, L2, RAM, etc) but the number of cycles operation takes will be constant. This is in contrast to operations like sqrt that may require additional cycles to achieve certain precision.
you can get instruction costs here for AMD, Intel, VIA: http://www.agner.org/optimize/instruction_tables.pdf
By time complexity, I presume you mean whether it depends on the number of digits in a and b? So whether the number of CPU clock cycles would vary depending on whether you multiplied say 2*3 or 111*509. I think yes they would vary and it would depend on how that architecture implements the multiplication operation and how the intermediate results are stored.
Although there can be many ways to do this one simple/primitive way is to implement multiplication using the binary adder/subtractor circuit.
Multiplication of a*b is adding a to itself b times using n-digit binary adders. Similarly division a/b is subtraction b from a until it reaches 0, although this will take more space to store the quotient and remainder.
void myfun()
{
int a = 111;
int b = 509;
int c = a * b;
}
De assemble part :
movl $111, -4(%ebp)
movl $509, -8(%ebp)
movl -4(%ebp), %eax
imull -8(%ebp), %eax
So as you can see it all depends on imull instruction, specifically the fetch, decode, and execute cycle of a CPU.
In your example, the compiler would do the multiplication and your code would look like
int c = 56499;
If you changed your example to look like
int c = a * 509;
then the compiler MIGHT decide to rewrite your code like
int c = a * ( 512 - 2 - 1 );
int c = (a << 9) - (a << 1) - a;
I said might because the compiler will compare the cost of using shift to the cost of a multiply instruction and pick the best option.
Given a fast multiply instruction, that usually means only 1 or maybe 2 shift will be faster.
If your numbers are too large to fit in an integer (32-bits) then the arbitrary precision
math routines use between O(n^2) and O(n log n) time where n is the number of 32-bit parts needed to hold the numbers.