How can adding a variable be faster than adding a constant? [duplicate] - c++

I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

Related

SIMD intrinsics slower for cross products over an array of points than whatever GCC -O3 -march=native does on its own? [duplicate]

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx
Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.
Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.
What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.
Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.
lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.

How to keep input-dependent hot data in registers when using SIMD intrinsics

I am trying to use Intel SIMD intrinsics to accelerate a query-answer program. Suppose query_cnt is input dependent but is always smaller than SIMD register count (i.e. there is enough SIMD registers to hold them). Since queries are the hot data in my application, instead of loading them each time when needed, may I load them at first and keep them always in registers?
Suppose queries are float type, and AVX256 is supported. Now I have to use something like:
std::vector<__m256> vec_queries(query_cnt / 8);
for (int i = 0; i < query_cnt / 8; ++i) {
vec_queries[i] = _mm256_loadu_ps((float const *)(curr_query_ptr));
curr_query_ptr += 8;
}
I know it is not a good practice since there is potential load/store overhead, but at least there is a slight chance that vec_queries[i] can be optimized so that they can be kept in registers, but I still think it is not a good way.
Any better ideas?
From the code sample you posted, it looks like you're just doing a variable-length memcpy. Depending on what the compiler does, and the surrounding code, you might get better results from just actually calling memcpy. e.g. for aligned copies of with a size that's a multiple of 16B, the break even point between a vector loop and rep movsb is maybe as low as ~128 bytes on Intel Haswell. Check Intel's optimization manual for some implementation notes on memcpy, and a graph of size vs. cycles for a couple different strategies. (Links in the x86 tag wiki).
You didn't say what CPU, so I'm just assuming recent Intel.
I think you're too worried about registers. Loads that hit in L1 cache are extremely cheap. Haswell (and Skylake) can do two __m256 loads per clock (and a store in the same cycle). Previous to that, Sandybridge/IvyBridge can do two memory operations per clock, with a max of one of them being a store. Or under ideal conditions (256b loads/stores), they can manage 2x 16B loaded and 1x 16B stored per clock. So loading/storing 256b vectors is more expensive than on Haswell, but still very cheap if they're aligned and hot in L1 cache.
I mentioned in comments that GNU C global register variables might be a possibility, but mostly in a "this is technically possible in theory" sense. You probably don't want multiple vector registers dedicated to this purpose for the entire run-time of your program (including library function calls, so you'd have to recompile them).
In reality, just make sure the compiler can inline (or at least see while optimizing) the definitions for every function you use inside any important loops. That way it can avoid having to spill/reload vector regs across function calls (since both the Windows and System V x86-64 ABIs have no call-preserved YMM (__m256) registers).
See Agner Fog's microarch pdf to learn even more about the microarchitectural details of modern CPUs, at least the details that are possible to measure by experiment and tune for.

nested for loop faster after profile guided optimization but with higher cache misses

I have a program that has at its heart a 2D array in the form of a
std::vector<std::vector< int > > grid
And there's a simple double for loop going on that goes somewhat like this:
for(int i=1; i<N-1; ++i)
for(int j=1; j<N-1; ++j)
sum += grid[i][j-1] + grid[i][j+1] + grid[i-1][j] + grid[i+1][j] + grid[i][j]*some_float;
With g++ -O3 it runs pretty fast, but for further optimization I profiled with callgrind and see a L1 Cache miss of about 37%, and 33% for LL which is a lot but not too surprising considering the random-ish nature of the computation. So I do a profile-guided optimization a la
g++ -fprofile-generate -O3 ...
./program
g++ -fprofile-use -O3 ...
and the program runs about 48% faster! But the puzzling part: The cache misses have even increased! L1 data cache miss is now 40%, LL same.
How can that be? There are no conditionals in the loop for which prediction could have been optimised and the cache misses are even higher. Yet it is faster.
edit: Alright, here's the sscce: http://pastebin.com/fLgskdQG . Play around with the N for different runtime. Compiled via
g++ -O3 -std=c++11 -sscce.cpp
on gcc 4.8.1 under linux.
profile-guided optimization with the commands above. The Callgrind stuff is done with a g++ -g switch and valgrind --tool=callgrind --simulate-cache=yes ./sscce
I noticed only one significant difference between assembly codes generated with or without PGO. Without PGO sum variable is spilled from register to memory, once per inner loop iteration. This writing variable to memory and loading it back might in theory slow down things very significantly. Fortunately modern processors optimize it with store-to-load forwarding, so that slowdown is not so big. Still Intel's optimization manual does not recommend to spill floating point variables to memory, especially when they are computed by long-latency operations, like floating point multiplication.
What is really puzzling here is why GCC needs PGO to avoid spilling register to memory. It is enough unused floating point registers, and even without PGO compiler could get all information necessary for proper optimization from single source file...
These unnecessary load/store operations explain not only why PGO code is faster, but also why it increases percentage of cache misses. Without PGO register is always spilled to the same location in memory, so this additional memory access increases both number of memory accesses and number of cache hits, while it does not change number of cache misses. With PGO we have less memory accesses but same amount of cache misses, so their percentage increases.

Huge performance difference of a C++ program (compiled with GCC) under Mac and Linux

Recently I wrote a small program in C++ (well, to be really honest it's more C plus classes) and tested the performance on both a Mac and Linux machine.
Even though the hardware is comparable, the performance is so different than I really thing there is something strange going on.
First of all some details:
Input: about 200MB compressed data
Operations of the program: it decompresses the data, then loads it in memory, and perform many data access to perform joins between the data. The program is sequential (no additional threads or processes).
Output: some strings to be displayed on the screen
The code is compiled using GCC 4.8.1 in the Linux machine and GCC 4.8.2 in the Mac machine. In both cases the compiler is called with the arguments:
gcc -c -O3 -fPIC -MD -MF $(patsubst %.o,%.d,$#) //The last three arguments are to create the dependencies between the files
The Mac (OS=mac mavericks 10.9) machine is a macbook pro equipped with a 2,3 GHz Intel core I7 (it's a quadcore) 256KB L2 cache, 6MB L3 cache, 8GB of DDR3 1600Mhz, and a 256 GB SSD disk.
The Linux machine (kernel 2.6.32-358) has a Intel E5-2620 2.0 GHz (it's a sixcore) 16MB cache, 64GB of DDR3 1600Mhz, and a 256 GB SSD disk. Both machines should use the Sandy Bridge architecture (maybe the Mac is ivy bridge, but anyway this shouldn't make a big difference).
Now if I launch the program on the linux machine then it takes 217ms to finish while if I launch it in the Mac machine it takes 132ms: this makes the linux code 1.6 times slower!!
Now, I understand that the two machines have different OS and hardware, but I find a such slowdown too large to be justified by these factors, and I feel that there must be some other reason behind it.
Notice that this timings were being taken after all the data is loaded in memory, and I'm sure the program does not swap to disk during this time. Therefore, I can exclude that the problem is the SSD disk.
Now, I really don't know what could have caused such slowdown? The memory is basically equivalent, while the CPU is only a bit slower.
Could it be that GCC produced a sensibly worse code on a linux than a mac?
Could it be that the Linux OS is sensibly worse than the Mac?
I find both things hard to believe. Any help?
EDIT:
I realized that I didn't mention how I do the timings: well, I use the boost chrono library, and I measure only the time necessary to invoke the main function. Something like:
time = now();
function();
duration = now() - time;
print(duration);
EDIT2:
After some tests, we managed to reproduce the difference of performance with a much simpler (and silly) program:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
char in1[10000000];
char in2[10000000];
static inline uint64_t rdtscp (void) {
uint64_t low, high;
uint64_t aux;
__asm__ __volatile__ (
".byte 0x0f,0x01,0xf9"
: "=a" (low), "=d" (high), "=c" (aux)
);
return low | (high << 32);
}
int main(int argc, char** argv) {
uint64_t counter = rdtscp();
for(int i = 0; i < 10000000; ++i) {
in1[i] = (char)i * 200;
in2[i] = (char)i * 100;
}
int joins = 0;
for(int j = 0; j < 10000000; ++j) {
int el = in1[j];
for(int m = 0; m < 10000000; m++) {
if (in2[m] == el) {
joins++;
break;
}
}
}
printf("Joins %d Cycles total %ld\n", joins, (rdtscp() - counter));
return 0;
}
Please don't look at the operations of the program. They make little sense. What we tried to reproduce is a sequence of access to memory and simple operations with them.
We launched this program on the Mac and the output was:
Joins 10000000 Cycles total 589015641
While on the linux machine it was:
Joins 10000000 Cycles total 838198832
Clearly the linux version requires many more CPU cycles, which are probably needed to access the memory. Now the question is: why is the memory access slower?
One reason could be that in1 and in2 don't fit in the CPU caches, and this requires some RAM accesses. As pointed by Roy Longbottom the memory in linux is indeed ECC and this could be the reason behing the lower performance. If we combine this with the slightly lower CPU speed, the difference between sandy and ivy bridge then we probably have a good explanation for such difference.
Anyway, thanks all for the tips!
Both systems follow the System V AMD64 ABI, so gcc shouldn't make a difference there. Unfortunately, random effects in system performance are rather prevalent nowadays, so you can sometimes get significant performance differences through things as silly as re-ordering link order (cf. Mytkowicz et al., ``Producing wrong data without doing anything obviously wrong'' , http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.163.8395)
Here are some suggestions for how to analyse this that come to mind:
Do more than one run. Personally I take at least 11 runs and compare the median (as well as the various quartiles, but that's probably more than you may care about there). This avoids some of the random effects.
Pipe all output into a file to minimise UI effects.
Check your performance counters. On Linux you can use the perf tool. Check for major-faults, which suggest that you have page faults that need to go to disk (unlikely on multiple runs, of course). Only then can you exclude that the disk makes a difference there. Unfortunately OS X doesn't (to the best of my knowledge) have as easy a way to collect performance counters.
You can experiment with -mcpu to force the same target instruction set.
Compare actual cache sizes. dmidecode -t cache does that on the Linux side, but you must be root. Your machines may have relevant differences there.
If your program runs through multiple phases, try benchmarking them individually.
Good luck!
Looking at it another way, the runtime difference is just 85 milliseconds, which is tiny.
What exactly are you measuring? If it's the whole programme runtime including startup and teardown (e.g. using the Unix time command) then the difference might easily be due to the dynamic linkers involved: on Linux at least, your programme will be linked to the system libstdc++ before it's actually executed. If the MacOS dynamic linker is a tiny bit faster (or the programme gets statically linked on the Mac?), this could easily explain the difference.
Or it might even be the time taken to write to the terminal. On Linux, gnome-terminal for example is often seen as "slow" due to its use of anti-aliased fonts and full Unicode support. Does your programme run faster if you use xterm instead? What happens if you redirect the output to /dev/null?
Actually, if you account for the different frequency (which may be critical if your program is CPU-bound and not memory-bound, you haven't told us what your code does), than the difference is reduced to ~1.43.
However, if one of the CPUs is IvyBridge-based, there may be quite some differences. It's true that the architecture didn't change dramatically, but there are some changes that may not be apparent when benchmarking over a large set of applications, but could be critical on specific ones. In your case you haven't shown any code but since you're dealing with large memory structures it could be related to one of these
Adaptive fill policy described here
Dynamic prefetch throttling, mentioned here.
Next-page prefetching, mentioned here
There aren't many details about the actual implementations but the reverse engineering done on the first is pretty impressive, and the second and third names speak for themselves (you could verify if this is the problem by disabling prefetces on both machines and comparing again). These features may be very critical on some memory consuming workloads (especially latency critical ones), but it's hard to tell without knowing how much you rely on your L3 cache
I'd also suggest making sure that you don't use OS-specific library versions or compiler version-specific intrinsics, the Apple folks may have done a better job in optimizing some basic operations
I attempted to compile that code via Linux Ubuntu on my Core 2 Duo PC. I could not get rdtscp to work and used a CPU time counter instead. The program was compiled with just the -O3 option. The key part of the C program and assembly listing are shown below. This PC can select 2.4 GHz or 1.6 GHz with default on-demand to produce a varying performance (at anything between 1.6 and 2.4 GHz). Results at 1.6 and 2.4 GHz are shown below. I added and extra count (floating point), to discover what was happening. Then speeds in joins per second were no different.
The result in joins per second was proportional to CPU MHz, unlikely if main memory speed dependent. Increasing the array and loop counts by 10 and 100 times produced the same joins per second, suggesting that memory speed can be ignored.
So, we are left with relative GHz under Turbo Boost, the same machine code generated (notice aligns) and the effects of Sandy Bridge vs Ivy Bridge. With the extra counter, it is possible to count the number of assembly instructions executed - I got lost on counting.
for(j = 0; j < 10000000; ++j) {
int el = in1[j];
for(m = 0; m < 10000000; m++) {
count = count + 1;
if (in2[m] == el)
{
joins++;
break;
}
}
}
.L6:
movzbl in1(%ecx), %edx
xorl %eax, %eax
jmp .L5
.p2align 4,,7
.p2align 3
.L3:
addl $1, %eax
cmpl $10000000, %eax
je .L4
.L5:
cmpb in2(%eax), %dl
fadd %st, %st(1)
jne .L3
addl $1, %ebx
.L4:
addl $1, %ecx
cmpl $10000000, %ecx
jne .L6
Result 2400 MHz
Count 320000000 Joins 10000000 0.4920310 seconds 20.32M joins per second
Result 1600 MHz
Count 320000000 Joins 10000000 0.7400470 seconds 13.51M joins per second

Array C[]=A[]*B[] in high-performance calculation

I believe it is usual to have such code in C++
for(size_t i=0;i<ARRAY_SIZE;++i)
A[i]=B[i]*C[i];
One commonly advocated alternation is:
double* pA=A,pB=B,pC=C;
for(size_t i=0;i<ARRAY_SIZE;++i)
*pA++=(*pB++)*(*pC++);
What I am wondering is, the best way of improving this code, as IMO following things needed to be considered:
CPU cache. How CPUs fill up their caches to gain best hit rate?
I suppose SSE could improve this?
The other thing is, what if the code could be parallelized? E.g. using OpenMP. In this case, pointer trick may not be available.
Any suggestions would be appreciated!
My g++ 4.5.2 produces absolutely identical code for both loops (having fixed the error in double *pA=A, *pB=B, *pC=C;, and it is
.L3:
movapd B(%rax), %xmm0
mulpd C(%rax), %xmm0
movapd %xmm0, A(%rax)
addq $16, %rax
cmpq $80000, %rax
jne .L3
(where my ARRAY_SIZE was 10000)
The compiler authors know these tricks already. OpenMP and other concurrent solutions are worth investigating, though.
The rule for performance are
not yet
get a target
measure
get an idea of how much improvement is possible and verify it is worthwhile to spend time to get it.
This is even more true for modern processors. About your questions:
simple index to pointer mapping is often done by the compilers, and when they don't do it they may have good reasons.
processors are already often optimized to sequential access to the cache: simple code generation will often give the best performance.
SSE can perhaps improve this. But not if you are already bandwidth limited. So we are back to the measure and determine bounds stage
parallelization: same thing as SSE. Using the multiple cores of a single processor won't help if you are bandwidth limited. Using different processor may help depending on the memory architecture.
manual loop unwinding (suggested in a now deleted answer) is often a bad idea. Compilers know how to do this when it is worth-wise (for instance if it can do software pipelining), and with modern OOO processors it is often not the case (it increase the pressure on instruction and trace caches while OOO execution, speculation over jumps and register renaming will automatically brings most of the benefit of unwinding and software pipelining).
The first form is exactly the sort of structure that your compiler will recognize and optimize, almost certainly emitting SSE instructions automatically.
For this kind of trivial inner loop, cache effects are irrelevant, because you are iterating through everything. If you have nested loops, or a sequence of operations (like g(f(A,B),C)), then you might try to arrange to access small blocks of memory repeatedly to be more cache-friendly.
Do not unroll the loop by hand. Your compiler will already do that, too, if it is a good idea (which it may not be on a modern CPU).
OpenMP can maybe help if the loop is huge and the operations within are complicated enough that you are not already memory-bound.
In general, write your code in a natural and straightforward way, because that is what your optimizing compiler is most likely to understand.
When to start considering SSE or OpenMP? If both of these are true:
If you find that code similar to yours appear 20 times or more in your project:
for (size_t i = 0; i < ARRAY_SIZE; ++i)A[i] = B[i] * C[i];
or some similar operations
If ARRAY_SIZE is routinely bigger than 10 million, or, if profiler tells you that this operation is becoming a bottleneck
Then,
First, make it into a function: void array_mul(double* pa, const double* pb, const double* pc, size_t count){ for (...) }
Second, if you can afford to find a suitable SIMD library, change your function to use it.
Good portable SIMD library
SIMD C++ library
As a side note, if you have a lot of operations that are only slightly more complicated than this, e.g. A[i] = B[i] * C[i] + D[i] then a library which supports expression template will be useful too.
You can use some easy parallelization method. Cuda will be hardware dependent but SSE is almost standard in every CPU. Also you can use multiple threads. In multiple threads you can still use pointer trick which is not very important. Those simple optimizations can be done by compiler as well. If you are using Visual Studio 2010 you can use parallel_invoke to execute functions in parallel without dealing with windows threads. In Linux pThread library is quite easy to use.
I think using valarrays are specialised for such calculations. I am not sure if it will improve the performance.