SSE-copy, AVX-copy and std::copy performance

SSE-copy, AVX-copy and std::copy performance - c++

I'm tried to improve performance of copy operation via SSE and AVX:
#include <immintrin.h>
const int sz = 1024;
float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(mas, mas+sz, [&](){return ++a;});
const int nn = 1000;//Number of iteration in tester loops
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3;
//std::copy testing
start1 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
std::copy(mas, mas+sz, tar);
end1 = std::chrono::system_clock::now();
float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=4, _tar+=4)
{
__m128 buffer = _mm_load_ps(_mas);
_mm_store_ps(_tar, buffer);
}
}
end2 = std::chrono::system_clock::now();
float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
//AVX-copy testing
start3 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m256 buffer = _mm256_load_ps(_mas);
_mm256_store_ps(_tar, buffer);
}
}
end3 = std::chrono::system_clock::now();
float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
_mm_free(mas);
_mm_free(tar);
It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:
nn=10: SSE-gain=3, AVX-gain=6;
nn=100: SSE-gain=0.75, AVX-gain=1.5;
nn=1000: SSE-gain=0.55, AVX-gain=1.1;
Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:
for blah blah:
sleep(500ms)
std::copy
sse
axv
output:
SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy
So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..
for blah blah:
sleep(500ms)
sse
axv
std::copy
Notice that absolutely nothing changed, except the order of the tests.
SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy
Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.
This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.
The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"
No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.
For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.
Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!
The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.
See also Enhanced REP MOVSB for memcpy here on SO, where #BeeOnRope's answer discusses NT stores (and non-RFO stores done by rep stosb/stosq) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.

Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).
You can test this idea by manually unrolling one notch:
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m128 buffer1 = _mm_load_ps(_mas);
__m128 buffer2 = _mm_load_ps(_mas+4);
_mm_store_ps(_tar, buffer1);
_mm_store_ps(_tar+4, buffer2);
}
}
Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".

I think this is because the measurement is not accurate for kinda short operations.
When measuring performance on Intel CPU
Disable "Turbo Boost" and "SpeedStep". You can to this on system BIOS.
Change Process/Thread priority to High or Realtime. This will keep your thread running.
Set Process CPU Mask to only one core. CPU Masking with Higher priority will minimize context switching.
use __rdtsc() intrinsic function. Intel Core series returns CPU internal clock counter with __rdtsc(). You will get 3400000000 counts/second from 3.4Ghz CPU. And __rdtsc() flushes all scheduled operations in CPU so it can measure timing more accurate.
This is my test-bed startup code for testing SSE/AVX codes.
int GetMSB(DWORD_PTR dwordPtr)
{
if(dwordPtr)
{
int result = 1;
#if defined(_WIN64)
if(dwordPtr & 0xFFFFFFFF00000000) { result += 32; dwordPtr &= 0xFFFFFFFF00000000; }
if(dwordPtr & 0xFFFF0000FFFF0000) { result += 16; dwordPtr &= 0xFFFF0000FFFF0000; }
if(dwordPtr & 0xFF00FF00FF00FF00) { result += 8; dwordPtr &= 0xFF00FF00FF00FF00; }
if(dwordPtr & 0xF0F0F0F0F0F0F0F0) { result += 4; dwordPtr &= 0xF0F0F0F0F0F0F0F0; }
if(dwordPtr & 0xCCCCCCCCCCCCCCCC) { result += 2; dwordPtr &= 0xCCCCCCCCCCCCCCCC; }
if(dwordPtr & 0xAAAAAAAAAAAAAAAA) { result += 1; }
#else
if(dwordPtr & 0xFFFF0000) { result += 16; dwordPtr &= 0xFFFF0000; }
if(dwordPtr & 0xFF00FF00) { result += 8; dwordPtr &= 0xFF00FF00; }
if(dwordPtr & 0xF0F0F0F0) { result += 4; dwordPtr &= 0xF0F0F0F0; }
if(dwordPtr & 0xCCCCCCCC) { result += 2; dwordPtr &= 0xCCCCCCCC; }
if(dwordPtr & 0xAAAAAAAA) { result += 1; }
#endif
return result;
}
else
{
return 0;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
// Set Core Affinity
DWORD_PTR processMask, systemMask;
GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask);
SetProcessAffinityMask(GetCurrentProcess(), 1 << (GetMSB(processMask) - 1) );
// Set Process Priority. you can use REALTIME_PRIORITY_CLASS.
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
DWORD64 start, end;
start = __rdtsc();
// your code here.
end = __rdtsc();
printf("%I64d\n", end - start);
return 0;
}

I think that your main problem/bottleneck is your _mm_malloc.
I highly suggest to use std::vector as your main data structure if you are concerned about locality in C++.
intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.
Also note that the fact that the AVX are a newer than SSE doesn't make the AVX faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.
Try with a POD int array[] or an std::vector.

Related

no speedup using openmp + SIMD

I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation.
In order to simplify the case without losing much essential information, I wrote a simplier toy example:
#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>
#include "immintrin.h" // for SIMD intrinsics
int main() {
int64_t size = 160000000;
std::vector<int> src(size);
// generating random src data
for (int i = 0; i < size; ++i)
src[i] = (rand() / (float)RAND_MAX) * size;
// to store the final results, so size is the same as src
std::vector<int> dst(size);
// get pointers for vector load and store
int * src_ptr = src.data();
int * dst_ptr = dst.data();
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
omp_set_num_threads(4); // you can change thread count here
// only measure the parallel part
struct timeval one, two;
double get_time;
gettimeofday (&one, NULL);
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
// load needed data
vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));
// computation part
vec_dst = _mm256_add_epi32(vec_src, vec_op);
vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
vec_dst = _mm256_slli_epi32(vec_dst, 1);
vec_dst = _mm256_add_epi32(vec_dst, vec_src);
vec_dst = _mm256_sub_epi32(vec_dst, vec_src);
// store results
_mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
}
gettimeofday(&two, NULL);
double oneD = one.tv_sec + (double)one.tv_usec * .000001;
double twoD = two.tv_sec + (double)two.tv_usec * .000001;
get_time = 1000 * (twoD - oneD);
std::cout << "took time: " << get_time << std::endl;
// output something in case the computation is optimized out
int64_t i = (int)((rand() / (float)RAND_MAX) * size);
for (int64_t i = 0; i < size; ++i)
std::cout << i << ": " << dst[i] << std::endl;
return 0;
}
It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. The result is as follows (the median value is picked out of 5 runs each):
1 thread: 92.519
2 threads: 89.045
4 threads: 90.361
The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups?
More information:
I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60, but it does not change much if I change the thread count from 1 -> 2 -> 4.
Any advice or clue is welcome.
EDIT-1:
Thank #JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. Here are the results:
1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039
2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015
4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003
It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. bandwidth. The memory bound does increase, but I am still not sure if that is the cause. Any advice?
EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully):
Thanks for the suggestions from #PaulR and #bazza. I tried 3 ways for comparison. One thing to note is that the processor has 4 cores and 8 hardware threads. Here are the results:
(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective;
(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads;
(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads.
It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations.
Any advice?
EDIT 3:
In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
for (int64_t i = 0; i < size; i += 8) {
......
}
}
i.e. the outer loop is parallelized using multithreads, and the inner loop is still serially processed. So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads.
I did another experiment that the the openmp directive is put inside the outer loop:
for (int k = 0; k < 100; ++k) {
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
......
}
}
and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83.
Problem still exists. :(
EDIT-4:
If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way):
#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
int query = src[i];
int res = src[i] + 2;
res = res * query;
res = res << 1;
res = res + query;
res = res - query;
dst[i] = res;
}
The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864. May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?

May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. That doesn't imply that it will scale perfectly. You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup.
In your case, you are indeed limited by memory bandwidth. A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr. This is not due to the fact that it optimizes away these two redundant lines:
res = res + query;
res = res - query;
It does, but that makes no difference. Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. Fore the store, it uses vmovntdq instead of vmovdqu, making use of sophisticated knowledge about the program, memory and the architecture. Not only does vmovntdq require aligned memory and can therefore be more efficient. It gives the CPU a non-temporal hint, preventing this data from being cached during the write to memory. This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well.
Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). There isn't much more direct room for improvement.
To further improve performance, you would have to do something about the operation / byte ratio of your code. Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for intops), but can only read&write 3.2 G int/s. That gives you an idea how much operations you need to perform to not be limited by memory. So if you can, work on the data in blocks so that you can reuse data in caches.
I cannot reproduce your results for (2) and (3). When I loop around the inner loop, the scaling behaves the same. The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such:
double one, two;
#pragma omp parallel
{
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
#pragma omp master
one = omp_get_wtime();
#pragma omp barrier
for (int kk = 0; kk < 100; kk++)
#pragma omp for
for (int64_t i = 0; i < size; i += 8) {
...
}
#pragma omp master
{
two = omp_get_wtime();
std::cout << "took time: " << (two-one) * 1000 << std::endl;
}
}
A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth.
Edit: One more thought about VTune not classifying it as memory-bound. This may be cause by the short computation time vs initialization. Try to see what VTune says about the code in a loop.

Why is std::fill(0) slower than std::fill(1)?

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:
5.8 GiB/s vs 7.5 GiB/s
However, the results are different for smaller data sizes, where fill(0) is faster:
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.
The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.
I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.
Here is the code to reproduce:
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>
using value = int;
using vector = std::vector<value>;
constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;
void __attribute__((noinline)) fill0(vector& v) {
std::fill(v.begin(), v.end(), 0);
}
void __attribute__((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size / (sizeof(value) * nthreads));
auto repeat = write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
<< write_size / (t2 - t1) << "\n";
}
}
int main(int argc, const char* argv[]) {
std::cout << "size,nthreads,fill0,fill1\n";
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, 1);
}
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, omp_get_max_threads());
}
for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
bench(max_data_size, nthreads);
}
}
Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

From your question + the compiler-generated asm from your answer:
fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)
Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the #BeeOnRope's excellent ERMSB answer for updates on this.
gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).
#BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.
movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)
Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).
As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).
movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).
#osgx posted some interesting links in comments:
Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory
Performance of the Intel Haswell-EP Architecture
See also other stuff in the x86 tag wiki.

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.
The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.
Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:
rep stos %al,%es:(%rdi)
Wheres the manual loop compiles down to an actual 128-bit instruction:
add $0x1,%rax
add $0x10,%rdx
movaps %xmm0,-0x10(%rdx)
cmp %rax,%r8
ja 400f41
Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop.
Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.
Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.
I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.
Update:
Here is a result in comparison with
fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.
For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.
Single threaded bandwidth:
Aggregate bandwidth in memory:
Here is the code used for the additional tests with their respective hot-loops:
void __attribute__ ((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
┌─→add $0x1,%rax
│ vmovdq %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rdi,%rax
└──jb e0
void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
for (auto& elem : v) {
_mm_stream_si32(&elem, 1);
}
}
┌─→movnti %ecx,(%rax)
│ add $0x4,%rax
│ cmp %rdx,%rax
└──jne 18
void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m128i buf = _mm_set1_epi32(1);
size_t i;
int* data;
int* end4 = &v[v.size() - (v.size() % 4)];
int* end = &v[v.size()];
for (data = v.data(); data < end4; data += 4) {
_mm_stream_si128((__m128i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %xmm0,(%rdx)
│ add $0x10,%rdx
│ cmp %rcx,%rdx
└──jb 40
void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m256i buf = _mm256_set1_epi32(1);
size_t i;
int* data;
int* end8 = &v[v.size() - (v.size() % 8)];
int* end = &v[v.size()];
for (data = v.data(); data < end8; data += 8) {
_mm256_stream_si256((__m256i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rcx,%rdx
└──jb 40
Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY

With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.

should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.

Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.

For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?

Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

Improve OpenMP/SSE parallelization effect

I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:
void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
//Initialize
auto OutMapLen = this->_OutMapLen;
auto KernelBatchLen = this->_KernelBatchLen;
auto OutMapHeig = this->_OutMapHeig;
auto OutMapWid = this->_OutMapWid;
auto InpMapWid = this->_InpMapWid;
auto NumInputMaps = this->_NumInputMaps;
auto InpMapLen = this->_InpMapLen;
auto KernelLen = this->_KernelLen;
auto KernelHeig = this->_KernelHeig;
auto KernelWid = this->_KernelWid;
auto input_local = this->input;
auto output_local = this->output;
auto weights_local = this->weights;
auto biases_local = this->biases;
auto klim = this->_klim;
#pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
for(auto i=0; i<_NumOutMaps; ++i)
{
auto output_map = output_local + i*OutMapLen;
auto kernel_batch = weights_local + i*KernelBatchLen;
auto bias = biases_local + i;
for(auto j=0; j<OutMapHeig; ++j)
{
auto output_map_row = output_map + j*OutMapWid;
auto inp_row_idx = j*InpMapWid;
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
*output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
auto kernel = kernel_batch + _i*KernelLen;
for(int _j=0; _j<KernelHeig; ++_j)
{
auto kernel_row_idx = _j*KernelWid;
auto inp_row_cur_idx = _j*InpMapWid;
int _k=0;
for(; _k<klim; _k+=4)//unroll and vectorize
{
float buf;
__m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
__m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
__m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
_mm_store_ss(&buf, prd);
*output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
*output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
}
}
}
}
Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?
Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.

Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.
There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.
The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.
Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:
for(auto k=0; k<OutMapWid; ++k)
{
auto output_nn = output_map_row + k;
auto _output_nn = *bias;
auto inp_cursor_idx = inp_row_idx + k;
for(int _i=0; _i<NumInputMaps; ++_i)
{
...
for(int _j=0; _j<KernelHeig; ++_j)
{
...
for(; _k<klim; _k+=4)//unroll and vectorize
{
...
_output_nn += buf;
}
for(; _k<KernelWid; ++_k)//residual loop
_output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
}
}
*output_nn = _output_nn;
}
But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.

Optimize check for a bit-vector being a proper subset of another?

I would like some help optimizing the most computationally intensive function of my program.
Currently, I am finding that the basic (non-SSE) version is significantly faster (up to 3x). I would thus request your help in rectifying this.
The function looks for subsets in unsigned integer vectors, and reports if they exist or not. For your convenience I have included the relevant code snippets only.
First up is the basic variant. It checks to see if blocks_ is a proper subset of x.blocks_. (Not exactly equal.) These are bitmaps, aka bit vectors or bitsets.
//Check for self comparison
if (this == &x)
return false;
//A subset is equal to or smaller.
if (no_bits_ > x.no_bits_)
return false;
int i;
bool equal = false;
//Pointers should not change.
const unsigned int *tptr = blocks_;
const unsigned int *xptr = x.blocks_;
for (i = 0; i < no_blocks_; i++, tptr++, xptr++) {
if ((*tptr & *xptr) != *tptr)
return false;
if (*tptr != *xptr)
equal = true;
}
return equal;
Then comes the SSE variant, which alas does not perform according to my expectations. Both of these snippets should look for the same things.
//starting pointers.
const __m128i* start = (__m128i*)&blocks_;
const __m128i* xstart = (__m128i*)&x.blocks_;
__m128i block;
__m128i xblock;
//Unsigned ints are 32 bits, meaning 4 can fit in a register.
for (i = 0; i < no_blocks_; i+=4) {
block = _mm_load_si128(start + i);
xblock = _mm_load_si128(xstart + i);
//Equivalent to (block & xblock) != block
if (_mm_movemask_epi8(_mm_cmpeq_epi32(_mm_and_si128(block, xblock), block)) != 0xffff)
return false;
//Equivalent to block != xblock
if (_mm_movemask_epi8(_mm_cmpeq_epi32(block, xblock)) != 0xffff)
equal = true;
}
return equal;
Do you have any suggestions as to how I may improve upon the performance of the SSE version? Am I doing something wrong? Or is this a case where optimization should be done elsewhere?
I have not yet added in the leftover calculations for no_blocks_ % 4 != 0, but there is little purpose in doing so until the performance increases, and it would only clutter up the code at this point.

There are three possibilities I see here.
First, your data might not suit wide comparisons. If there's a high chance that (*tptr & *xptr) != *tptr within the first few blocks, the plain C++ version will almost certainly always be faster. In that instance, your SSE will run through more code & data to accomplish the same thing.
Second, your SSE code may be incorrect. It's not totally clear here. If no_blocks_ is identical between the two samples, then start + i is probably having the unwanted behavior of indexing into 128-bit elements, not 32-bit as the first sample.
Third, SSE really likes it when instructions can be pipelined, and this is such a short loop that you might not be getting that. You can reduce branching significantly here by processing more than one SSE block at once.
Here's a quick untested shot at processing 2 SSE blocks at once. Note I've removed the block != xblock branch entirely by keeping the state outside of the loop and only testing at the end. In total, this moves things from 1.3 branches per int to 0.25.
bool equal(unsigned const *a, unsigned const *b, unsigned count)
{
__m128i eq1 = _mm_setzero_si128();
__m128i eq2 = _mm_setzero_si128();
for (unsigned i = 0; i != count; i += 8)
{
__m128i xa1 = _mm_load_si128((__m128i const*)(a + i));
__m128i xb1 = _mm_load_si128((__m128i const*)(b + i));
eq1 = _mm_or_si128(eq1, _mm_xor_si128(xa1, xb1));
xa1 = _mm_cmpeq_epi32(xa1, _mm_and_si128(xa1, xb1));
__m128i xa2 = _mm_load_si128((__m128i const*)(a + i + 4));
__m128i xb2 = _mm_load_si128((__m128i const*)(b + i + 4));
eq2 = _mm_or_si128(eq2, _mm_xor_si128(xa2, xb2));
xa2 = _mm_cmpeq_epi32(xa2, _mm_and_si128(xa2, xb2));
if (_mm_movemask_epi8(_mm_packs_epi32(xa1, xa2)) != 0xFFFF)
return false;
}
return _mm_movemask_epi8(_mm_or_si128(eq1, eq2)) != 0;
}
If you've got enough data and a low probability of failure within the first few SSE blocks, something like this should be at least somewhat faster than your SSE.

I seems that your problem is a memory bandwidth bounded problem:
Asymptotic you need about 2 operation for processing a pair of integer in memory scanned. There is not enough arithmetic complexity to get advantage of use more arithmetic throughput from CPU SSE instructions. In fact you CPU pass lot of time waiting for data transfers.
But using SSE instructions in your case induce a overall of instructions and resulting code is not well optimized by compiler.
There are some alternatives strategies to improve performance in bandwidth bounded problem:
Multi-thread hide access memory by concurrent arithmetic
operations in hyper-threading context.
Fine tuning of size of data load at time improve memory bandwidth.
Improve the pipe-line continuity by adding supplementary independents operations in a loop (scan two different sets of data at each step in your "for" loop)
Keep more data in cache or in registers (some iterations of your code may be need the same set of data many times)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js