How can I obtain consistently high throughput in this loop?

How can I obtain consistently high throughput in this loop? - c++

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

Related

Large and math-heavy C++ program is ~10% faster or slower randomly

The program
I have a C++ program that looks something like the following:
<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
uint32_t pos = 0;
double factor;
if((i%2) == 0) factor = 1.0; else factor = -1.0;
for(uint32_t j = 0; j<xyzvec.size(); j++){
pos = j*basis::ndist; //ndist is a compile-time constant == 36
for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
sum += factor*basis::energy(x_a, &coeff[0], p, m);
}
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger
where reps is a single digit number, xyzvec has ~15k elements, and a single call to basis::energy(...) takes about 100 µs to return. The energy function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).
Edit: The m array is somewhat large, ~270 KiB for this test case.
Edit 2: Source code of the two functions responsible for ~90% of execution time
All of the pointers entering energy are __restrict__-qualified and declared to be aligned via __assume_aligned(...), the object files are generated with -Ofast -march=haswell to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).
energy does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.
The strange timing behaviour
Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.
Calls to energy are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X.
All calls to energy in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts.
The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via echo 3 | sudo tee /proc/sys/vm/drop_caches immediately before execution.
The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)
What could be the cause?
Data alignment (dubious, the arrays intensively used are aligned)
Code alignment (maybe, but I have tried printing the function pointer of energy, no correlation with speed)
Cache aliasing?
JCC erratum?
Interrupts, scheduler activity?
Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
???
Edit
Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.

Two consequent std::chrono::high_resolution_clock::now() gives ~270ns difference

I want to measure duration of a piece of code with a std::chrono clock, but it seems too heavy to measure something that lasts nanoseconds. That program:
#include <cstdio>
#include <chrono>
int main() {
using clock = std::chrono::high_resolution_clock;
// try several times
for (int i = 0; i < 5; i++) {
// two consequent now() here, one right after another without anything in between
printf("%dns\n", (int)std::chrono::duration_cast<std::chrono::nanoseconds>(clock::now() - clock::now()).count());
}
return 0;
}
Always gives me around 100-300ns. Is this because of two syscalls? Is it possible to have less duration between two now()? Thanks!
Environment: Linux Ubuntu 18.04, kernel 4.18, load average is low, stdlib is linked dynamically.

Use rdtsc instruction to measure times with the highest resolution and the least overhead possible:
#include <iostream>
#include <cstdint>
int main() {
uint64_t a = __builtin_ia32_rdtsc();
uint64_t b = __builtin_ia32_rdtsc();
std::cout << b - a << " cpu cycles\n";
}
Output:
19 cpu cycles
To convert the cycles to nanoseconds divide cycles by the base CPU frequency in GHz. For example, for a 4.2GHz i7-7700k divide by 4.2.
TSC is a global counter in the CPU shared across all cores.
Modern CPUs have a constant TSC that ticks at the same rate regardless of the current CPU frequency and boost. Look for constant_tsc in /proc/cpuinfo, flags field.
Also note, that __builtin_ia32_rdtsc is more effective than the inline assembly, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48877

If you want to measure the duration of very fast code snippets it is generally a good idea to run them multiple times and take the average time of all runs, the ~200ns that you mention will be negligible then because they are distributed over all runs.
Example:
#include <cstdio>
#include <chrono>
using clock = std::chrono::high_resolution_clock;
auto start = clock::now();
int n = 10000; // adjust depending on the expected runtime of your code
for (unsigned int i = 0; i < n; ++i)
functionYouWantToTime();
auto result =
std::chrono::duration_cast<std::chrono::nanoseconds>(start - clock::now()).count() / n;

Just do not use time clocks for nanoseconds benchmark. Instead, use CPU ticks - on any hardware modern enough to worry about nanoseconds, CPU ticks are monotonic, steady and synchronized between cores.
Unfortunately, C++ does not expose CPU tick clock, so you'd have to use RDTSC instruction directly (it can be nicely wrapped in the inline function or you can use compiler's intrinsics). The difference in CPU ticks could also be converted into time if you so desire (by using CPU frequency), but normally for such a low-latency benchmarks it is not necessary.

Reason for collapse of memory bandwidth when 2KB of data is cached in L1-cache

In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question):
unsigned int doit(const std::vector<unsigned int> &mem){
const size_t BLOCK_SIZE=16;
size_t n = mem.size();
unsigned int result=0;
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
return result;
}
//... initialize mem, result and so on
int NITER = 200;
//... measure time of
for(int i=0;i<NITER;i++)
resul+=doit(mem)
BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code above could saturate a bandwith as high as 182GB/s (this value is just an upper bound and is probably quite off, what is important is the ratio of bandwidths for different sizes). The code is compiled with g++ and -O3.
Varying the size of the vector, I can observe expected bandwidths for L1(*)-, L2-, L3-caches and the RAM-memory:
However, there is an effect I'm really struggling to explain: the collapse of the measured bandwidth of L1-cache for sizes around 2 kB, here in somewhat higher resolution:
I could reproduce the results on all machines I have access to (which have Intel-Broadwell and Intel-Haswell processors).
My question: What is the reason for the performance-collapse for memory-sizes around 2 KB?
(*) I hope I understand correctly, that for L1-cache not 64 bytes but only 4 bytes per addition are read/transfered (there is no further faster cache where a cache line must be filled), so the plotted bandwidth for L1 is only the upper limit and not the badwidth itself.
Edit: When the step size in the inner for-loop is chosen to be
8 (instead of 16) the collapse happens for 1KB
4 (instead of 16) the collapse happens for 0.5KB
i.e. when the inner loop consists of about 31-35 steps/reads. That means the collapse isn't due to the memory-size but due to the number of steps in the inner loop.
It can be explained with branch misses as shown in #user10605163's great answer.
Listing for reproducing the results
bandwidth.cpp:
#include <vector>
#include <chrono>
#include <iostream>
#include <algorithm>
//returns minimal time needed for one execution in seconds:
template<typename Fun>
double timeit(Fun&& stmt, int repeat, int number)
{
std::vector<double> times;
for(int i=0;i<repeat;i++){
auto begin = std::chrono::high_resolution_clock::now();
for(int i=0;i<number;i++){
stmt();
}
auto end = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9/number;
times.push_back(time);
}
return *std::min_element(times.begin(), times.end());
}
const int NITER=200;
const int NTRIES=5;
const size_t BLOCK_SIZE=16;
struct Worker{
std::vector<unsigned int> &mem;
size_t n;
unsigned int result;
void operator()(){
for(size_t i=0;i<n;i+=BLOCK_SIZE){
result+=mem[i];
}
}
Worker(std::vector<unsigned int> &mem_):
mem(mem_), n(mem.size()), result(1)
{}
};
double PREVENT_OPTIMIZATION=0.0;
double get_size_in_kB(int SIZE){
return SIZE*sizeof(int)/(1024.0);
}
double get_speed_in_GB_per_sec(int SIZE){
std::vector<unsigned int> vals(SIZE, 42);
Worker worker(vals);
double time=timeit(worker, NTRIES, NITER);
PREVENT_OPTIMIZATION+=worker.result;
return get_size_in_kB(SIZE)/(1024*1024)/time;
}
int main(){
int size=BLOCK_SIZE*16;
std::cout<<"size(kB),bandwidth(GB/s)\n";
while(size<10e3){
std::cout<<get_size_in_kB(size)<<","<<get_speed_in_GB_per_sec(size)<<"\n";
size=(static_cast<int>(size+BLOCK_SIZE)/BLOCK_SIZE)*BLOCK_SIZE;
}
//ensure that nothing is optimized away:
std::cerr<<"Sum: "<<PREVENT_OPTIMIZATION<<"\n";
}
create_report.py:
import sys
import pandas as pd
import matplotlib.pyplot as plt
input_file=sys.argv[1]
output_file=input_file[0:-3]+'png'
data=pd.read_csv(input_file)
labels=list(data)
plt.plot(data[labels[0]], data[labels[1]], label="my laptop")
plt.xlabel(labels[0])
plt.ylabel(labels[1])
plt.savefig(output_file)
plt.close()
Building/running/creating report:
>>> g++ -O3 -std=c++11 bandwidth.cpp -o bandwidth
>>> ./bandwidth > report.txt
>>> python create_report.py report.txt
# image is in report.png

I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.
I don't have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.
The blue line in the graph corresponds to your measurement (left axis):
The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.
Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.
For further information on the erratic behavior before the drop see the comments made by #PeterCordes below.
In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:
void operator()(){
for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){
result+=mem[i];
result+=mem[i+BLOCK_SIZE];
result+=mem[i+2*BLOCK_SIZE];
result+=mem[i+3*BLOCK_SIZE];
}
}
Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.
As you can see the fraction of branch mispredictions didn't really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.
There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.
If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.

Might be unrelated but your Linux machine might playing with CPU frequency. I know Ubuntu 18 has a gouverner that is balanced between power and performance. You also want to play with the process affinity to make sure it does not get migrated to different core while running.

CPU hashes faster than GPU?

I want to generate a random number, and hash that with SHA256 on my GPU using OpenCL with this base code (instead of hashing those pre-given plain-texts, it hashes the random numbers).
I got all the hashing to work on my GPU, but there is one problem:
the amount of hashes done per second lowers when using OpenCL?
Yes, you heard that correctly, at the moment it's faster to use only the CPU over only using the GPU.
My GPU runs at only ~10% while my CPU runs at ~100%
My question is: how can this be possible and more importantly, how do I fix it?
This is the code I use for generating a Pseudo-Random Number (which doesn't change at all between the 2 runs):
long Miner::Rand() {
std::mt19937 rng;
// initialize the random number generator with time-dependent seed
uint64_t timeSeed = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::seed_seq ss{ uint32_t(timeSeed & 0xffffffff), uint32_t(timeSeed >> 32) };
rng.seed(ss);
// initialize a uniform distribution between 0 and 1
std::uniform_real_distribution<double> unif(0, 1);
double rnd = unif(rng);
return floor(99999999 * rnd);
}
Here is the code that calculates the hashrate for me:
void Miner::ticker() {
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);
while (true) {
Sleep(1000);
HashesPerSecond = hashes;
hashes = 0;
PrintInfo();
}
}
which gets called from here:
void Miner::Start() {
std::chrono::system_clock::time_point today = std::chrono::system_clock::now();
startTime = std::chrono::system_clock::to_time_t(today);
std::thread tickT(&Miner::ticker, this);
PostHit();
GetAPIBalance();
while (true) {
std::thread t[32]; //max 32
hashFound = false;
if (RequestNewBlock()) {
for (int i = 0; i < numThreads; ++i) {
t[i] = std::thread(&Miner::JSEMine, this);
}
for (auto& th : t)
if (th.joinable())
th.join();
}
}
}
which in turn get's called like this:
Miner m(threads);
m.Start();

CPUs have far better latency characteristics than GPUs. That is to say, CPUs can do one operation way, way WAAAAYYYY faster than a GPU can. That's not even taking into account the CPU -> Main RAM -> PCIe bus -> GDDR5 "Global" GPU -> GPU Registers -> "Global GPU" -> PCIe bus back -> Main RAM -> CPU round trip time (and I'm skipping a few steps here, like pinning and L1 Cache)
GPUs have better bandwidth characteristics than CPUs (provided that the dataset can fit inside of the GPU's limited local memory). GPUs can perform Billions of SHA256 hashes faster than a CPU can perform billions of SHA256 hashes.
Bitcoin requires millions, billions, or even trillions of hashes to achieve a competitive hash rate. Furthermore, computations can take place on the GPU without much collaboration with the CPU (removing the need for the slow round-trip through PCIe).
Its an issue of fundamental design. CPUs are designed to minimize latency, but GPUs are designed to maximize bandwidth. It seems like your problem is latency-bound (you're calculating too few SHA256 hashes for the GPU to be effective). 32 is... really really small in the scale we're talking about.
The AMD GCN architecture doesn't even perform full speed until you have at LEAST 64-work items, and arguably you really need 256 work items to maximize just one of the 44-compute units of say... a R9 290x.
I guess what I'm trying to say is: try it again with 11264 work items (or more), that's the number of work items that GPUs are designed to work with. Not 32. I got this number from 44-compute units on R9 290x * 4-Vector units per compute unit * 64-work items per vector unit.

Why is std::fill(0) slower than std::fill(1)?

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:
5.8 GiB/s vs 7.5 GiB/s
However, the results are different for smaller data sizes, where fill(0) is faster:
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.
The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.
I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.
Here is the code to reproduce:
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>
using value = int;
using vector = std::vector<value>;
constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;
void __attribute__((noinline)) fill0(vector& v) {
std::fill(v.begin(), v.end(), 0);
}
void __attribute__((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size / (sizeof(value) * nthreads));
auto repeat = write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
<< write_size / (t2 - t1) << "\n";
}
}
int main(int argc, const char* argv[]) {
std::cout << "size,nthreads,fill0,fill1\n";
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, 1);
}
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, omp_get_max_threads());
}
for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
bench(max_data_size, nthreads);
}
}
Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

From your question + the compiler-generated asm from your answer:
fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)
Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the #BeeOnRope's excellent ERMSB answer for updates on this.
gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).
#BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.
movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)
Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).
As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).
movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).
#osgx posted some interesting links in comments:
Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory
Performance of the Intel Haswell-EP Architecture
See also other stuff in the x86 tag wiki.

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.
The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.
Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:
rep stos %al,%es:(%rdi)
Wheres the manual loop compiles down to an actual 128-bit instruction:
add $0x1,%rax
add $0x10,%rdx
movaps %xmm0,-0x10(%rdx)
cmp %rax,%r8
ja 400f41
Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop.
Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.
Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.
I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.
Update:
Here is a result in comparison with
fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.
For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.
Single threaded bandwidth:
Aggregate bandwidth in memory:
Here is the code used for the additional tests with their respective hot-loops:
void __attribute__ ((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
┌─→add $0x1,%rax
│ vmovdq %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rdi,%rax
└──jb e0
void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
for (auto& elem : v) {
_mm_stream_si32(&elem, 1);
}
}
┌─→movnti %ecx,(%rax)
│ add $0x4,%rax
│ cmp %rdx,%rax
└──jne 18
void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m128i buf = _mm_set1_epi32(1);
size_t i;
int* data;
int* end4 = &v[v.size() - (v.size() % 4)];
int* end = &v[v.size()];
for (data = v.data(); data < end4; data += 4) {
_mm_stream_si128((__m128i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %xmm0,(%rdx)
│ add $0x10,%rdx
│ cmp %rcx,%rdx
└──jb 40
void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m256i buf = _mm256_set1_epi32(1);
size_t i;
int* data;
int* end8 = &v[v.size() - (v.size() % 8)];
int* end = &v[v.size()];
for (data = v.data(); data < end8; data += 8) {
_mm256_stream_si256((__m256i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rcx,%rdx
└──jb 40
Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js