AVX: data alignment: store crash, storeu, load, loadu doesn't

AVX: data alignment: store crash, storeu, load, loadu doesn't - c++

I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics.
for (b=0; b<(to-from)/8; b++)
{
val = _mm256_setzero_ps();
for (a=from2; a<to2; a++)
{
t1 = _mm256_set1_ps (srcvec.ac[a]);
t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
//val =_mm256_fmadd_ps (t1, t2, t3)
t3 = _mm256_mul_ps(t1,t2);
val = _mm256_add_ps (val, t3);
}
t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
t4 = _mm256_add_ps(t4,val);
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
}
This example crashes on:
_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
However if i change to
_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);
(with u for unaligned i suppose) everything works as intended. My question is: why would load work (whereas it is not supposed to, if the data is unaligned) and store doesn't. (furthermore both are operating on the same address).
dest.ac have been allocated using
void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
size_t max_size = (size_t)-1;
// Watch out for overflow
if(elsize == 0 || nelem >= max_size/elsize)
return NULL;
size_t size = nelem * elsize;
void *memory = _mm_malloc(size+64, alignment);
if(memory != NULL)
memset(memory, 0, size);
return memory;
}
and it's at least 50 elements long.
(BTW with VS2012 I have an illegal instruction on some random assignment, so I use linux.)
thank you in advance,
Arkantus.

TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.
Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps compiles to.
I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps at all for _mm256_load_ps, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)
The store, on the other hand, does generate a vmov... instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.
I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.
Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.

Related

Why does this piece of code written using uint8_t run faster than analogous code written with uint32_t or uint64_t on a 64bit machine?

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}

TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.

Using Vector Intrinsics Yields Unexpected (Slow) Results

I'm attempting to use vector intrinsics to speed up a trivial piece of code (as a test), and I'm not getting a speed up - in fact, it runs slower by a bit sometimes. I'm wondering two things:
Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?
constexpr size_t VECTORSIZE{ (size_t)1024 * 1024 * 64 }; //large array to force main memory accesses
void normal_copy(const fftw_complex* in, fftw_complex* copyto, size_t copynum)
{
for (size_t i = 0; i < copynum; i++)
{
copyto[i][0] = in[i][0] / 128.0;
copyto[i][1] = in[i][1] / 128.0;
}
}
#if defined(_WIN32) || defined(_WIN64)
void avx2_copy(const fftw_complex* __restrict in, fftw_complex* __restrict copyto, size_t copynum)
#else
void avx2_copy(const fftw_complex* __restrict__ in, fftw_complex* __restrict__ copyto, size_t copynum)
#endif
{ //avx2 supports 256 bit vectorized instructions
constexpr double zero = 0.0;
constexpr double dnum = 128.0;
__m256d tmp = _mm256_broadcast_sd(&zero);
__m256d div = _mm256_broadcast_sd(&dnum);
for (size_t i = 0; i < copynum; i += 2)
{
tmp = _mm256_load_pd(&in[i][0]);
tmp = _mm256_div_pd(tmp, div);
_mm256_store_pd(&copyto[i][0], tmp);
}
}
int main()
{
fftw_complex* invec = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec1 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec3 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
//some initialization stuff for invec
//some timing stuff (wall clock)
normal_copy(invec, outvec1, VECTORSIZE);
//some timing stuff (wall clock)
avx2_copy(invec, outvec3, VECTORSIZE);
return 0;
}
fftw_complex is a datatype equivalent to std::complex. I've tested using both g++ (with -O3 and -ftree-vectorize) on Linux, and Visual Studio on Windows - same results - AVX2 copy and div is slower and segfaults for certain array sizes. Tested array sizes are always powers of 2, so anything related to reading invalid memory (from _mm256_load_pd) doesn't seem to be the issue. Any thoughts?

Put it shortly: using SIMD instructions does not help much here except for the use of non-temporal stores.
Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
This is dependent of the type of data that is copied and the target processor as well as the target RAM used. That being said, in your case, a modern x86-64 processor should nearly saturate the memory hierarchy with a scalar code because modern processors can both load and store 8-bytes in parallel per cycle and most processor are working at least at 2.5 GHz. This means 37.2 GiB/s for a core at this minimum frequency. While this is generally not enough to saturate the L1 or L2 cache, this is enough to saturate the RAM of most PC.
In practice, this is significantly more complex and the saturation is clearly underestimated. Indeed, Intel x86-64 processors and AMD Zen ones use a write allocate cache policy that cause written cache lines to be read first from the memory before being written back. This means that the actual throughput would be 37.2*1.5 = 56 GiB/s. This is not enough: even if the RAM would be able to support such a high throughput, cores often cannot because of the very high latency of the RAM compared to the size of the cache and the capability of hardware prefetchers (see this related post for more information). To reduce the wasted memory througput and so increase the real throughput, you can use non-temporal streaming instructions (aka. NT stores) like _mm256_stream_pd. Note that such an instruction require the data pointer to be aligned.
Note that NT store are only useful for data that are not directly reused or that are to big to fit in caches. Note also that memcpy should use NT-stores on x86-64 processor on relatively big input data. Note also that working in-place does not cause any issue due to the write allocate policy.
Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the _mm256_div_pd intrinsic. Is my usage correct?
Because of the possible address misalignment (mentioned in the comments), you need to use a scalar loop to operate on some items until the address is aligned. As also mentioned in the comment, using a multiplication (_mm256_mul_pd) by 1./128. is much more efficient. The multiplication adds some latency but does not impact the throughput.
PS: do not forget to free the allocated memory.

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm

First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?

Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Using SSE for vector initialzation

I am relative new to C++ (moved from Java for performance for my scientific app) and I know nothing about SSE. Still, I need to improve the very simple following code:
int myMax=INT_MAX;
int size=18000003;
vector<int> nodeCost(size);
/* init part */
for (int k=0;k<size;k++){
nodeCost[k]=myMax;
}
I have measured the time for the initialization part and it takes 13ms which is way too big for my scientific app (the entire algorithm runs in 22ms which means that the initialization takes 1/2 of the total time). Keep in mind that the initialization part will be repeated multiple times for the same vector.
As you see the size of the vector is not divided by 4. Is there a way to accelerate the initialization with SSE? Can you suggest how? Do I need to use arrays or SSE can be used with vectors as well?
Please, since I need your help let's all avoid a) "how did you measure the time" or b) "premature optimization is the root of all evil" which are both reasonable for you to ask but a) the measured time is correct b) I agree with it but I have no other choice. I do not want to parallelize the code with OpenMP, so SSE is the only fallback.
Thanks for your help

Use the vector's constructor:
std::vector<int> nodeCost(size, myMax);
This will most likely use an optimized "memset"-type of implementation to fill the vector.
Also tell your compiler to generate architecture-specific code (e.g. -march=native -O3 on GCC). On my x86_64 machine, this produces the following code for filling the vector:
L5:
add r8, 1 ;; increment counter
vmovdqa YMMWORD PTR [rax], ymm0 ;; magic, ymm contains the data, and eax...
add rax, 32 ;; ... the "end" pointer for the vector
cmp r8, rdi ;; loop condition, rdi holds the total size
jb .L5
The movdqa instruction, size-prefixed for 256-bit operations, copies 32 bytes to memory at once; it is part of the AVX instruction set.

Try std::fill first as already suggested, and then if that's still not fast enough you can go to SIMD if you really need to. Note that, depending on your CPU and memory sub-system, for large vectors such as this you may well hit your DRAM's maximum bandwidth and that could be the limiting factor. Anyway, here's a fairly simple SSE implementation:
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size - 3; k += 4)
{
_mm_storeu_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k)
{
pNodeCost[k] = myMax;
}
This should work well on modern CPUs - for older CPUs you might need to handle the potential data misalignment better, i.e. use _mm_store_si128 rather than _mm_storeu_si128. E.g.
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size && (((intptr_t)&pNodeCost[k] & 15ULL) != 0); ++k)
{ // initial scalar loop until we
pNodeCost[k] = myMax; // hit 16 byte alignment
}
for ( ; k < size - 3; k += 4) // 16 byte aligned SIMD loop
{
_mm_store_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k) // scalar loop to take care of any
{ // remaining elements at end of vector
pNodeCost[k] = myMax;
}

This is an extension of the ideas in Mats Petersson's comment.
If you really care about this, you need to improve your referential locality. Plowing through 72 megabytes of initialization, only to come back later to overwrite it, is extremely unfriendly to the memory hierarchy.
I do not know how to do this in straight C++, since std::vector always initializes itself. But you might try (1) using calloc and free to allocate the memory; and (2) interpreting the elements of the array as "0 means myMax and n means n-1". (I am assuming "cost" is non-negative. Otherwise you need to adjust this scheme a bit. The point is to avoid the explicit initialization.)
On a Linux system, this can help because calloc of a sufficiently large block does not need to explicitly zero the memory, since pages acquired directly from the kernel are already zeroed. Better yet, they only get mapped and zeroed the first time you touch them, which is very cache-friendly.
(On my Ubuntu 13.04 system, Linux calloc is smart enough not to explicitly initialize. If yours is not, you might have to do an mmap of /dev/zero to use this approach...)
Yes, this does mean every access to the array will involve adding/subtracting 1. (Although not for operations like "min" or "max".) Main memory is pretty darn slow by comparison, and simple arithmetic like this can often happen in parallel with whatever else you are doing, so there is a decent chance this could give you a big performance win.
Of course whether this helps will be platform dependent.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js