I have enermous array:
int* arr = new int[BIGNUMBER];
How to fullfil it with 1 number really fast. Normally I would do
for(int i = 0; i < BIGNUMBER; i++)
arr[i] = 1
but I think it would take long.
Can I use memcpy or similar?
You could try using the standard function std::uninitialized_fill_n:
#include <memory>
// ...
std::uninitialized_fill_n(arr, BIGNUMBER, 1);
In any case, when it comes to performance, the rule is to always make measurements to back up your assumptions - especially if you are going to abandon a clear, simple design to embrace a more complex one because of an alleged performance improvement.
EDIT:
Notice that - as Benjamin Lindley mentioned in the comments - for trivial types std::uninitialized_fill_n does not bring any advantage over the more obvious std::fill_n. The advantage would exist for non-trivial types, since std::uninitialized_fill would allow you to allocate a memory region and then construct objects in place.
However, one should not fall into the trap of calling std::uninitialized_fill_n for a memory region that is not uninitialized. The following, for instance, would give undefined behavior:
my_object* v = new my_object[BIGNUMBER];
std::uninitialized_fill_n(my_object, BIGNUMBER, my_object(42)); // UB!
Alternative to a dynamic array is std::vector<int> with the constructor that accepts an initial value for each element:
std::vector<int> v(BIGNUMBER, 1); // 'BIGNUMBER' elements, all with value 1.
as already stated, performance would need measured. This approach provides the additional benefit that the memory will be freed automatically.
Some possible alternatives to Andy Prowl's std::uninitialized_fill_n() solution, just for posterity:
If you are lucky and your value is composed of all the same bytes, memset will do the trick.
Some implementations offer a 16-bit version memsetw, but that's not everywhere.
GCC has an extension for Designated Initializers that can fill ranges.
I've worked with a few ARM systems that had libraries that had accelerated CPU and DMA variants of word-fill, hand coded in assembly -- you might look and see if your platform offers any of this, if you aren't terribly concerned about portability.
Depending on your processor, even looking into loops around SIMD intrinsics may provide a boost; some of the SIMD units have load/store pipelines that are optimized for moving data around like this. On the other hand you may take severe penalties for moving between register types.
Last but definitely not least, to echo some of the commenters: you should test and see. Compilers tend to be pretty good at recognizing and optimizing patterns like this -- you probably are just trading off portability or readability with anything other than the simple loop or uninitialized_fill_n.
You may be interested in prior questions:
Is there memset() that accepts integers larger than char?
initializing an array of ints
How to initialize all members of an array to the same value?
Under Linux/x86 gcc with optimizations turned on, your code will compile to the following:
rax = arr
rdi = BIGNUMBER
400690: c7 04 90 01 00 00 00 movl $0x1,(%rax,%rdx,4)
Move immediate int(1) to rax + rdx
400697: 48 83 c2 01 add $0x1,%rdx
Increment register rdx
40069b: 48 39 fa cmp %rdi,%rdx
Cmp rdi to rdx
40069e: 75 f0 jne 400690 <main+0xa0>
If BIGNUMBER has been reached jump back to start.
It takes about 1 second per gigabyte on my machine, but most of that I bet is paging in physical memory to back the uninitialized allocation.
Just unroll the loop by, say, 8 or 16 times. Functions like memcpy are fast, but they're really there for convenience, not to be faster than anything you could possibly write:
for (i = 0; i < BIGNUMBER-8; i += 8){
a[i+0] = 1; // this gets rid of the test against BIGNUMBER, and the increment, on 7 out of 8 items.
a[i+1] = 1; // the compiler should be able to see that a[i] is being calculated repeatedly here
...
a[i+7] = 1;
}
for (; i < BIGNUMBER; i++) a[i] = 1;
The compiler might be able to unroll the loop for you, but why take the chance?
Use memset or memcpy
memset(arr, 0, BIGNUMER);
Try using memset?
memset(arr, 1, BIGNUMBER);
http://www.cplusplus.com/reference/cstring/memset/
memset(arr, 1, sizeof(int) * BIGNUMBER);
Related
Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.
In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/
I'm trying to debug a rather large program with many variables. The code is setup in this way:
while (condition1) {
//non timing sensitive code
while (condition2) {
//timing sensitive code
//many variables that change each iteration
}
}
I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.
I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.
Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.
I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.
With the information stored being in formats such as:
"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"
So, is there a cleaner option for writing large amounts of debug data quickly?
If it's really time-sensitive, then don't format strings inside the critical loop.
I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.
This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.
In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.
Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.
See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.
#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>
struct loop_log {
// Generally sort in order of size for better packing.
// Use as narrow types as possible to reduce memory bandwidth.
// e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
int64_t x, y, z;
uint64_t ux, uy, uz;
int32_t a, b, c;
uint16_t t, i, j;
uint8_t c1, c2, c3;
// isn't there a less-repetitive way to write this?
loop_log(int64_t x, int32_t a, int outer_counter, char c1)
: x(x), a(a), i(outer_counter), c1(c1)
// leaves other members *uninitialized*, not zeroed.
// note lack of gcc warning for initializing uint16_t i from an int
// and for not mentioning every member
{}
};
static constexpr size_t initial_reserve = 10000;
// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
std::vector<struct loop_log> log;
log.reserve(initial_reserve);
int outer_counter = outer_iterations;
while (--outer_counter) {
//non timing sensitive code
int32_t a = inner_param - outer_counter;
while (a != 0) {
//timing sensitive code
a <<= 1;
int64_t x = outer_counter * (100LL + a);
char c1 = x;
// much more efficient code with gcc 5.3 -O3 than push_back( a struct literal );
log.emplace_back(x, a, outer_counter, c1);
}
const auto logdata = log.data();
const size_t bytes = log.size() * sizeof(*logdata);
// write group size, then a group of records
logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
logfile.write( reinterpret_cast<const char *>(logdata), bytes );
// you could format the records into strings at this point if you want
log.clear();
}
}
#include <fstream>
int main() {
std::ofstream logfile("dbg.log");
foo(logfile, 100, 10);
}
gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:
## gcc 5.3 -masm=intel -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
test rbx, rbx # log // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
je .L6 #,
mov QWORD PTR [rbx], rbp # log_53->x, x // emplace_back the 4 elements
mov DWORD PTR [rbx+48], r12d # log_53->a, a
mov WORD PTR [rbx+62], r15w # log_53->i, outer_counter
mov BYTE PTR [rbx+66], bpl # log_53->c1, x
.L6:
add rbx, 72 # log, // struct size is 72B
mov r8, r13 # D.44727, log
test r12d, r12d # a
je .L58 #, // a != 0
.L4:
add r12d, r12d # a // a <<= 1
movsx rbp, r12d # D.44726, a // x = ...
add rbp, 100 # D.44726, // x = ...
imul rbp, QWORD PTR [rsp+8] # x, %sfp // x = ...
cmp r14, rbx # log$D40277$_M_impl$_M_end_of_storage, log
jne .L59 #, // stay in this tight loop as long as we don't run out of reserved space in the vector
// fall through into code that allocates more space and copies.
// gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq. Clang uses AVX to copy 32B at a time
// anyway, that code never runs as long as the reserve is big enough
// I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/
An attempt to avoid repetitive constructor code:
I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:
#ifdef USE_CONSTRUCTOR
// much more efficient code with gcc 5.3 -O3.
log.emplace_back(x, a, outer_counter, c1);
#else
// Put the mapping from local var names to struct member names right here in with the loop
log.push_back( (struct loop_log) {
.x = x, .y =0, .z=0, // C99 designated-initializers are a GNU extension to C++,
.ux=0, .uy=0, .uz=0, // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
.a = a, .b=0, .c=0, // without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
.t=0, .i = outer_counter, .j=0,
.c1 = (uint8_t)c1
} );
#endif
This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:
mov rax, QWORD PTR [rsp+72]
mov QWORD PTR [rdx+8], rax // rdx points into the vector's buffer
mov rax, QWORD PTR [rsp+80]
mov QWORD PTR [rdx+16], rax
... // total of 9 loads/stores for a 72B struct
So it will have more impact on the inner loop.
There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.
BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).
For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.
This was a fun but terrible idea:
// Doesn't compiler. Worse: hard to read, probably easy to screw up
while (outerloop) {
int64_t x=0, y=1;
struct loop_log {int64_t logx=x, logy=y;}; // loop vars as default initializers
// error: default initializers can't be local vars with automatic storage.
while (innerloop) { x+=y; y+=x; log.emplace_back(loop_log()); }
}
Lower overhead from using a flat array instead of a std::vector
Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)
So in your loop, you could just have code like
loop_log *current_record = logbuf;
while(inner_loop) {
int64_t x = ...;
current_record->x = x;
...
current_record->i = (short)outer_counter;
...
// or maybe
// *current_record = { .x = x, .i = (short)outer_counter };
// compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back
current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);
Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.
Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.
You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.
It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.
Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.
I am relative new to C++ (moved from Java for performance for my scientific app) and I know nothing about SSE. Still, I need to improve the very simple following code:
int myMax=INT_MAX;
int size=18000003;
vector<int> nodeCost(size);
/* init part */
for (int k=0;k<size;k++){
nodeCost[k]=myMax;
}
I have measured the time for the initialization part and it takes 13ms which is way too big for my scientific app (the entire algorithm runs in 22ms which means that the initialization takes 1/2 of the total time). Keep in mind that the initialization part will be repeated multiple times for the same vector.
As you see the size of the vector is not divided by 4. Is there a way to accelerate the initialization with SSE? Can you suggest how? Do I need to use arrays or SSE can be used with vectors as well?
Please, since I need your help let's all avoid a) "how did you measure the time" or b) "premature optimization is the root of all evil" which are both reasonable for you to ask but a) the measured time is correct b) I agree with it but I have no other choice. I do not want to parallelize the code with OpenMP, so SSE is the only fallback.
Thanks for your help
Use the vector's constructor:
std::vector<int> nodeCost(size, myMax);
This will most likely use an optimized "memset"-type of implementation to fill the vector.
Also tell your compiler to generate architecture-specific code (e.g. -march=native -O3 on GCC). On my x86_64 machine, this produces the following code for filling the vector:
L5:
add r8, 1 ;; increment counter
vmovdqa YMMWORD PTR [rax], ymm0 ;; magic, ymm contains the data, and eax...
add rax, 32 ;; ... the "end" pointer for the vector
cmp r8, rdi ;; loop condition, rdi holds the total size
jb .L5
The movdqa instruction, size-prefixed for 256-bit operations, copies 32 bytes to memory at once; it is part of the AVX instruction set.
Try std::fill first as already suggested, and then if that's still not fast enough you can go to SIMD if you really need to. Note that, depending on your CPU and memory sub-system, for large vectors such as this you may well hit your DRAM's maximum bandwidth and that could be the limiting factor. Anyway, here's a fairly simple SSE implementation:
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size - 3; k += 4)
{
_mm_storeu_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k)
{
pNodeCost[k] = myMax;
}
This should work well on modern CPUs - for older CPUs you might need to handle the potential data misalignment better, i.e. use _mm_store_si128 rather than _mm_storeu_si128. E.g.
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size && (((intptr_t)&pNodeCost[k] & 15ULL) != 0); ++k)
{ // initial scalar loop until we
pNodeCost[k] = myMax; // hit 16 byte alignment
}
for ( ; k < size - 3; k += 4) // 16 byte aligned SIMD loop
{
_mm_store_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k) // scalar loop to take care of any
{ // remaining elements at end of vector
pNodeCost[k] = myMax;
}
This is an extension of the ideas in Mats Petersson's comment.
If you really care about this, you need to improve your referential locality. Plowing through 72 megabytes of initialization, only to come back later to overwrite it, is extremely unfriendly to the memory hierarchy.
I do not know how to do this in straight C++, since std::vector always initializes itself. But you might try (1) using calloc and free to allocate the memory; and (2) interpreting the elements of the array as "0 means myMax and n means n-1". (I am assuming "cost" is non-negative. Otherwise you need to adjust this scheme a bit. The point is to avoid the explicit initialization.)
On a Linux system, this can help because calloc of a sufficiently large block does not need to explicitly zero the memory, since pages acquired directly from the kernel are already zeroed. Better yet, they only get mapped and zeroed the first time you touch them, which is very cache-friendly.
(On my Ubuntu 13.04 system, Linux calloc is smart enough not to explicitly initialize. If yours is not, you might have to do an mmap of /dev/zero to use this approach...)
Yes, this does mean every access to the array will involve adding/subtracting 1. (Although not for operations like "min" or "max".) Main memory is pretty darn slow by comparison, and simple arithmetic like this can often happen in parallel with whatever else you are doing, so there is a decent chance this could give you a big performance win.
Of course whether this helps will be platform dependent.
Is there a performance difference if you either do myarray[ i ] or store the adress of myarray[ i ] in a pointer?
Edit: The pointers are all calculated during an unimportant step in my program where performance is no criteria. During the critical parts the pointers remain static and are not modified. Now the question is if these static pointers are faster than using myarray[ i ] all the time.
For this code:
int main() {
int a[100], b[100];
int * p = b;
for ( unsigned int i = 0; i < 100; i++ ) {
a[i] = i;
*p++ = i;
}
return a[1] + b[2];
}
when built with -O3 optimisation in g++, the statement:
a[i] = i;
produced the assembly output:
mov %eax,(%ecx,%eax,4)
and this statement:
*p++ = i;
produced:
mov %eax,(%edx,%eax,4)
So in this case there was no difference between the two. However, this is not and cannot be a general rule - the optimiser might well generate completely different code for even a slightly different input.
It will probably make no difference at all. The compiler will usually be smart enough to know when you are using an expression more than once and create a temporary itself, if appropriate.
Compilers can do surprising optimizations; the only way to know is to read the generated assembly code.
With GCC, use -S, with -masm=intel for Intel syntax.
With VC++, use /FA (IIRC).
You should also enable optimizations: -O2 or -O3 with GCC, and /O2 with VC++.
I prefer using myarray[ i ] since it is more clear and the compiler has easier time compiling this to optimized code.
When using pointers it is more complex for the compiler to optimize this code since it's harder to know exactly what you're doing with the pointer.
There should not be much different but by using indexing you avoid all types of different pitfalls that the compiler's optimizer is prone to (aliasing being the most important one) and thus I'd say the indexing case should be easier to handle for the compiler. This doesn't mean that you should take care of aforementioned things before the loop, but pointers in a loop generally just adds to the complexity.
Yes. Having a pointer the address won't be calculated by using the initial address of the array. It will accessed directly. So you have a little performance improve if you save the address in a pointer.
But the compiler will usually optimize the code and use the pointer in both cases (if you have statical arrays)
For dynamic arrays (created with new) the pointer will offer you more performance as the compiler cannot optimize array accesses at compile time.
There will be no substantial difference. Premature optimization is the root of all evil - get a profiler before checking micro-optimizations like this. Also, the myarray[i] is more portable to custom types, such as a std::vector.
Okay so your questions is, whats faster:
int main(int argc, char **argv)
{
int array[20];
array[0] = 0;
array[1] = 1;
int *value_1 = &array[1];
printf("%d", *value_1);
printf("%d", array[1]);
printf("%d", *(array + 1));
}
Like someone else already pointed out, compilers can do clever optimization. Of course this is depending on where an expression is used, but normally you shouldn't care about those subtle differences. All your assumption can be proven wrong by the compiler. Today you shouldn't need to care about such differences.
For example the above code produces the following (only snippet):
mov [ebp+var_54], 1 #store 1
lea eax, [ebp+var_58] # load the address of array[0]
add eax, 4 # add 4 (size of int)
mov [ebp+var_5C], eax
mov eax, [ebp+var_5C]
mov eax, [eax]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000 # points to %d
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
mov eax, [ebp+var_54]
mov [esp+88h+var_84], eax
mov [esp+88h+var_88], offset unk_403000
call printf
Short answer: the only way to know for sure is to code up both versions and compare performance. I would personally be surprised if there was a measureable difference unless you were doing a lot of array accesses in a really tight loop. If this is something that happens once or twice over the lifetime of the program, or depends on user input, it's not worth worrying about.
Remember that the expression a[i] is evaluated as *(a+i), which is an addition plus a dereference, whereas *p is just a dereference. Depending on how the code is structured, though, it may not make a difference. Assume the following:
int a[N]; // for any arbitrary N > 1
int *p = a;
size_t i;
for (i = 0; i < N; i++)
printf("a[%d] = %d\n", i, a[i]);
for (i = 0; i < N; i++)
printf("*(%p) = %d\n", (void*) p, *p++);
Now we're comparing a[i] to *p++, which is a dereference plus a postincrement (in addition to the i++ in the loop control); that may turn out to be a more expensive operation than the array subscript. Not to mention we've introduced another variable that's not strictly necessary; we're trading a little space for what may or may not be an improvement in speed. It really depends on the compiler, the structure of the code, optimization settings, OS, and CPU.
Worry about correctness first, then worry about readability/maintainability, then worry about safety/reliability, then worry about performance. Unless you're failing to meet a hard performance requirement, focus on making your intent clear and easy to understand. It doesn't matter how fast your code is if it gives you the wrong answer or performs the wrong action, or if it crashes horribly at the first hint of bad input, or if you can't fix bugs or add new features without breaking something.
Yes.. when storing myarray[i] pointer it will perform better (if used on large scale...)
Why??
It will save you an addition and may be a multiplication (or a shift..)
Many compilers may optimize that for you in case of static memory allocation.
If you are using dynamic memory allocation, the compiler will not optimize it, because it is in runtime!