memcmp in DMD v.s GDC AND std.parallelism: parallel - d

I'm implementing a struct with a pointer to some manually managed memory. It all works great with DMD, but when I test it with GDC, it fails on the opEquals operator overload. I've narrowed it down to memcmp. In opEquals I compare the pointed to memory with memcmp, which behaves as I expect in DMD but fails with GDC.
If I go back and write the opEquals method by comparing each value stored in the manually managed memory 1 at a time using == on the builtin types, it works in both compilers. I prefer the memcmp route because it was shorter to write and seems like it should be faster (less indirection, iteration, etc).
Why? Is this a bug?
(My experience with C was 10 years ago, been using python/java since, I never had this kind of problem in C, but I didn't use it that much.)
Edit:
The memory I'm comparing represents a 2-D array of 'real' values, I just wanted it to be allocated in one chunk so I didn't have to deal with jagged arrays. I'll be using the structs a lot in tight loops. Basically I'm rolling my own matrix struct that will (eventually) cache some frequently used values (trace, determinant) and offers an alternate read only view into the transpose that doesn't require copying it. I plan to work with matrices of about 10x10 to about 1000x1000 (though not always square).
I also plan on implementing a version that allocates memory with the GC via a ubyte[] and profiling the two implementations.
Edit 2:
Ok, I tried a couple of things. I also have some parallel loops, and I had a hunch that might be the problem. So I added some version statements to make a parallel and non-parallel version. In order to get it to work with GDC, I had to use the non-parallel version AND change real to double.
All cases compiled under GDC. But the unit tests failed, not always consistently on the same line, but consistently at an opEquals call when I used real or parallel. In DMD all cases compiled and ran no problem.
Thanks,

real has a bit of a strange size: it is 80 bits of data, but if you check real.sizeof, you'll see it is bigger than that (at least on Linux, I think it is 10 bytes on Windows, I betcha you wouldn't see this bug there). The reason is to make sure it is aligned on a word boundary - a multiple of four bytes - for the processor to load more efficient in arrays.
The bytes between each data element are called padding, and their content is not always defined. I haven't confirmed this myself, but #jpf's comment on the question said the same thing my gut does, so I'm posting it as answer now.
The is operator in D does the same as memcmp(&data1, &data2, data.sizeof), so #jpf's comment and your memcmp would be the same thing. It checks the data AND the padding, whereas == only checks the data (and does a bit special for floating types btw because it also compares for NaN, so the exact bit pattern is important to those checks; actually, my first gut when I saw the question title was that it was NaN related! but not the case)
Anyway, apparently dmd initializes the padding bytes as well, whereas gdc doesn't, leaving it as garbage which doesn't always match.

Related

Are there memset function implementations that fill the buffer in reverse order?

I know that there are implementations of memcpy, which copied memory in reverse order to optimize for some processors. At one time, a bug "Strange sound on mp3 flash website" was connected with that. Well, it was an interesting story, but my question is about another function.
I am wondering, there is a memset function in the world, which fills the buffer, starting from the end. It is clear that in theory nothing prevents doing such an implementation of a function. But I am interested exactly in the fact that this function was done in practice by someone somewhere. I would be especially grateful on the link on the library with such a function.
P.S. I understand that in terms of applications programming it has completely no difference whether the buffer is filled in the ascending or descending order. However, it is important for me to find out whether there was any "reverse" function implementation. I need it for writing an article.
The Linux kernel's memset for the SuperH architecture has this property:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/sh/lib/memset.S?id=v4.14
Presumably it's done this way because the mov instruction exists in predecrement form (mov.l Rm,#-Rn) but not postincrement form. See:
http://shared-ptr.com/sh_insns.html
If you want something that's not technically kernel internals on a freestanding implementation, but an actual hosted C implementation that application code could get linked to, musl libc also has an example:
https://git.musl-libc.org/cgit/musl/tree/src/string/memset.c?id=v1.1.18
Here, the C version of memset (used on many but not all target archs) does not actually fill the whole buffer backwards, but rather starts from both the beginning and end in a manner that reduces the number of conditional branches and makes them all predictable for very small memsets. See the commit message where it was added for details:
https://git.musl-libc.org/cgit/musl/commit/src/string/memset.c?id=a543369e3b06a51eacd392c738fc10c5267a195f
Some of the arch-specific asm versions of memset also have this property:
https://git.musl-libc.org/cgit/musl/tree/src/string/x86_64/memset.s?id=v1.1.18

strict aliasing and memory alignment

I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
Edit:
I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
this code:
tmp buf;
tmp * X = &buf;
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
First I'd like to say I'm definitely with you when you ask not to buzz about "standard violation", "implementation-dependent" and etc. Your question is absolutely legitimate IMHO.
Your approach to pack all the arrays within one struct also makes sense, that's what I'd do.
It's unclear from the question formulation which "artifacts" do you observe. Is there any unneeded code generated? Or data misalignment? If the latter is the case - you may (hopefully) use things like STATIC_ASSERT to ensure at compile-time that things are aligned properly. Or at least have some run-time ASSERT at debug build.
As Eric Postpischil proposed, you may consider declaring this structure as global (if this is applicable for the case, I mean multi-threading and recursion are not an option).
One more point that I'd like to notice is so-called stack probes. When you allocate a lot of memory from the stack in a single function (more than 1 page to be exact) - on some platforms (such as Win32) the compiler adds an extra initialization code, known as stack probes. This may also have some performance impact (though likely to be minor).
Also, if you don't need all the 40 arrays simultaneously you may arrange some of them in a union. That is, you'll have one big struct, inside which some sub-structs will be grouped into union.
There are a number of issues here.
Alignment: There is little that requires 32-byte alignment. 16-byte alignment is beneficial for SIMD types on current Intel and ARM processors. With AVX on current Intel processors, the performance cost of using addresses that are 16-byte aligned but not 32-byte aligned is generally mild. There may be a large penalty for 32-byte stores that cross a cache line, so 32-byte alignment can be helpful there. Otherwise, 16-byte alignment may be fine. (On OS X and iOS, malloc returns 16-byte aligned memory.)
Allocation in critical code: You should avoid allocating memory in performance critical code. Generally, memory should be allocated at the start of the program, or before performance critical work begins, and reused during the performance critical code. If you allocate memory before performance critical code begins, then the time it takes to allocate and prepare the memory is essentially irrelevant.
Large, numerous arrays on the stack: The stack is not intended for large memory allocations, and there are limits to its use. Even if you are not encountering problems now, apparently unrelated changes in your code in the future could interact with using a lot of memory on the stack and cause stack overflows.
Numerous arrays: 40 arrays is a lot. Unless these are all in use for different data at the same time, and necessarily so, you should seek to reuse some of the same space for different data and purposes. Using different arrays unnecessarily can cause more cache thrashing than necessary.
Optimization: It is not clear what you mean by saying that the “alignment mess confuses the optimizer and different register allocation slows down the function big time”. If you have multiple automatic arrays inside a function, I would generally expect the optimizer to know they are different, even if you derive pointers from the arrays by address arithmetic. E.g., given code such as a[i] = 3; b[i] = c[i]; a[i] = 4;, I would expect the optimizer to know that a, b, and c are different arrays, and therefore c[i] cannot be the same as a[i], so it is okay to eliminate a[i] = 3;. Perhaps an issue you have is that, with 40 arrays, you have 40 pointers to arrays, so the compiler ends up moving pointers into and out of registers?
In that case, reusing fewer arrays for multiple purposes might help reduce that. If you have an algorithm that is actually using 40 arrays at one time, then you might look at restructuring the algorithm so it uses fewer arrays at a time. If an algorithm has to point to 40 different places in memory, then you essentially need 40 pointers, regardless of where or how they are allocated, and 40 pointers is more than there are registers available.
If you have other concerns about optimization and register use, you should be more specific about them.
Aliasing and artifacts: You report there are some aliasing and artifact problems, but you do not give sufficient details to understand them. If you have one large char array that you reinterpret as a struct containing all your arrays, then there is no aliasing within the struct. So it is not clear what issues you are encountering.
Just disable alias-based optimization and call it a day
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)
32 byte alignment sounds as if you are pushing the button too far. No CPU instruction should require an alignement as large as that. Basically an alignement as wide as the largest data type of your architecture should suffice.
C11 has the concept fo maxalign_t, which is a dummy type of maximum alignment for the architecture. If your compiler doesn't have it, yet, you can easily simulate it by something like
union maxalign0 {
long double a;
long long b;
... perhaps a 128 integer type here ...
};
typedef union maxalign1 maxalign1;
union maxalign1 {
unsigned char bytes[sizeof(union maxalign0)];
union maxalign0;
}
Now you have a data type that has the maximal alignment of your platform and that is default initialized with all bytes set to 0.
maxalign1 history_[someSize];
short * history = history_.bytes;
This avoids the awful address computations that you do currently, you'd only have to do some adoption of someSize to take into account that you always allocate multiples of sizeof(maxalign1).
Also be asured that this has no aliasing problems. First of all unions in C made for this, and then character pointers (of any version) are always allowed to alias any other pointer.

Is there any way to create a 16-byte aligned class that can be passed as a param

We have a (Numeric 3 float) vector class that I would love to align to 16-bytes in order to allow SIMD oerations. Using declspec to 16-byte align it causes a slew of C2719 errors (parameter': formal parameter with __declspec(align('#')) won't be aligned). If I can't pass around a vector aligned, what's the point? Even using a const reference to the vector is causing the compiler error which really annoys me.
Is there a way to do what I want here - get 16-byte class alignment while allowing struct passing without having to do some silly trickery to __m128 types?
You're not likely to get much of a benefit from using SIMD unless you're operating on a bunch of these 3-dimensional vector structures at a time, in which case you would probably pass them in an array, which you could align as you need to. The other case where you might obtain some benefit from SIMD is if you're doing a lot of computations on each vector and you can parallelize the operations on the three channels. In that case, then doing some manual manipulation at the beginning of a function to coax it into a __m128 type might still afford you some benefit.
If I can't pass around a vector aligned, what's the point?
__declspec(align(#)) does seem rather useless. C++11 has support for what you want; alignas appears to work in all the ways that __declspec(align(#)) is broken. For example, using alignas to declare your type will cause parameters of that type to be aligned.
Unfortunately Microsoft's compiler doesn't support standard alignment specifiers yet, and the only compiler I know of that does is Clang, which has limited support for Windows.
Anyway, I just wanted to point out that C++ has this feature and it will probably be available to you eventually. Unless you can move to another platform then for now you're probably best off with not passing parameters by value, as others have mentioned
Surely you don't need to pass the array by value? Pass a pointer to the 16-byte-aligned array instead. Or have I misunderstood something?
There is a __declspec(passinreg) that's supported on Xbox360, but not in Visual Studio for Windows at the moment.
You can vote for the request to support the feature here:
http://connect.microsoft.com/VisualStudio/feedback/details/381542/supporting-declspec-passinreg-in-windows
For vector arguments in our engine we use a VectorParameter typedef'ed to either const Vector or const Vector& depending on whether the platform supports passing by register.
While the question is old, situation with VC++ compiler hasn't changed much, so perhaps, these notes will be of value to someone.
1) The simple fix to allow classes or structs with __declspec(align(X)) to be passed to functions is to pass by reference. Use consts as needed.
2) There is definitely a reason to use SIMD for vector algebra. I was able to speed up the animation and skinning pass in our engine by 20% by switching just quat multiply and quat rotate functions to SIMD. No alignment, no arrays. Just two functions that took float[4] params. For something that wasn't poorly written to begin with and to result in measurable FPS improvement, this is nothing to sneeze at. And since these are the sort of things that can be hard to optimize later, there is really no such thing as premature optimization for vector algebra.
3) If you make your vectors into a class, all of the excessive _mm_store_ps and _mm_load_ps instructions on the stack optimize out under /O2. So while gain of having a single add via SIMD might be negligible, if you have cases where you run several operations back to back, the resulting code is blazing fast.

How to use bit values instead of chars in c++ program?

I got some code that I'd like to improve. It's a simple app for one of the variations of 2DBPP and you can take a look at the source at https://gist.github.com/892951
Here's an outline of things that I use chars for (I'd like to switch to binary values instead.) Initialize a block of memory with '0's ():
...
char* bin;
bin = new (nothrow) char[area];
memset(bin, '\0', area);
sometimes I check particular values:
if (!bin[j*height+k]) {...}
or blocks:
if (memchr(bin+i*height+pos.y, '\1', pos.height)) {...}
or set values to '1's:
memset(bin+i*height+best.y,'\1',best.height);
I don't know of any standart types or methods to work with binary values. How do I get to use the bits instead of bytes?
There's a related question that you might be interested in -
C++ performance: checking a block of memory for having specific values in specific cells
Thank you!
Edit: There's still a bigger question - would it be an improvement? I'm only concerned with time.
For starters, you can refer to this post:
How do you set, clear, and toggle a single bit?
Also, try looking into the C++ Std Bitset, or bit field.
I recommend reading up on boost.dynamic_bitset, which is a runtime-sized version of std::bitset.
Alternatively, if you don't want to use boost for some reason, consider using a std::vector<bool>. Quoting cppreference.com:
Note that a boolean vector (std::vector<bool>) is a specialization of the vector template that is designed to use less memory. A normal boolean variable usually uses 1-4 bytes of memory, but a boolean vector uses only one bit per boolean value.
Unless memory space is an issue, I would stay away from bit twiddling. You may save some memory space but extend performance time. Packing and unpacking bits takes time, and extra code.
Get the code more robust and correct before attempting bit twiddling. Play with different (high level) designs that can improve performance and memory usage.
If you are going to the bit level, study up on boolean arithmetic and logic. Redesign your data to be easier to manipulate at the bit level.

C++: delete vs. free and performance

Consider:
char *p=NULL;
free(p) // or
delete p;
What will happen if I use free and delete on p?
If a program takes a long time to execute, say 10 minutes, is there any way to reduce its running time to 5 minutes?
Some performance notes about new/delete and malloc/free:
malloc and free do not call the constructor and deconstructor, respectively. This means your classes won't get initalized or deinitialized automatically, which could be bad (e.g. uninitalized pointers)! This doesn't matter for POD data types like char and double, though, since they don't really have a ctor.
new and delete do call the constructor and deconstructor. This means your class instances are initalized and deinitialized automatically. However, normally there's a performance hit (compared to plain allocation), but that's for the better.
I suggest staying consistent with new/malloc usage unless you have a reason (e.g. realloc). This way, you have less dependencies, reducing your code size and load time (only by a smidgin, though). Also, you won't mess up by free'ing something allocated with new, or deleting something allocated with malloc. (This will most likely cause a crash!)
Answer 1: Both free(p) and delete p work fine with a NULL pointer.
Answer 2: Impossible to answer without seeing the slow parts of the code. You should profile the code! If you are using Linux I suggest using Callgrind (part of Valgrind) to find out what parts of the execution takes the most time.
Question one: nothing will happen.
From the current draft of ISO/IEC 14882 (or: C++):
20.8.15 C Library [c.malloc]
The contents [of <cstdlib>, that is: where free lives,] are the same as the Standard C library [(see intro.refs for that)] header <stdlib.h>, with the following changes: [nothing that effects this answer].
So, from ISO/IEC 9899:1999 (or: C):
7.20.3.2 The free function
If [the] ptr [parameter] is a null pointer, no action occurs.
From the C++ standard again, for information about delete this time:
3.7.4.2 Deallocation functions [basic.stc.dynamic.deallocation]
The value of the first argument supplied to a deallocation function may be a null pointer value; if so, and if the deallocation function is one supplied in the standard library, the call has no effect.
See also:
What is the difference between new/delete and malloc/free?
What happens when you try to free() already freed memory in c?
Nothing will happen if you call free with a NULL parameter, or delete with an NULL operand. Both are defined to accept NULL and perform no action.
There are any number of thing you can change in C++ code which can affect how fast it runs. Often the most useful (in approximate order) are:
Use good algorithms. This is a big topic, but for example, I recently halved the running time of a bit of code by using a std::vector instead of a std::list, in a case where elements were being added and removed only at the end.
Avoid repeating long calculations.
Avoid creating and copying objects unnecessarily (but anything which happens less than 10 million times per minute won't make any significant difference, unless you're handling something really big, like a vector of 10 million items).
Compile with optimisation.
Mark commonly-used functions (again, anything called more than 100 million times in your 10 minute runtime), as inline.
Having said that, divideandconquer is absolutely right - you can't effectively speed your program up unless you know what it spends its time doing. Sometimes this can be guessed correctly when you know how the code works, other times it's very surprising. So it's best to profile. Even if you can't profile the code to see exactly where the time is spent, if you measure what effect your changes have you can often figure it out eventually.
On question 2:
Previous answers are excellent. But I just wanted to add something about pre-optimization. Assuming a program of moderate complexity, the 90/10 rule usually applies - i.e. 90% of the execution time is spent in 10% of the code. "Optimized" code is often harder to read and to maintain. So, always solve the problems first, then see where the bottlenecks are (profiling is a good tool).
As others have pointed out deleting (or freeing) a NULL pointer will not do anything. However if you had allocated some memory then whether to use free() or delete depends upon the method you used to allocate them. For example, if you had used malloc() to allocate memory then you should free() and if you had used new to allocate then you should use delete. However, be careful not to mix the memory allocations. Use a single way to allocate and deallocate them.
For the second question, it will be very difficult to generalize without seeing the actual code. It should be taken on a case by case basis.
Good answers all.
On the performance issue, this provides a method that most can't imagine will work, but a few know it does, surprisingly well.
The 90/10 rule is true. In my experience, there usually are multiple trouble spots, and they usually are at mid-levels of the call stack. They often are caused by using over-general data structures, but you should never fix something unless you've proven that it actually is a problem. Performance problems are amazingly unpredictable.
Fixing any single performance problem may not give you the speedup you need, but each one you fix makes the remaining ones take a larger percentage of the remaining time, so they become easier to find. The speedups combine in a compounded fashion, so you may be surprised at the final result.
When you can no longer find significant problems that you can fix, you've done about as well as you can. Sometimes, at that point, a redesign (such as using code generation) can set off a further round of speedups.