Does this violate strict aliasing or pointer alignment rules? - c++

I'm swapping bytes in a char buffer:
char* data; // some char buffer
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
// For debugging:
int align = reinterpret_cast<uintptr_t>(data) % alignof(uint16_t);
std::cout << "aligned? " << align << "\n";
for (size_t i = 0; i < length_of_data16; i++) {
#if defined(__GNUC__) || defined(__clang__)
data16[i] = __builtin_bswap16(data16[i]);
#elif defined(_MSC_VER)
data16[i] = _byteswap_ushort(data16[i]);
#endif
}
I'm casting from char* to uint16_t*, which raises a flag because it's casting to a more strictly aligned type.
However, the code runs correctly (on x86), even when the debugging code prints 1 (as in, not aligned). In the assembly I see MOVDQU, which I take to mean that the compiler recognizes that this might not be aligned.
This looks similar to this question, where the answer was "this is not safe." Does the above code only work on certain architectures and with certain compilers, or is there a subtle difference between these two questions that makes the above code valid?
(Less important: consistent with what I've read online, there's also no noticeable perf difference between aligned and unaligned execution of this code.)

If alignof(unit16_t) != 1 then this line may cause undefined behaviour due to alignment:
uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
Putting an alignment check after this is no good; for a compiler could hardcode the check to say 1 because it knows that correct code couldn't reach that point otherwise.
In Standard C++ , for this check to be meaningful it must occur before the cast, and then the cast must not be performed if the check fails. (UB can time travel).
Of course, individual compilers may choose to define behaviour that is not defined by the Standard, e.g. perhaps g++ targeting x86 or x64 includes a definition that you're allowed to form unaligned pointers and dereference them.
There is no strict aliasing violation, as __builtin_bswap16 is not covered by the standard and we presume g++ implements it in such a way that is consistent with itself. MSVC doesn't do strict aliasing optimizations anyway.

Related

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

How to force an alignment error when casting a u8[] in a u16?

I was trying , when answering a question, to warn the OP against alignment problems.
But when doing my snippet to show the OP how it can happens, I was unable to make it happen.
When running this code (C/C++) on an online compiler, I would expect it to fail.
Why is it not?
#include <cstdint>
#include <cstddef>
#include <iostream>
#define SIZE 20
int main()
{
uint8_t in[20];
in[0] = 0;
in[1] = 1;//8bit
in[2] = 1;
in[3] = 1;//16bit
in[4] = 1;
in[5] = 1;
in[6] = 1;
in[7] = 1;//32bit
in[8] = 1;
in[9] = 1;
in[10] = 1;
in[11] = 1;
in[12] = 1;
in[13] = 1;
in[14] = 1;
in[15] = 1;//64bit
in[16] = 1;
in[17] = 1;
in[18] = 1;
in[19] = 1;
uint16_t out;
for (int i =0; i < SIZE - 2; i++)
{
out = *((uint16_t*)&in[i+1]);
std::cout << "&in: " << (void*)&in[i+1] << "\n out: " << out << "\n in: " << in[i+2]*256 + in[i+1]<< std::endl;
}
return 0;
}
When running this code, I would expect it to fail. Why is it not?
Because:
The behaviour of the program is undefined1. There is no guarantee of failure2.
You may be using a system whose CPU supports misaligned access. As far as I understand, x86 for example performs misaligned reads and writes; they are merely slower than aligned ones (this does not apply to SIMD instructions though).
C++ standard says (quoting the latest draft):
1
[basic.lval]
If a program attempts to access the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:
the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
a char, unsigned char, or std::byte type.
uint16_t is none of those listed exceptional types in this case (well, it could be on some system that has 16 bit byte, but not in general, and probably not on the server that runs the online compiler, and such system probably wouldn't provide uint8_t).
2
[defns.undefined]
behavior for which this document imposes no requirements
Note the lack of any guarantees.
When running this code, I would expect it to fail. Why is it not?
The program has undefined behavior as a result of strict-aliasing violations, but that doesn't mean it is obligated to fail (see "undefined"). From an alignment perspective, it is nowhere required that accessing a value through a pointer that does not have the natural alignment for its target type must fail, although that is one case that falls largely under the umbrella of the strict aliasing rule. Whether such an access attempt actually does fail typically depends on the hardware on which the program runs.
What exactly happens depends on the platform used (cpu architecture and operating system).
There are several possibilities:
The architecture does not have a natural word alignment at all, so all accesses are considered aligned.
The CPU handles the unaligned access internally by performing several aligned accesses and constructing the result (slow).
The CPU detects the unaligned access and throws an exception. The operating system catches this exception and emulates the unaligned access in software (slower!).
Linux, for example, has this option for several arm architectures, it can even be chosen, if the unaligned access should be ignored, fixed, or signalled, optionally accompanied by a warning in the kernel log (see kernel source file arch/arm/mm/alignment.c.
The alignment results in a CPU exception and the process is signalled.
On Linux, the process is usually terminated with a SIGBUS in that case.
Summary: Avoiding unaligned access is the safe side, but on most platforms, it will still work one way or the other.

Additional questions on memory alignment

There have previously been some great answers on memory alignment, but I feel don't completely answer some questions.
E.g.:
What is data alignment? Why and when should I be worried when typecasting pointers in C?
What is aligned memory allocation?
I have an example program:
#include <iostream>
#include <vector>
#include <cstring>
int32_t cast_1(int offset) {
std::vector<char> x = {1,2,3,4,5};
return reinterpret_cast<int32_t*>(x.data()+offset)[0];
}
int32_t cast_2(int offset) {
std::vector<char> x = {1,2,3,4,5};
int32_t y;
std::memcpy(reinterpret_cast<char*>(&y), x.data() + offset, 4);
return y;
}
int main() {
std::cout << cast_1(1) << std::endl;
std::cout << cast_2(1) << std::endl;
return 0;
}
The cast_1 function outputs a ubsan alignment error (as expected) but cast_2 does not. However, cast_2 looks much less readable to me (requires 3 lines). cast_1 looks perfectly clear on the intent, even though it is UB.
Questions:
1) Why is cast_1 UB, when the intent is perfectly clear? I understand that there may be performance issues with alignment.
2) Is cast_2 a correct approach to fixing the UB of cast_1?
1) Why is cast_1 UB?
Because the language rules say so. Multiple rules in fact.
The offset where you access the object does not meet the alignment requirements of int32_t (except on systems where the alignment requirement is 1). No objects can be created without conforming to the alignment requirement of the type.
A char pointer may not be aliased by a int32_t pointer.
2) Is cast_2 a correct approach to fixing the UB of cast_1?
cast_2 has well defined behaviour. The reinterpret_cast in that function is redundant, and it is bad to use magic constants (use sizeof).
WRT the first question, it would be trivial for the compiler to handle that for you, true. All it would have to do is pessimize every other non-char load in the program.
The alignment rules were written precisely so the compiler can generate code that performs well on the many platforms where aligned memory access is a fast native op, and misaligned access is the equivalent of your memcpy. Except where it could prove alignment, the compiler would have to handle every load the slow & safe way.

How to use alignas to replace pragma pack?

I am trying to understand how alignas should be used, I wonder if it can be a replacement for pragma pack, I have tried hard to verify it but with no luck. Using gcc 4.8.1 (http://ideone.com/04mxpI) I always get 8 bytes for below STestAlignas, while with pragma pack it is 5 bytes. What I would like ot achive is to make sizeof(STestAlignas) return 5. I tried running this code on clang 3.3 (http://gcc.godbolt.org/) but I got error:
!!error: requested alignment is less than minimum alignment of 8 for type 'long' - just below alignas usage.
So maybe there is a minimum alignment value for alignas?
below is my test code:
#include <iostream>
#include <cstddef>
using namespace std;
#pragma pack(1)
struct STestPragmaPack {
char c;
long d;
} datasPP;
#pragma pack()
struct STestAttributPacked {
char c;
long d;
} __attribute__((packed)) datasAP;
struct STestAlignas {
char c;
alignas(char) long d;
} datasA;
int main() {
cout << "pragma pack = " << sizeof(datasPP) << endl;
cout << "attribute packed = " << sizeof(datasAP) << endl;
cout << "alignas = " << sizeof(datasA) << endl;
}
results for gcc 4.8.1:
pragma pack = 5
attribute packed = 5
alignas = 8
[26.08.2019]
It appears there is some standardisation movement in this topic. p1112 proposal - Language support for class layout control - suggest adding (among others) [[layout(smallest)]] attribute which shall reorder class members so as to make the alignment cost as small as possible (which is a common technique among programmers - but it often kills class definition readability). But this is not equal to what pragma(pack) does!
alignas cannot replace #pragma pack.
GCC accepts the alignas declaration, but still keeps the member properly aligned: satisfying the strictest alignment requirement (in this case, the alignment of long) also satisfies the requirement you specified.
However, GCC is too lenient as the standard actually explicitly forbids this in ยง7.6.2, paragraph 5:
The combined effect of all alignment-specifiers in a declaration shall not specify an alignment that is less strict than the alignment that would be required for the entity being declared if all alignment-specifiers were omitted (including those in other declarations).
I suppose you know that working with unaligned or missaligned data have risks and have costs.
For instance, retrieving a missaligned Data Structure of 5 bytes is more time-expensive than retrieving an 8 bytes aligned one. This is because, if your 5 "... byte data does not start on one of those 4 byte boundaries, the computer must read the memory twice, and then assemble the 4 bytes to a single register internally" (1).
Working with unaligned data requires more mathematical operations and ends in more time (and power) consumption by the ECU.
Please, consider that both C and C++ are conceived to be "hardware friendly" languages, which means not only "minimum memory usage" languages, but principally languages focused on efficiency and fastness processing. Data alignmnt (when it is not strictly required for "what I need to store") is a concept that implies another one: "many times, software and hardware are similar to life: you require sacrifices to reach better results!".
Please, consider also asking yourself is you do not have a wrong assumption. Something like: "smaller/st structures => faster/st processing". If this were the case, you might be (totally) wrong.
But if we suppose that your point is something like this: you do not care at all about efficiency, power consumption and fastness of your software, but just you are obsessed (because of your hardware limitations or just because of theoritcal interest) in "minimum memory usage", then and perhaps you might find useful the following readings:
(1) Declare, manipulate and access unaligned memory in C++
(2) C Avoiding Alignment Issues
BUT, please, be sure to read the following ones:
(3) What does the standard say about unaligned memory access?
Which redirects to this Standard's snipped:
(4) http://eel.is/c++draft/basic.life#1
(5) Unaligned memory access: is it defined behavior or not? [Which is duplicated but, maybe, with some extra information].
Unfortunately, alignment is not guaranted, neither in C++11 nor in C++14.
But it is effectived guaranted in C++17.
Please, check this excellent work from Bartlomiej Filipek:
https://www.bfilipek.com/2019/08/newnew-align.html

Is there any guarantee of alignment of address return by C++'s new operation?

Most of experienced programmer knows data alignment is important for program's performance. I have seen some programmer wrote program that allocate bigger size of buffer than they need, and use the aligned pointer as begin. I am wondering should I do that in my program, I have no idea is there any guarantee of alignment of address returned by C++'s new operation. So I wrote a little program to test
for(size_t i = 0; i < 100; ++i) {
char *p = new char[123];
if(reinterpret_cast<size_t>(p) % 4) {
cout << "*";
system("pause");
}
cout << reinterpret_cast<void *>(p) << endl;
}
for(size_t i = 0; i < 100; ++i) {
short *p = new short[123];
if(reinterpret_cast<size_t>(p) % 4) {
cout << "*";
system("pause");
}
cout << reinterpret_cast<void *>(p) << endl;
}
for(size_t i = 0; i < 100; ++i) {
float *p = new float[123];
if(reinterpret_cast<size_t>(p) % 4) {
cout << "*";
system("pause");
}
cout << reinterpret_cast<void *>(p) << endl;
}
system("pause");
The compiler I am using is Visual C++ Express 2008. It seems that all addresses the new operation returned are aligned. But I am not sure. So my question is: are there any guarantee? If they do have guarantee, I don't have to align myself, if not, I have to.
The alignment has the following guarantee from the standard (3.7.3.1/2):
The pointer returned shall be suitably aligned so that it can be converted to a
pointer of any complete object type and then used to access the object or array in the
storage allocated (until
the storage is explicitly deallocated by a call to a corresponding deallocation function).
EDIT: Thanks to timday for highlighting a bug in gcc/glibc where the guarantee does not hold.
EDIT 2: Ben's comment highlights an intersting edge case. The requirements on the allocation routines are for those provided by the standard only. If the application has it's own version, then there's no such guarantee on the result.
This is a late answer but just to clarify the situation on Linux - on 64-bit systems
memory is always 16-byte aligned:
http://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html
The address of a block returned by malloc or realloc in the GNU system is always a
multiple of eight (or sixteen on 64-bit systems).
The new operator calls malloc internally
(see ./gcc/libstdc++-v3/libsupc++/new_op.cc)
so this applies to new as well.
The implementation of malloc which is part of the glibc basically defines
MALLOC_ALIGNMENT to be 2*sizeof(size_t) and size_t is 32bit=4byte and 64bit=8byte
on a x86-32 and x86-64 system, respectively.
$ cat ./glibc-2.14/malloc/malloc.c:
...
#ifndef INTERNAL_SIZE_T
#define INTERNAL_SIZE_T size_t
#endif
...
#define SIZE_SZ (sizeof(INTERNAL_SIZE_T))
...
#ifndef MALLOC_ALIGNMENT
#define MALLOC_ALIGNMENT (2 * SIZE_SZ)
#endif
C++17 changes the requirements on the new allocator, such that it is required to return a pointer whose alignment is equal to the macro __STDCPP_DEFAULT_NEW_ALIGNMENT__ (which is defined by the implementation, not by including a header).
This is important because this size can be larger than alignof(std::max_align_t). In Visual C++ for example, the maximum regular alignment is 8-byte, but the default new always returns 16-byte aligned memory.
Also, note that if you override the default new with your own allocator, you are required to abide by the __STDCPP_DEFAULT_NEW_ALIGNMENT__ as well.
Incidentally the MS documentation mentions something about malloc/new returning addresses which are 16-byte aligned, but from experimentation this is not the case. I happened to need the 16-byte alignment for a project (to speed up memory copies with enhanced instruction set), in the end I resorted to writing my own allocator...
The platform's new/new[] operator will return pointers with sufficient alignment so that it'll perform good with basic datatypes (double,float,etc.). At least any sensible C++ compiler+runtime should do that.
If you have special alignment requirements like for SSE, then it's probably a good idea use special aligned_malloc functions, or roll your own.
I worked on a system where they used the alignment to free up the odd bit for there own use!
They used the odd bit to implement a virtual memory system.
When a pointer had the odd bit set they used that to signify that it pointed (minus the odd
bit) to the information to get the data from the database not the data itself.
I thought this a particulary nasty bit of coding which was far to clever for its own good!!
Tony