How I can avoid strict aliasing rule violation, trying to modify char* result of sha256 function.
Compute hash value:
std::string sha = sha256("some text");
const char* sha_result = sha.c_str();
unsigned long* mod_args = reinterpret_cast<unsigned long*>(sha_result);
than getting 2 pieces of 64 bit:
unsigned long a = mod_args[1] ^ mod_args[3] ^ mod_args[5] ^ mod_args[7];
unsigned long b = mod_args[0] ^ mod_args[2] ^ mod_args[4] ^ mod_args[6];
than getting result by concat that two pieces:
unsigned long long result = (((unsigned long long)a) << 32) | b;
As depressing as it might sound, the only true portable, standard-conforming and efficient way of doing so is through memcpy(). Using reinterpret_cast is a violation of strict aliasing rule, and using union (as often suggested) triggers undefined behaviour when you read from the member you didn't write to.
However, since most compilers will optimize away memcpy() calls, this is not as depressing as it sounds.
For example, following code with two memcpy()s:
char* foo() {
char* sha = sha256("some text");
unsigned int mod_args[8];
memcpy(mod_args, sha, sizeof(mod_args));
mod_args[5] = 0;
memcpy(sha, mod_args, sizeof(mod_args));
return sha;
}
Produce following optimized assembly:
foo(): # #foo()
pushq %rax
movl $.L.str, %edi
callq sha256(char const*)
movl $0, 20(%rax)
popq %rdx
retq
It is easy to see, no memcpy() is there - the value is modified 'in place'.
Related
Background
This was inspired by this question/answer and ensuing discussion in the comments: Is the definition of “volatile” this volatile, or is GCC having some standard compliancy problems?. Based on others' and my interpretation of what should happening, as discussed in comments, I've submitted it to GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 Other relevant responses are still welcome.
Also, that thread has since given rise to this question: Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?
Intro
I know volatile isn't what most people think it is and is an implementation-defined nest of vipers. And I certainly don't want to use the below constructs in any real code. That said, I'm totally baffled by what's going on in these examples, so I'd really appreciate any elucidation.
My guess is this is due to either highly nuanced interpretation of the Standard or (more likely?) just corner-cases for the optimiser used. Either way, while more academic than practical, I hope this is deemed valuable to analyse, especially given how typically misunderstood volatile is. Some more data points - or perhaps more likely, points against it - must be good.
Input
Given this code:
#include <cstddef>
void f(void *const p, std::size_t n)
{
unsigned char *y = static_cast<unsigned char *>(p);
volatile unsigned char const x = 42;
// N.B. Yeah, const is weird, but it doesn't change anything
while (n--) {
*y++ = x;
}
}
void g(void *const p, std::size_t n, volatile unsigned char const x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
void h(void *const p, std::size_t n, volatile unsigned char const &x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
int main(int, char **)
{
int y[1000];
f(&y, sizeof y);
volatile unsigned char const x{99};
g(&y, sizeof y, x);
h(&y, sizeof y, x);
}
Output
g++ from gcc (Debian 4.9.2-10) 4.9.2 (Debian stable a.k.a. Jessie) with the command line g++ -std=c++14 -O3 -S test.cpp produces the below ASM for main(). Version Debian 5.4.0-6 (current unstable) produces equivalent code, but I just happened to run the older one first, so here it is:
main:
.LFB3:
.cfi_startproc
# f()
movb $42, -1(%rsp)
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L21:
subq $1, %rax
movzbl -1(%rsp), %edx
jne .L21
# x = 99
movb $99, -2(%rsp)
movzbl -2(%rsp), %eax
# g()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L22:
subq $1, %rax
jne .L22
# h()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L23:
subq $1, %rax
movzbl -2(%rsp), %edx
jne .L23
# return 0;
xorl %eax, %eax
ret
.cfi_endproc
Analysis
All 3 functions are inlined, and both that allocate volatile local variables do so on the stack for fairly obvious reasons. But that's about the only thing they share...
f() ensures to read from x on each iteration, presumably due to its volatile - but just dumps the result to edx, presumably because the destination y isn't declared volatile and is never read, meaning changes to it can be nixed under the as-if rule. OK, makes sense.
Well, I mean... kinda. Like, not really, because volatile is really for hardware registers, and clearly a local value can't be one of those - and can't otherwise be modified in a volatile way unless its address is passed out... which it's not. Look, there's just not a lot of sense to be had out of volatile local values. But C++ lets us declare them and tries to do something with them. And so, confused as always, we stumble onwards.
g(): What. By moving the volatile source into a pass-by-value parameter, which is still just another local variable, GCC somehow decides it's not or less volatile, and so it doesn't need to read it every iteration... but it still carries out the loop, despite its body now doing nothing.
h(): By taking the passed volatile as pass-by-reference, the same effective behaviour as f() is restored, so the loop does volatile reads.
This case alone actually makes practical sense to me, for reasons outlined above against f(). To elaborate: Imagine x refers to a hardware register, of which every read has side-effects. You wouldn't want to skip any of those.
Adding #define volatile /**/ leads to main() being a no-op, as you'd expect. So, when present, even on a local variable volatile does do something... I just have no idea what in the case of g(). What on Earth is going on there?
Questions
Why does a local value declared in-body produce different results from a by-value parameter, with the former letting reads be optimised away? Both are declared volatile. Neither have an address passed out - and don't have a static address, ruling out any inline-ASM POKEry - so they can never be modified outwith the function. The compiler can see that each is constant, need never be re-read, and volatile just ain't true -
so (A) is either allowed to be elided under such constraints? (acting as-if they weren't declared volatile) -
and (B) why does only one get elided? Are some volatile local variables more volatile than others?
Setting aside that inconsistency for just a moment: After the read was optimised away, why does the compiler still generate the loop? It does nothing! Why doesn't the optimiser elide it as-if no loop was coded?
Is this a weird corner case due to order of optimising analyses or such? As the code is a daft thought-experiment, I wouldn't chastise GCC for this, but it'd be good to know for sure. (Or is g() the manual timing loop people have dreamt of all these years?) If we conclude there's no Standard bearing on any of this, I'll move it to their Bugzilla just for their information.
And of course, the more important question from a practical perspective, though I don't want that to overshadow the potential for compiler geekery... Which, if any of these, are well-defined/correct according to the Standard?
For f: GCC eliminates the non-volatile stores (but not the loads, which can have side-effects if the source location is a memory mapped hardware register). There is really nothing surprising here.
For g: Because of the x86_64 ABI the parameter x of g is allocated in a register (i.e. rdx) and does not have a location in memory. Reading a general purpose register does not have any observable side effects so the dead read gets eliminted.
I wonder what's the recommended way to convert integer to/from little-endian in a portable way.
Is there any library for that?
We can use htonl and ntohl and then do another big-endian to (from) little-endian conversion, but it's not efficient.
The portable way is to use bit shifts and masks into an appropriately sized string. Notice I say string, because this is really the only time you need to concern yourself with endianness -- when transferring bytes between systems.
If you want to avoid unnecessary conversions (e.g. converting to little-endian on a little-endian architecture), there is no completely portable way to do it at compile-time. But you can check at runtime to dynamically select the set of conversion functions.
This does have the disadvantage where the code can't be inlined. It might be more efficient to write the conversions in the portable way and use templates or inlining. Combined with semi-portable compile-time checks, that's about as good as you'll get.
Further reading: Detecting Endianness at compile-time.
This is a great question. It prompted me to see if there was any way to determine endianness at compile time using constexpr expression.
It turns out that without preprocessor tricks it's not possible, because there's no way to turn an integer into a sequence of bytes (via casts or unions) when evaluating in a constexpr context.
However it turns out that in gcc, a simple run-time check gets optimised away when compiled with -O2, so this is actually optimally efficient:
#include <cstdint>
#include <iostream>
constexpr bool is_little_endian()
{
union endian_tester {
std::uint16_t n;
std::uint8_t p[4];
};
constexpr endian_tester sample = {0x0102};
return sample.p[0] == 0x2;
}
template<class Int>
Int to_little_endian(Int in)
{
if (is_little_endian()) {
return in;
}
else {
Int out = 0;
std::uint8_t* p = reinterpret_cast<std::uint8_t*>(std::addressof(out));
for (std::size_t byte = 0 ; byte < sizeof(in) ; ++byte)
{
auto part = (in >> (byte * 8)) & 0xff;
*p++ = std::uint8_t(part);
}
return out;
}
}
int main()
{
auto x = to_little_endian(10);
std::cout << x << std::endl;
}
here's the assembler output when compiling on an intel (little-endian) platform:
main:
subq $8, %rsp
#
# here is the call to to_little_endian()
#
movl $10, %esi
#
# that was it - it's been completely optimised away
#
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)
xorl %eax, %eax
addq $8, %rsp
ret
Does a conversion like:
int a[3];
char i=1;
a[ static_cast<unsigned char>(i) ];
introduce any overhead like conversions or can the compiler optimize everything away?
I am interested because I want to get rid of -Wchar-subscripts warnings, but want to use a char as index (other reasons)
I did one test on Clang 3.4.1 for this code :
int ival(signed char c) {
int a[] = {0,1,2,3,4,5,6,7,8,9};
unsigned char u = static_cast<unsigned char>(c);
return a[u];
}
Here is the relevant part or the assembly file generated with c++ -S -O3
_Z4ivala: # #_Z4ivala
# BB#0:
pushl %ebp
movl %esp, %ebp
movzbl 8(%ebp), %eax
movl .L_ZZ4ivalaE1a(,%eax,4), %eax
popl %ebp
ret
There is no trace of the conversion.
On most modern architectures char and unsigned char have the same size and alignment, hence unsigned char can represent all non-negative values of char and casting one to another does not require any CPU instructions.
I understand that, in C++, when I convert a float/double into an int, whereby the floating-point number is beyond the range that the int can hold, the result is not defined as part of the C++ language. The result depends on the implementation/compiler. What are some strategies common compilers use to deal with this?
Converting 7.2E12 to an int can yield the values 1634811904 or 2147483647. For example, does anyone know what the compiler is doing in each of these cases?
The compiler generates sequences of instructions that produce the correct result for all inputs that do not cause overflow. This is all it has to worry about (because overflow in the conversion from floating-point to integer is undefined behavior). The compiler does not “deal with” overflows so much as completely ignore them. If the underlying assembly instruction(s) on the platform raise an exception, fine. If they wrap around, fine. If they produce nonsensical results, again, fine.
As an example, constant expressions may be converted to integers at compile-time with rules that differ from the behavior of the assembly instructions generated on the platform. My blog post gives the example:
int printf(const char *, ...);
volatile double v = 0;
int main()
{
int i1 = 2147483648.0;
int i2 = 2147483648.0 + v;
printf("%d %d\n", i1, i2);
}
which produces a program that prints two different values for i1 and i2. This is because the conversion in the computation of i1 was applied at compile-time, whereas the conversion in the computation of i2 was applied at run-time.
As another example, in the particular case of the conversion from double to 32-bit unsigned int on the x86-64 platform, the results can be funny:
There are no instructions in the x86 instruction sets to convert from floating-point to unsigned integer.
On Mac OS X for Intel, compiling a 64-bit program, the conversion from double to 32-bit unsigned int is compiled in a single instruction: the instruction for 64-bit conversions, cvttsd2siq, with destination a 64-bit register of which only the bottom 32-bit will subsequently be used as the 32-bit unsigned integer it represents:
$ cat t.c
#include <stdio.h>
#include <stdlib.h>
int main(int c, char **v)
{
unsigned int i = 4294967296.0 + strtod(v[1], 0);
printf("%u\n", i);
}
$ gcc -m64 -S -std=c99 -O t.c && cat t.s
…
addsd LCPI1_0(%rip), %xmm0 ; this is the + from the C program
cvttsd2siq %xmm0, %rsi ; one-instruction conversion
…
This explains how, on that platform, a result modulo 232 can be obtained for doubles that are small enough (specifically, small enough to fit in a signed 64-bit integer).
In the old IA-32 instruction set, there is no instruction to convert a double to a 64-bit signed integer (and there is no instruction to convert a double to a 32-bit unsigned int either). The conversion to 32-bit unsigned int has to be done by combining a few of the instructions that do exist, including two instructions cvttsd2si to convert from double to 32-bit signed integer:
$ gcc -m32 -S -std=c99 -O t.c && cat t.s
…
addsd LCPI1_0-L1$pb(%esi), %xmm0 ; this is the + from the C program
movsd LCPI1_1-L1$pb(%esi), %xmm1 ; conversion to unsigned int starts here
movapd %xmm0, %xmm2
subsd %xmm1, %xmm2
cvttsd2si %xmm2, %eax
xorl $-2147483648, %eax
ucomisd %xmm1, %xmm0
cvttsd2si %xmm0, %edx
cmovael %eax, %edx
…
Two alternative solutions are computed, respectively in %eax and in %edx. The alternatives are each correct on different definition domains. If the number to convert, in %xmm0, is larger than the constant 231 in %xmm1, then one alternative is chosen, otherwise, the other one is. The high-level algorithm, using only conversions from double to int, would be:
if (d < 231)
then (unsigned int)(int)d
else (231 + (unsigned int)(int)(d - 231))
This translation of the C conversion from double to unsigned int gives the same saturating behavior as the 32-bit conversion instruction that it relies on:
$ gcc -m32 -std=c99 -O t.c && ./a.out 123456
0
Assume I have guarantees that float is IEEE 754 binary32. Given a bit pattern that corresponds to a valid float, stored in std::uint32_t, how does one reinterpret it as a float in a most efficient standard compliant way?
float reinterpret_as_float(std::uint32_t ui) {
return /* apply sorcery to ui */;
}
I've got a few ways that I know/suspect/assume have some issues:
Via reinterpret_cast,
float reinterpret_as_float(std::uint32_t ui) {
return reinterpret_cast<float&>(ui);
}
or equivalently
float reinterpret_as_float(std::uint32_t ui) {
return *reinterpret_cast<float*>(&ui);
}
which suffers from aliasing issues.
Via union,
float reinterpret_as_float(std::uint32_t ui) {
union {
std::uint32_t ui;
float f;
} u = {ui};
return u.f;
}
which is not actually legal, as it is only allowed to read from most recently written to member. Yet, it seems some compilers (gcc) allow this.
Via std::memcpy,
float reinterpret_as_float(std::uint32_t ui) {
float f;
std::memcpy(&f, &ui, 4);
return f;
}
which AFAIK is legal, but a function call to copy single word seems wasteful, though it might get optimized away.
Via reinterpret_casting to char* and copying,
float reinterpret_as_float(std::uint32_t ui) {
char* uip = reinterpret_cast<char*>(&ui);
float f;
char* fp = reinterpret_cast<char*>(&f);
for (int i = 0; i < 4; ++i) {
fp[i] = uip[i];
}
return f;
}
which AFAIK is also legal, as char pointers are exempt from aliasing issues and manual byte copying loop saves a possible function call. The loop will most definitely be unrolled, yet 4 possibly separate one-byte loads/stores are worrisome, I have no idea whether this is optimizable to single four byte load/store.
The 4 is the best I've been able to come up with.
Am I correct so far? Is there a better way to do this, particulary one that will guarantee single load/store?
Afaik, there are only two approaches that are compliant with strict aliasing rules: memcpy() and cast to char* with copying. All others read a float from memory that belongs to an uint32_t, and the compiler is allowed to perform the read before the write to that memory location. It might even optimize away the write altogether as it can prove that the stored value will never be used according to strict aliasing rules, resulting in a garbage return value.
It really depends on the compiler/optimizes whether memcpy() or char* copy is faster. In both cases, an intelligent compiler might be able to figure out that it can just load and copy an uint32_t, but I would not trust any compiler to do so before I have seen it in the resulting assembler code.
Edit:
After some testing with gcc 4.8.1, I can say that the memcpy() approach is the best for this particulare compiler, see below for details.
Compiling
#include <stdint.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
for( int i = sizeof(a); i--; ) bPointer[i] = aPointer[i];
return b;
}
with gcc -S -std=gnu11 -O3 foo.c yields this assemble code:
movl %edi, %ecx
movl %edi, %edx
movl %edi, %eax
shrl $24, %ecx
shrl $16, %edx
shrw $8, %ax
movb %cl, -1(%rsp)
movb %dl, -2(%rsp)
movb %al, -3(%rsp)
movb %dil, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is not optimal.
Doing the same with
#include <stdint.h>
#include <string.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
memcpy(bPointer, aPointer, sizeof(a));
return b;
}
yields (with all optimization levels except -O0):
movl %edi, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is optimal.
If the bitpattern in the integer variable is the same as a valid float value, then union is probably the best and most compliant way to go. And it's actually legal if you read the specification (don't remember the section at the moment).
memcpy is always safe but does involve a copy
casting may lead to problems
union - seems to be allowed in C99 and C11, not sure about C++
Take a look at:
What is the strict aliasing rule?
and
Is type-punning through a union unspecified in C99, and has it become specified in C11?
float reinterpret_as_float(std::uint32_t ui) {
return *((float *)&ui);
}
As plain function, its code is translated into assembly as this (Pelles C for Windows):
fld [esp+4]
ret
If defined as inline function, then a code like this (n being unsigned, x being float):
x = reinterpret_as_float (n);
Is translated to assembler as this:
fld [ebp-4] ;RHS of asignment. Read n as float
fstp dword ptr [ebp-8] ;LHS of asignment