I wonder what's the recommended way to convert integer to/from little-endian in a portable way.
Is there any library for that?
We can use htonl and ntohl and then do another big-endian to (from) little-endian conversion, but it's not efficient.
The portable way is to use bit shifts and masks into an appropriately sized string. Notice I say string, because this is really the only time you need to concern yourself with endianness -- when transferring bytes between systems.
If you want to avoid unnecessary conversions (e.g. converting to little-endian on a little-endian architecture), there is no completely portable way to do it at compile-time. But you can check at runtime to dynamically select the set of conversion functions.
This does have the disadvantage where the code can't be inlined. It might be more efficient to write the conversions in the portable way and use templates or inlining. Combined with semi-portable compile-time checks, that's about as good as you'll get.
Further reading: Detecting Endianness at compile-time.
This is a great question. It prompted me to see if there was any way to determine endianness at compile time using constexpr expression.
It turns out that without preprocessor tricks it's not possible, because there's no way to turn an integer into a sequence of bytes (via casts or unions) when evaluating in a constexpr context.
However it turns out that in gcc, a simple run-time check gets optimised away when compiled with -O2, so this is actually optimally efficient:
#include <cstdint>
#include <iostream>
constexpr bool is_little_endian()
{
union endian_tester {
std::uint16_t n;
std::uint8_t p[4];
};
constexpr endian_tester sample = {0x0102};
return sample.p[0] == 0x2;
}
template<class Int>
Int to_little_endian(Int in)
{
if (is_little_endian()) {
return in;
}
else {
Int out = 0;
std::uint8_t* p = reinterpret_cast<std::uint8_t*>(std::addressof(out));
for (std::size_t byte = 0 ; byte < sizeof(in) ; ++byte)
{
auto part = (in >> (byte * 8)) & 0xff;
*p++ = std::uint8_t(part);
}
return out;
}
}
int main()
{
auto x = to_little_endian(10);
std::cout << x << std::endl;
}
here's the assembler output when compiling on an intel (little-endian) platform:
main:
subq $8, %rsp
#
# here is the call to to_little_endian()
#
movl $10, %esi
#
# that was it - it's been completely optimised away
#
movl std::cout, %edi
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
movq %rax, %rdi
call std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)
xorl %eax, %eax
addq $8, %rsp
ret
Related
How I can avoid strict aliasing rule violation, trying to modify char* result of sha256 function.
Compute hash value:
std::string sha = sha256("some text");
const char* sha_result = sha.c_str();
unsigned long* mod_args = reinterpret_cast<unsigned long*>(sha_result);
than getting 2 pieces of 64 bit:
unsigned long a = mod_args[1] ^ mod_args[3] ^ mod_args[5] ^ mod_args[7];
unsigned long b = mod_args[0] ^ mod_args[2] ^ mod_args[4] ^ mod_args[6];
than getting result by concat that two pieces:
unsigned long long result = (((unsigned long long)a) << 32) | b;
As depressing as it might sound, the only true portable, standard-conforming and efficient way of doing so is through memcpy(). Using reinterpret_cast is a violation of strict aliasing rule, and using union (as often suggested) triggers undefined behaviour when you read from the member you didn't write to.
However, since most compilers will optimize away memcpy() calls, this is not as depressing as it sounds.
For example, following code with two memcpy()s:
char* foo() {
char* sha = sha256("some text");
unsigned int mod_args[8];
memcpy(mod_args, sha, sizeof(mod_args));
mod_args[5] = 0;
memcpy(sha, mod_args, sizeof(mod_args));
return sha;
}
Produce following optimized assembly:
foo(): # #foo()
pushq %rax
movl $.L.str, %edi
callq sha256(char const*)
movl $0, 20(%rax)
popq %rdx
retq
It is easy to see, no memcpy() is there - the value is modified 'in place'.
Does a conversion like:
int a[3];
char i=1;
a[ static_cast<unsigned char>(i) ];
introduce any overhead like conversions or can the compiler optimize everything away?
I am interested because I want to get rid of -Wchar-subscripts warnings, but want to use a char as index (other reasons)
I did one test on Clang 3.4.1 for this code :
int ival(signed char c) {
int a[] = {0,1,2,3,4,5,6,7,8,9};
unsigned char u = static_cast<unsigned char>(c);
return a[u];
}
Here is the relevant part or the assembly file generated with c++ -S -O3
_Z4ivala: # #_Z4ivala
# BB#0:
pushl %ebp
movl %esp, %ebp
movzbl 8(%ebp), %eax
movl .L_ZZ4ivalaE1a(,%eax,4), %eax
popl %ebp
ret
There is no trace of the conversion.
On most modern architectures char and unsigned char have the same size and alignment, hence unsigned char can represent all non-negative values of char and casting one to another does not require any CPU instructions.
I have written a hooking library, that examines a PE executables dll import table, to create a library that enables changing of parameters and return values. I have a few questions on how the return value is passed from a function.
I have learned that the return value of a function is saved in the accumulator register. Is this always the case? If not, how does the compiler know where to look for the function result?
What about the return type size? An integer will easily fit, but what about a bigger structure? Does the caller reserve stack space so the method it calls could write the result onto stack?
It's all specific to calling convention.
For most calling conventions floating point numbers are returned either on FPU-stack or in XMM registers.
Call to the function returning a structure
some_struct foo(int arg1, int arg2);
some_struct s = foo(1, 2);
will be compiled into some equivalent of:
some_struct* foo(some_struct* ret_val, int arg1, int arg2);
some_struct s; // constructor isn't called
foo(&s, 1, 2); // constructor will be called in foo
Edit: (add info)
just to clarify: this works for all structs and classes even when sizeof(some_struct) <= 4. So if you define small useful class like ip4_type with the only unsigned field and some constructors/convertors to/trom unsigned, in_addr, char* it will lack efficiency compared to use of raw unigned value.
If the function get inlined, the result is not saved in eax, also if results are passed by reference/pointer, that register won't be used.
look at what happens to a function that return doubles (on a 32 bit machine)
double func(){
volatile double val=5.0;
return val;
}
int main(){
double val = func();
return 0;
}
doubles are not in eax.
func():
pushq %rbp
movq %rsp, %rbp
movabsq $4617315517961601024, %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, -24(%rbp)
movsd -24(%rbp), %xmm0
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $24, %rsp
call func()
movsd %xmm0, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -8(%rbp)
movl $0, %eax
leave
ret
It really depends on the calling convention used, but typically EAX is used for 32-bit and smaller integral data types, floating point values tend to use FPU or MMX registers, and 64-bit integral types tend to use a combination of EAX and EDX instead. Then there is the issue of complex class/struct types, in which case the compiler may decide to optimize away the return value and use an extra output parameter on the call stack to pass the returned object by reference to the caller.
You are asking questions about the ABI (Application Binary Interface). This varies depending on the operating system. You should look it up. You can find some good info and links to other documents at http://en.wikipedia.org/wiki/X86_calling_conventions
To answer your question, yes, as far as I know, all of the popular operating systems use the A register to return the result.
Assume I have guarantees that float is IEEE 754 binary32. Given a bit pattern that corresponds to a valid float, stored in std::uint32_t, how does one reinterpret it as a float in a most efficient standard compliant way?
float reinterpret_as_float(std::uint32_t ui) {
return /* apply sorcery to ui */;
}
I've got a few ways that I know/suspect/assume have some issues:
Via reinterpret_cast,
float reinterpret_as_float(std::uint32_t ui) {
return reinterpret_cast<float&>(ui);
}
or equivalently
float reinterpret_as_float(std::uint32_t ui) {
return *reinterpret_cast<float*>(&ui);
}
which suffers from aliasing issues.
Via union,
float reinterpret_as_float(std::uint32_t ui) {
union {
std::uint32_t ui;
float f;
} u = {ui};
return u.f;
}
which is not actually legal, as it is only allowed to read from most recently written to member. Yet, it seems some compilers (gcc) allow this.
Via std::memcpy,
float reinterpret_as_float(std::uint32_t ui) {
float f;
std::memcpy(&f, &ui, 4);
return f;
}
which AFAIK is legal, but a function call to copy single word seems wasteful, though it might get optimized away.
Via reinterpret_casting to char* and copying,
float reinterpret_as_float(std::uint32_t ui) {
char* uip = reinterpret_cast<char*>(&ui);
float f;
char* fp = reinterpret_cast<char*>(&f);
for (int i = 0; i < 4; ++i) {
fp[i] = uip[i];
}
return f;
}
which AFAIK is also legal, as char pointers are exempt from aliasing issues and manual byte copying loop saves a possible function call. The loop will most definitely be unrolled, yet 4 possibly separate one-byte loads/stores are worrisome, I have no idea whether this is optimizable to single four byte load/store.
The 4 is the best I've been able to come up with.
Am I correct so far? Is there a better way to do this, particulary one that will guarantee single load/store?
Afaik, there are only two approaches that are compliant with strict aliasing rules: memcpy() and cast to char* with copying. All others read a float from memory that belongs to an uint32_t, and the compiler is allowed to perform the read before the write to that memory location. It might even optimize away the write altogether as it can prove that the stored value will never be used according to strict aliasing rules, resulting in a garbage return value.
It really depends on the compiler/optimizes whether memcpy() or char* copy is faster. In both cases, an intelligent compiler might be able to figure out that it can just load and copy an uint32_t, but I would not trust any compiler to do so before I have seen it in the resulting assembler code.
Edit:
After some testing with gcc 4.8.1, I can say that the memcpy() approach is the best for this particulare compiler, see below for details.
Compiling
#include <stdint.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
for( int i = sizeof(a); i--; ) bPointer[i] = aPointer[i];
return b;
}
with gcc -S -std=gnu11 -O3 foo.c yields this assemble code:
movl %edi, %ecx
movl %edi, %edx
movl %edi, %eax
shrl $24, %ecx
shrl $16, %edx
shrw $8, %ax
movb %cl, -1(%rsp)
movb %dl, -2(%rsp)
movb %al, -3(%rsp)
movb %dil, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is not optimal.
Doing the same with
#include <stdint.h>
#include <string.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
memcpy(bPointer, aPointer, sizeof(a));
return b;
}
yields (with all optimization levels except -O0):
movl %edi, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is optimal.
If the bitpattern in the integer variable is the same as a valid float value, then union is probably the best and most compliant way to go. And it's actually legal if you read the specification (don't remember the section at the moment).
memcpy is always safe but does involve a copy
casting may lead to problems
union - seems to be allowed in C99 and C11, not sure about C++
Take a look at:
What is the strict aliasing rule?
and
Is type-punning through a union unspecified in C99, and has it become specified in C11?
float reinterpret_as_float(std::uint32_t ui) {
return *((float *)&ui);
}
As plain function, its code is translated into assembly as this (Pelles C for Windows):
fld [esp+4]
ret
If defined as inline function, then a code like this (n being unsigned, x being float):
x = reinterpret_as_float (n);
Is translated to assembler as this:
fld [ebp-4] ;RHS of asignment. Read n as float
fstp dword ptr [ebp-8] ;LHS of asignment
I've gotten myself into a confused mess regarding multithreaded programming and was hoping someone could come and slap some understanding in me.
After doing quite a bit of reading, I've come to the understanding that I should be able to set the value of a 64 bit int atomically on a 64 bit system1.
I found a lot of this reading difficult though, so thought I would try to make a test to verify this. So I wrote a simple program with one thread which would set a variable into one of two values:
bool switcher = false;
while(true)
{
if (switcher)
foo = a;
else
foo = b;
switcher = !switcher;
}
And another thread which would check the value of foo:
while (true)
{
__uint64_t blah = foo;
if ((blah != a) && (blah != b))
{
cout << "Not atomic! " << blah << endl;
}
}
I set a = 1844674407370955161; and b = 1144644202170355111;. I run this program and get no output warning me that blah is not a or b.
Great, looks like it probably is an atomic write...but then, I changed the first thread to set a and b directly, like so:
bool switcher = false;
while(true)
{
if (switcher)
foo = 1844674407370955161;
else
foo = 1144644202170355111;
switcher = !switcher;
}
I re-run, and suddenly:
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
What's changed? Either way I'm assigning a large number to foo - does the compiler handle a constant number differently, or have I misunderstood everything?
Thanks!
1: Intel CPU documentation, section 8.1, Guaranteed Atomic Operations
2: GCC Development list discussing that GCC doesn't guarantee it in the documentation, but the kernel and other programs rely on it
Disassembling the loop, I get the following code with gcc:
.globl _switcher
_switcher:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, -4(%rbp)
L2:
cmpl $0, -4(%rbp)
je L3
movq _foo#GOTPCREL(%rip), %rax
movl $-1717986919, (%rax)
movl $429496729, 4(%rax)
jmp L5
L3:
movq _foo#GOTPCREL(%rip), %rax
movl $1486032295, (%rax)
movl $266508246, 4(%rax)
L5:
cmpl $0, -4(%rbp)
sete %al
movzbl %al, %eax
movl %eax, -4(%rbp)
jmp L2
LFE2:
So it would appear that gcc does use to 32-bit movl instruction with 32-bit immediate values. There is an instruction movq that can move a 64-bit register to memory (or memory to a 64-bit register), but it does not seems to be able to set move an immediate value to a memory address, so the compiler is forced to either use a temporary register and then move the value to memory, or to use to movl. You can try to force it to use a register by using a temporary variable, but this may not work.
References:
mov
movq
http://www.x86-64.org/documentation/assembly.html
immediate values inside instructions remain 32 bits.
There is no way for the compiler to do the assignation of a 64 bits constant atomically, excepted by first filling a register and then moving that register to the variable. That is probably more costly than assigning directly to the variable and as atomicity is not required by the language, the atomic solution is not chosen.
The Intel CPU documentation is right, aligned 8 Bytes read/writes are always atomic on recent hardware (even on 32 bit operating systems).
What you don't tell us, are you using a 64 bit hardware on a 32 bit system? If so, the 8 byte write will most likely be splitted into two 4 byte writes by the compiler.
Just have a look at the relevant section in the object code.