portable ntohl and friends - c++

I'm writing a small program that will save and load data, it'll be command line (and not interactive) so there's no point in including libraries I need not include.
When using sockets directly, I get the ntohl functions just by including sockets, however here I don't need sockets. I'm not using wxWidgets, so I don't get to use its byte ordering functions.
In C++ there are lot of new standardised things, for example look at timers and regex (although that's not yet fully supported) but certainly timers!
Is there a standardised way to convert things to network-byte ordered?
Naturally I've tried searching "c++ network byte order cppreference" and similar things, nothing comes up.
BTW in this little project, the program will manipulate files that may be shared across computers, it'd be wrong to assume "always x86_64"

Is there a standardised way to convert things to network-byte ordered?
No. There isn't.
Boost ASIO has equivalents, but that somewhat violates your requirements.

GCC has __BYTE_ORDER__ which is as good as it will get! It's easy to detect if the compiler is GCC and test this macro, or detect if it is Clang and test that, then stick the byte ordering in a config file and use the pre-processor to conditionally compile bits of code.

There are no C++ standard functions for that, but you can compose the required functionality from the C++ standard functions.
Big-endian-to-host byte-order conversion can be implemented as follows:
#include <boost/detail/endian.hpp>
#include <boost/utility/enable_if.hpp>
#include <boost/type_traits/is_arithmetic.hpp>
#include <algorithm>
#ifdef BOOST_LITTLE_ENDIAN
# define BE_TO_HOST_COPY std::reverse_copy
#elif defined(BOOST_BIG_ENDIAN)
# define BE_TO_HOST_COPY std::copy
#endif
inline void be_to_host(void* dst, void const* src, size_t n) {
char const* csrc = static_cast<char const*>(src);
BE_TO_HOST_COPY(csrc, csrc + n, static_cast<char*>(dst));
}
template<class T>
typename boost::enable_if<boost::is_integral<T>, T>::type
be_to_host(T const& big_endian) {
T host;
be_to_host(&host, &big_endian, sizeof(T));
return host;
}
Host-to-big-endian byte-order conversion can be implemented in the same manner.
Usage:
uint64_t big_endian_piece_of_data;
uint64_t host_piece_of_data = be_to_host(big_endian_piece_of_data);

The following should work correctly on any endian platform
int32_t getPlatformInt(uint8_t* bytes, size_t num)
{
int32_t ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
You network integer can easily be cast to an array of chars using:
uint8_t* p = reiterpret_cast<uint8_t*>(&network_byte_order_int)

The code from Doron that should work on any platform did not work for me on a big-endian system (Power7 CPU architecture).
Using a compiler built_in is much cleaner and worked great for me using gcc on both Windows and *nix (AIX):
uint32_t getPlatformInt(const uint32_t* bytes)
{
uint32_t ret;
ret = __builtin_bswap32 (*bytes));
return ret;
}
See also How can I reorder the bytes of an integer in c?

Related

Optimizing away memcpy

In an attempt to avoid breaking strict aliasing rules, I introduced memcpy to a couple places in my code expecting it to be a no-op. The following example produces a call to memcpy (or equivalent) on gcc and clang. Specifically, fool<40> always does while foo does on gcc but not clang and fool<2> does on clang but not gcc. When / how can this be optimized away?
uint64_t bar(const uint16_t *buf) {
uint64_t num[2];
memcpy(&num, buf, 16);
return num[0] + num[1];
}
uint64_t foo(const uint16_t *buf) {
uint64_t num[3];
memcpy(&num, buf, sizeof(num));
return num[0] + num[1];
}
template <int SZ>
uint64_t fool(const uint16_t *buf) {
uint64_t num[SZ];
memcpy(&num, buf, sizeof(num));
uint64_t ret = 0;
for (int i = 0; i < SZ; ++i)
ret += num[i];
return ret;
}
template uint64_t fool<2>(const uint16_t*);
template uint64_t fool<40>(const uint16_t*);
And a link to the compiled output (godbolt).
I can't really tell you why exactly the respective compilers fail to optimize the code in the way you'd hope them to optimize in the specific cases. I guess each compiler is either just unable to track the relationship established by memcpy between the target array and the source memory (as we can see, they do seem to recognize this relationship at least in some cases), or they simply have some heuristic tell them to chose not to make use of it.
Anyways, since compilers seem to not behave as we would hope when we rely on them tracking the entire array, what we can try to do is to make it more obvious to the compiler by just doing the memcpy on an element-per-element basis. This seems to produce the desired result on both compilers. Note that I had to manually unroll the initialization in bar and foo as clang will otherwise do a copy again.
Apart from that, note that in C++ you should use std::memcpy, std::uint64_t, etc. since the standard headers are not guaranteed to also introduce these names into the global namespace (though I'm not aware of any implementation that doesn't do that).

Header with restrictions using bytes for an UDP Socket

I am doing a Header for an UDP socket which have a restrictions using bytes.
| Packet ID (1 byte) | Packet Size (2 bytes) | Subpacket ID (1 Byte) | etc
I did an struct for store this kind of attributes like:
typedef struct WHEATHER_STRUCT
{
unsigned char packetID[1];
unsigned char packetSize[2];
unsigned char subPacketID[1];
unsigned char subPacketOffset[2];
...
} wheather_struct;
I initialized this struct using new and I updated the values. The question is about if I want to use only 2 bytes in Packet Size attribute. What of these two forms that I wrote below is the correct one?
*weather_struct->packetSize = '50';
or
*weather_struct->packetSize = 50;
If you can use C++11 and gcc (or clang) then I would do this:
typedef struct WHEATHER_STRUCT
{
uint8_t packetID;
uint16_t packetSize;
uint8_t subPacketID;
uint16_t subPacketOffset;
// ...
} __attribute__((packed)) wheather_struct;
If you can't use C++11 then you can use unsigned char and unsigned short instead.
If you're using Visual C then you can do:
#pragma pack (push, 1)
typedef struct ...
#pragma (pop)
Beware also byte ordering issues, depending on what architectures you need to support. You can use htons() and ntohs() to overcome this problem.
Live demo at Wandbox
Packing and unpacking data from IP packets is a problem as old as the internet itself (indeed, older).
Different machine architectures have different layouts for representing integers, which can cause problems when communicating between machines.
For this reason, the IP stack standardises on encoding integers in 'network byte order' (which basically means most significant byte first).
Standard functions exist to convert values in network byte order to native types and vice versa. I urge you to consider using these as your code will then be more portable.
Furthermore, it makes sense to abstract data representations from the program's point of view. c++ compilers can perform the conversions very efficiently.
Example:
#include <arpa/inet.h>
#include <cstring>
#include <cstdint>
typedef struct WEATHER_STRUCT
{
std::int8_t packetID;
std::uint16_t packetSize;
std::uint8_t subPacketID;
std::uint16_t subPacketOffset;
} weather_struct;
const std::int8_t* populate(weather_struct& target, const std::int8_t* source)
{
auto get16 = [&source]
{
std::uint16_t buf16;
std::memcpy(&buf16, source, 2);
source += 2;
return ntohs(buf16);
};
target.packetID = *source++;
target.packetSize = get16();
target.subPacketID = *source++;
target.subPacketOffset = get16();
return source;
}
uint8_t* serialise(uint8_t* target, weather_struct const& source)
{
auto write16 = [&target](std::uint16_t val)
{
val = ntohs(val);
std::memcpy(target, &val, 2);
target += 2;
};
*target++ = source.packetID;
write16(source.packetSize);
*target++ = source.subPacketID;
write16(source.subPacketOffset);
return target;
}
https://linux.die.net/man/3/htons
here's an link to a c++17 version of the above:
https://godbolt.org/z/oRASjI
A further note on conversion costs:
Data arriving into or leaving your program is an event that happens once per payload. Suffering a conversion cost here incurs a negligible penalty.
Once the data has arrived in your program, or before it leaves, it may be manipulated many times by your code.
Some processors architectures suffer huge performance penalties during data access if data is not aligned on natural word boundaries. This is why attributes such as packed exist - the compiler is doing all it can to avoid misaligned data. Using a packed attribute is tantamount to deliberately telling the compiler to produce very suboptimal code.
For this reason, I would recommend not using packed structures (e.g. __attribute__((packed)) etc) for data that will be referred to by program logic.
Compared to RAM, networks are many orders of magnitude slower. A minuscule performance hit (literally nanoseconds) at the point of encoding or decoding a network packet is inconsequential compared to the cost of actually transmitting it.
Packing structures can cause horrible performance issues in program code and often leads to portability headaches.
Neither is correct, you need to treat the two bytes as a single 16-bit number. You probably also need to take into account the different endianness of the network stream to your processor architecture (depending on the protocol, most are big endian).
The correct code would therefore be:
*((uint16_t*)weather_struct->packetSize) = htons(50);
It would be simpler if packetSize were uint16_t to start with:
weather_struct->packetSize = htons(50);

Convert between little-endian and big-endian floats effectively

I have a working software, which currently runs on a little-endian architecture. I would like to make it run in big-endian mode too. I would like to write little-endian data into files, regardless of the endianness of the underlying system.
To achieve this, I decided to use the boost endian library. It can convert integers efficiently. But it cannot handle floats (and doubles).
It states in the documentation, that "Floating point types will be supported in the Boost 1.59.0". But they are still not supported in 1.62.
I can assume, that the floats are valid IEEE 754 floats (or doubles). But their endianness may vary according to the underlying system. As far as I know, using the htonl and ntohl functions on floats is not recommended. How is it possible then? Is there any header-only library, which can handle floats too? I was not able to find any.
I could convert the floats to string, and write that into a file, I would like to avoid that method, for many reasons ( performance, disk-space, ... )
Here:
float f = 1.2f;
auto it = reinterpret_cast<uint8_t*>(&f);
std::reverse(it, it + sizeof(f)); //f is now in the reversed endianness
No need for anything fancy.
Unheilig: you are correct, but
#include <boost/endian/conversion.hpp>
template <typename T>
inline T endian_cast(const T & t)
{
#ifdef BOOST_LITTLE_ENDIAN
return boost::endian::endian_reverse(t);
#else
return t;
#endif
}
or when u are using pointers, to immediate reversing, use:
template <typename T>
inline T endian_cast(T *t)
{
#ifdef BOOST_LITTLE_ENDIAN
return boost::endian::endian_reverse_inplace(*t);
#else
return t;
#endif
}
and use it, instead of manually (or maybe error-prone) reversing it's content
example:
std::uint16_t start_address() const
{
std::uint16_t address;
std::memcpy(&address, &data()[1], 2);
return endian_cast(address);
}
void start_address(std::uint16_t i)
{
endian_cast(&i);
std::memcpy(&data()[1], &i, 2);
}
Good luck.
When serializing float/double values, I make the following three assumptions:
The machine representation follows IEEE 754
The endianess of float/double matches the endianess of integers
The behavior of reinterpret_cast-ing between double&/int64_t& or float&/int32_t& is well-defined (E.g., the cast behaves as if the types are similar).
None of these assumptions is guaranteed by the standard. Under these assumptions, the following code will ensure doubles are written in little-endian:
ostream out;
double someVal;
...
static_assert(sizeof(someVal) == sizeof(int64_t),
"Endian conversion requires 8-byte doubles");
native_to_little_inplace(reinterpret_cast<int64_t&>(someVal));
out.write(reinterpret_cast<char*>(&someVal), sizeof(someVal));

what is the benefit of detecting endian at runtime?

i've searched for macro's to determine endianess on a machine and didn't found any standard proprocessor macros for this, but a lot of solutions doing that on runtime. why should i detect endianess at runtime?
if i do somthing like that:
#ifdef LITTLE_ENDIAN
inline int swap(int& x) {
// do swap anyhow
return swapped;
}
#elif BIG_ENDIAN
inline int& swap(int& x) { return x; }
#else
#error "some blabla"
#endif
int main() {
int x = 0x1234;
int y = swap(x);
return 0;
}
the compiler will generate only one function.
but if i do it like (see predef.endian):
enum {
ENDIAN_UNKNOWN,
ENDIAN_BIG,
ENDIAN_LITTLE,
ENDIAN_BIG_WORD, /* Middle-endian, Honeywell 316 style */
ENDIAN_LITTLE_WORD /* Middle-endian, PDP-11 style */
};
int endianness(void)
{
uint8_t buffer[4];
buffer[0] = 0x00;
buffer[1] = 0x01;
buffer[2] = 0x02;
buffer[3] = 0x03;
switch (*((uint32_t *)buffer)) {
case 0x00010203: return ENDIAN_BIG;
case 0x03020100: return ENDIAN_LITTLE;
case 0x02030001: return ENDIAN_BIG_WORD;
case 0x01000302: return ENDIAN_LITTLE_WORD;
default: return ENDIAN_UNKNOWN;
}
int swap(int& x) {
switch(endianess()) {
case ENDIAN_BIG:
return x;
break;
case LITTLE_ENDIAN:
// do swap
return swapped;
break;
default:
// error blabla
}
// do swap anyhow
}
the compiler generates code for the detection.
i don't get it, why should i do this?
if i have code, compiled for a little-endian machine, the whole code is generated for little endian, and if i try to run such code on a big-endian machine (on a bi-endian machine like arm wiki:bi-endian) the whole code is compiled for a little-endian machine. so all other declarations of e.g. int are also le.
// compiled on little endian
uint32_t 0x1234; // 0x1234 constant literal
// should result 34120000 on BE
There are actually systems where SOFTWARE can set whether the system is (currently running in) little or big endian mode. Most systems only support switching that under special circumstances, and not (fortunately for system programmers and such) switching back and forth arbitrarily. But it would be conceivable to support that an executable file defines whether that particular executable runs in LE or BE mode. In that case, you can't rely on picking out what OS and processor model it is...
On the other hand, if the hardware only EVER supports one endianness (e.g. x86 in its different forms), then I don't see a need to check at runtime. You know it's little endian, and that's it. It is wasteful (in terms of performance and code-size) to have the system contain code to check which endianness it is, and carry around conversion methods to convert from big endian to little endian.
Robust endian detection at compile time isn't necessarily possible. There are platforms where endianess can change even between runs of the same binary.
http://gcc.gnu.org/ml/gcc-help/2007-07/msg00343.html
I think the only benefit of detecting endianness in runtime is that you don't have to mess around with macros. As you have noticed yourself, there is no standard macro saying what is the endiannes of the machine you are compiling your code on, so you must define something yourself and pass it to the compiler, or define it conditionally depending on other flags indicating architecture/operating system, something like:
#ifdef _this_system_
#define LITTLE_ENDIAN
#endif
#ifdef _that_system_
#define BIG_ENDIAN
#endif
but repeated many times, for every possible architecture, which is messy and error prone. It is easier and safer to check it in runtime. I know, it seems silly, but it is really more practical.

lock-free memory reclamation with 64bit pointers

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.
To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.
Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?
Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.
Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.
Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.
Ok hopefully, I have the book,
For others that may provides answers, the point is to implement this class :
class AtomicReference<T>{
public:
void set(T *ref, int stamp){ ... }
T *get(int *stamp){ ... }
private:
T *_ref;
int _stamp;
};
in a lock-free way so that :
set() updates the reference and the stamp, atomicly.
get() returns the reference and set *stamp to the stamp corresponding to the reference.
JDonner please, correct me if I am wrong.
Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.
If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.
However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.
Yes, x64 supports this; you need to use CMPXCHG16B.
You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):
inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
unsigned long old_high = old_val >> 32, old_low = old_val;
unsigned long new_high = new_val >> 32, new_low = new_val;
char res = 0;
asm volatile("lock; cmpxchg8b (%6);"
"setz %7; "
: "=a" (old_low), // 0
"=d" (old_high) // 1
: "0" (old_low), // 2
"1" (old_high), // 3
"b" (new_low), // 4
"c" (new_high), // 5
"r" (mem), // 6
"m" (res) // 7
: "cc", "memory");
return res;
}
You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.
template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
static const unsigned int TagMask = ~ PtrMask;
typedef unsigned long raw_value_type;
raw_value_type raw_m_;
tagged_pointer () : raw_m_(0) {}
tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }
void pack (pointer_type ptr, tag_type tag)
{
this->raw_m_ = 0;
this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
}
pointer_type ptr () const
{
raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
return *reinterpret_cast<pointer_type*>(&p);
}
tag_type tag () const
{
raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
return *reinterpret_cast<tag_type*>(&t);
}
};
I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.
Note, on x86_64 architecture and gcc you can enable 128 bit CAS. It can be enabled with -mcx16 gcc option.
int main()
{
__int128_t x = 0;
__sync_bool_compare_and_swap(&x,0,10);
return 0;
}
Compile with:
gcc -mcx16 file.c
The cmpxchg16b operation provides the expected operation but beware that some older x86-64 processors don't have this instruction.
You then just need to build an entity with the counter and and the pointer and the asm-inline code. I've written a blog post on the subject here:Implementing Generic Double Word Compare And Swap
Nevertheless, you don't need this operation if you just want to prevent early-free and ABA issues. The hazard pointer is more simpler and doesn't require specific asm code (as long as you use C++11 atomic values.) I've got a repo on bitbucket with experimental implementations of various lock-free algorithms: Lock Free Experiment (beware all these implementations are toys for experimentation, not reliable and tested code for production.)