Standard compliant host to network endianess conversion - c++

I am amazed at how many topics on StackOverflow deal with finding out the endianess of the system and converting endianess. I am even more amazed that there are hundreds of different answers to these two questions. All proposed solutions that I have seen so far are based on undefined behaviour, non-standard compiler extensions or OS-specific header files. In my opinion, this question is only a duplicate if an existing answer gives a standard-compliant, efficient (e.g., use x86-bswap), compile time-enabled solution.
Surely there must be a standard-compliant solution available that I am unable to find in the huge mess of old "hacky" ones. It is also somewhat strange that the standard library does not include such a function. Perhaps the attitude towards such issues is changing, since C++20 introduced a way to detect endianess into the standard (via std::endian), and C++23 will probably include std::byteswap, which flips endianess.
In any case, my questions are these:
Starting at what C++ standard is there a portable standard-compliant way of performing host to network byte order conversion?
I argue below that it's possible in C++20. Is my code correct and can it be improved?
Should such a pure-c++ solution be preferred to OS specific functions such as, e.g., POSIX-htonl? (I think yes)
I think I can give a C++23 solution that is OS-independent, efficient (no system call, uses x86-bswap) and portable to little-endian and big-endian systems (but not portable to mixed-endian systems):
// requires C++23. see https://gcc.godbolt.org/z/6or1sEvKn
#include <type_traits>
#include <utility>
#include <bit>
constexpr inline auto host_to_net(std::integral auto i) {
static_assert(std::endian::native == std::endian::big || std::endian::native == std::endian::little);
if constexpr (std::endian::native == std::endian::big) {
return i;
} else {
return std::byteswap(i);
}
}
Since std::endian is available in C++20, one can give a C++20 solution for host_to_net by implementing byteswap manually. A solution is described here, quote:
// requires C++17
#include <climits>
#include <cstdint>
#include <type_traits>
template<class T, std::size_t... N>
constexpr T bswap_impl(T i, std::index_sequence<N...>) {
return ((((i >> (N * CHAR_BIT)) & (T)(unsigned char)(-1)) <<
((sizeof(T) - 1 - N) * CHAR_BIT)) | ...);
}; // ^~~~~ fold expression
template<class T, class U = typename std::make_unsigned<T>::type>
constexpr U bswap(T i) {
return bswap_impl<U>(i, std::make_index_sequence<sizeof(T)>{});
}
The linked answer also provides a C++11 byteswap, but that one seems to be less efficient (not compiled to x86-bswap). I think there should be an efficient C++11 way of doing this, too (using either less template-nonsense or even more) but I don't care about older C++ and didn't really try.
Assuming I am correct, the remaining question is: can one can determine system endianess before C++20 at compile time in a standard-compliant and compiler-agnostic way? None of the answers here seem to do achieve this. They use reinterpret_cast (not compile time), OS-headers, union aliasing (which I believe is UB in C++), etc. Also, for some reason, they try to do it "at runtime" although a compiled executable will always run under the same endianess.)
One could do it outside of constexpr context and hope it's optimized away. On the other hand, one could use system-defined preprocessor definitions and account for all platforms, as seems to be the approach taken by Boost. Or maybe (although I would guess the other way is better?) use macros and pick platform-specific htnl-style functions from networking libraries(done, e.g., here (GitHub))?

compile time-enabled solution.
Consider whether this is useful requirement in the first place. The program isn't going to be communicating with another system at compile time. What is the case where you would need to use the serialised integer in a compile time constant context?
Starting at what C++ standard is there a portable standard-compliant way of performing host to network byte order conversion?
It's possible to write such function in standard C++ since C++98. That said, later standards bring tasty template goodies that make this nicer.
There isn't such function in the standard library as of the latest standard.
Should such a pure-c++ solution be preferred to OS specific functions such as, e.g., POSIX-htonl? (I think yes)
Advantage of POSIX is that it's less important to write tests to make sure that it works correctly.
Advantage of pure C++ function is that you don't need platform specific alternatives to those that don't conform to POSIX.
Also, the POSIX htonX are only for 16 bit and 32 bit integers. You could instead use htobeXX functions instead that are in some *BSD and in Linux (glibc).
Here is what I have been using since C+17. Some notes beforehand:
Since endianness conversion is always1 for purposes of serialisation, I write the result directly into a buffer. When converting to host endianness, I read from a buffer.
I don't use CHAR_BIT because network doesn't know my byte size anyway. Network byte is an octet, and if your CPU is different, then these functions won't work. Correct handling of non-octet byte is possible but unnecessary work unless you need to support network communication on such system. Adding an assert might be a good idea.
I prefer to call it big endian rather than "network" endian. There's a chance that a reader isn't aware of the convention that de-facto endianness of network is big.
Instead of checking "if native endianness is X, do Y else do Z", I prefer to write a function that works with all native endianness. This can be done with bit shifts.
Yeah, it's constexpr. Not because it needs to be, but just because it can be. I haven't been able to produce an example where dropping constexpr would produce worse code.
// helper to promote an integer type
template <class T>
using promote_t = std::decay_t<decltype(+std::declval<T>())>;
template <class T, std::size_t... I>
constexpr void
host_to_big_impl(
unsigned char* buf,
T t,
[[maybe_unused]] std::index_sequence<I...>) noexcept
{
using U = std::make_unsigned_t<promote_t<T>>;
constexpr U lastI = sizeof(T) - 1u;
constexpr U bits = 8u;
U u = t;
( (buf[I] = u >> ((lastI - I) * bits)), ... );
}
template <class T, std::size_t... I>
constexpr void
host_to_big(unsigned char* buf, T t) noexcept
{
using Indices = std::make_index_sequence<sizeof(T)>;
return host_to_big_impl<T>(buf, t, Indices{});
}
1 In all use cases I've encountered. Conversions from integer to integer can be implemented by delegating these if you have such case, although they cannot be constexpr due to need for reinterpret_cast.

I made a benchmark comparing my C++ solution from the question and the solution by eeroika from the accepted answer.
Looking at this is a complete waste of time, but now that I did it, I though I might as well share it. The result is that (in the specific not-quite-realistic usecase I look at) they seem to be equivalent in terms of performance. This is despite my solution being compiled to use x86-bswap, while the solution by eeroika does it by just using mov.
The performance seems to differ a lot (!!) when using different compilers and the main thing I learned from these benchmarks is, again, that I'm just wasting my time...
// benchmark to compare two C++20-stand-alone host-to-big-endian endianess conversion.]
// Run at quick-bench.com! This is not a complete program. (https://quick-bench.com/q/2qnr4xYKemKLZupsicVFV_09rEk)
// To run locally, include Google benchmark header and a main method as required by the benchmarking library.
// Adapted from https://stackoverflow.com/a/71004000/9988487
#include <type_traits>
#include <utility>
#include <cstddef>
#include <cstdint>
#include <climits>
#include <type_traits>
#include <utility>
#include <bit>
#include <random>
/////////////////////////////// Solution 1 ////////////////////////////////
template <typename T> struct scalar_t { T t{}; /* no begin/end */ };
static_assert(not std::ranges::range< scalar_t<int> >);
template<class T, std::size_t... N>
constexpr T bswap_impl(T i, std::index_sequence<N...>) noexcept {
constexpr auto bits_per_byte = 8u;
static_assert(bits_per_byte == CHAR_BIT);
return ((((i >> (N * bits_per_byte)) & (T)(unsigned char)(-1)) <<
((sizeof(T) - 1 - N) * bits_per_byte)) | ...);
}; // ^~~~~ fold expression
template<class T, class U = typename std::make_unsigned<T>::type>
constexpr U bswap(T i) noexcept {
return bswap_impl<U>(i, std::make_index_sequence<sizeof(T)>{});
}
constexpr inline auto host_to_net(std::integral auto i) {
static_assert(std::endian::native == std::endian::big || std::endian::native == std::endian::little);
if constexpr (std::endian::native == std::endian::big) {
return i;
} else {
return bswap(i); // replace by `std::byteswap` once it's available!
}
}
/////////////////////////////// Solution 2 ////////////////////////////////
// helper to promote an integer type
template <class T>
using promote_t = std::decay_t<decltype(+std::declval<T>())>;
template <class T, std::size_t... I>
constexpr void
host_to_big_impl(
unsigned char* buf,
T t,
[[maybe_unused]] std::index_sequence<I...>) noexcept {
using U = std::make_unsigned_t<promote_t<T>>;
constexpr U lastI = sizeof(T) - 1u;
constexpr U bits = 8u;
U u = t;
( (buf[I] = u >> ((lastI - I) * bits)), ... );
}
template <class T, std::size_t... I>
constexpr void
host_to_big(unsigned char* buf, T t) noexcept {
using Indices = std::make_index_sequence<sizeof(T)>;
return host_to_big_impl<T>(buf, t, Indices{});
}
//////////////////////// Benchmarks ////////////////////////////////////
template<std::integral T>
std::vector<T> get_random_vector(std::size_t length, unsigned int seed) {
// NOTE: IT IS VERY SLOW TO RECREATE RNG EVERY TIME. Don't use in production code!
std::mt19937_64 rng{seed};
std::uniform_int_distribution<T> distribution(
std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::vector<T> result(length);
for (auto && val : result) {
val = distribution(rng);
}
return result;
}
template<>
std::vector<bool> get_random_vector<bool>(std::size_t length, unsigned int seed) {
// NOTE: IT IS VERY SLOW TO RECREATE RNG EVERY TIME. ONLY USE FOR TESTING!
std::mt19937_64 rng{seed};
std::bernoulli_distribution distribution{0.5};
std::vector<bool> vec(length);
for (auto && val : vec) {
val = distribution(rng);
}
return vec;
}
constexpr std::size_t n_ints{1000};
static void solution1(benchmark::State& state) {
std::vector<int> intvec = get_random_vector<int>(n_ints, 0);
std::vector<std::uint8_t> buffer(sizeof(int)*intvec.size());
for (auto _ : state) {
for (std::size_t i{}; i < intvec.size(); ++i) {
host_to_big(buffer.data() + sizeof(int)*i, intvec[i]);
}
benchmark::DoNotOptimize(buffer);
benchmark::ClobberMemory();
}
}
BENCHMARK(solution1);
static void solution2(benchmark::State& state) {
std::vector<int> intvec = get_random_vector<int>(n_ints, 0);
std::vector<std::uint8_t> buffer(sizeof(int)*intvec.size());
for (auto _ : state) {
for (std::size_t i{}; i < intvec.size(); ++i) {
buffer[sizeof(int)*i] = host_to_net(intvec[i]);
}
benchmark::DoNotOptimize(buffer);
benchmark::ClobberMemory();
}
}
BENCHMARK(solution2);

Related

How to write a portable constexpr std::copysign()?

In particular, it must work with NaNs as std::copysign does. Similarly, I need a constexpr std::signbit.
constexpr double copysign(double mag, double sgn)
{
// how?
}
constexpr bool signbit(double arg)
{
// how?
}
// produce the two types of NaNs
constexpr double nan_pos = copysign(std::numeric_limits<double>::quiet_NaN(), +1);
constexpr double nan_neg = copysign(std::numeric_limits<double>::quiet_NaN(), -1);
// must pass the checks
static_assert(signbit(nan_pos) == false);
static_assert(signbit(nan_neg) == true);
The story behind is that I need two types of NaNs at compile time, as well as ways to distinguish between them. The most straightforward way I can think of is to manipulate the sign bit of the NaNs. It does work at run time; now I just want to move some computations to compile time, and this is the last hurdle.
Notes: at the moment, I'm relying on GCC, as it has built-in versions of these functions and they are indeed constexpr, which is nice. But I want my codebase to compile on Clang and perhaps other compilers too.
P0533: constexpr for <cmath> and <cstdlib> is now accepted.
Starting in C++23 you can just use std::copysign and std::syncbit.
Use of __builtin... is not really portable, but works in compilers that mentioned as target. __builtin_copysign is contexpr, but __builtin_signbit is apparently not on clang, so doing signbit with __builtin_copysign:
#include <limits>
constexpr double copysign(double mag, double sgn)
{
return __builtin_copysign(mag, sgn);
}
constexpr bool signbit(double arg)
{
return __builtin_copysign(1, arg) < 0;
}
// produce the two types of NaNs
constexpr double nan_pos = copysign(std::numeric_limits<double>::quiet_NaN(), +1);
constexpr double nan_neg = copysign(std::numeric_limits<double>::quiet_NaN(), -1);
// must pass the checks
static_assert(signbit(nan_pos) == false);
static_assert(signbit(nan_neg) == true);
int main() {}
https://godbolt.org/z/8Wafaj4a4
If you can use std::bit_cast, you can manipulate floating point types cast to integer types. The portability is limited to the representation of double, but if you can assume theĀ IEEE 754 double-precision binary floating-point format, cast to uint64_t and using sign bit should work.

What is the default hash function used in C++ std::unordered_map?

I am using
unordered_map<string, int>
and
unordered_map<int, int>
What hash function is used in each case and what is chance of collision in each case?
I will be inserting unique string and unique int as keys in each case respectively.
I am interested in knowing the algorithm of hash function in case of string and int keys and their collision stats.
The function object std::hash<> is used.
Standard specializations exist for all built-in types, and some other standard library types
such as std::string and std::thread. See the link for the full list.
For other types to be used in a std::unordered_map, you will have to specialize std::hash<> or create your own function object.
The chance of collision is completely implementation-dependent, but considering the fact that integers are limited between a defined range, while strings are theoretically infinitely long, I'd say there is a much better chance for collision with strings.
As for the implementation in GCC, the specialization for builtin-types just returns the bit pattern. Here's how they are defined in bits/functional_hash.h:
/// Partial specializations for pointer types.
template<typename _Tp>
struct hash<_Tp*> : public __hash_base<size_t, _Tp*>
{
size_t
operator()(_Tp* __p) const noexcept
{ return reinterpret_cast<size_t>(__p); }
};
// Explicit specializations for integer types.
#define _Cxx_hashtable_define_trivial_hash(_Tp) \
template<> \
struct hash<_Tp> : public __hash_base<size_t, _Tp> \
{ \
size_t \
operator()(_Tp __val) const noexcept \
{ return static_cast<size_t>(__val); } \
};
/// Explicit specialization for bool.
_Cxx_hashtable_define_trivial_hash(bool)
/// Explicit specialization for char.
_Cxx_hashtable_define_trivial_hash(char)
/// ...
The specialization for std::string is defined as:
#ifndef _GLIBCXX_COMPATIBILITY_CXX0X
/// std::hash specialization for string.
template<>
struct hash<string>
: public __hash_base<size_t, string>
{
size_t
operator()(const string& __s) const noexcept
{ return std::_Hash_impl::hash(__s.data(), __s.length()); }
};
Some further search leads us to:
struct _Hash_impl
{
static size_t
hash(const void* __ptr, size_t __clength,
size_t __seed = static_cast<size_t>(0xc70f6907UL))
{ return _Hash_bytes(__ptr, __clength, __seed); }
...
};
...
// Hash function implementation for the nontrivial specialization.
// All of them are based on a primitive that hashes a pointer to a
// byte array. The actual hash algorithm is not guaranteed to stay
// the same from release to release -- it may be updated or tuned to
// improve hash quality or speed.
size_t
_Hash_bytes(const void* __ptr, size_t __len, size_t __seed);
_Hash_bytes is an external function from libstdc++. A bit more searching led me to this file, which states:
// This file defines Hash_bytes, a primitive used for defining hash
// functions. Based on public domain MurmurHashUnaligned2, by Austin
// Appleby. http://murmurhash.googlepages.com/
So the default hashing algorithm GCC uses for strings is MurmurHashUnaligned2.
GCC C++11 uses "MurmurHashUnaligned2", by Austin Appleby
Though the hashing algorithms are compiler-dependent, I'll present it for GCC C++11. #Avidan Borisov astutely discovered that the GCC hashing algorithm used for strings is "MurmurHashUnaligned2," by Austin Appleby. I did some searching and found a mirrored copy of GCC on Github. Therefore:
The GCC C++11 hashing functions used for unordered_map (a hash table template) and unordered_set (a hash set template) appear to be as follows.
Thanks to Avidan Borisov for his background research which on the question of what are the GCC C++11 hash functions used, stating that GCC uses an implementation of "MurmurHashUnaligned2", by Austin Appleby (see http://murmurhash.googlepages.com/ and https://github.com/aappleby/smhasher).
In the file "gcc/libstdc++-v3/libsupc++/hash_bytes.cc", here (https://github.com/gcc-mirror/gcc/blob/master/libstdc++-v3/libsupc++/hash_bytes.cc), I found the implementations. Here's the one for the "32-bit size_t" return value, for example (pulled 11 Aug 2017)
Code:
// Implementation of Murmur hash for 32-bit size_t.
size_t _Hash_bytes(const void* ptr, size_t len, size_t seed)
{
const size_t m = 0x5bd1e995;
size_t hash = seed ^ len;
const char* buf = static_cast<const char*>(ptr);
// Mix 4 bytes at a time into the hash.
while (len >= 4)
{
size_t k = unaligned_load(buf);
k *= m;
k ^= k >> 24;
k *= m;
hash *= m;
hash ^= k;
buf += 4;
len -= 4;
}
// Handle the last few bytes of the input array.
switch (len)
{
case 3:
hash ^= static_cast<unsigned char>(buf[2]) << 16;
[[gnu::fallthrough]];
case 2:
hash ^= static_cast<unsigned char>(buf[1]) << 8;
[[gnu::fallthrough]];
case 1:
hash ^= static_cast<unsigned char>(buf[0]);
hash *= m;
};
// Do a few final mixes of the hash.
hash ^= hash >> 13;
hash *= m;
hash ^= hash >> 15;
return hash;
}
The latest version of Austin Appleby's hashing functions is "MurmurHash3", which is released into the public domain!
Austin states in his readme:
The SMHasher suite also includes MurmurHash3, which is the latest version in the series of MurmurHash functions - the new version is faster, more robust, and its variants can produce 32- and 128-bit hash values efficiently on both x86 and x64 platforms.
For MurmurHash3's source code, see here:
MurmurHash3.h
MurmurHash3.cpp
And the great thing is!? It's public domain software. That's right! The tops of the files state:
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
So, if you'd like to use MurmurHash3 in your open source software, personal projects, or proprietary software, including for implementing your own hash tables in C, go for it!
If you'd like build instructions to build and test his MurmurHash3 code, I've written some here: https://github.com/ElectricRCAircraftGuy/smhasher/blob/add_build_instructions/build/README.md. Hopefully this PR I've opened gets accepted and then they will end up in his main repo. But, until then, refer to the build instructions in my fork.
For additional hashing functions, including djb2, and the 2 versions of the K&R hashing functions...
...(one apparently terrible, one pretty good), see my other answer here: hash function for string.
See also:
https://en.wikipedia.org/wiki/MurmurHash
Further study to do: take a look at these hash function speed benchmarks: https://github.com/fredrikwidlund/hash-function-benchmark (thanks #lfmunoz for pointing this out)

C/C++ faster calls from array of functions

Dear all,
I've a program in C++ that should be as fast as possible.
In particular, there is a crucial part that works as follows:
There is a variable D ranging in [0,255], based on its value
it calls the function whose pointer is stored in a array F (with 255 elements).
i.e., F[D] gives the pointer to the function to call.
The functions are very simple, they execute few expressions or/and assignments (no cycles).
How can I do this faster?
I can replicate the code in the functions. I do not need features of functions call (I'm using them since it was simpler way to do).
My goal is that of removing the inefficiencies due to calls of functions.
I consider to use Switch/case.
The code of each case is the code of the corresponding function.
Is there a faster way to do it?
A switch/case may be faster, as the compiler can make more specific optimizations, and there are other things like code locality compared to calling a function pointer. However, it's unlikely that you'll notice a substantial performance increase.
Convert it to a switch
Make sure the tiny functions can be inlined (how to do this is compiler dependant - but explicitly marking them inline and placing them in the compilation unit is a good bet)
Finally, you might try profile-guided optimizations if (as seems quite possible) your switch statement calls the branches in a non-uniform fashion.
The aim is to avoid function-call overhead and have the compiler reorder the switch statement to reduce the number of branches typically encountered.
Edit: since there's some skeptical voices as to how much this helps; I gave it a shot. In the following program, the function-pointer loop needs 1.07 seconds, and the inlined switch statement takes 0.79 seconds on my machine - YMMV:
template<unsigned n0,unsigned n> struct F { static inline unsigned func(unsigned val); };
template<unsigned n> struct F<0,n> { static inline unsigned func(unsigned val) { return val + n;} };
template<unsigned n> struct F<1,n> { static inline unsigned func(unsigned val) { return val - n; } };
template<unsigned n> struct F<2,n> { static inline unsigned func(unsigned val) { return val ^ n; } };
template<unsigned n> struct F<3,n> { static inline unsigned func(unsigned val) { return val * n; } };
template<unsigned n> struct F<4,n> { static inline unsigned func(unsigned val) { return (val << ( n %16)) + n*(n&0xff); } };
template<unsigned n> struct F<5,n> { static inline unsigned func(unsigned val) { return (val >> ( n %16)) + (n*(n&0xff) << 16); } };
template<unsigned n> struct F<6,n> { static inline unsigned func(unsigned val) { return val / (n|1) + val; } };
template<unsigned n> struct F<7,n> { static inline unsigned func(unsigned val) { return (val <<16) + (val>>16); } };
template<unsigned n> struct f { static inline unsigned func(unsigned val) { return F<n%8,n>::func(val); } };
typedef unsigned (*fPtr)(unsigned);
fPtr funcs[256];
template<unsigned n0,unsigned n1> inline void fAssign() {
if(n0==n1-1 || n0==n1) //||n0==n1 just to avoid compiler warning
funcs[n0] = f<n0>::func;
else {
fAssign<n0,(n0 + n1)/2>();
fAssign<(n0 + n1)/2,n1>();
}
}
__forceinline unsigned funcSwitch(unsigned char type,unsigned val);//huge function elided
__declspec(noinline) unsigned doloop(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcs[x*37&0xff](val);
return val;
}
__declspec(noinline) unsigned doloop2(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcSwitch(x*37&0xff,val);
return val;
}
I verified that all function calls are inlined except those to doloop, doloop2 and funcs[?] to ensure I'm not measuring odd compiler choices.
So, on this machine, with MSC 10, this (thoroughly artifical) benchmark shows that the huge-switch version is a third faster than the function-pointer lookup-based version. PGO slowed both versions down; probably because they're too small to exhibit cache effects and the program is small enough to be fully inlined/optimized even without PGO.
You're not going to beat the lookup table based indirect call with any other method, including switch. Switch is going to take at best logarithmic time (well, it might hash to get constant time, but that'll likely be slower than logarithmic time with ints for most input sizes).
This is assuming your code looks like this:
typedef void (*myProc_t)();
myProc_t functionArray[255] = { ... };
void CallSomepin(unsigned char D)
{
functionArray[D]();
}
If you're creating the array each time the function is called however, it might be a good idea to amortize the cost of the construction by doing the initialization of the function array once, rather than every time.
EDIT: That said, the best way to avoid the inefficiency of the indirect call is simply to not do it. Look at your code and see if there are places you can replace the indirect lookup based call with a direct call.
There is a variable D ranging in [0,255], based on its value it calls the function whose pointer is stored in a array F (with 255 elements)
Warning, this will result in a buffer overflow if D has the value 255.

constexpr and endianness

A common question that comes up from time to time in the world of C++ programming is compile-time determination of endianness. Usually this is done with barely portable #ifdefs. But does the C++11 constexpr keyword along with template specialization offer us a better solution to this?
Would it be legal C++11 to do something like:
constexpr bool little_endian()
{
const static unsigned num = 0xAABBCCDD;
return reinterpret_cast<const unsigned char*> (&num)[0] == 0xDD;
}
And then specialize a template for both endian types:
template <bool LittleEndian>
struct Foo
{
// .... specialization for little endian
};
template <>
struct Foo<false>
{
// .... specialization for big endian
};
And then do:
Foo<little_endian()>::do_something();
New answer (C++20)
c++20 has introduced a new standard library header <bit>.
Among other things it provides a clean, portable way to check the endianness.
Since my old method relies on some questionable techniques, I suggest anyone who uses it to switch to the check provided by the standard library.
Here's an adapter which allows to use the new way of checking endianness without having to update the code that relies on the interface of my old class:
#include <bit>
class Endian
{
public:
Endian() = delete;
static constexpr bool little = std::endian::native == std::endian::little;
static constexpr bool big = std::endian::native == std::endian::big;
static constexpr bool middle = !little && !big;
};
Old answer
I was able to write this:
#include <cstdint>
class Endian
{
private:
static constexpr uint32_t uint32_ = 0x01020304;
static constexpr uint8_t magic_ = (const uint8_t&)uint32_;
public:
static constexpr bool little = magic_ == 0x04;
static constexpr bool middle = magic_ == 0x02;
static constexpr bool big = magic_ == 0x01;
static_assert(little || middle || big, "Cannot determine endianness!");
private:
Endian() = delete;
};
I've tested it with g++ and it compiles without warnings. It gives a correct result on x64.
If you have any big-endian or middle-endian proccesor, please, confirm that this works for you in a comment.
It is not possible to determine endianness at compile time using constexpr (before C++20). reinterpret_cast is explicitly forbidden by [expr.const]p2, as is iain's suggestion of reading from a non-active member of a union. Casting to a different reference type is also forbidden, as such a cast is interpreted as a reinterpret_cast.
Update:
This is now possible in C++20. One way (live):
#include <bit>
template<std::integral T>
constexpr bool is_little_endian() {
for (unsigned bit = 0; bit != sizeof(T) * CHAR_BIT; ++bit) {
unsigned char data[sizeof(T)] = {};
// In little-endian, bit i of the raw bytes ...
data[bit / CHAR_BIT] = 1 << (bit % CHAR_BIT);
// ... corresponds to bit i of the value.
if (std::bit_cast<T>(data) != T(1) << bit)
return false;
}
return true;
}
static_assert(is_little_endian<int>());
(Note that C++20 guarantees two's complement integers -- with an unspecified bit order -- so we just need to check that every bit of the data maps to the expected place in the integer.)
But if you have a C++20 standard library, you can also just ask it:
#include <type_traits>
constexpr bool is_little_endian = std::endian::native == std::endian::little;
Assuming N2116 is the wording that gets incorporated, then your example is ill-formed (notice that there is no concept of "legal/illegal" in C++). The proposed text for [decl.constexpr]/3 says
its function-body shall be a compound-statement of the form
{ return expression; }
where expression is a potential constant expression (5.19);
Your function violates the requirement in that it also declares a local variable.
Edit: This restriction could be overcome by moving num outside of the function. The function still wouldn't be well-formed, then, because expression needs to be a potential constant expression, which is defined as
An expression is a potential constant expression if it is a constant
expression when all occurrences of function parameters are replaced
by arbitrary constant expressions of the appropriate type.
IOW, reinterpret_cast<const unsigned char*> (&num)[0] == 0xDD would have to be a constant expression. However, it is not: &num would be a address constant-expression (5.19/4). Accessing the value of such a pointer is, however, not allowed for a constant expression:
The subscripting operator [] and the class member access . and
operators, the & and * unary operators, and pointer casts (except dynamic_casts, 5.2.7) can be used in the creation of an
address constant expression, but the value of an object shall not be accessed by the use of these operators.
Edit: The above text is from C++98. Apparently, C++0x is more permissive what it allows for constant expressions. The expression involves an lvalue-to-rvalue conversion of the array reference, which is banned from constant expressions unless
it is applied to an lvalue of effective integral type that refers
to a non-volatile const variable or static data member initialized
with constant expressions
It's not clear to me whether (&num)[0] "refers to" a const variable, or whether only a literal num "refers to" such a variable. If (&num)[0] refers to that variable, it is then unclear whether reinterpret_cast<const unsigned char*> (&num)[0] still "refers to" num.
There is std::endian in the upcoming C++20.
#include <bit>
constexpr bool little_endian() noexcept
{
return std::endian::native == std::endian::little;
}
My first post. Just wanted to share some code that I'm using.
//Some handy defines magic, thanks overflow
#define IS_LITTLE_ENDIAN ('ABCD'==0x41424344UL) //41 42 43 44 = 'ABCD' hex ASCII code
#define IS_BIG_ENDIAN ('ABCD'==0x44434241UL) //44 43 42 41 = 'DCBA' hex ASCII code
#define IS_UNKNOWN_ENDIAN (IS_LITTLE_ENDIAN == IS_BIG_ENDIAN)
//Next in code...
struct Quad
{
union
{
#if IS_LITTLE_ENDIAN
struct { std::uint8_t b0, b1, b2, b3; };
#elif IS_BIG_ENDIAN
struct { std::uint8_t b3, b2, b1, b0; };
#elif IS_UNKNOWN_ENDIAN
#error "Endianness not implemented!"
#endif
std::uint32_t dword;
};
};
Constexpr version:
namespace Endian
{
namespace Impl //Private
{
//41 42 43 44 = 'ABCD' hex ASCII code
static constexpr std::uint32_t LITTLE_{ 0x41424344u };
//44 43 42 41 = 'DCBA' hex ASCII code
static constexpr std::uint32_t BIG_{ 0x44434241u };
//Converts chars to uint32 on current platform
static constexpr std::uint32_t NATIVE_{ 'ABCD' };
}
//Public
enum class Type : size_t { UNKNOWN, LITTLE, BIG };
//Compare
static constexpr bool IS_LITTLE = Impl::NATIVE_ == Impl::LITTLE_;
static constexpr bool IS_BIG = Impl::NATIVE_ == Impl::BIG_;
static constexpr bool IS_UNKNOWN = IS_LITTLE == IS_BIG;
//Endian type on current platform
static constexpr Type NATIVE_TYPE = IS_LITTLE ? Type::LITTLE : IS_BIG ? Type::BIG : Type::UNKNOWN;
//Uncomment for test.
//static_assert(!IS_LITTLE, "This platform has little endian.");
//static_assert(!IS_BIG, "This platform has big endian.");
//static_assert(!IS_UNKNOWN, "Error: Unsupported endian!");
}
That is a very interesting question.
I am not Language Lawyer, but you might be able to replace the reinterpret_cast with a union.
const union {
int int_value;
char char_value[4];
} Endian = { 0xAABBCCDD };
constexpr bool little_endian()
{
return Endian[0] == 0xDD;
}
This may seem like cheating, but you can always include endian.h... BYTE_ORDER == BIG_ENDIAN is a valid constexpr...
Here is a simple C++11 compliant version, inspired by #no-name answer:
constexpr bool is_system_little_endian(int value = 1) {
return static_cast<const unsigned char&>(value) == 1;
}
Using a default value to crank everything on one line is to meet C++11 requirements on constexpr functions: they must only contain a single return statement.
The good thing with doing it (and testing it!) in a constexpr context is that it makes sure that there is no undefined behavior in the code.
On compiler explorer here.
If your goal is to insure that the compiler optimizes little_endian() into a constant true or false at compile-time, without any of its contents winding up in the executable or being executed at runtime, and only generating code from the "correct" one of your two Foo templates, I fear you're in for a disappointment.
I also am not a language lawyer, but it looks to me like constexpr is like inline or register: a keyword that alerts the compiler writer to the presence of a potential optimization. Then it's up to the compiler writer whether or not to take advantage of that. Language specs typically mandate behaviors, not optimizations.
Also, have you actually tried this on a variety of C++0x complaint compilers to see what happens? I would guess most of them would choke on your dual templates, since they won't be able to figure out which one to use if invoked with false.

Templatized branchless int max/min function

I'm trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a given word size:
inline int32 imax( int32 a, int32 b )
{
// signed for arithmetic shift
int32 mask = a - b;
// mask < 0 means MSB is 1.
return a + ( ( b - a ) & ( mask >> 31 ) );
}
Now, assuming arguendo that I really am writing the kind of application on the kind of in-order processor where this is necessary, my question is whether there is a way to use C++ templates to generalize this to all sizes of int.
The >>31 step only works for int32s, of course, and while I could copy out overloads on the function for int8, int16, and int64, it seems like I should use a template function instead. But how do I get the size of a template argument in bits?
Is there a better way to do it than this? Can I force the mask T to be signed? If T is unsigned the mask-shift step won't work (because it'll be a logical rather than arithmetic shift).
template< typename T >
inline T imax( T a, T b )
{
// how can I force this T to be signed?
T mask = a - b;
// I hope the compiler turns the math below into an immediate constant!
mask = mask >> ( (sizeof(T) * 8) - 1 );
return a + ( ( b - a ) & mask );
}
And, having done the above, can I prevent it from being used for anything but an integer type (eg, no floats or classes)?
EDIT: This answer is from before C++11. Since then, C++11 and later has offered make_signed<T> and much more as part of the standard library
Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits<char>::max()) since it isn't guaranteed that characters are 8-bit.
Any good compiler will be smart enough to merge all of the math constants at compile time.
You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):
typename numeric_traits<T>::signed_type x;
An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).
or better yet, use boost:
typename boost::make_signed<T>::type x;
EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed types. In my copy of the draft standard, the following is written:
The value of E1 >> E2 is E1
rightshifted E2 bit positions. If E1
has an unsigned type or if E1 has a
signed type and a nonnegative value,
the value of the result is the
integral part of the quotient of E1
divided by the quantity 2 raised to
the power E2. If E1 has a signed type
and a negative value, the resulting
value is implementation defined.
But as I said, it will work on every compiler I've seen :-p.
Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
template <typename T>
inline T imin (T a, T b)
{
return (a > b) * b + (a <= b) * a;
}
tl;dr
To achieve your goals, you're best off just writing this:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
Long version
I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.
Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.
Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.
So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.
#include <iostream>
#include <chrono>
using int32 = long;
using uint32 = unsigned long;
constexpr int32 NaiveMax(int32 a, int32 b)
{
return (a > b) ? a : b;
}
constexpr int32 FastMax(int32 a, int32 b)
{
int32 mask = a - b;
mask = mask >> ((sizeof(int32) * 8) - 1);
return a + ((b - a) & mask);
}
int main()
{
int32 resInts[1000] = {};
int32 lotsOfInts[1'000];
for (uint32 i = 0; i < 1000; i++)
{
lotsOfInts[i] = rand();
}
auto naiveTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
auto fastTime = [&]() -> auto
{
auto start = std::chrono::high_resolution_clock::now();
for (uint32 i = 1; i < 1'000'000; i++)
{
const auto index = i % 1000;
const auto lastIndex = (i - 1) % 1000;
resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
}
auto finish = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
}();
std::cout << "Naive Time: " << naiveTime << std::endl;
std::cout << "Fast Time: " << fastTime << std::endl;
getchar();
return 0;
}
And here's the output I get on my machine:
Naive Time: 2330174
Fast Time: 2492246
I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.
Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.
The Answer
In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:
template<typename T> T max(T a, T b) { return (a > b) ? a : b; }
There are additional upsides to the naive method:
It works for unsigned types.
It even works for floating types.
It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)
You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.
I don't know what are the exact conditions for this bit mask trick to work but you can do something like
#include<type_traits>
template<typename T, typename = std::enable_if_t<std::is_integral<T>{}> >
inline T imax( T a, T b )
{
...
}
Other useful candidates are std::is_[un]signed, std::is_fundamental, etc. https://en.cppreference.com/w/cpp/types
In addition to tloch14's answer "tl;dr", one can also use an index into an array. This avoids the unwieldly bitshuffling of the "branchless min/max"; it's also generalizable to all types.
template<typename T> constexpr T OtherFastMax(const T &a, const T &b)
{
const T (&p)[2] = {a, b};
return p[a>b];
}