The man pages for htonl() seem to suggest that you can only use it for up to 32 bit values. (In reality, ntohl() is defined for unsigned long, which on my platform is 32 bits. I suppose if the unsigned long were 8 bytes, it would work for 64 bit ints).
My problem is that I need to convert 64 bit integers (in my case, this is an unsigned long long) from big endian to little endian. Right now, I need to do that specific conversion. But it would be even nicer if the function (like ntohl()) would NOT convert my 64 bit value if the target platform WAS big endian. (I'd rather avoid adding my own preprocessor magic to do this).
What can I use? I would like something that is standard if it exists, but I am open to implementation suggestions. I have seen this type of conversion done in the past using unions. I suppose I could have a union with an unsigned long long and a char[8]. Then swap the bytes around accordingly. (Obviously would break on platforms that were big endian).
Documentation: man htobe64 on Linux (glibc >= 2.9) or FreeBSD.
Unfortunately OpenBSD, FreeBSD and glibc (Linux) did not quite work together smoothly to create one (non-kernel-API) libc standard for this, during an attempt in 2009.
Currently, this short bit of preprocessor code:
#if defined(__linux__)
# include <endian.h>
#elif defined(__FreeBSD__) || defined(__NetBSD__)
# include <sys/endian.h>
#elif defined(__OpenBSD__)
# include <sys/types.h>
# define be16toh(x) betoh16(x)
# define be32toh(x) betoh32(x)
# define be64toh(x) betoh64(x)
#endif
(tested on Linux and OpenBSD) should hide the differences. It gives you the Linux/FreeBSD-style macros on those 4 platforms.
Use example:
#include <stdint.h> // For 'uint64_t'
uint64_t host_int = 123;
uint64_t big_endian;
big_endian = htobe64( host_int );
host_int = be64toh( big_endian );
It's the most "standard C library"-ish approach available at the moment.
I would recommend reading this: http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
uint64_t
ntoh64(const uint64_t *input)
{
uint64_t rval;
uint8_t *data = (uint8_t *)&rval;
data[0] = *input >> 56;
data[1] = *input >> 48;
data[2] = *input >> 40;
data[3] = *input >> 32;
data[4] = *input >> 24;
data[5] = *input >> 16;
data[6] = *input >> 8;
data[7] = *input >> 0;
return rval;
}
uint64_t
hton64(const uint64_t *input)
{
return (ntoh64(input));
}
int
main(void)
{
uint64_t ull;
ull = 1;
printf("%"PRIu64"\n", ull);
ull = ntoh64(&ull);
printf("%"PRIu64"\n", ull);
ull = hton64(&ull);
printf("%"PRIu64"\n", ull);
return 0;
}
Will show the following output:
1
72057594037927936
1
You can test this with ntohl() if you drop the upper 4 bytes.
Also You can turn this into a nice templated function in C++ that will work on any size integer:
template <typename T>
static inline T
hton_any(const T &input)
{
T output(0);
const std::size_t size = sizeof(input);
uint8_t *data = reinterpret_cast<uint8_t *>(&output);
for (std::size_t i = 0; i < size; i++) {
data[i] = input >> ((size - i - 1) * 8);
}
return output;
}
Now your 128 bit safe too!
Quick answer
#include <endian.h> // __BYTE_ORDER __LITTLE_ENDIAN
#include <byteswap.h> // bswap_64()
uint64_t value = 0x1122334455667788;
#if __BYTE_ORDER == __LITTLE_ENDIAN
value = bswap_64(value); // Compiler builtin GCC/Clang
#endif
Header file
As reported by zhaorufei (see her/his comment) endian.h is not C++ standard header and the macros __BYTE_ORDER and __LITTLE_ENDIAN may be undefined. Therefore the #if statement is not predictable because undefined macro are treated as 0.
Please edit this answer if you want to share your C++ elegant trick to detect endianness.
Portability
Moreover the macro bswap_64() is available for GCC and Clang compilers but not for Visual C++ compiler. To provide a portable source code, you may be inspired by the following snippet:
#ifdef _MSC_VER
#include <stdlib.h>
#define bswap_16(x) _byteswap_ushort(x)
#define bswap_32(x) _byteswap_ulong(x)
#define bswap_64(x) _byteswap_uint64(x)
#else
#include <byteswap.h> // bswap_16 bswap_32 bswap_64
#endif
See also a more portable source code: Cross-platform _byteswap_uint64
C++14 constexpr template function
Generic hton() for 16 bits, 32 bits, 64 bits and more...
#include <endian.h> // __BYTE_ORDER __LITTLE_ENDIAN
#include <algorithm> // std::reverse()
template <typename T>
constexpr T htonT (T value) noexcept
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
char* ptr = reinterpret_cast<char*>(&value);
std::reverse(ptr, ptr + sizeof(T));
#endif
return value;
}
C++11 constexpr template function
C++11 does not permit local variable in constexpr function.
Therefore the trick is to use an argument with default value.
Moreover the C++11 constexpr function must contain one single expression.
Therefore the body is composed of one return having some comma-separated statements.
template <typename T>
constexpr T htonT (T value, char* ptr=0) noexcept
{
return
#if __BYTE_ORDER == __LITTLE_ENDIAN
ptr = reinterpret_cast<char*>(&value),
std::reverse(ptr, ptr + sizeof(T)),
#endif
value;
}
No compilation warning on both clang-3.5 and GCC-4.9 using -Wall -Wextra -pedantic
(see compilation and run output on coliru).
C++11 constexpr template SFINAE functions
However the above version does not allow to create constexpr variable as:
constexpr int32_t hton_six = htonT( int32_t(6) );
Finally we need to separate (specialize) the functions depending on 16/32/64 bits.
But we can still keep generic functions.
(see the full snippet on coliru)
The below C++11 snippet use the traits std::enable_if to exploit Substitution Failure Is Not An Error (SFINAE).
template <typename T>
constexpr typename std::enable_if<sizeof(T) == 2, T>::type
htonT (T value) noexcept
{
return ((value & 0x00FF) << 8)
| ((value & 0xFF00) >> 8);
}
template <typename T>
constexpr typename std::enable_if<sizeof(T) == 4, T>::type
htonT (T value) noexcept
{
return ((value & 0x000000FF) << 24)
| ((value & 0x0000FF00) << 8)
| ((value & 0x00FF0000) >> 8)
| ((value & 0xFF000000) >> 24);
}
template <typename T>
constexpr typename std::enable_if<sizeof(T) == 8, T>::type
htonT (T value) noexcept
{
return ((value & 0xFF00000000000000ull) >> 56)
| ((value & 0x00FF000000000000ull) >> 40)
| ((value & 0x0000FF0000000000ull) >> 24)
| ((value & 0x000000FF00000000ull) >> 8)
| ((value & 0x00000000FF000000ull) << 8)
| ((value & 0x0000000000FF0000ull) << 24)
| ((value & 0x000000000000FF00ull) << 40)
| ((value & 0x00000000000000FFull) << 56);
}
Or an even-shorter version based on built-in compiler macros and C++14 syntax std::enable_if_t<xxx> as a shortcut for std::enable_if<xxx>::type:
template <typename T>
constexpr typename std::enable_if_t<sizeof(T) == 2, T>
htonT (T value) noexcept
{
return bswap_16(value); // __bswap_constant_16
}
template <typename T>
constexpr typename std::enable_if_t<sizeof(T) == 4, T>
htonT (T value) noexcept
{
return bswap_32(value); // __bswap_constant_32
}
template <typename T>
constexpr typename std::enable_if_t<sizeof(T) == 8, T>
htonT (T value) noexcept
{
return bswap_64(value); // __bswap_constant_64
}
Test code of the first version
std::uint8_t uc = 'B'; std::cout <<std::setw(16)<< uc <<'\n';
uc = htonT( uc ); std::cout <<std::setw(16)<< uc <<'\n';
std::uint16_t us = 0x1122; std::cout <<std::setw(16)<< us <<'\n';
us = htonT( us ); std::cout <<std::setw(16)<< us <<'\n';
std::uint32_t ul = 0x11223344; std::cout <<std::setw(16)<< ul <<'\n';
ul = htonT( ul ); std::cout <<std::setw(16)<< ul <<'\n';
std::uint64_t uL = 0x1122334455667788; std::cout <<std::setw(16)<< uL <<'\n';
uL = htonT( uL ); std::cout <<std::setw(16)<< uL <<'\n';
Test code of the second version
constexpr uint8_t a1 = 'B'; std::cout<<std::setw(16)<<a1<<'\n';
constexpr auto b1 = htonT(a1); std::cout<<std::setw(16)<<b1<<'\n';
constexpr uint16_t a2 = 0x1122; std::cout<<std::setw(16)<<a2<<'\n';
constexpr auto b2 = htonT(a2); std::cout<<std::setw(16)<<b2<<'\n';
constexpr uint32_t a4 = 0x11223344; std::cout<<std::setw(16)<<a4<<'\n';
constexpr auto b4 = htonT(a4); std::cout<<std::setw(16)<<b4<<'\n';
constexpr uint64_t a8 = 0x1122334455667788;std::cout<<std::setw(16)<<a8<<'\n';
constexpr auto b8 = htonT(a8); std::cout<<std::setw(16)<<b8<<'\n';
Output
B
B
1122
2211
11223344
44332211
1122334455667788
8877665544332211
Code generation
The online C++ compiler gcc.godbolt.org indicate the generated code.
g++-4.9.2 -std=c++14 -O3
std::enable_if<(sizeof (unsigned char))==(1), unsigned char>::type htonT<unsigned char>(unsigned char):
movl %edi, %eax
ret
std::enable_if<(sizeof (unsigned short))==(2), unsigned short>::type htonT<unsigned short>(unsigned short):
movl %edi, %eax
rolw $8, %ax
ret
std::enable_if<(sizeof (unsigned int))==(4), unsigned int>::type htonT<unsigned int>(unsigned int):
movl %edi, %eax
bswap %eax
ret
std::enable_if<(sizeof (unsigned long))==(8), unsigned long>::type htonT<unsigned long>(unsigned long):
movq %rdi, %rax
bswap %rax
ret
clang++-3.5.1 -std=c++14 -O3
std::enable_if<(sizeof (unsigned char))==(1), unsigned char>::type htonT<unsigned char>(unsigned char): # #std::enable_if<(sizeof (unsigned char))==(1), unsigned char>::type htonT<unsigned char>(unsigned char)
movl %edi, %eax
retq
std::enable_if<(sizeof (unsigned short))==(2), unsigned short>::type htonT<unsigned short>(unsigned short): # #std::enable_if<(sizeof (unsigned short))==(2), unsigned short>::type htonT<unsigned short>(unsigned short)
rolw $8, %di
movzwl %di, %eax
retq
std::enable_if<(sizeof (unsigned int))==(4), unsigned int>::type htonT<unsigned int>(unsigned int): # #std::enable_if<(sizeof (unsigned int))==(4), unsigned int>::type htonT<unsigned int>(unsigned int)
bswapl %edi
movl %edi, %eax
retq
std::enable_if<(sizeof (unsigned long))==(8), unsigned long>::type htonT<unsigned long>(unsigned long): # #std::enable_if<(sizeof (unsigned long))==(8), unsigned long>::type htonT<unsigned long>(unsigned long)
bswapq %rdi
movq %rdi, %rax
retq
Note: my original answer was not C++11-constexpr compliant.
This answer is in Public Domain CC0 1.0 Universal
To detect your endian-ness, use the following union:
union {
unsigned long long ull;
char c[8];
} x;
x.ull = 0x0123456789abcdef; // may need special suffix for ULL.
Then you can check the contents of x.c[] to detect where each byte went.
To do the conversion, I would use that detection code once to see what endian-ness the platform is using, then write my own function to do the swaps.
You could make it dynamic so that the code will run on any platform (detect once then use a switch inside your conversion code to choose the right conversion) but, if you're only going to be using one platform, I'd just do the detection once in a separate program then code up a simple conversion routine, making sure you document that it only runs (or has been tested) on that platform.
Here's some sample code I whipped up to illustrate it. It's been tested though not in a thorough manner, but should be enough to get you started.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TYP_INIT 0
#define TYP_SMLE 1
#define TYP_BIGE 2
static unsigned long long cvt(unsigned long long src) {
static int typ = TYP_INIT;
unsigned char c;
union {
unsigned long long ull;
unsigned char c[8];
} x;
if (typ == TYP_INIT) {
x.ull = 0x01;
typ = (x.c[7] == 0x01) ? TYP_BIGE : TYP_SMLE;
}
if (typ == TYP_SMLE)
return src;
x.ull = src;
c = x.c[0]; x.c[0] = x.c[7]; x.c[7] = c;
c = x.c[1]; x.c[1] = x.c[6]; x.c[6] = c;
c = x.c[2]; x.c[2] = x.c[5]; x.c[5] = c;
c = x.c[3]; x.c[3] = x.c[4]; x.c[4] = c;
return x.ull;
}
int main (void) {
unsigned long long ull = 1;
ull = cvt (ull);
printf ("%llu\n",ull);
return 0;
}
Keep in mind that this just checks for pure big/little endian. If you have some weird variant where the bytes are stored in, for example, {5,2,3,1,0,7,6,4} order, cvt() will be a tad more complex. Such an architecture doesn't deserve to exist, but I'm not discounting the lunacy of our friends in the microprocessor industry :-)
Also keep in mind that this is technically undefined behaviour, as you're not supposed to access a union member by any field other than the last one written. It will probably work with most implementations but, for the purist point of view, you should probably just bite the bullet and use macros to define your own routines, something like:
// Assumes 64-bit unsigned long long.
unsigned long long switchOrderFn (unsigned long long in) {
in = (in && 0xff00000000000000ULL) >> 56
| (in && 0x00ff000000000000ULL) >> 40
| (in && 0x0000ff0000000000ULL) >> 24
| (in && 0x000000ff00000000ULL) >> 8
| (in && 0x00000000ff000000ULL) << 8
| (in && 0x0000000000ff0000ULL) << 24
| (in && 0x000000000000ff00ULL) << 40
| (in && 0x00000000000000ffULL) << 56;
return in;
}
#ifdef ULONG_IS_NET_ORDER
#define switchOrder(n) (n)
#else
#define switchOrder(n) switchOrderFn(n)
#endif
some BSD systems has betoh64 which does what you need.
one line macro for 64bit swap on little endian machines.
#define bswap64(y) (((uint64_t)ntohl(y)) << 32 | ntohl(y>>32))
How about a generic version, which doesn't depend on the input size (some of the implementations above assume that unsigned long long is 64 bits, which is not necessarily always true):
// converts an arbitrary large integer (preferrably >=64 bits) from big endian to host machine endian
template<typename T> static inline T bigen2host(const T& x)
{
static const int one = 1;
static const char sig = *(char*)&one;
if (sig == 0) return x; // for big endian machine just return the input
T ret;
int size = sizeof(T);
char* src = (char*)&x + sizeof(T) - 1;
char* dst = (char*)&ret;
while (size-- > 0) *dst++ = *src--;
return ret;
}
uint32_t SwapShort(uint16_t a)
{
a = ((a & 0x00FF) << 8) | ((a & 0xFF00) >> 8);
return a;
}
uint32_t SwapWord(uint32_t a)
{
a = ((a & 0x000000FF) << 24) |
((a & 0x0000FF00) << 8) |
((a & 0x00FF0000) >> 8) |
((a & 0xFF000000) >> 24);
return a;
}
uint64_t SwapDWord(uint64_t a)
{
a = ((a & 0x00000000000000FFULL) << 56) |
((a & 0x000000000000FF00ULL) << 40) |
((a & 0x0000000000FF0000ULL) << 24) |
((a & 0x00000000FF000000ULL) << 8) |
((a & 0x000000FF00000000ULL) >> 8) |
((a & 0x0000FF0000000000ULL) >> 24) |
((a & 0x00FF000000000000ULL) >> 40) |
((a & 0xFF00000000000000ULL) >> 56);
return a;
}
How about:
#define ntohll(x) ( ( (uint64_t)(ntohl( (uint32_t)((x << 32) >> 32) )) << 32) |
ntohl( ((uint32_t)(x >> 32)) ) )
#define htonll(x) ntohll(x)
I like the union answer, pretty neat. Typically I just bit shift to convert between little and big endian, although I think the union solution has fewer assignments and may be faster:
//note UINT64_C_LITERAL is a macro that appends the correct prefix
//for the literal on that platform
inline void endianFlip(unsigned long long& Value)
{
Value=
((Value & UINT64_C_LITERAL(0x00000000000000FF)) << 56) |
((Value & UINT64_C_LITERAL(0x000000000000FF00)) << 40) |
((Value & UINT64_C_LITERAL(0x0000000000FF0000)) << 24) |
((Value & UINT64_C_LITERAL(0x00000000FF000000)) << 8) |
((Value & UINT64_C_LITERAL(0x000000FF00000000)) >> 8) |
((Value & UINT64_C_LITERAL(0x0000FF0000000000)) >> 24) |
((Value & UINT64_C_LITERAL(0x00FF000000000000)) >> 40) |
((Value & UINT64_C_LITERAL(0xFF00000000000000)) >> 56);
}
Then to detect if you even need to do your flip without macro magic, you can do a similiar thing as Pax, where when a short is assigned to 0x0001 it will be 0x0100 on the opposite endian system.
So:
unsigned long long numberToSystemEndian
(
unsigned long long In,
unsigned short SourceEndian
)
{
if (SourceEndian != 1)
{
//from an opposite endian system
endianFlip(In);
}
return In;
}
So to use this, you'd need SourceEndian to be an indicator to communicate the endianness of the input number. This could be stored in the file (if this is a serialization problem), or communicated over the network (if it's a network serialization issue).
An easy way would be to use ntohl on the two parts seperately:
unsigned long long htonll(unsigned long long v) {
union { unsigned long lv[2]; unsigned long long llv; } u;
u.lv[0] = htonl(v >> 32);
u.lv[1] = htonl(v & 0xFFFFFFFFULL);
return u.llv;
}
unsigned long long ntohll(unsigned long long v) {
union { unsigned long lv[2]; unsigned long long llv; } u;
u.llv = v;
return ((unsigned long long)ntohl(u.lv[0]) << 32) | (unsigned long long)ntohl(u.lv[1]);
}
htonl can be done by below steps
If its big endian system return the value directly. No need to do any conversion. If its litte endian system, need to do the below conversion.
Take LSB 32 bit and apply 'htonl' and shift 32 times.
Take MSB 32 bit (by shifting the uint64_t value 32 times right) and apply 'htonl'
Now apply bit wise OR for the value received in 2nd and 3rd step.
Similarly for ntohll also
#define HTONLL(x) ((1==htonl(1)) ? (x) : (((uint64_t)htonl((x) & 0xFFFFFFFFUL)) << 32) | htonl((uint32_t)((x) >> 32)))
#define NTOHLL(x) ((1==ntohl(1)) ? (x) : (((uint64_t)ntohl((x) & 0xFFFFFFFFUL)) << 32) | ntohl((uint32_t)((x) >> 32)))
You can delcare above 2 definition as functions also.
template <typename T>
static T ntoh_any(T t)
{
static const unsigned char int_bytes[sizeof(int)] = {0xFF};
static const int msb_0xFF = 0xFF << (sizeof(int) - 1) * CHAR_BIT;
static bool host_is_big_endian = (*(reinterpret_cast<const int *>(int_bytes)) & msb_0xFF ) != 0;
if (host_is_big_endian) { return t; }
unsigned char * ptr = reinterpret_cast<unsigned char *>(&t);
std::reverse(ptr, ptr + sizeof(t) );
return t;
}
Works for 2 bytes, 4-bytes, 8-bytes, and 16-bytes(if you have 128-bits integer). Should be OS/platform independent.
This is assuming you are coding on Linux using 64 bit OS; most systems have htole(x) or ntobe(x) etc, these are typically macro's to the various bswap's
#include <endian.h>
#include <byteswap.h>
unsigned long long htonll(unsigned long long val)
{
if (__BYTE_ORDER == __BIG_ENDIAN) return (val);
else return __bswap_64(val);
}
unsigned long long ntohll(unsigned long long val)
{
if (__BYTE_ORDER == __BIG_ENDIAN) return (val);
else return __bswap_64(val);
}
Side note; these are just functions to call to swap the byte ordering. If you are using little endian for example with a big endian network, but if you are using big ending encoding then this will unnecessarily reverse the byte ordering so a little "if __BYTE_ORDER == __LITTLE_ENDIAN" check might be require to make your code more portable, depening on your needs.
Update: Edited to show example of endian check
universal function for any value size.
template <typename T>
T swap_endian (T value)
{
union {
T src;
unsigned char dst[sizeof(T)];
} source, dest;
source.src = value;
for (size_t k = 0; k < sizeof(T); ++k)
dest.dst[k] = source.dst[sizeof(T) - k - 1];
return dest.src;
}
union help64
{
unsigned char byte[8];
uint64_t quad;
};
uint64_t ntoh64(uint64_t src)
{
help64 tmp;
tmp.quad = src;
uint64_t dst = 0;
for(int i = 0; i < 8; ++i)
dst = (dst << 8) + tmp.byte[i];
return dst;
}
It isn't in general necessary to know the endianness of a machine to convert a host integer into network order. Unfortunately that only holds if you write out your net-order value in bytes, rather than as another integer:
static inline void short_to_network_order(uchar *output, uint16_t in)
{
output[0] = in>>8&0xff;
output[1] = in&0xff;
}
(extend as required for larger numbers).
This will (a) work on any architecture, because at no point do I use special knowledge about the way an integer is laid out in memory and (b) should mostly optimise away in big-endian architectures because modern compilers aren't stupid.
The disadvantage is, of course, that this is not the same, standard interface as htonl() and friends (which I don't see as a disadvantage, because the design of htonl() was a poor choice imo).
Related
I need to extract Unicode strings from a PE file. While extracting I need to detect it first. For UTF-8 characters, I used the following link - How to easily detect utf8 encoding in the string?. Is there any similar way to detect UTF-16 characters. I have tried the following code. Is this right? Please do help or provide suggestions. Thanks in advance!!!
BYTE temp1 = buf[offset];
BYTE temp2 = buf[offset+1];
while (!(temp1 == 0x00 && temp2 == 0x00) && offset <= bufSize)
{
if ((temp1 >= 0x00 && temp1 <= 0xFF) && (temp2 >= 0x00 && temp2 <= 0xFF))
{
tmp += 2;
}
else
{
break;
}
offset += 2;
temp1 = buf[offset];
temp2 = buf[offset+1];
if (temp1 == 0x00 && temp2 == 0x00)
{
break;
}
}
I just implemented right now a function for you, DecodeUtf16Char(), basically it is able to do two things - either just check if it is a valid utf-16 (when check_only = true) or check and return valid decoded Unicode code-point (32-bit). Also it supports either big endian (default, when big_endian = true) or little endian (big_endian = false) order of bytes within two-byte utf-16 word. bad_skip equals to number of bytes to be skipped if failed to decode a character (invalid utf-16), bad_value is a value that is used to signify that utf-16 wasn't decoded (was invalid) by default it is -1.
Example of usage/tests are included after this function definition. Basically you just pass starting (ptr) and ending pointer to this function and when returned check return value, if it is -1 then at pointer begin was invalid utf-16 sequence, if it is not -1 then this returned value contains valid 32-bit unicode code-point. Also my function increments ptr, by amount of decoded bytes in case of valid utf-16 or by bad_skip number of bytes if it is invalid.
My functions should be very fast, because it contains only few ifs (plus a bit of arithmetics in case when you ask to actually decode chars), always place my function into headers so that it is inlined into calling function to produce very fast code! Also pass in only compile-time-constants check_only and big_endian, this will remove extra decoding code through C++ optimizations.
If for example you just want to detect long runs of utf-16 bytes then you do next thing, iterate in a loop calling this function and whenever it first returned not -1 then it will be possible beginning, then iterate further and catch last not-equal-to -1 value, this will be the last point of text. Also important to pass in bad_skip = 1 when searching for utf-16 bytes because valid char may start at any byte.
I used for testing different characters - English ASCII, Russian chars (two-byte utf-16) plus two 4-byte chars (two utf-16 words). My tests append converted line to test.txt file, this file is UTF-8 encoded to be easily viewable e.g. by notepad. All of the code after my decoding function is not needed for it to work, the rest is just testing code.
My function to work needs two functions - _DecodeUtf16Char_ReadWord() (helper) plus DecodeUtf16Char() (main decoder). I only include one standard header <cstdint>, if you're not allowed to include anything then just define uint8_t and uint16_t and uint32_t, I use only these types definition from this header.
Also, for reference, see my other post which implements both from scratch (and using standard C++ library) all types of conversions between UTF-8<-->UTF-16<-->UTF-32!
Try it online!
#include <cstdint>
static inline bool _DecodeUtf16Char_ReadWord(
uint8_t const * & ptrc, uint8_t const * end,
uint16_t & r, bool const big_endian
) {
if (ptrc + 1 >= end) {
// No data left.
if (ptrc < end)
++ptrc;
return false;
}
if (big_endian) {
r = uint16_t(*ptrc) << 8; ++ptrc;
r |= uint16_t(*ptrc) ; ++ptrc;
} else {
r = uint16_t(*ptrc) ; ++ptrc;
r |= uint16_t(*ptrc) << 8; ++ptrc;
}
return true;
}
static inline uint32_t DecodeUtf16Char(
uint8_t const * & ptr, uint8_t const * end,
bool const check_only = true, bool const big_endian = true,
uint32_t const bad_skip = 1, uint32_t const bad_value = -1
) {
auto ptrs = ptr, ptrc = ptr;
uint32_t c = 0;
uint16_t v = 0;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if (v < 0xD800 || v > 0xDFFF) {
// Correct single-word symbol.
if (!check_only)
c = v;
} else if (v >= 0xDC00) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else { // Possibly double-word sequence.
if (!check_only)
c = (v & 0x3FF) << 10;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if ((v < 0xDC00) || (v > 0xDFFF)) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else {
// Correct double-word symbol
if (!check_only) {
c |= v & 0x3FF;
c += 0x10000;
}
}
}
if (c == bad_value)
ptr = ptrs + bad_skip; // Skip bytes.
else
ptr = ptrc; // Skip all eaten bytes.
return c;
}
// --------- Next code only for testing only and is not needed for decoding ------------
#include <iostream>
#include <string>
#include <codecvt>
#include <fstream>
#include <locale>
static std::u32string DecodeUtf16Bytes(uint8_t const * ptr, uint8_t const * end) {
std::u32string res;
while (true) {
if (ptr >= end)
break;
uint32_t c = DecodeUtf16Char(ptr, end, false, false, 2);
if (c != -1)
res.append(1, c);
}
return res;
}
#if (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif
template <typename CharT = char>
static std::basic_string<CharT> U32ToU8(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv;
auto res = utf_8_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return res;
}
template <typename WCharT = wchar_t>
static std::basic_string<WCharT> U32ToU16(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffffUL, std::little_endian>, char32_t> utf_16_32_conv;
auto res = utf_16_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return std::basic_string<WCharT>((WCharT*)(res.c_str()), (WCharT*)(res.c_str() + res.length()));
}
template <typename StrT>
void OutputString(StrT const & s) {
std::ofstream f("test.txt", std::ios::binary | std::ios::app);
f.write((char*)s.c_str(), size_t((uint8_t*)(s.c_str() + s.length()) - (uint8_t*)s.c_str()));
f.write("\n\x00", sizeof(s.c_str()[0]));
}
int main() {
std::u16string a = u"привет|мир|hello|𐐷|world|𤭢|again|русский|english";
*((uint8_t*)(a.data() + 12) + 1) = 0xDD; // Introduce bad utf-16 byte.
// Also truncate by 1 byte ("... - 1" in next line).
OutputString(U32ToU8(DecodeUtf16Bytes((uint8_t*)a.c_str(), (uint8_t*)(a.c_str() + a.length()) - 1)));
return 0;
}
Output:
привет|мир|hllo|𐐷|world|𤭢|again|русский|englis
For many purposes, short strings/char arrays packed into an unsigned 32-bit integer are pretty useful, since they can be compared at one go with a simple integer comparison and be used in switch statements, while still maintaining a bit of human readability.
The most common way to convert these short strings to 32-bit integers is to shift/or:
#include <stdint.h>
uint32_t quadchar( const char* _str )
{
uint32_t result = 0;
for( size_t i=0; i<4; i++ )
{
if( _str[i] == 0 )
return result;
result = (result << 8) | _str[i];
}
return result;
}
Strings, which are too long, are truncated.
So far so good, but this has to be done on runtime, which costs a bit of time.
Would it be also possible to do this on compile time?
There is no need of a detail helper function: you can use default values.
And there is no need for a double ternary operator: you can make all with a single test
std::uint32_t inline constexpr quadchar (char const * input,
std::size_t idx = 0U,
std::uint32_t result = 0U)
{
return (idx < 4U) && *input
? quadchar(input+1, idx+1U, (result << 8) | *input)
: result;
}
But, to make it a little more portable and generic, I suggest
1) use sizeof() instead of 4 for the idx limit
2) use CHAR_BIT instead of 8 for the result shift (remember to include "<climits>")
3) use a template type (defaulted to std::uint32_t, if you want) for the result type.
Something like
template <typename I = std::uint32_t>
constexpr inline I ichar (char const * input,
I result = 0U,
std::size_t idx = 0U)
{
return (idx < sizeof(I)) && *input
? ichar(input+1, idx+1U, (result << CHAR_BIT) | *input)
: result;
}
that you can call
constexpr auto u32 = ichar(ptr);
when you want a std::uint32_t, or (by example)
constexpr auto u64 = ichar<std::uint64_t>(ptr);
for other returned types.
As of C++11, it's possible to do this on compile time with zero costs on runtime, using the constexpr specifier.
namespace Internal
{
uint32_t inline constexpr quadchar( char const *_input,
uint8_t _idx, uint32_t _result )
{
return _idx == 4 ? _result
: *_input ? quadchar ( _input+1, _idx + 1, (_result << 8) | *_input )
: _result;
}
}
uint32_t inline constexpr quadchar( char const *_input ) {
return Internal::quadchar( _input, 0, 0 );
}
I have placed the implementation overload into an internal namespace to hide it from the user. The syntax is not as nice as in the above runtime example, since you can't use if in constexpr, but I think it's worth it.
I have a class that facilitates encoding/decoding raw memory. I ultimately store a void pointer to point to the memory and the number of bytes being referenced. I'm concerned about aliasing issues as well as the bit-shifting operations to get the encoding correct. Essentially, for WHAT_TYPE should I use char, unsigned char, int8_t, uint8_t, int_fast8_t, uint_fast8_t, int_least8_t, or uint_least8_t? Is there a definitive answer within the spec?
class sample_buffer {
size_t index; // For illustrative purposes
void *memory;
size_t num_bytes;
public:
sample_buffer(size_t n) :
index(0),
memory(malloc(n)),
num_bytes(memory == nullptr ? 0 : n) {
}
~sample_buffer() {
if (memory != nullptr) free(memory);
}
void put(uint32_t const value) {
WHAT_TYPE *bytes = static_cast<WHAT_TYPE *>(memory);
bytes[index] = value >> 24;
bytes[index + 1] = (value >> 16) & 0xFF;
bytes[index + 2] = (value >> 8) & 0xFF;
bytes[index + 3] = value & 0xFF;
index += 4;
}
void read(uint32_t &value) {
WHAT_TYPE const *bytes = static_cast<WHAT_TYPE const *>(memory);
value = (static_cast<uint32_t>(bytes[index]) << 24) |
(static_cast<uint32_t>(bytes[index + 1]) << 16) |
(static_cast<uint32_t>(bytes[index + 2]) << 8) |
(static_cast<uint32_t>(bytes[index + 3]);
index += 4;
}
};
In C++17: std::byte. This type is specifically created precisely for this reason, to convey all the right semantic meaning. Moreover, it has all the operators you would need to use on raw data (like the << in your example), but none of the operators that you wouldn't.
Before C++17: unsigned char. The standard defines object representation as a sequence of unsigned char, so it's just a good type to use. Furthermore, as Mooing Duck rightly suggests, using unsigned char* would prevent many bugs caused by mistakenly using your char* that refers to raw bytes as if it were a string and passing it into a function like strlen.
If you really cannot use unsigned char, then you should use char. Both unsigned char and char are the types you're allowed to alias through, so either are preferred to any of the other integer types.
1) I have a big buffer
2) I have a lot of variables of almost every types,
I use this buffer to send to multiple destinations, with different byte orders.
when I send to a network byte order, I usually used htons, or htonl and a customized function for specific data types,
so my issue,
every time I am constructing the buffer, I change byte order for each variable then use memcpy.
however, does anyone know a better way, like I was wishing for an efficient memcpy with specific intended byte order
an example,
UINT32 dwordData = 0x01234567;
UINT32 dwordTmp = htonl(dwordData);
memcpy(&buffer[loc], &dwordTmp, sizeof(UNIT32));
loc += sizeof(UNIT32);
this is just an example I just randomly wrote btw
I hope for a function that look like
memcpyToNetwork(&buffer[loc], &dwordTmp, sizeof(UNIT32));
if you know what I mean, naming is just a descriptive, and depending on the data type it does the byte order for the specific data type so I dont have to keep changing orders manually and have a temp variable to copy to, saving copying twice.
There is no standard solution, but it is fairly easy to write yourself.
Off the top of my head, an outline could look like this:
// Macro to be able to switch easily between encodings. Just for convenience
#define WriteBuffer WriteBufferBE
// Generic template as interface specification. Not implemented itself
// Takes buffer (of sufficient size) and value, returns number of bytes written
template <typename T>
size_t WriteBufferBE(char* buffer, const T& value);
template <typename T>
size_t WriteBufferLE(char* buffer, const T& value);
// Specializations for specific types
template <>
size_t WriteBufferBE(char* buffer, const UINT32& value)
{
buffer[0] = (value >> 24) & 0xFF;
buffer[1] = (value >> 16) & 0xFF;
buffer[2] = (value >> 8) & 0xFF;
buffer[3] = (value) & 0xFF;
return 4;
}
template <>
size_t WriteBufferBE(char* buffer, const UINT16& value)
{
buffer[0] = (value >> 8) & 0xFF;
buffer[1] = (value) & 0xFF;
return 2;
}
template <>
size_t WriteBufferLE(char* buffer, const UINT32& value)
{
buffer[0] = (value) & 0xFF;
buffer[1] = (value >> 8) & 0xFF;
buffer[2] = (value >> 16) & 0xFF;
buffer[3] = (value >> 24) & 0xFF;
return 4;
}
template <>
size_t WriteBufferLE(char* buffer, const UINT16& value)
{
buffer[0] = (value) & 0xFF;
buffer[1] = (value >> 8) & 0xFF;
return 2;
}
// Other types left as an exercise. Can use the existing functions!
// Usage:
loc += writeBuffer(&buffer[loc], dwordData);
Background
When designing binary file formats, it's generally recommended to write integers in network byte order. For that, there are macros like htonhl(). But for a format such as WAV, actually the little endian format is used.
Question
How do you portably write little endian values, regardless of if the CPU your code runs on is a big endian or little endian architecture? (Ideas: can the standard macros ntohl() and htonl() be used "in reverse" somehow? Or should the code just test runtime if it's running on a little or big endian CPU and choose the appropriate code path?)
So the question is not really about file formats, file formats were just an example. It could be any kind of serialization where little endian "on the wire" is required, such as a (heretic) network protocol.
Warning: This only works on unsigned integers, because signed right shift is implementation defined and can lead to vulnerabilities (https://stackoverflow.com/a/7522498/395029)
C already provides an abstraction over the host's endianness: the number† or int†.
Producing output in a given endianness can be done portably by not trying to be clever: simply interpret the numbers as numbers and use bit shifts to extract each byte:
uint32_t value;
uint8_t lolo = (value >> 0) & 0xFF;
uint8_t lohi = (value >> 8) & 0xFF;
uint8_t hilo = (value >> 16) & 0xFF;
uint8_t hihi = (value >> 24) & 0xFF;
Then you just write the bytes in whatever order you desire.
When you are taking byte sequences with some endianness as input, you can reconstruct them in the host's endianness by again constructing numbers with bit operations:
uint32_t value = (hihi << 24)
| (hilo << 16)
| (lohi << 8)
| (lolo << 0);
† Only the representations of numbers as byte sequences have endianness; numbers (i.e. quantities) don't.
Here's a template based version:
#include <iostream>
#include <iomanip>
enum endianness_t {
BIG, // 0x44332211 => 0x44 0x33 0x22 0x11
LITTLE, // 0x44332211 => 0x11 0x22 0x33 0x44
UNKNOWN
};
const uint32_t test_value = 0x44332211;
const bool is_little_endian = (((char *)&test_value)[0] == 0x11) && (((char *)&test_value)[1] == 0x22);
const bool is_big_endian = (((char *)&test_value)[0] == 0x44) && (((char *)&test_value)[1] == 0x33);
const endianness_t endianness =
is_big_endian ? BIG:
(is_little_endian ? LITTLE : UNKNOWN);
template <typename T>
T identity(T v){
return v;
}
// 16 bits values ------
uint16_t swap_(uint16_t v){
return ((v & 0xFF) << 8) | ((v & 0xFF00) >> 8);
}
// 32 bits values ------
uint32_t swap_(uint32_t v){
return ((v & 0xFF) << 24) | ((v & 0xFF00) << 8) | ((v & 0xFF0000) >> 8) | ((v & 0xFF000000) >> 24);
}
template <typename T, endianness_t HOST, endianness_t REMOTE>
struct en_swap{
static T conv(T v){
return swap_(v);
}
};
template <typename T>
struct en_swap<T, BIG, BIG>{
static T conv(T v){
return v;
}
};
template <typename T>
struct en_swap<T, LITTLE, LITTLE> {
static T conv(T v){
return v;
}
};
template <typename T>
T to_big(T v) {
switch (endianness){
case LITTLE :
return en_swap<T,LITTLE,BIG>::conv(v);
case BIG :
return en_swap<T,BIG,BIG>::conv(v);
}
}
template <typename T>
T to_little(T v) {
switch (endianness){
case LITTLE :
return en_swap<T,LITTLE,LITTLE>::conv(v);
case BIG :
return en_swap<T,BIG,LITTLE>::conv(v);
}
}
int main(){
using namespace std;
uint32_t x = 0x0ABCDEF0;
uint32_t y = to_big(x);
uint32_t z = to_little(x);
cout << hex << setw(8) << setfill('0') << x << " " << y << " " << setw(8) << setfill('0') << z << endl;
}
In fact, the MSDN functions ntohl() and htonl() are the inverse of eachother:
The htonl function converts a u_long from host to TCP/IP network byte
order (which is big-endian).
The ntohl function converts a u_long from TCP/IP network order to host
byte order (which is little-endian on Intel processors).
Yes, runtime detecting endianness is a very sane thing to do, and basically what any ready-to-use macro/function would do at some point anyway.
And if you want to do little-big endian conversions yourself, see answer by #R-Martinho-Fernandes.