I am working on translating a system from python to c++. I need to be able to perform actions in c++ that are generally performed by using Python's struct.unpack (interpreting binary strings as numerical values). For integer values, I am able to get this to (sort of) work, using the data types in stdint.h:
struct.unpack("i", str) ==> *(int32_t*) str; //str is a char* containing the data
This works properly for little-endian binary strings, but fails on big-endian binary strings. Basically, I need an equivalent to using the > tag in struct.unpack:
struct.unpack(">i", str) ==> ???
Please note, if there is a better way to do this, I am all ears. However, I cannot use c++11, nor any 3rd party libraries other than Boost. I will also need to be able to interpret floats and doubles, as in struct.unpack(">f", str) and struct.unpack(">d", str), but I'll get to that when I solve this.
NOTE I should point out that the endianness of my machine is irrelevant in this case. I know that the bitstream I receive in my code will ALWAYS be big-endian, and that's why I need a solution that will always cover the big-endian case. The article pointed out by BoBTFish in the comments seems to offer a solution.
For 32 and 16-bit values:
This is exactly the problem you have for network data, which is big-endian. You can use the the ntohl to turn a 32-bit into host order, little-endian in your case.
The ntohl() function converts the unsigned integer netlong from network byte order to
host byte order.
int res = ntohl(*((int32_t) str)));
This will also take care of the case where your host is big-endian and won't do anything.
For 64-bit values
Non-standardly on linux/BSD you can take a look at 64 bit ntohl() in C++?, which points to htobe64
These functions convert the byte encoding of integer values from the byte order that
the current CPU (the "host") uses, to and from little-endian and big-endian byte
order.
For windows try: How do I convert between big-endian and little-endian values in C++?
Which points to _byteswap_uint64 and as well as a 16 and 32-bit solution and a gcc-specific __builtin_bswap(32/64) call.
Other Sizes
Most systems don't have values that aren't 16/32/64 bits long. At that point I might try to store it in a 64-bit value, shift it and they translate. I'd write some good tests. I suspectt is an uncommon situation and more details would help.
Unpack the string one byte at a time.
unsigned char *str;
unsigned int result;
result = *str++ << 24;
result |= *str++ << 16;
result |= *str++ << 8;
result |= *str++;
First, the cast you're doing:
char *str = ...;
int32_t i = *(int32_t*)str;
results in undefined behavior due to the strict aliasing rule (unless str is initialized with something like int32_t x; char *str = (char*)&x;). In practical terms that cast can result in an unaligned read which causes a bus error (a crash) on some platforms and slow performance on others.
Instead you should be doing something like:
int32_t i;
std::memcpy(&i, c, sizeof(i));
There are a number of functions for swapping bytes between the host's native byte ordering and a host independent ordering: ntoh*(), hton*(), where * is nothing, l, or s for the different types supported. Since different hosts may have different byte orderings then this may be what you want to use if the data you're reading uses a consistent serialized form on all platforms.
ntoh(i);
You can also manually move bytes around in str before copying it into the integer.
std::swap(str[0],str[3]);
std::swap(str[1],str[2]);
std::memcpy(&i,str,sizeof(i));
Or you can manually manipulate the integer's value using shifts and bitwise operators.
std::memcpy(&i,str,sizeof(i));
i = (i&0xFFFF0000)>>16 | (i&0x0000FFFF)<<16;
i = (i&0xFF00FF00)>>8 | (i&0x00FF00FF)<<8;
This falls in the realm of bit twiddling.
for (i=0;i<sizeof(struct foo);i++) dst[i] = src[i ^ mask];
where mask == (sizeof type -1) if the stored and native endianness differ.
With this technique one can convert a struct to bit masks:
struct foo {
byte a,b; // mask = 0,0
short e; // mask = 1,1
int g; // mask = 3,3,3,3,
double i; // mask = 7,7,7,7,7,7,7,7
} s; // notice that all units must be aligned according their native size
Again these masks can be encoded with two bits per symbol: (1<<n)-1, meaning that in 64-bit machines one can encode necessary masks of a 32 byte sized struct in a single constant (with 1,2,4 and 8 byte alignments).
unsigned int mask = 0xffffaa50; // or zero if the endianness matches
for (i=0;i<16;i++) {
dst[i]=src[i ^ ((1<<(mask & 3))-1]; mask>>=2;
}
If your as received values are truly strings, (char* or std::string) and you know their format information, sscanf(), and atoi(), well, really ato() will be your friends. They take well formatted strings and convert them per passed-in formats (kind of reverse printf).
Related
I have a question really similar to this:
Building a 32-bit float out of its 4 composite bytes.
Specifically I have an array of unsigned char composed by 8 elements:
unsigned char c[8] = {0b01001000, 0b11100001, 0b00100110, 0b01000001, 0b01111011,0b00010100, 0b10000110, 0b01000000}
This, with a little endianness convention corresponds to two floats, namely { 10.4300f, 4.19000f }.
I know that I could obtain the latter with:
float f[2];
memcpy(&f, &c, sizeof(f))
//f = { 10.4300f, 4.19000f }
But this involves, a copy.
Is there a way to cast the c array inplace, changing its type so that I can avoid copying?
Is there a way to cast the c array inplace
No. However, if the array is sufficiently aligned to hold a float, what you can do after memcpy is to placement-new a copy of that float onto the array.
Optimisers are smart, and typically know that you copied same value back. Sometimes two copies for abstract machine results in zero copies for cpu.
This, with a little endianness convention corresponds
I know that I could obtain the latter with
Note that memcpy will always result in native byte order and thus you only get little endian result on little endian systems. Thus the assumption of the data being interpreted as little endian is not a portable assumption.
If you want to avoid assuming native endianness, you'll need to read bytes in correct order, shift / mask them into an unsigned integer, then memcpy (or bit_cast) that integer into float.
Is there a way in C/C++ to cast a char array to an int at any position?
I tried the following, bit it automatically aligns to the nearest 32 bits (on a 32 bit architecture) if I try to use pointer arithmetic with non-const offsets:
unsigned char data[8];
data[0] = 0; data[1] = 1; ... data[7] = 7;
int32_t p = 3;
int32_t d1 = *((int*)(data+3)); // = 0x03040506 CORRECT
int32_t d2 = *((int*)(data+p)); // = 0x00010203 WRONG
Update:
As stated in the comments the input comes in tuples of 3 and I cannot
change that.
I want to convert 3 values to an int for further
processing and this conversion should be as fast as possible.
The
solution does not have to be cross platform. I am working with a very
specific compiler and processor, so it can be assumed that it is a 32
bit architecture with big endian.
The lowest byte of the result does not matter to me (see above).
My main questions at the moment are: Why has d1 the correct value but d2 does not? Is this also true for other compilers? Can this behavior be changed?
No you can't do that in a portable way.
The behaviour encountered when attempting a cast from char* to int* is undefined in both C and C++ (possibly for the very reasons that you've spotted: ints are possibly aligned on 4 byte boundaries and data is, of course, contiguous.)
(The fact that data+3 works but data+p doesn't is possibly due to to compile time vs. runtime evaluation.)
Also note that the signed-ness of char is not specified in either C or C++ so you should use signed char or unsigned char if you're writing code like this.
Your best bet is to use bitwise shift operators (>> and <<) and logical | and & to absorb char values into an int. Also consider using int32_tin case you build to targets with 16 or 64 bit ints.
There is no way, converting a pointer to a wrongly aligned one is undefined.
You can use memcpy to copy the char array into an int32_t.
int32_t d = 0;
memcpy(&d, data+3, 4); // assuming sizeof(int) is 4
Most compilers have built-in functions for memcpy with a constant size argument, so it's likely that this won't produce any runtime overhead.
Even though a cast like you've shown is allowed for correctly aligned pointers, dereferencing such a pointer is a violation of strict aliasing. An object with an effective type of char[] must not be accessed through an lvalue of type int.
In general, type-punning is endianness-dependent, and converting a char array representing RGB colours is probably easier to do in an endianness-agnostic way, something like
int32_t d = (int32_t)data[2] << 16 | (int32_t)data[1] << 8 | data[0];
Is it possible to store data in integer form from 0 to 255 rather than 8-bit characters.Although both are same thing, how can we do it, for example, with write() function?
Is it ok to directly cast any integer to char and vice versa? Does something like
{
int a[1]=213;
write((char*)a,1);
}
and
{
int a[1];
read((char*)a,1);
cout<<a;
}
work to get 213 from the same location in the file? It may work on that computer but is it portable, in other words, is it suitable for cross-platform projects in that way? If I create a file format for each game level(which will store objects' coordinates in the current level's file) using this principle, will it work on other computers/systems/platforms in order to have loaded same level?
The code you show would write the first (lowest-address) byte of a[0]'s object representation - which may or may not be the byte with the value 213. The particular object representation of an int is imeplementation defined.
The portable way of writing one byte with the value of 213 would be
unsigned char c = a[0];
write(&c, 1);
You have the right idea, but it could use a bit of refinement.
{
int intToWrite = 213;
unsigned char byteToWrite = 0;
if ( intToWrite > 255 || intToWrite < 0 )
{
doError();
return();
}
// since your range is 0-255, you really want the low order byte of the int.
// Just reading the 1st byte may or may not work for your architecture. I
// prefer to let the compiler handle the conversion via casting.
byteToWrite = (unsigned char) intToWrite;
write( &byteToWrite, sizeof(byteToWrite) );
// you can hard code the size, but I try to be in the habit of using sizeof
// since it is better when dealing with multibyte types
}
{
int a = 0;
unsigned char toRead = 0;
// just like the write, the byte ordering of the int will depend on your
// architecture. You could write code to explicitly handle this, but it's
// easier to let the compiler figure it out via implicit conversions
read( &toRead, sizeof(toRead) );
a = toRead;
cout<<a;
}
If you need to minimize space or otherwise can't afford the extra char sitting around, then it's definitely possible to read/write a particular location in your integer. However, it can need linking in new headers (e.g. using htons/ntons) or annoying (using platform #defines).
It will work, with some caveats:
Use reinterpret_cast<char*>(x) instead of (char*)x to be explicit that you’re performing a cast that’s ordinarily unsafe.
sizeof(int) varies between platforms, so you may wish to use a fixed-size integer type from <cstdint> such as int32_t.
Endianness can also differ between platforms, so you should be aware of the platform byte order and swap byte orders to a consistent format when writing the file. You can detect endianness at runtime and swap bytes manually, or use htonl and ntohl to convert between host and network (big-endian) byte order.
Also, as a practical matter, I recommend you prefer text-based formats—they’re less compact, but far easier to debug when things go wrong, since you can examine them in any text editor. If you determine that loading and parsing these files is too slow, then consider moving to a binary format.
Say, i have binary protocol, where first 4 bits represent a numeric value which can be less than or equal to 10 (ten in decimal).
In C++, the smallest data type available to me is char, which is 8 bits long. So, within my application, i can hold the value represented by 4 bits in a char variable. My question is, if i have to pack the char value back into 4 bits for network transmission, how do i pack my char's value back into 4 bits?
You do bitwise operation on the char;
so
unsigned char packedvalue = 0;
packedvalue |= 0xF0 & (7 <<4);
packedvalue |= 0x0F & (10);
Set the 4 upper most bit to 7 and the lower 4 bits to 10
Unpacking these again as
int upper, lower;
upper = (packedvalue & 0xF0) >>4;
lower = packedvalue & 0x0F;
As an extra answer to the question -- you may also want to look at protocol buffers for a way of encoding and decoding data for binary transfers.
Sure, just use one char for your value:
std::ofstream outfile("thefile.bin", std::ios::binary);
unsigned int n; // at most 10!
char c = n << 4; // fits
outfile.write(&c, 1); // we wrote the value "10"
The lower 4 bits will be left at zero. If they're also used for something, you'll have to populate c fully before writing it. To read:
infile.read(&c, 1);
unsigned int n = c >> 4;
Well, there's the popular but non-portable "Bit Fields". They're standard-compliant, but may create a different packing order on different platforms. So don't use them.
Then, there are the highly portable bit shifting and bitwise AND and OR operators, which you should prefer. Essentially, you work on a larger field (usually 32 bits, for TCP/IP protocols) and extract or replace subsequences of bits. See Martin's link and Soren's answer for those.
Are you familiar with C's bitfields? You simply write
struct my_bits {
unsigned v1 : 4;
...
};
Be warned, various operations are slower on bitfields because the compiler must unpack them for things like addition. I'd imagine unpacking a bitfield will still be faster than the addition operation itself, even though it requires multiple instructions, but it's still overhead. Bitwise operations should remain quite fast. Equality too.
You must also take care with endianness and threads (see the wikipedia article I linked for details, but the issues are kinda obvious). You should leearn about endianness anyways since you said "binary protocol" (see this previous questions)
I want to transmit data over the network, but I don't want to use any foreign libraries (Standard C/C++ is ok).
for example:
unsigned int x = 123;
char y[3] = {'h', 'i', '\0'};
float z = 1.23f;
I want this in an
char xyz[11];
array.
Note:
To transmit it over network, I need Network Byte order for the unsigned int (htonl function), then I need to somehow serialize the float to be in IEEE 754 form (theres many functions on the internet), and I know it.
How do I get them into the the xyz-Array, nicely lined up end to end, so I can use this as a buffer for my socket + send() function? Obviously I have reverse functions (ntohl, and a reverse IEEE 754) to get them out but I need a technique there too, preferably the same...
It would be something like this:
xyz in binary:
00000000 0000000 00000000 01111011 | 01101000 | 01101001 | 00000000 | 00111111 10011101 01110000 10100100
- big endian repr. of u. int 123 - | - 'h' - | - 'i' - | - '\0' - | - IEEE 754 repr of float 1.23 -
How can I accomplish this without external libraries and minimal use of standard library functions? This isn't so much for my program as for me to learn from.
Ah, you want to serialize primitive data types! In principle, there are two approaches: The first one is, that you just grab the internal, in-memory binary representation of the data you want to serialize, reinterpret it as a character, and use that as you representation:
So if you have a:
double d;
you take the address of that, reinterpret that pointer as a pointer to character, and then use these characters:
double *pd=&d;
char *pc = reinterpret_cast<char*>(pd);
for(size_t i=0; i<sizeof(double); i++)
{
char ch = *pc;
DoSomethingWith(ch);
pc++;
}
This works with all primitive data types. The main problem here is, that the binray representation is implementation dependent (mainly CPU dependent). (And you will run into subtle bugs when you try doing this with IEEE NANs...).
All in all, this approach is not portable at all, as you have no control at all over the representation of your data.
The second approach is, to use a higher-level representation, that you yourself have under control. If performance is not an issue, you could use std::strstream and the >> and << operators to stream primitive C type variables into std::strings. This is slow but easy to read and debug, and very portable on top of it.
Something like the code below would do it. Watch out for problems where sizeof(unsigned int) is different on different systems, those will get you. For things like this you're better off using types with well-defined sizes, like int32_t. Anyway...
unsigned int x = 123;
char y[3] = {'h', 'i', '\0'};
float z = 1.23f;
// The buffer we will be writing bytes into
unsigned char outBuf[sizeof(x)+sizeof(y)+sizeof(z)];
// A pointer we will advance whenever we write data
unsigned char * p = outBuf;
// Serialize "x" into outBuf
unsigned int32_t neX = htonl(x);
memcpy(p, &neX, sizeof(neX));
p += sizeof(neX);
// Serialize "y" into outBuf
memcpy(p, y, sizeof(y));
p += sizeof(y);
// Serialize "z" into outBuf
int32_t neZ = htonl(*(reinterpret_cast<int32_t *>(&z)));
memcpy(p, &neZ, sizeof(neZ));
p += sizeof(neZ);
int resultCode = send(mySocket, outBuf, p-outBuf, 0);
[...]
... and of course the receiving code would do something similar, except in reverse.
This discussion seems relevant to your question, but it uses boost serialization API
What exactly is your goal? And what exactly are the means you're willing to use?
If you just want to get the job done with one particular compiler on one particular computer, then the fastest and easiest, but also dirtiest, solution is, to use a union. You define a struct that has your items as members and merge that with the character array. You need to tell the compiler to pack the members really tightly, something along the lines of #pragma pack(1), and your problem is solved. You just store the three values in the members, and then look at it as a character array.
If the machine is little endian, and you need big endian ints / floats, you just swap the relevant characters.
But there are at least another dozen solutions that come to mind if you have other goals, like portability, non-standard byte order, sizeof(int) !=4, float not stored in IEEE format internally, etc.