c++ memcpy a struct into a byte array - c++

I have a problem with copying data of a struct to my byteArray. This byte array is used to pass information thru an interface. For normal datatypes I must use byteswap.
But now I have a struct. When I use memcpy, the values of the struct are swapped.
How can I copy the struct easily and "correctly" to the byte array?
memcpy(byteArray, &stData, sizeof(stData));
stData has simple integer. 0x0001 will be stored in the byte array as 0x1000.

If you are on an x86 architecture machine, then integers are stored in "Little Endian" order with the least significant bytes first. That is why 0x0001 will appear as 0x01 0x00 in a byte array. As long as you also unpack on a machine with the same architecture, this will work OK, but this is one of the (many) reasons that binary serialization is non-trivial.
If you need to exchange binary data between machines in a safe manner, then you can either decide on a standard (e.g. convert all binary data to little-endian or big-endian; network wire protocols generally convert to big-endian, though many high-performance proprietary systems stick with little-endian since today this is the native format on most machines) or look for a portable binary file format, such as HDF or BSON. (These store metadata about the binary data being stored.) Finally, you can convert to ASCII (XML, json). (Also, note that "big" and "little" aren't the only choices - "every machine" is a tall order since they haven't all been invented yet. :) )
See wikipedia or search for "endian" on SO for many examples.

Your problem is that you are in Little Endian end you want to store it as Big endian.
In the standard C hibrary you have functions to do this
htons, htonl : host (you little endian machine) to network standard( big endian).
s for 16 bits and l for 32 bits (http://linux.die.net/man/3/htons)
For 4 bytes integer you can do
#include <arpa/inet.h>
#include <stdint.h>
...
*(uint32_t*)byteArray = htonl((uint32_t)stData);
for 8 bytes int you can use bswap_64 https://www.gnu.org/software/gnulib/manual/html_node/bswap_005f64.html
But it only exists on gnu libc. Otherwise you have to swap manually, there are lot of examples on the web.

Related

C++: How to read and write multi-byte integer values in a platform-independent way?

I'm developing a simple protocol that is used to read/write integer values from/to a buffer. The vast majority of integers are below 128, but much larger values are possible, so I'm looking at some form of multi-byte encoding to store the values in a concise way.
What is the simplest and fastest way to read/write multi-byte values in a platform-independent (i.e. byte order agnostic) way?
XDR format might help you there. If I had to summarize it in one sentence, it's a kind of binary UTF-8 for integers.
Edit: As mentioned in my comment below, I "know" XDR because I use several XDR-related functions in my office job. Only after your comment I realized that the "packed XDR" format I use every day isn't even part of the official XDR docs, so I'll describe it seperately.
The idea is thus:
inspect most-significant bit of byte.
If it is 0, that byte is the value.
if it is 1, the next three bits give "byte count", i.e. number of bytes in value.
mask out top nibble (flag bit plus byte count), concatenate the appropriate number of bytes and you've got the value.
I have no idea if this is a "real" format or my (former) coworker created this one himself (which is why I don't post code).
You might be interested in the following functions:
htonl, htons, ntohl, ntohs - convert
values between host and network byte
order
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
man byteorder
Text would be my first choice. If you want a varying length binary encoding you have two basic choices:
a length indication
an end marker
You obviously make merge those with some value bits.
For a length indication that would give you something where the length and some bits are given together (see for instance UTF-8),
For an end marker, you can for instance state that MSB set indicates the last byte and thus have 7 data bits per byte.
Other variants are obviously possible.
You could try Network Byte Order
Google's protocol buffers provide a pre-made implementation that uses variable-width encodings.

big endian little endian conversion

I saw a question on stack overflow how to convert from one endian to another. And the solution was like this:
template <typename T>
void swap_endian(T& pX)
{
char& raw = reinterpret_cast<char&>(pX);
std::reverse(&raw, &raw + sizeof(T));
}
My question is. Will this solution swap the bits correctly? It will swaps the bytes in the correct order but it will not swap the bits.
Yes it will, because there is no need to swap the bits.
Edit:
Endianness has effect on the order in which the bytes are written for values of 2 bytes or more. Little endian means the least significant byte comes first, big-endian the other way around.
If you receive a big-eindian stream of bytes written by a little endian system, there is no debate what the most significant bit is within the bytes. If the bit order was affected you could not read each others byte streams reliably (even if it was just plain 8 bit ascii).
This can not be autmatically determined for 2-byte or bigger values, as the file system (or network layer) does not know if you send data a byte at a time, or if you are sending ints that are (e.g.) 4 bytes long.
If you have a direct 1-bit serial connection with another system, you will have to agree on little or big endian bit ordering at the transport layer.
bigendian vs little endian concerns itself with how bytes are ordered within a larger unit, such as an int,long, etc. The ordering of bits within a byte is the same.
"Endianness" generally refers to byte order, not the order of the bits within those bytes. In this case, you don't have to reverse the bits.
You are correct, that function would only swap the byte order, not individual bits. This is usually sufficient for networking. Depending on your needs, you may also find the htons() family of functions useful.
From Wikipedia:
Most modern computer processors agree
on bit ordering "inside" individual
bytes (this was not always the
case). This means that any single-byte
value will be read the same on almost
any computer one may send it to."

How to guarantee bits of char and short for communication to external device

Hello I am writing a library for communicating to an external device via rs-232 serial connection.
Often I have to communicate a command that includes an 8 bit = 1 byte character or a 16 bit = 2 byte number. How do I do this in a portable way?
Main problem
From reading other questions it seems that the standard does not guarantee 1byte = 8bits,
(defined in the Standard $1.7/1)
The fundamental storage unit in the C
+ + memory model is the byte. A byte is at least large enough to contain
any member of the basic execution
character set and is composed of a
contiguous sequence of bits, the
number of which is
implementation-defined.
How can I guarantee the number of bits of char? My device expects 8-bits exactly, rather than at least 8 bits.
I realise that almost all implementations have 1byte = 8 bits but I am curious as to how to guarantee it.
Short->2 byte check
I hope you don't mind, I would also like to run my proposed solution for the short -> 2 byte conversion by you. I am new to byte conversions and cross platform portability.
To guarantee the the number of bytes of the short I guess I am going to need to need to
do a sizeof(short). If sizeof(short)=2 Convert to bytes and check the byte ordering (as here)
if sizeof(short)>2 then convert the short to bytes, check the byte ordering (as here), then check the most significant bytes are empty and remove them?
Is this the right thing to do? Is there a better way?
Many thanks
AFAIK, communication with the serial port is somehow platform/OS dependent, so when you write the low level part of it, you'll know very well the platform, its endianness and CHAR_BIT. In this way the question does not have any sense.
Also, don't forget that UART hardware is able to transmit 7 or 8 bit words, so it does not depend on the system architecture.
EDIT: I mentioned that the UART's word is fixed (let's consider mode 3 with 8 bits, as the most standard), the hardware itself won't send more than 8 bits, so by giving it one send command, it will send exactly 8 bits, regardless of the machine's CHAR_BIT. In this way, by using one single send for byte and
unsigned short i;
send(i);
send(i>>8);
you can be sure it will do the right thing.
Also, a good idea would be to see what exactly boost.asio is doing.
This thread seems to suggest you can use CHAR_BIT from <climits>. This page even suggests 8 is the minimum amount of bits in a char... Don't know how the quote from the standard relates to this.
For fixed-size integer types, if using MSVC2010 or GCC, you can rely on C99's <stdint.h> (even in C++) to define (u)int8_t and (u)int16_t which are guaranteed to be exactly 8 and 16 bits wide respectively.
CHAR_BIT from the <climits> header tells you the number of bits in a char. This is at least 8. Also, a short int uses at least 16 bits for its value representation. This is guaranteed by the minimum value ranges:
type can at least represent
---------------------------------------
unsigned char 0...255
signed char -127...127
unsigned short 0...65535
signed short -32767...32767
unsigned int 0...65535
signed int -32767...32767
see here
Regarding portability, whenever I write code that relies on CHAR_BIT==8 I simply write this:
#include <climits>
#if CHAR_BIT != 8
#error "I expect CHAR_BIT==8"
#endif
As you said, this is true for almost all platforms and if it's not in a particular case, it won't compile. That's enough portability for me. :-)

What platforms have something other than 8-bit char?

Every now and then, someone on SO points out that char (aka 'byte') isn't necessarily 8 bits.
It seems that 8-bit char is almost universal. I would have thought that for mainstream platforms, it is necessary to have an 8-bit char to ensure its viability in the marketplace.
Both now and historically, what platforms use a char that is not 8 bits, and why would they differ from the "normal" 8 bits?
When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?
In the past I've come across some Analog Devices DSPs for which char is 16 bits. DSPs are a bit of a niche architecture I suppose. (Then again, at the time hand-coded assembler easily beat what the available C compilers could do, so I didn't really get much experience with C on that platform.)
char is also 16 bit on the Texas Instruments C54x DSPs, which turned up for example in OMAP2. There are other DSPs out there with 16 and 32 bit char. I think I even heard about a 24-bit DSP, but I can't remember what, so maybe I imagined it.
Another consideration is that POSIX mandates CHAR_BIT == 8. So if you're using POSIX you can assume it. If someone later needs to port your code to a near-implementation of POSIX, that just so happens to have the functions you use but a different size char, that's their bad luck.
In general, though, I think it's almost always easier to work around the issue than to think about it. Just type CHAR_BIT. If you want an exact 8 bit type, use int8_t. Your code will noisily fail to compile on implementations which don't provide one, instead of silently using a size you didn't expect. At the very least, if I hit a case where I had a good reason to assume it, then I'd assert it.
When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?
It's not so much that it's "worth giving consideration" to something as it is playing by the rules. In C++, for example, the standard says all bytes will have "at least" 8 bits. If your code assumes that bytes have exactly 8 bits, you're violating the standard.
This may seem silly now -- "of course all bytes have 8 bits!", I hear you saying. But lots of very smart people have relied on assumptions that were not guarantees, and then everything broke. History is replete with such examples.
For instance, most early-90s developers assumed that a particular no-op CPU timing delay taking a fixed number of cycles would take a fixed amount of clock time, because most consumer CPUs were roughly equivalent in power. Unfortunately, computers got faster very quickly. This spawned the rise of boxes with "Turbo" buttons -- whose purpose, ironically, was to slow the computer down so that games using the time-delay technique could be played at a reasonable speed.
One commenter asked where in the standard it says that char must have at least 8 bits. It's in section 5.2.4.2.1. This section defines CHAR_BIT, the number of bits in the smallest addressable entity, and has a default value of 8. It also says:
Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
So any number equal to 8 or higher is suitable for substitution by an implementation into CHAR_BIT.
Machines with 36-bit architectures have 9-bit bytes. According to Wikipedia, machines with 36-bit architectures include:
Digital Equipment Corporation PDP-6/10
IBM 701/704/709/7090/7094
UNIVAC 1103/1103A/1105/1100/2200,
A few of which I'm aware:
DEC PDP-10: variable, but most often 7-bit chars packed 5 per 36-bit word, or else 9 bit chars, 4 per word
Control Data mainframes (CDC-6400, 6500, 6600, 7600, Cyber 170, Cyber 176 etc.) 6-bit chars, packed 10 per 60-bit word.
Unisys mainframes: 9 bits/byte
Windows CE: simply doesn't support the `char` type at all -- requires 16-bit wchar_t instead
There is no such thing as a completely portable code. :-)
Yes, there may be various byte/char sizes. Yes, there may be C/C++ implementations for platforms with highly unusual values of CHAR_BIT and UCHAR_MAX. Yes, sometimes it is possible to write code that does not depend on char size.
However, almost any real code is not standalone. E.g. you may be writing a code that sends binary messages to network (protocol is not important). You may define structures that contain necessary fields. Than you have to serialize it. Just binary copying a structure into an output buffer is not portable: generally you don't know neither the byte order for the platform, nor structure members alignment, so the structure just holds the data, but not describes the way the data should be serialized.
Ok. You may perform byte order transformations and move the structure members (e.g. uint32_t or similar) using memcpy into the buffer. Why memcpy? Because there is a lot of platforms where it is not possible to write 32-bit (16-bit, 64-bit -- no difference) when the target address is not aligned properly.
So, you have already done a lot to achieve portability.
And now the final question. We have a buffer. The data from it is sent to TCP/IP network. Such network assumes 8-bit bytes. The question is: of what type the buffer should be? If your chars are 9-bit? If they are 16-bit? 24? Maybe each char corresponds to one 8-bit byte sent to network, and only 8 bits are used? Or maybe multiple network bytes are packed into 24/16/9-bit chars? That's a question, and it is hard to believe there is a single answer that fits all cases. A lot of things depend on socket implementation for the target platform.
So, what I am speaking about. Usually code may be relatively easily made portable to certain extent. It's very important to do so if you expect using the code on different platforms. However, improving portability beyond that measure is a thing that requires a lot of effort and often gives little, as the real code almost always depends on other code (socket implementation in the example above). I am sure that for about 90% of code ability to work on platforms with bytes other than 8-bit is almost useless, for it uses environment that is bound to 8-bit. Just check the byte size and perform compilation time assertion. You almost surely will have to rewrite a lot for a highly unusual platform.
But if your code is highly "standalone" -- why not? You may write it in a way that allows different byte sizes.
It appears that you can still buy an IM6100 (i.e. a PDP-8 on a chip) out of a warehouse. That's a 12-bit architecture.
Many DSP chips have 16- or 32-bit char. TI routinely makes such chips for example.
The C and C++ programming languages, for example, define byte as "addressable unit of data large enough to hold any member of the basic character set of the execution environment" (clause 3.6 of the C standard). Since the C char integral data type must contain at least 8 bits (clause 5.2.4.2.1), a byte in C is at least capable of holding 256 different values. Various implementations of C and C++ define a byte as 8, 9, 16, 32, or 36 bits
Quoted from http://en.wikipedia.org/wiki/Byte#History
Not sure about other languages though.
http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_Formats
Defines a byte on that machine to be variable length
The DEC PDP-8 family had a 12 bit word although you usually used 8 bit ASCII for output (on a Teletype mostly). However, there was also a 6-BIT character code that allowed you to encode 2 chars in a single 12-bit word.
For one, Unicode characters are longer than 8-bit. As someone mentioned earlier, the C spec defines data types by their minimum sizes. Use sizeof and the values in limits.h if you want to interrogate your data types and discover exactly what size they are for your configuration and architecture.
For this reason, I try to stick to data types like uint16_t when I need a data type of a particular bit length.
Edit: Sorry, I initially misread your question.
The C spec says that a char object is "large enough to store any member of the execution character set". limits.h lists a minimum size of 8 bits, but the definition leaves the max size of a char open.
Thus, the a char is at least as long as the largest character from your architecture's execution set (typically rounded up to the nearest 8-bit boundary). If your architecture has longer opcodes, your char size may be longer.
Historically, the x86 platform's opcode was one byte long, so char was initially an 8-bit value. Current x86 platforms support opcodes longer than one byte, but the char is kept at 8 bits in length since that's what programmers (and the large volumes of existing x86 code) are conditioned to.
When thinking about multi-platform support, take advantage of the types defined in stdint.h. If you use (for instance) a uint16_t, then you can be sure that this value is an unsigned 16-bit value on whatever architecture, whether that 16-bit value corresponds to a char, short, int, or something else. Most of the hard work has already been done by the people who wrote your compiler/standard libraries.
If you need to know the exact size of a char because you are doing some low-level hardware manipulation that requires it, I typically use a data type that is large enough to hold a char on all supported platforms (usually 16 bits is enough) and run the value through a convert_to_machine_char routine when I need the exact machine representation. That way, the platform-specific code is confined to the interface function and most of the time I can use a normal uint16_t.
what sort of consideration is it worth giving to platforms with non-8-bit char?
magic numbers occur e.g. when shifting;
most of these can be handled quite simply
by using CHAR_BIT and e.g. UCHAR_MAX instead of 8 and 255 (or similar).
hopefully your implementation defines those :)
those are the "common" issues.....
another indirect issue is say you have:
struct xyz {
uchar baz;
uchar blah;
uchar buzz;
}
this might "only" take (best case) 24 bits on one platform,
but might take e.g. 72 bits elsewhere.....
if each uchar held "bit flags" and each uchar only had 2 "significant" bits or flags that
you were currently using, and you only organized them into 3 uchars for "clarity",
then it might be relatively "more wasteful" e.g. on a platform with 24-bit uchars.....
nothing bitfields can't solve, but they have other things to watch out
for ....
in this case, just a single enum might be a way to get the "smallest"
sized integer you actually need....
perhaps not a real example, but stuff like this "bit" me when porting / playing with some code.....
just the fact that if a uchar is thrice as big as what is "normally" expected,
100 such structures might waste a lot of memory on some platforms.....
where "normally" it is not a big deal.....
so things can still be "broken" or in this case "waste a lot of memory very quickly" due
to an assumption that a uchar is "not very wasteful" on one platform, relative to RAM available, than on another platform.....
the problem might be more prominent e.g. for ints as well, or other types,
e.g. you have some structure that needs 15 bits, so you stick it in an int,
but on some other platform an int is 48 bits or whatever.....
"normally" you might break it into 2 uchars, but e.g. with a 24-bit uchar
you'd only need one.....
so an enum might be a better "generic" solution ....
depends on how you are accessing those bits though :)
so, there might be "design flaws" that rear their head....
even if the code might still work/run fine regardless of the
size of a uchar or uint...
there are things like this to watch out for, even though there
are no "magic numbers" in your code ...
hope this makes sense :)
The weirdest one I saw was the CDC computers. 6 bit characters but with 65 encodings. [There were also more than one character set -- you choose the encoding when you install the OS.]
If a 60 word ended with 12, 18, 24, 30, 36, 40, or 48 bits of zero, that was the end of line character (e.g. '\n').
Since the 00 (octal) character was : in some code sets, that meant BNF that used ::= was awkward if the :: fell in the wrong column. [This long preceded C++ and other common uses of ::.]
ints used to be 16 bits (pdp11, etc.). Going to 32 bit architectures was hard. People are getting better: Hardly anyone assumes a pointer will fit in a long any more (you don't right?). Or file offsets, or timestamps, or ...
8 bit characters are already somewhat of an anachronism. We already need 32 bits to hold all the world's character sets.
The Univac 1100 series had two operational modes: 6-bit FIELDATA and 9-bit 'ASCII' packed 6 or 4 characters respectively into 36-bit words. You chose the mode at program execution time (or compile time.) It's been a lot of years since I actually worked on them.

Byte swap of a byte array into a long long

I have a program where i simply copy a byte array into a long long array. There are a total of 20 bytes and so I just needed a long long of 3. The reason I copied the bytes into a long long was to make it portable on 64bit systems.
I just need to now byte swap before I populate that array such that the values that go into it go reversed.
there is a byteswap.h which has _int64 bswap_64(_int64) function that i think i can use. I was hoping for some help with the usage of that function given my long long array. would i just simply pass in the name of the long long and read it out into another long long array?
I am using c++ not .net or c#
update:
clearly there are issues i am still confused about. for example, workng with byte arrays that just happen to be populated with 160 bit hex string which then has to be outputed in decimal form made me think about the case where if i just do a simple assignment to a long (4 byte) array my worries would be over. Then i found out that this code would ahve to run on a 64bit sun box. Then I thought that since the sizes of data from one env to another can change just a simple assignment would not cut it. this made me think about just using a long long to just make the code sort of immune to that size issue. however, then i read about endianess and how 64bit reads MSB vs 32bit which is LSB. So, taking my data and reversing it such that it is stored in my long long as MSB was the only solution that came to mind. ofc, there is the case about the 4 extra bytes which in this case does not matter and i simply will take the decimal output and display any random six digits i choose. However programatically, i guess it would be better to just work with 4 byte longs and not deal with that whole wasted 4 byte issue.
Between this and your previous questions, it sounds like there are several fundamental confusions here:
If your program is going to be run on a 64-bit machine, it sounds like you should compile and unit-test it on a 64-bit machine. Running unit tests on a 32-bit machine can give you confidence the program is correct in that environment, but doesn't necessarily mean the code is correct for a 64-bit environment.
You seem to be confused about how 32- and 64-bit architectures relate to endianness. 32-bit machines are not always little-endian, and 64-bit machines are not always big-endian. They are two separate concepts and can vary independently.
Endianness only matters for single values consisting of multiple bytes; for example, the integer 305,419,896 (0x12345678) requires 4 bytes to represent, or a UTF-16 character (usually) requires 2 bytes to represent. For these, the order of storage matters because the bytes are interpreted as a single unit. It sounds like what you are working with is a sequence of raw bytes (like a checksum or hash). Values like this, where multiple bytes are not interpreted in groups, are not affected by the endianness of the processor. In your case, casting the byte array to a long long * actually creates a potential endianness problem (on a little-endian architecture, your bytes will now be interpreted in the opposite order), not the other way around.
Endianness also doesn't matter unless the little-endian and big-endian versions of your program actually have to communicate with each other. For example, if the little-endian program writes a file containing multi-byte integers without swapping and the big-endian program reads it in, the big-endian program will probably misinterpret the data. It sounds like you think your code that works on a little-endian platform will suddenly break on a big-endian platform even if the two never exchange data. You generally don't need to be worried about the endianness of the architecture if the two versions don't need to talk to each other.
Another point of confusion (perhaps a bit pedantic). A byte does not store a "hex value" versus a "decimal value," it stores an integer. Decimal and hexadecimal are just two different ways of representing (printing) a particular integer value. It's all binary in the computer's memory anyway, hexadecimal is just an easy conversion to and from binary and decimal is convenient to our brains since we have ten fingers.
Assuming what you're trying to do is print the value of each byte of the array as decimal, you could do this:
unsigned char bytes[] = {0x12, 0x34, 0x56, 0x78};
for (int i = 0; i < sizeof(bytes) / sizeof(unsigned char); ++i)
{
printf("%u ", (unsigned int)bytes[i]);
}
printf("\n");
Output should be something like:
18 52 86 120
Ithink you should look at: htonl() and family
http://beej.us/guide/bgnet/output/html/multipage/htonsman.html
This family of functions is used to encode/decode integers for transport between machines that have different sizes/endianness of integers.
Write your program in the clearest, simplest way. You shouldn't need to do anything to make it "portable."
Byte-swapping is done to translate data of one endianness to another. bswap_64 is for resolving incompatibility between different 64-bit systems such as Power and X86-64. It isn't for manipulating your data.
If you want to reverse bytes in C++, try searching the STL for "reverse." You will find std::reverse, a function which takes pointers or iterators to the first and one-past-last bytes of your 20-byte sequence and reverses it. It's in the <algorithm> header.