How do I cast from one slice type to another? [duplicate] - casting

I have a [u8; 16384] and a u16. How would I "temporarily transmute" the array so I can set the two u8s at once, the first to the least significant byte and the second to the most significant byte?

The obvious, safe and portable way is to just use math.
fn set_u16_le(a: &mut [u8], v: u16) {
a[0] = v as u8;
a[1] = (v >> 8) as u8;
If you want a higher-level interface, there's the byteorder crate which is designed to do this.
You should definitely not use transmute to turn a [u8] into a [u16], because that doesn't guarantee anything about the byte order.

slice::align_to and slice::align_to_mut are stable as of Rust 1.30. These functions handle the alignment concerns that sellibitze brings up.
The big- and little- endian problems are still yours to worry about. You may be able to use methods like u16::to_le to help with that. I don't have access to a big-endian computer to test with, however.
fn example(blob: &mut [u8; 16], value: u16) {
// I copied this example from Stack Overflow without providing
// rationale why my specific case is safe.
let (head, body, tail) = unsafe { blob.align_to_mut::<u16>() };
// This example simply does not handle the case where the input data
// is misaligned such that there are bytes that cannot be correctly
// reinterpreted as u16.
body[0] = value
fn main() {
let mut data = [0; 16];
example(&mut data, 500);
println!("{:?}", data);

As DK suggests, you probably shouldn't really use unsafe code to reinterpret the memory... but you can if you want to.
If you really want to go that route, you should be aware of a couple of gotchas:
You could have an alignment problem. If you just take a &mut [u8] from somewhere and convert it to a &mut [u16], it could refer to some memory region that is not properly aligned to be accessed as a u16. Depending on what computer you run this code on, such an unaligned memory access might be illegal. In this case, the program would probably abort somehow. For example, the CPU could generate some kind of signal which the operating system responds to in order to kill the process.
It'll be non-portable. Even without the alignment issue, you'll get different results on different machines (little- versus big-endian machines).
If you can switch it around (creating a u16 array and temporarily dealing with it on a byte level), you would solve the potential memory alignment problem:
/// warning: The resulting byte view is system-specific
unsafe fn raw_byte_access(s16: &mut [u16]) -> &mut [u8] {
use std::slice;
slice::from_raw_parts_mut(s16.as_mut_ptr() as *mut u8, s16.len() * 2)
On a big-endian machine, this function will not do what you want; you want a little-endian byte order. You can only use this as an optimization for little-endian machines and need to stick with a solution like DK's for big- or mixed-endian machines.


C/C++ Little/Big Endian handler

There are two systems that communicate via TCP. One uses little endian and the second one big endian. The ICD between systems contains a lot of structs (fields). Making bytes swap for each field looks like not the best solution.
Is there any generic solution/practice for handling communication between systems with different endianness?
Each system may have a different architecture, but endianness should be defined by the communication protocol. If the protocol says "data must be sent as big endian", then that's how the system sends it and how the other system receives it.
I am guessing the reason why you're asking is because you would like to cast a struct pointer to a char* and just send it over the wire, and this won't work.
That is generally a bad idea. It's far better to create an actual serializer, so that your internal data is decoupled from the actual protocol, which also means you can easily add support for different protocols in the future, or different versions of the protocols. You also don't have to worry about struct padding, aliasing, or any implementation-defined issues that casting brings along.
So generally, you would have something like:
void Serialize(const struct SomeStruct *s, struct BufferBuilder *bb)
BufferBuilder_append_u16_le(bb, s->SomeField);
BufferBuilder_append_s32_le(bb, s->SomeOther);
BufferBuilder_append_u08(bb, s->SomeOther);
Where you would already have all these methods written in advance, like
// append unsigned 16-bit value, little endian
void BufferBuilder_append_u16_le(struct BufferBuilder *bb, uint16_t value)
if (bb->remaining < sizeof(value))
return; // or some error handling, whatever
memcpy(bb->buffer, &value, sizeof(value));
bb->remaining -= sizeof(value);
We use this approach because it's simpler to unit test these "appending" methods in isolation, and writing (de)serializers is then a matter of just calling them in succession.
But of course, if you can pick any protocol and implement both systems, then you could simply use protobuf and avoid doing a bunch of plumbing.
Generally speaking, values transmitted over a network should be in network byte order, i.e. big endian. So values should be converted from host byte order to network byte order for transmission and converted back when received.
The functions htons and ntohs do this for 16 bit integer values and htonl and ntohl do this for 32 bit integer values. On little endian systems these functions essentially reverse the bytes, while on big endian systems they're a no-op.
So for example if you have the following struct:
struct mystruct {
char f1[10];
uint32_t f2;
uint16_t f3;
Then you would serialize the data like this:
// s points to the struct to serialize
// p should be large enough to hold the serialized struct
void serialize(struct mystruct *s, unsigned char *p)
memcpy(p, s->f1, sizeof(s->f1));
p += sizeof(s->f1);
uint32_t f2_tmp = htonl(s->f2);
memcpy(p, &f2_tmp, sizeof(f2_tmp));
p += sizeof(s->f2);
uint16_t f3_tmp = htons(s->f3);
memcpy(p, &f3_tmp, sizeof(f3_tmp));
And deserialize it like this:
// s points to a struct which will store the deserialized data
// p points to the buffer received from the network
void deserialize(struct mystruct *s, unsigned char *p)
memcpy(s->f1, p, sizeof(s->f1));
p += sizeof(s->f1);
uint32_t f2_tmp;
memcpy(&f2_tmp, p, sizeof(f2_tmp));
s->f2 = ntohl(f2_tmp);
p += sizeof(s->f2);
uint16_t f3_tmp;
memcpy(&f3_tmp, p, sizeof(f3_tmp));
s->f3 = ntohs(f3_tmp);
While you could use compiler specific flags to pack the struct so that it has a known size, allowing you to memcpy the whole struct and just convert the integer fields, doing so means that certain fields may not be aligned properly which can be a problem on some architectures. The above will work regardless of the overall size of the struct.
You mention one problem with struct fields. Transmitting structs also requires taking care of alignment of fields (causing gaps between fields): compiler flags.
For binary data one can use Abstract Syntax Notation One (ASN.1) where you define the data format. There are some alternatives. Like Protocol Buffers.
In C one can with macros determine endianess and field offsets inside a struct, and hence use such a struct description as the basis for a generic bytes-to-struct conversion. So this would work independent of endianess and alignment.
You would need to create such a descriptor for every struct.
Alternatively a parser might generate code for bytes-to-struct conversion.
But then again you could use a language neutral solution like ASN.1.
C and C++ of course have no introspection/reflection capabilities like Java has, so that are the only solutions.
The fastest and most portable way is to use bit shifts.
These have the big advantage that you only need to know the network endianess, never the CPU endianess.
uint8_t buf[4] = { MS_BYTE, ... LS_BYTE}; // some buffer from TCP/IP = Big Endian
uint32_t my_u32 = ((uint32_t)buf[0] << 24) |
((uint32_t)buf[1] << 16) |
((uint32_t)buf[2] << 8) |
((uint32_t)buf[3] << 0) ;
Do not use (bit-field) structs/type punning directly on the input. They are poorly standardized, may contain padding/alignment requirements, depend on endianess. It is fine to use structs if you have proper serialization/deserialization routines in between. A deserialization routine may contain the above bit shifts, for example.
Do not use pointer arithmetic to iterate across the input, or plain memcpy(). Neither one of these solves the endianess issue.
Do not use htons etc bloat libs. Because they are non-portable. But more importantly because anyone who can't write a simple bit shift like above without having some lib function holding their hand, should probably stick to writing high level code in a more family-friendly programming language.
There is no point in writing code in C if you don't have a clue about how to do efficient, close to the hardware programming, also known as the very reason you picked C for the task to begin with.
Helping hand for people who are confused over how C code gets translated to asm: As we can see, the machine code is identical on x86 Linux. The htonl won't compile on a number of embedded targets, nor on MSVC, while leading to worse performance on Mips64.

Why does using reinterpret_cast to convert from char* to a structure seem to work normally?

People say it's not good to trust reinterpret_cast to convert from raw data (like char*) to a structure. For example, for the structure
struct A
unsigned int a;
unsigned int b;
unsigned char c;
unsigned int d;
sizeof(A) = 16 and __alignof(A) = 4, exactly as expected.
Suppose I do this:
char *data = new char[sizeof(A) + 1];
A *ptr = reinterpret_cast<A*>(data + 1); // +1 is to ensure it doesn't points to 4-byte aligned data
Then copy some data to ptr:
memcpy_s(sh, sizeof(A),
"\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00", sizeof(A));
Then ptr->a is 1, ptr->b is 2, ptr->c is 3 and ptr->d is 4.
Okay, seems to work. Exactly what I was expecting.
But the data pointed by ptr is not 4-byte aligned like A should be. What problems this may cause in a x86 or x64 platform? Performance issues?
For one thing, your initialization string assumes that the underlying integers are stored in little endian format. But another architecture might use big endian, in which case your string will produce garbage. (Some huge numbers.) The correct string for that architecture would be
Then, of course, there is the issue of alignment.
Certain architectures won't even allow you to assign the address of data + 1 to a non-character pointer, they will issue a memory alignment trap.
But even architectures which will allow this (like x86) will perform miserably, having to perform two memory accesses for each integer in the structure. (For more information, see this excellent answer:
Finally, I am not completely sure about this, but I think that C and C++ do not even guarantee to you that an array of characters will contain characters packed in bytes. (I hope someone who knows more might clarify this.) Conceivably, there can be architectures which are completely incapable of addressing non-word-aligned data, so in such architectures each character would have to occupy an entire word. This would mean that it would be valid to take the address of data + 1, because it would still be aligned, but your initialization string would be unsuitable for the intended job, as the first 4 characters in it would cover your entire structure, producing a=1, b=0, c=0 and d=0.
The problem is that you can not be sure if this code will run on another platform, with the next version of Visual Studio, etc. When running on another processor, it may cause a hardware exception.
There was a time when you could read out arbitrary memory locations, but all those programs crash with an "access violation" exception nowadays. Something similar could happen to this program in the future.
However, what you can do, and what any compiler that calls itself "C++ standard compliant" must compile correctly, is this:
You can reinterpret_cast a pointer to something else, and then back to the original type. The value of the type, when read before and after, must stay the same.
I don't know what exactly you want to do, but you might get away with, for example
allocating a struct A
reinterpret_casting it to chars
saving the memory content to a file
and restore everything later:
allocate a struct A
reinterpret_cast it to chars
load the content to memory
reinterpret_cast it back to a struct A

Is explicit alignment necessary?

After some readings, I understand that compiler has done the padding for structs or classes such that each member can be accessed on its natural aligned boundary. So under what circumstance is it necessary for coders to make explicit alignment to achieve better performance? My question arises from here:
Intel 64 and IA-32 Architechtures Optimization Reference Manual:
For best performance, align data as follows:
Align 8-bit data at any address.
Align 16-bit data to be contained within an aligned 4-byte word.
Align 32-bit data so that its base address is a multiple of four.
Align 64-bit data so that its base address is a multiple of eight.
Align 80-bit data so that its base address is a multiple of sixteen.
Align 128-bit data so that its base address is a multiple of sixteen.
So suppose I have a struct:
struct A
int a;
int b;
int c;
// size = 12;
// aligned on boundary of: 4
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
Let me answer your question directly: No, there is no need to explicitly align data in C++ for performance.
Any decent compiler will properly align the data for underlying system.
The problem would come (variation on above) if you had:
int w ;
char x ;
int y ;
char z ;
This illustrates the two common structure alignment problems.
(1) It is likely a compiler would insert (2) 3 alignment bytes after both x and z. If there is no padding after x, y is unaligned. If there is no padding after z, w and x will be unaligned in arrays.
The instructions are you are reading in the manual are targeted towards assembly language programmers and compiler writers.
When data is unaligned, on some systems (not Intel) it causes an exception and on others it take multiple processor cycles to fetch and write the data.
The only time I can thing of when you want explicit alignment is when you are directly copying/casting data between your struct to a char* for serialization in some type of binary protocol.
Here unexpected padding may cause problems with a remote user of your protocol.
In pseudocode:
struct Data PACKED
char code[3];
int val;
Data data = { "AB", 24 };
char buf[20];
memcpy(buf, data, sizeof(data));
send (buf, sizeof(data);
Now if our protocol expects 3 octets of code followed by a 4 octet integer value for val, we will run into problems if we use the above code. Since padding will introduce problems for us. The only way to get this to work is for the struct above to be packed (allignment 1)
There is indeed a facility in the language (it's not a macro, and it's not from the standard library) to tell you the alignment of an object or type. It's alignof (see also: std::alignment_of).
To answer your question: In general you should not be concerned with alignment. The compiler will take care of it for you, and in general/most cases it knows much, much better than you do how to align your data.
The only case where you'd need to fiddle with alignment (see alignas specifier) is when you're writing some code which allows some possibly less aligned data type to be the backing store for some possibly more aligned data type.
Examples of things that do this under the hood are std::experimental::optional and boost::variant. There's also facilities in the standard library explicitly for creating such a backing store, namely std::aligned_storage and std::aligned_union.
By creating an array of type A, even if I do nothing, it is properly aligned. Then what's the point to follow the guide and make the alignment stronger?
The ABI only describes how to use the data elements it defines. The guideline doesn't apply to your struct.
Is it because of cache line split? Assuming the cache line is 64 bytes. With the 6th access of object in the array, the byte starts from 61 to 72, which slows down the program??
The cache question could go either way. If your algorithm randomly accesses the array and touches all of a, b, and c then alignment of the entire structure to a 16-byte boundary would improve performance, because fetching any of a, b, or c from memory would always fetch the other two. However if only linear access is used or random accesses only touch one of the members, 16-byte alignment would waste cache capacity and memory bandwidth, decreasing performance.
Exhaustive analysis isn't really necessary. You can just try and see what alignas does for performance. (Or add a dummy member, pre-C++11.)
BTW, is there a macro in standard library that tells me the alignment requirement based on the running machine by returning a value of std::size_t?
C++11 (and C11) have an alignof operator.

Correct way to serialize binary data in C++

After having read the following 1 and 2 Q/As and having used the technique discussed below for many years on x86 architectures with GCC and MSVC and not seeing a problems, I'm now very confused as to what is supposed to be the correct but also as important "most efficient" way to serialize then deserialize binary data using C++.
Given the following "wrong" code:
int main()
std::ifstream strm("file.bin");
char buffer[sizeof(int)] = {0};,sizeof(int));
int i = 0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
return 0;
Now as I understand things, the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question - with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
That said the answers provided above seem to indicate as far as C++ is concerned that this is all undefined behavior.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
Furthermore I've seen over the years many situations where a struct made up entirely of pods (using compiler specific pragmas to remove padding) is cast to a char* and subsequently written to a file or socket, then later on read back into a buffer and the buffer cast back to a pointer of the original struct, (ignoring potential endian and float/double format issues between machines), is this kind of code also considered undefined behaviour?
The following is more complex example:
int main()
std::ifstream strm("file.bin");
char buffer[1000] = {0};
const std::size_t size = sizeof(int) + sizeof(short) + sizeof(float) + sizeof(double);
const std::size_t weird_offset = 3;
buffer += weird_offset;,size);
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
s = reinterpret_cast<short*>(buffer);
buffer += sizeof(short);
f = reinterpret_cast<float*>(buffer);
buffer += sizeof(float);
d = reinterpret_cast<double*>(buffer);
buffer += sizeof(double);
return 0;
First, you can correctly, portably, and efficiently solve the alignment problem using, e.g., std::aligned_storage::value>::type instead of char[sizeof(int)] (or, if you don't have C++11, there may be similar compiler-specific functionality).
Even if you're dealing with a complex POD, aligned_stored and alignment_of will give you a buffer that you can memcpy the POD into and out of, construct it into, etc.
In some more complex cases, you need to write more complex code, potentially using compile-time arithmetic and template-based static switches and so on, but so far as I know, nobody came up with a case during the C++11 deliberations that wasn't possible to handle with the new features.
However, just using reinterpret_cast on a random char-aligned buffer is not enough. Let's look at why:
the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer
Yes, but you're also indicating that it can assume that the buffer is aligned properly for an integer. If you're lying about that, it's free to generate broken code.
and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question
Yes, it's free to issue instructions that either require those alignments, or that assume they're already taken care of.
with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
Yes, it may issue instructions with the extra reads and shifts. But it may also issue instructions that don't do them, because you've told it that it doesn't have to. So, it could issue a "read aligned word" instruction which raises an interrupt when used on non-aligned addresses.
Some processors don't have a "read aligned word" instruction, and just "read word" faster with alignment than without. Others can be configured to suppress the trap and instead fall back to a slower "read word". But others—like ARM—will just fail.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
You don't need to copy the bytes 1 by 1. You could, for example, memcpy each variable one by one into properly-aligned storage. (That would only be copying bytes 1 by 1 if all of your variables were 1-byte long, in which case you wouldn't be worried about alignment in the first place…)
As for casting a POD to char* and back using compiler-specific pragmas… well, any code that relies on compiler-specific pragmas for correctness (rather than for, say, efficiency) is obviously not correct, portable C++. Sometimes "correct with g++ 3.4 or later on any 64-bit little-endian platform with IEEE 64-bit doubles" is good enough for your use cases, but that's not the same thing as actually being valid C++. And you certainly can't expect it to work with, say, Sun cc on a 32-bit big-endian platform with 80-bit doubles and then complain that it doesn't.
For the example you added later:
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
buffer += weird_offset;
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
Experts are right. Here's a simple example of the same thing:
int i[2];
char *c = reinterpret_cast<char *>(i) + 1;
int *j = reinterpret_cast<int *>(c);
int k = *j;
The variable i will be aligned at some address divisible by 4, say, 0x01000000. So, j will be at 0x01000001. So the line int k = *j will issue an instruction to read a 4-byte-aligned 4-byte value from 0x01000001. On, say, PPC64, that will just take about 8x as long as int k = *i, but on, say, ARM, it will crash.
So, if you have this:
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
And you want to write it to a stream, how do you do it?
How do you read back from a stream?
Presumably whatever kind of stream you're using (whether ifstream, FILE*, whatever) has a buffer in it, so readFromStream(&f) is going to check whether there are sizeof(float) bytes available, read the next buffer if not, then copy the first sizeof(float) bytes from the buffer to the address of f. (In fact, it may even be smarter—it's allowed to, e.g., check whether you're just near the end of the buffer, and if so issue an asynchronous read-ahead, if the library implementer thought that would be a good idea.) The standard doesn't say how it has to do the copy. Standard libraries don't have to run anywhere but on the implementation they're part of, so your platform's ifstream could use memcpy, or *(float*), or a compiler intrinsic, or inline assembly—and it will probably use whatever's fastest on your platform.
So, how exactly would unaligned access help you optimize this or simplify it?
In nearly every case, picking the right kind of stream, and using its read and write methods, is the most efficient way of reading and writing. And, if you've picked a stream out of the standard library, it's guaranteed to be correct, too. So, you've got the best of both worlds.
If there's something peculiar about your application that makes something different more efficient—or if you're the guy writing the standard library—then of course you should go ahead and do that. As long as you (and any potential users of your code) are aware of where you're violating the standard and why (and you actually are optimizing things, rather than just doing something because it "seems like it should be faster"), this is perfectly reasonable.
You seem to think that it would help to be able to put them into some kind of "packed struct" and just write that, but the C++ standard does not have any such thing as a "packed struct". Some implementations have non-standard features that you can use for that. For example, both MSVC and gcc will let you pack the above into 18 bytes on i386, and you can take that packed struct and memcpy it, reinterpret_cast it to char * to send over the network, whatever. But it won't be compatible with the exact same code compiled by a different compiler that doesn't understand your compiler's special pragmas. It won't even be compatible with a related compiler, like gcc for ARM, which will pack the same thing into 20 bytes. When you use non-portable extensions to the standard, the result is not portable.

which of these two methods of converting this array to integer you would suggest?

consider the following array of bytes that is intended to be converted into a single unsigned integer:
unsigned char arr[3] = {0x23, 0x45, 0x67};
each byte represents the equivalent byte in integer, now which one of the following methods would you suggest specially performance-wise:
unsigned int val1 = arr[2] << 16 | arr[1] << 8 | arr[0];
unsigned int val2=arr[0];
*((char *)&val2+1)=arr[1];
*((char *)&val2+2)=arr[2];
I prefer the first method because it is portable. The second isn't due to endianness issues.
This depends on your specific processor, a lot.
For example, on the PowerPC, the second form -- writing through the character pointers -- runs into a tricky implementation detail called a load-hit-store. This is a CPU stall that occurs when you store to a location in memory, then read it back again before the store has completed. The load op cannot complete until the store has finished (most PPCs do not have memory store-forwarding), and the store may take many cycles to make it from the CPU out to the memory cache.
Because of the way the store and arithmetic units are arranged in the pipeline, the CPU will have to flush the pipeline completely until the store completes: this can be a stall of twenty cycles or more during which the CPU has stopped dead. In general, writing to memory and then reading it back immediately is very bad on this platform. So on this case, the sequential bitshifts will be much faster, as they all occur on registers, and will not incur a pipeline stall.
On the Pentium series, the situation may be entirely reversed, because that chipset does have store forwarding and a fast stack architecture, and relatively few architectural registers. On the Core Duos and i7s, it may reverse yet again, because their pipelines are very deep.
Remember: it is not the case that every opcode takes one cycle. CPUs are not simple, and things like superscalar pipes and data hazards may cause instructions to take many cycles, or even many instructions to occur per cycle, depending on just how you arrange your code.
All of this just to underscore the point: this sort of optimization is extremely specific to a particular compiler and chipset. So you must compile, test and measure.
the first is faster, translated in x86 asm. It depends on your architecture anyway. Usually the compilers are able to optimize the first expression very well, and it's more portable too
The performance depends on the compiler and the machine. For example, in my experiment with gcc 4.4.5 on x64 the second was marginally faster, while others report the first as being faster. Therefore I recommend to stick with the first one because it is cleaner (no casts) and safer (no endianness issues).
I believe bitshift will the fastest solution. In my mind the CPU can just slide in the values, but by going directly to the address, like your second example, it will have to use many temp storages.
I would suggest a solution with union :
union color {
// first representation (member of union)
struct s_color {
unsigned char a, b, g, r;
} uc_color;
// second representation (member of union)
unsigned int int_color;
int main()
color a;
a.int_color = 0x23567899;
Take care that it platform dependent (which endianess)