Optimizing away memcpy

Optimizing away memcpy - c++

In an attempt to avoid breaking strict aliasing rules, I introduced memcpy to a couple places in my code expecting it to be a no-op. The following example produces a call to memcpy (or equivalent) on gcc and clang. Specifically, fool<40> always does while foo does on gcc but not clang and fool<2> does on clang but not gcc. When / how can this be optimized away?
uint64_t bar(const uint16_t *buf) {
uint64_t num[2];
memcpy(&num, buf, 16);
return num[0] + num[1];
}
uint64_t foo(const uint16_t *buf) {
uint64_t num[3];
memcpy(&num, buf, sizeof(num));
return num[0] + num[1];
}
template <int SZ>
uint64_t fool(const uint16_t *buf) {
uint64_t num[SZ];
memcpy(&num, buf, sizeof(num));
uint64_t ret = 0;
for (int i = 0; i < SZ; ++i)
ret += num[i];
return ret;
}
template uint64_t fool<2>(const uint16_t*);
template uint64_t fool<40>(const uint16_t*);
And a link to the compiled output (godbolt).

I can't really tell you why exactly the respective compilers fail to optimize the code in the way you'd hope them to optimize in the specific cases. I guess each compiler is either just unable to track the relationship established by memcpy between the target array and the source memory (as we can see, they do seem to recognize this relationship at least in some cases), or they simply have some heuristic tell them to chose not to make use of it.
Anyways, since compilers seem to not behave as we would hope when we rely on them tracking the entire array, what we can try to do is to make it more obvious to the compiler by just doing the memcpy on an element-per-element basis. This seems to produce the desired result on both compilers. Note that I had to manually unroll the initialization in bar and foo as clang will otherwise do a copy again.
Apart from that, note that in C++ you should use std::memcpy, std::uint64_t, etc. since the standard headers are not guaranteed to also introduce these names into the global namespace (though I'm not aware of any implementation that doesn't do that).

Related

Passing pointers to arrays of unrelated, but compatible, types without copying?

(Disclaimer: At this point, this is mostly academic interest.)
Imagine I have such an external interface, that is, I do not control it's code:
// Provided externally: Cannot (easily) change this:
// fill buffer with n floats:
void data_source_external(float* pDataOut, size_t n);
// send n data words from pDataIn:
void data_sink_external(const uint32_t* pDataIn, size_t n);
Is it possible within standard C++ to "move" / "stream" data between these two interfaces without copying?
That is, is there any way to make the following be non-UB, without copying of the data between two correctly typed buffers?
int main()
{
constexpr size_t n = 64;
float fbuffer[n];
data_source_external(fbuffer, n);
// These hold and can be checked statically:
static_assert(sizeof(float) == sizeof(uint32_t), "same size");
static_assert(alignof(float) == alignof(uint32_t), "same alignment");
static_assert(std::numeric_limits<float>::is_iec559 == true, "IEEE 754");
// This is clearly UB. Any way to make this work without copying the data?
const uint32_t* buffer_alias = static_cast<uint32_t*>(static_cast<void*>(fbuffer));
// **Note**:
// + reinterpret_cast would also be UB.
data_sink_external(buffer_alias, n);
// ...
As far as I can tell the following would be defined behavior, at least with regard to strict aliasing:
...
uint32_t ibuffer[n];
std::memcpy(ibuffer, fbuffer, n * sizeof(uint32_t));
data_sink_external(ibuffer, n);
but given that the ibuffer will have exactly the same bits as the fbuffer this seems quite insane.
Or would we expect optimizing compilers to optimize even this copy away? (In a now deleted comment-like answer a user posted a godbolt link that seems to indicate, at least on first glance, that clang 11 indeed would be able to optimize out the memcpy.)

I didn't test and can't comment yet (cause not enough reputation). But reinterpret_cast may help in this situation.
Documentation
Basically it tells the compiler, hey treat this pointer as if it was the specified type in the cast.

Auto-vectorization with SSE2 movemask for bytes to bitmap with gcc

With properly constructed C/C++ code one can hint gcc to generate efficient SIMD assembler on its own, without use of intrinsics, e.g. https://locklessinc.com/articles/vectorize/
I am trying to achieve a similar effect for movemask operation (PMOVMSKB / *_movemask_epi8 family), but so far without success.
The simplest code I could think of:
#include <cstdint>
alignas(128) int8_t arr[32];
uint32_t foo()
{
uint32_t rv = 0;
for (int it = 0; it < 32; ++it)
{
rv |= (arr[it] < 0) << it;
}
return rv;
}
leads to assembly that fails to utilize move mask instruction: https://godbolt.org/z/3XimYc
Does anyone have an idea if there's a way to do that with gcc without explicitly using intrinsics?
I haven't looked into MD files and associated implementation in gcc yet (https://github.com/gcc-mirror/gcc/tree/master/gcc/config/i386).

How to read sequence of bytes from pointer in C++ as long?

I have a pointer to a char array, and I need to go along and XOR each byte with a 64 bit mask. I thought the easiest way to do this would be to read each 8 bytes as one long long or uint64_t and XOR with that, but I'm unsure how. Maybe casting to a long long* and dereferencing? I'm still quite unsure about pointers in general, so any example code would be much appreciated as well. Thanks!
EDIT: Example code (just to show what I want, I know it doesn't work):
void encrypt(char* in, uint64_t len, uint64_t key) {
for (int i = 0; i < (len>>3); i++) {
(uint64_t*)in ^= key;
in += 8;
}
}
}

The straightforward way to do your XOR-masking is by bytes:
void encrypt(uint8_t* in, size_t len, const uint8_t key[8])
{
for (size_t i = 0; i < len; i++) {
in[i] ^= key[i % 8];
}
}
Note: here the key is an array of 8 bytes, not a 64-bit number. This code is straightforward - no tricks needed, easy to debug. Measure its performance, and be done with it if the performance is good enough.
Some (most?) compilers optimize such simple code by vectorizing it. That is, all the details (casting to uint64_t and such) are performed by the compiler. However, if you try to be "clever" in your code, you may inadvertently prevent the compiler from doing the optimization. So try to write simple code.
P.S. You should probably also use the restrict keyword, which is currently non-standard, but may be required for best performance. I have no experience with using it, so didn't add it to my example.
If you have a bad compiler, cannot enable the vectorization option, or just want to play around, you can use this version with casting:
void encrypt(uint8_t* in, size_t len, uint64_t key)
{
uint64_t* in64 = reinterpret_cast<uint64_t*>(in);
for (size_t i = 0; i < len / 8; i++) {
in64[i] ^= key;
}
}
It has some limitations:
Requires the length to be divisible by 8
Requires the processor to support unaligned pointers (not sure about x86 - will probably work)
Compiler may refuse to vectorize this one, leading to worse performance
As noted by Hurkyl, the order of the 8 bytes in the mask is not clear (on x86, little-endian, the least significant byte will mask the first byte of the input array)

How to get the size the trailing padding of a struct or class?

sizeof can be used to get the size of a struct or class. offsetof can be used to get the byte offset of a field within a struct or class.
Similarly, is there a way to get the size of the trailing padding of a struct or class? I'm looking for a way that doesn't depend on the layout of the struct, e.g. requiring the last field to have a certain name.
For the background, I'm writing out a struct to disk but I don't want to write out the trailing padding, so the number of bytes I need to write is the sizeof minus the size of the trailing padding.
CLARIFICATION: The motivation for not writing the trailing padding is to save output bytes. I'm not trying to save the internal padding as I'm not sure about the performance impact of non-aligned access and I want the writing code to be low-maintenance such that it doesn't need to change if the struct definition changes.

The way a compiler can pad fields in a structure is not strictly defined in the standard, so it's a kind of free choice and implementation dependent.
If a data aggregate have to be interchanged the only solution is to avoid any padding.
This is normally accomplished using a #pragma pack(1). This pragma instructs the compiler to pack all fields together on a 1 byte boundary. It will slow the access on some processors, but will make the structure compact and well defined on any system, and, of course, without any padding.

pragma pack or equivalent is the canonical way to do that. Apart from that I can only think of a macro, if the number of members is fixed or the maximum number is low, like
$ cat struct-macro.c && echo
#include<stdio.h>
using namespace std;
#define ASSEMBLE_STRUCT3(Sname, at, a, bt, b, ct, c) struct Sname {at a; bt b; ct c; }; \
int Sname##_trailingbytes() { return sizeof(struct Sname) - offsetof(Sname, c) - sizeof(ct); }
ASSEMBLE_STRUCT3(S, int, i1, int, i2, char, c)
int main()
{
printf("%d\n", S_trailingbytes());
}
$ g++ -Wall -o struct-macro struct-macro.c && ./struct-macro
3
$
I wonder if something fancy can be done with a variadic template class with in C++. But I can't quite see how the class/structure can be defined and the offset function/constant be provided without a macro again -- which would defeat the purpose.

The padding could be anywhere inside the struct, except at the very beginning. There is no standard way to disable padding, although some flavour of #pragma pack is a common non-standard extension.
What you actually should do if you want a robust, portable solution, is to write a serialization/de-serialization routine for your struct.
Something like this:
typedef
{
int x;
int y;
...
} mytype_t;
void mytype_serialize (uint8_t* restrict dest, const mytype_t* restrict src)
{
memcpy(dest, &src->x, sizeof(src->x)); dest += sizeof(src->x);
memcpy(dest, &src->y, sizeof(src->y)); dest += sizeof(src->y);
...
}
And similarly for the other way around.
Please note that padding is there for a reason. If you get rid of it, you sacrifice execution speed in favour of memory size.
EDIT
The weird way to do it, just by skipping trailing padding:
size_t mytype_serialize (uint8_t* restrict dest, const mytype_t* restrict src)
{
size_t size = offsetof(my_type_t, y); // assuming y is last object
memcpy(dest, src, size);
memcpy(dest+size, &src->y, sizeof(src->y));
size += sizeof(src->y);
return size;
}
You need to know the size and do something meaningful with it, because otherwise you can't know the size of the stored data when you need to read it back.

This is a possiblity:
#define BYTES_AFTER(st, last) (sizeof (st) - offsetof(st, last) - sizeof ((st*)0)->last)
As is this (C99) approach:
#define BYTES_AFTER(st, last) (sizeof (st) - offsetof(st, last) - sizeof (st){0}.last)
Another way is just declaring your structs packed via some non-standard #pragma or similar. This would also take care of padding in the middle.
Neither of those two are pretty though. Sharing between different systems might not work because different alignment requirements. And using non-standard extensions is, well, non-standard.
Just do the serialization yourself. Maybe something like that:
unsigned char buf[64];
mempcpy(mempcpy(mempcpy(buf,
&st.member_1, sizeof st.member_1),
&st.member_2, sizeof st.member_2),
&st.member_3, sizeof st.member_3);
mempcpy is a GNU extension, if it's not available, just define it yourself:
static inline void * mempcpy (void *dest, const void *src, size_t len) {
return (char*)memcpy(dest, src, len) + len;
}
IMO, it makes code like that easier to read.

portable ntohl and friends

I'm writing a small program that will save and load data, it'll be command line (and not interactive) so there's no point in including libraries I need not include.
When using sockets directly, I get the ntohl functions just by including sockets, however here I don't need sockets. I'm not using wxWidgets, so I don't get to use its byte ordering functions.
In C++ there are lot of new standardised things, for example look at timers and regex (although that's not yet fully supported) but certainly timers!
Is there a standardised way to convert things to network-byte ordered?
Naturally I've tried searching "c++ network byte order cppreference" and similar things, nothing comes up.
BTW in this little project, the program will manipulate files that may be shared across computers, it'd be wrong to assume "always x86_64"

Is there a standardised way to convert things to network-byte ordered?
No. There isn't.
Boost ASIO has equivalents, but that somewhat violates your requirements.

GCC has __BYTE_ORDER__ which is as good as it will get! It's easy to detect if the compiler is GCC and test this macro, or detect if it is Clang and test that, then stick the byte ordering in a config file and use the pre-processor to conditionally compile bits of code.

There are no C++ standard functions for that, but you can compose the required functionality from the C++ standard functions.
Big-endian-to-host byte-order conversion can be implemented as follows:
#include <boost/detail/endian.hpp>
#include <boost/utility/enable_if.hpp>
#include <boost/type_traits/is_arithmetic.hpp>
#include <algorithm>
#ifdef BOOST_LITTLE_ENDIAN
# define BE_TO_HOST_COPY std::reverse_copy
#elif defined(BOOST_BIG_ENDIAN)
# define BE_TO_HOST_COPY std::copy
#endif
inline void be_to_host(void* dst, void const* src, size_t n) {
char const* csrc = static_cast<char const*>(src);
BE_TO_HOST_COPY(csrc, csrc + n, static_cast<char*>(dst));
}
template<class T>
typename boost::enable_if<boost::is_integral<T>, T>::type
be_to_host(T const& big_endian) {
T host;
be_to_host(&host, &big_endian, sizeof(T));
return host;
}
Host-to-big-endian byte-order conversion can be implemented in the same manner.
Usage:
uint64_t big_endian_piece_of_data;
uint64_t host_piece_of_data = be_to_host(big_endian_piece_of_data);

The following should work correctly on any endian platform
int32_t getPlatformInt(uint8_t* bytes, size_t num)
{
int32_t ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
You network integer can easily be cast to an array of chars using:
uint8_t* p = reiterpret_cast<uint8_t*>(&network_byte_order_int)

The code from Doron that should work on any platform did not work for me on a big-endian system (Power7 CPU architecture).
Using a compiler built_in is much cleaner and worked great for me using gcc on both Windows and *nix (AIX):
uint32_t getPlatformInt(const uint32_t* bytes)
{
uint32_t ret;
ret = __builtin_bswap32 (*bytes));
return ret;
}
See also How can I reorder the bytes of an integer in c?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimizing away memcpy - c++

Related

Passing pointers to arrays of unrelated, but compatible, types without copying?

Auto-vectorization with SSE2 movemask for bytes to bitmap with gcc

How to read sequence of bytes from pointer in C++ as long?

How to get the size the trailing padding of a struct or class?

portable ntohl and friends

Categories

Resources