memmove vs copying backwards - c++

I understand that memmove in C (cstring library) handles overlaps nicely "at the cost of slower runtime" (see this post). I was wondering why this additional runtime cost? It seems to me that any overlap problem could be fixed by copying backwards instead of forward, am I wrong?
As a toy example, here are two versions of a "right-shift" function, that shifts the contents of an array by one element on the right:
// Using memmove
template <typename T>
void shift_right( T *data, unsigned n )
{
if (n)
{
data[n-1].~T();
memmove( data+1, data, (n-1)*sizeof(T) );
new (data) T();
}
}
// Using copy_backward
template <typename Iterator>
void shift_right( Iterator first, Iterator last )
{
Iterator it = last;
std::copy_backward( first, --it, last );
}
Are they equivalent? Performance-wise, which one is best to use?
Note: judging by the comment of #DieterLücking, and despite the precautions taken, the above version using memmove is unsafe in this situation.

Assuming a good implementation, the only "extra cost" of memmove is the initial check (an add and a compare-and-branch) to decide whether to copy front-to-back or back-to-front. This cost is so completely negligible (the add and compare will be hidden by ILP and the branch is perfectly predictable under normal circumstances) that on some platforms, memcpy is just an alias of memmove.
In anticipation of your next question ("if memcpy isn't significantly faster than memmove, why does it exist?"), there are a few good reasons to keep memcpy around. The best one, to my mind, is that some CPUs essentially implement memcpy as a single instruction (rep/movs on x86, for example). These HW implementations often have a preferred (fast) direction of operation (or they may only support copying in one direction). A compiler may freely replace memcpy with the fastest instruction sequence without worrying about these details; it cannot do the same for memmove.

Memmove figures out for you whether to copy backward or forward; it is also highly optimized for this task (i.e. copying in SSE-optimized blocks as much as feasible).
It is unlikely that you can do any better by calling any generic STL algorithm (the best they could do is to call memcopy or memmove behind the scenes), but of course you could answer this question simply by running your code and timing it.

from the post you actually linked (emphasis mine):
memcpy just loops, while memmove performs a test to determine which
direction to loop in to avoid corrupting the data. These
implementations are rather simple. Most high-performance
implementations are more complicated (involving copying word-size
blocks at a time rather than bytes).

The appropriate ways to copy or move are std::copy, std::copy_n, std::copy_backward and std::move. A proper C++ library will use memcpy or memmove if applicable. Hence there is no need to go for undefined results if the copied or moved sequence holds no trivial data.
Note: Here the std::move is the template 'OutputIterator move(InputIterator first, InputIterator last, OutputIterator result);' (For #Void)

Related

Type-Safe C++ wrapper for memcpy?

Given that std::copy (for Trivial Types obviously) can only be implemented as a wrapper around memmove(*), I'm wondering:
Is there a Standard C++ type-safe wrapper for the times you need memcpy? (I can't count the number of times I forgot to multiply by sizeof.)
If there's nothing in the standard, have there been any proposals for this? If not, why not?
Are there any specific obstacles in providing a memcpy wrapper that does the sizeof multiplication automatically?
(*): C++ Standard Library implementations (from back MSVC 2005(!) up to modern MSVC2015, libc++ etc.) decay std::copy TriviallyCopyable types to memmove. Bot not to memcpy. Because:
std::copy(src_first, src_last, destination_first) defines that:
The behavior is undefined if d_first is within the range [first, last).
Only the beginning of the destination range MUST NOT be within the source range. The destination range is allowed to extend into the source range. That is, d_first can be "to the left" of the source range, and the destination range can extend into the source range.
For std::memcpy the definition is that
If the objects overlap, the behavior is undefined.
That is, the full ranges must not overlap: This is what allows memcpy to be the fastest variant, because it can just assume that the memory of source and destination is completely disjoint.
For std::memmove, the definition is:
The objects may overlap: copying takes place as if the characters were copied to a temporary character array and then the characters were copied from the array to dest.
That is, the source and destination range may arbitrarily overlap, there is no restriction.
Given this, it is clear that you can use std::memove to implement std::copy for TrivialllyCopyable types, because memmove doesn't impose any restrictions and the dispatch to the correct implementation can be done at compile time via type traits --
but it's hard to implement std::copy in terms of memcpy because (a) the check whether the pointer ranges overlap would have to be done at run time, and (b) even implementing the runtime check for unrelated memory ranges could be quite a mess.
So, this leaves us with
void* memcpy( void* dest, const void* src, std::size_t count );
a function with a less than stellar interface, where you constantly need to multiply the input count of non-char objects with their sizeof and that is totally untyped.
But memcpy is fastest (and by quite a margin, measure it yourself), and when you need fast copies of TriviallyCopyable types, you reach for memcpy. Which superficially should be easy to wrap in a type safe wrapper like:
template<typename T>
T* trivial_copy(T* dest, T* src, std::size_t n) {
return static_cast<T*>(std::memcpy(dest, src, sizeof(T) * n));
}
but then, it's unclear wether you should do compile time checks via std::is_trival or somesuch and of course there may be some discussion whether to go with the exact memcpy signature order, yadda yadda.
So do I really have to reinvent this wheel myself? Was it discussed for the standard? Etc.
To clarify the difference between mencpy and memove, according to the docs
memmove can copy memory to a location that overlaps the source memory, for memcpy this is undefined behavior.
"The objects may overlap: copying takes place as if the characters were copied to a temporary character array and then the characters were copied from the array to dest."
Is there a Standard C++ type-safe wrapper for the times you need memcpy? (I can't count the number of times I forgot to multiply by sizeof.)
Yes, std::copy (maybe, explained below)
If there's nothing in the standard, have there been any proposals for this? If not, why not?
As far as i know the standard does not enforce the usage of memmove/memcpy for std::copy for trivial types. So it's up the implementation. For example in visual studio update 2015 update 2 they did use memmove to speed things up:
"Increased the speed of std::vector reallocation and std::copy(); they are up to 9x faster as they call memmove() for trivially copyable types (including user-defined types)."
Are there any specific obstacles in providing a memcpy wrapper that does the sizeof multiplication automatically?
No, in fact you can implement this yourself by using std::is_trivial
Edit:
According to this document section 25.3.1 there are no restrictions to std::copy implementation only complexity:
Complexity: Exactly last - first assignments.
And this makes perfect sense when you consider that memcpy uses cpu speciffic instruction (that are not available on all cpus) to speed up memory copy.

Is it cheap to construct a const std::string from a const char* + length?

How expensive is it to execute
const std::string s(my_const_char_ptr, my_length);
? Is there copying involved? If not, how many instructions can I expect from typical standard library implementations? Few enough to have this in performance-critical code?
... or must I get a GSL implementation and use string_view?
You need to measure it on your target system and compilation settings if you want to know the exact answer. But for what's happening under the hood:
If my_length is large enough (or the standard library's std::string implementation doesn't use small string optimization — which is rare), then there will be a dynamic memory allocation.
In any case, there will be an O(n) character-by-character copy from *my_const_char_ptr to the std::string's buffer.
How expensive is it to execute...
pretty cheap if you do it once
? Is there copying involved?
Yes, but that's the least of your worries. a linear copy is about the cheapest thing you can do in a modern architecture. (because of the pipelining, prefetching, etc etc)
how many instructions can I expect from typical standard library implementations?
fewer than you'd think - particularly in the copy part. The implementation (built with -O2) will seek to loop-unroll, vectorise, and transfer large words at once where possible. Memory alignment will actually be the biggest arbiter of performance.
Few enough to have this in performance-critical code?
If it's performance-critical, pre-allocate the string and re-use it (see std::string::assign(first, last). Again, the copying bit is as cheap as chips. It's the memory allocation which will kill you, so do it once.
... or must I get a GSL implementation and use string_view?
Not unless the strings are absolutely enormous.
Per the standard table 66
basic_string(const charT*, size_type, const Allocator&) effects
data(): points at the first element of an allocated copy of the array whose first element is pointed at by s
[...]
So we are going to have an allocation and a O(N) copies where N is the size passed to the function.

Performance of std::copy of portion of std::vector

I want to copy part of a vector to itself, e.g.
size_t offset; /* some offset */
std::vector<T> a = { /* blah blah blah */};
std::copy(a.begin() + offset, a.begin() + (offset*2), a.begin());
However, I'm concerned about the performance of this approach. I'd like to have this boil down to a single memmove (or equivalent) when the types in question allow it, but still behave as one would expect when given a non-trivially-copyable type.
When the template type T is trivially copyable (in particular int64_t if it matters), does this result in one memmove of length sizeof(T) * offset, or offset distinct memmoves of length sizeof(T)? I assume the later would give noticeably worse performance because it requires many separate memory reads. Or should I just assume that caching will make the performance in these situations effectively equivalent for relatively small offsets (<100)?
In cases where the template type T is not trivially copyable, is it guaranteed to result in offset distinct calls to the copy assignment operator T::operator=, or will something stranger happen?
If std::copy doesn't yield the result I'm looking for, is there some alternative approach that would satisfy my performance constraints without just writing template-specializations of the copy code for all the types in question?
Edit: GCC 5.1.0, compiling with -O3
There are no guarantees about how standard library functions are implemented, other than the guarantees which explicitly appear in the standard which cover:
the observable effect of valid invocations, and
space and time complexity (in this case: strictly linear in the number of objects to copy, assuming that copying an object is O(1)).
So that std::copy might or might not do the equivalent to memmove. It might do an element-by-element copy in a simple loop. Or it might unroll the loop. Or it might call memmove. Or it might find an even faster solution, based on the compiler's knowledge about the alignment of the datatypes, possibly using a vectorizing optimization.
<rant>Contrary to what seems to be popular opinion, the authors of the standard C++ library are not in a conspiracy to slow down your code, nor are they so incompetent that anyone with a couple of months' of coding experience could easily generated faster code. For particular use cases, you might be able to leverage your knowledge about the data being moved around to find a faster solution, but in general -- and particularly without profiling real code -- your best bet is to assume that the standard library authors are excellent coders dedicated to making your programmes as efficient as possible.
</rant>
If the question is about standard, the answer is 'anything can happen'. Might do memmove(), might not. On the other hand, if the question is about a particular implementation, than you should not ask, but instead check your implementation.
On my implementation, it is a memmove() call.
By the way, it is hard to imagine implementation doing offset memory moves. It would be either single call to memmove(), or looped element-by-element copy. Calling memmove() just makes no sense.

Is accessing the elements of a char* or std::string faster?

I have seen char* vs std::string in c++, but am still wondering if accessing the elements of a char* is faster than std::string.
If you need to know, the char*/std::string will contain less than 80 characters, but I would like to know a cutoff if there is one.
I would also like to know the answer to this question for different compilers and different Operating Systems, if there is a difference.
Thanks in advance!
Edit: I would be accessing the elements using array[n], and would set the values once.
(Note: If this doesn't meet the help center, please let me know how I can reword it before down-voting)
They should be equivalent in general, though std::string might be a tiny bit slower. Why? Because of short-string optimization.
Short-string optimization is a trick some implementations use to store short strings in std::string without allocating any memory. Usually this is done by doing something like this (though different variations exist):
union {
char* data_ptr;
char short_string[sizeof(char*)];
};
Then std::string can use the short_string array to store the data, but only if the size of the string is short enough to fit in there. If not, then it will need to allocate memory and use data_ptr to store that pointer.
Depending on how short-string optimization is implemented, whenever you access data in a std::string, it needs to check its length and determine if it's using the short_string or the data_ptr. This check is not totally free: it takes at least a couple instructions and might cause some branch misprediction or inhibit prefetching in the CPU.
libc++ uses short-string optimization kinda like this that requires checking whether the string is short vs long every access.
libstdc++ uses short-string optimization, but they implement it slightly differently and actually avoid any extra access costs. Their union is between a short_string array and an allocated_capacity integer, which means their data_ptr can always point to the real data (whether it's in short_string or in an allocated buffer), so there aren't any extra steps needed when accessing it.
If std::string doesn't use short-string optimization (or if it's implemented like in libstdc++), then it should be the same as using a char*. I disagree with black's statement that there is an extra level of indirection in this situation. The compiler should be able to inline operator[] and it should be the same as directly accessing the internal data pointer in the std::string.
Since you don't have direct access to the underlying CharT sequence, accessing it will require an extra layer through the public interface. So it could be slower, probably requiring 20-30 cycles more. Even then, only in a tight loop you might see a difference.
However, it's extremely easy to optimize this out considering the large range of techniques a compiler can employ (caching, inlining, non-standard function calls and so on) if you instruct it to.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?
As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.
It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.
As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.