Performance of std::copy of portion of std::vector

Performance of std::copy of portion of std::vector - c++

I want to copy part of a vector to itself, e.g.
size_t offset; /* some offset */
std::vector<T> a = { /* blah blah blah */};
std::copy(a.begin() + offset, a.begin() + (offset*2), a.begin());
However, I'm concerned about the performance of this approach. I'd like to have this boil down to a single memmove (or equivalent) when the types in question allow it, but still behave as one would expect when given a non-trivially-copyable type.
When the template type T is trivially copyable (in particular int64_t if it matters), does this result in one memmove of length sizeof(T) * offset, or offset distinct memmoves of length sizeof(T)? I assume the later would give noticeably worse performance because it requires many separate memory reads. Or should I just assume that caching will make the performance in these situations effectively equivalent for relatively small offsets (<100)?
In cases where the template type T is not trivially copyable, is it guaranteed to result in offset distinct calls to the copy assignment operator T::operator=, or will something stranger happen?
If std::copy doesn't yield the result I'm looking for, is there some alternative approach that would satisfy my performance constraints without just writing template-specializations of the copy code for all the types in question?
Edit: GCC 5.1.0, compiling with -O3

There are no guarantees about how standard library functions are implemented, other than the guarantees which explicitly appear in the standard which cover:
the observable effect of valid invocations, and
space and time complexity (in this case: strictly linear in the number of objects to copy, assuming that copying an object is O(1)).
So that std::copy might or might not do the equivalent to memmove. It might do an element-by-element copy in a simple loop. Or it might unroll the loop. Or it might call memmove. Or it might find an even faster solution, based on the compiler's knowledge about the alignment of the datatypes, possibly using a vectorizing optimization.
<rant>Contrary to what seems to be popular opinion, the authors of the standard C++ library are not in a conspiracy to slow down your code, nor are they so incompetent that anyone with a couple of months' of coding experience could easily generated faster code. For particular use cases, you might be able to leverage your knowledge about the data being moved around to find a faster solution, but in general -- and particularly without profiling real code -- your best bet is to assume that the standard library authors are excellent coders dedicated to making your programmes as efficient as possible.
</rant>

If the question is about standard, the answer is 'anything can happen'. Might do memmove(), might not. On the other hand, if the question is about a particular implementation, than you should not ask, but instead check your implementation.
On my implementation, it is a memmove() call.
By the way, it is hard to imagine implementation doing offset memory moves. It would be either single call to memmove(), or looped element-by-element copy. Calling memmove() just makes no sense.

Related

Type-Safe C++ wrapper for memcpy?

Given that std::copy (for Trivial Types obviously) can only be implemented as a wrapper around memmove(*), I'm wondering:
Is there a Standard C++ type-safe wrapper for the times you need memcpy? (I can't count the number of times I forgot to multiply by sizeof.)
If there's nothing in the standard, have there been any proposals for this? If not, why not?
Are there any specific obstacles in providing a memcpy wrapper that does the sizeof multiplication automatically?
(*): C++ Standard Library implementations (from back MSVC 2005(!) up to modern MSVC2015, libc++ etc.) decay std::copy TriviallyCopyable types to memmove. Bot not to memcpy. Because:
std::copy(src_first, src_last, destination_first) defines that:
The behavior is undefined if d_first is within the range [first, last).
Only the beginning of the destination range MUST NOT be within the source range. The destination range is allowed to extend into the source range. That is, d_first can be "to the left" of the source range, and the destination range can extend into the source range.
For std::memcpy the definition is that
If the objects overlap, the behavior is undefined.
That is, the full ranges must not overlap: This is what allows memcpy to be the fastest variant, because it can just assume that the memory of source and destination is completely disjoint.
For std::memmove, the definition is:
The objects may overlap: copying takes place as if the characters were copied to a temporary character array and then the characters were copied from the array to dest.
That is, the source and destination range may arbitrarily overlap, there is no restriction.
Given this, it is clear that you can use std::memove to implement std::copy for TrivialllyCopyable types, because memmove doesn't impose any restrictions and the dispatch to the correct implementation can be done at compile time via type traits --
but it's hard to implement std::copy in terms of memcpy because (a) the check whether the pointer ranges overlap would have to be done at run time, and (b) even implementing the runtime check for unrelated memory ranges could be quite a mess.
So, this leaves us with
void* memcpy( void* dest, const void* src, std::size_t count );
a function with a less than stellar interface, where you constantly need to multiply the input count of non-char objects with their sizeof and that is totally untyped.
But memcpy is fastest (and by quite a margin, measure it yourself), and when you need fast copies of TriviallyCopyable types, you reach for memcpy. Which superficially should be easy to wrap in a type safe wrapper like:
template<typename T>
T* trivial_copy(T* dest, T* src, std::size_t n) {
return static_cast<T*>(std::memcpy(dest, src, sizeof(T) * n));
}
but then, it's unclear wether you should do compile time checks via std::is_trival or somesuch and of course there may be some discussion whether to go with the exact memcpy signature order, yadda yadda.
So do I really have to reinvent this wheel myself? Was it discussed for the standard? Etc.

To clarify the difference between mencpy and memove, according to the docs
memmove can copy memory to a location that overlaps the source memory, for memcpy this is undefined behavior.
"The objects may overlap: copying takes place as if the characters were copied to a temporary character array and then the characters were copied from the array to dest."
Is there a Standard C++ type-safe wrapper for the times you need memcpy? (I can't count the number of times I forgot to multiply by sizeof.)
Yes, std::copy (maybe, explained below)
If there's nothing in the standard, have there been any proposals for this? If not, why not?
As far as i know the standard does not enforce the usage of memmove/memcpy for std::copy for trivial types. So it's up the implementation. For example in visual studio update 2015 update 2 they did use memmove to speed things up:
"Increased the speed of std::vector reallocation and std::copy(); they are up to 9x faster as they call memmove() for trivially copyable types (including user-defined types)."
Are there any specific obstacles in providing a memcpy wrapper that does the sizeof multiplication automatically?
No, in fact you can implement this yourself by using std::is_trivial
Edit:
According to this document section 25.3.1 there are no restrictions to std::copy implementation only complexity:
Complexity: Exactly last - first assignments.
And this makes perfect sense when you consider that memcpy uses cpu speciffic instruction (that are not available on all cpus) to speed up memory copy.

Can splitting a one-liner into multiple lines with temporary variables impact performance i.e. by inhibiting some optimizations?

It is a very general c++ question. Consider the following two blocks (they do the same thing):
v_od=((x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint()).cwiseAbs2().rowwise().sum()).array().sqrt();
and
MatrixXd wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(wtemp.cwiseAbs2().rowwise().sum()).array().sqrt();
Now the first construct feels more efficient. But is it true,
or would the c++ compiler compile them down to the same thing (I'm assuming the compiler is a good one and has all the safe optimization flag turned on. For argument's sake wtemp is mild sized, say a matrix with 100k elements all told)?
I know the generic answer to this is 'benchmark it and come back to us'
but I want a general answer.

There are two cases where your second expression could be fundamentally less efficient than your first.
The first case is where the writer of the MatrixXd class did rvalue reference to this overloads on cwiseAbs2(). In the first code, the value we call the method on is a temporary, in the second it is not. We can fix this by simply changing the second expression to:
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
which casts wtemp into an rvalue reference, and basically tells cwiseAbs2() that the matrix it is being called on can be reused as scratch space. This only matters if the writers of the MatrixXd class implemented this particular feature.
The second possible way it could be fundamentally slower is if the writers of the MatrixXd class used expression templates for pretty much every operation listed. This technique builds the parse tree of the operations, and only finalizes all of them when you assign the result to a value at the end.
Some expression templates are written to handle being able to be stored in an intermediate object like this:
auto&& wtemp=(x-wOut*svd.matrixV().topLeftCorner(p,Q).adjoint());
v_od=(std::move(wtemp).cwiseAbs2().rowwise().sum()).array().sqrt();
where the first stores the expression template wtemp rather than evaluating it into a matrix, and the second line consumes the first intermediate result. Other expression template implementations break horribly if you try to do something like the above.
Expression templates are also something that the matrix class writers would have to have specifically implemented. And is again a somewhat obscure technique -- it would mainly be of use in situations where extending a buffer is done by seemingly cheap operations, like string append.
Barring those two cases, any difference in performance is going to be purely "noise" -- there would be no reason, a priori, to expect the compiler to be confused by one or the other more or less.
And both of these are relatively advanced/modern techniques.
Neither of them will be implemented "by the compiler" without explicitly being done by the library writer.

In general second case is much more readable, and that's why preferred. It clearly names temporary variable, that helps to understand code better. Moreover, it's much easier to debug! That's why I would strongly recommend to go for second option.
I would not care much about preformance difference: I think good compiler will make identical code from both examples.

The most important aspects of code in order, most important -> less important:
Correct code
Readable code
Fast code
Of course, this can change (i.e. on embedded devices where you have to squeeze out every last bit of performance in limited memory space) but this is the general case.
Therefor, you want the code that is easier to read over a possibly neglible performance increase.
I wouldn't expect a performance hit for storing temporaries - at least not in the general case. In fact, in some cases you can expect it to be faster, i.e. caching the result of strlen() when working with c_strings (as the first example that comes to mind)
Once you have written the code, verified that it is correct code, and found a performace problem, only then should you worry about profiling and making it faster, at which point you'll probably find that having more maintainable / readable code actually helps you isolate the problem.

memmove vs copying backwards

I understand that memmove in C (cstring library) handles overlaps nicely "at the cost of slower runtime" (see this post). I was wondering why this additional runtime cost? It seems to me that any overlap problem could be fixed by copying backwards instead of forward, am I wrong?
As a toy example, here are two versions of a "right-shift" function, that shifts the contents of an array by one element on the right:
// Using memmove
template <typename T>
void shift_right( T *data, unsigned n )
{
if (n)
{
data[n-1].~T();
memmove( data+1, data, (n-1)*sizeof(T) );
new (data) T();
}
}
// Using copy_backward
template <typename Iterator>
void shift_right( Iterator first, Iterator last )
{
Iterator it = last;
std::copy_backward( first, --it, last );
}
Are they equivalent? Performance-wise, which one is best to use?
Note: judging by the comment of #DieterLücking, and despite the precautions taken, the above version using memmove is unsafe in this situation.

Assuming a good implementation, the only "extra cost" of memmove is the initial check (an add and a compare-and-branch) to decide whether to copy front-to-back or back-to-front. This cost is so completely negligible (the add and compare will be hidden by ILP and the branch is perfectly predictable under normal circumstances) that on some platforms, memcpy is just an alias of memmove.
In anticipation of your next question ("if memcpy isn't significantly faster than memmove, why does it exist?"), there are a few good reasons to keep memcpy around. The best one, to my mind, is that some CPUs essentially implement memcpy as a single instruction (rep/movs on x86, for example). These HW implementations often have a preferred (fast) direction of operation (or they may only support copying in one direction). A compiler may freely replace memcpy with the fastest instruction sequence without worrying about these details; it cannot do the same for memmove.

Memmove figures out for you whether to copy backward or forward; it is also highly optimized for this task (i.e. copying in SSE-optimized blocks as much as feasible).
It is unlikely that you can do any better by calling any generic STL algorithm (the best they could do is to call memcopy or memmove behind the scenes), but of course you could answer this question simply by running your code and timing it.

from the post you actually linked (emphasis mine):
memcpy just loops, while memmove performs a test to determine which
direction to loop in to avoid corrupting the data. These
implementations are rather simple. Most high-performance
implementations are more complicated (involving copying word-size
blocks at a time rather than bytes).

The appropriate ways to copy or move are std::copy, std::copy_n, std::copy_backward and std::move. A proper C++ library will use memcpy or memmove if applicable. Hence there is no need to go for undefined results if the copied or moved sequence holds no trivial data.
Note: Here the std::move is the template 'OutputIterator move(InputIterator first, InputIterator last, OutputIterator result);' (For #Void)

Is std::vector::size() allowed to require non-trivial computations? When would it make sense?

I'm reviewing a piece of code and see a class where an std::vector is stored as a member variable and the size of that std::vector is stored as a separate member variable. Both std::vector and its "stored copy" of size are never change during the containing object lifetime and the comments say size is stored separately "for convenience and for cases when an implementation computes the size each time".
My first reaction was "WT*? Should't it be always trivial to extract std::vectors size?"
Now I've carefully read 23.2.4 of C++ Standard and can't see anything saying whether such implementations are allowed in the first place and I can't imagine why it would be necessary to implement std::vector in such way that its current size needs non-trivial computations.
Is such implementation that std::vector::size() requires some non-trivial actions allowed? When would having such implementation make sense?

C++03 says in Table 65, found in §23.1, that size() should have a constant complexity. (In C++0x, this is required for all containers.) You'd be hard-pressed to find a std::vector<> where it's not.
Typically, as Steve says, this is just the difference between two pointers, a simple operation.

I would guess that your definition of "trivial" doesn't match that of the author of the code.
If size isn't stored, I'd expect begin and end to be stored, and size to be computed as the difference of the two, and that code to be inlined. So we're basically talking two (nearby) memory accesses and a subtraction, instead of one memory access.
For most practical purposes, both of those are trivial, and if the standard library author thinks that the result of that computation isn't worth caching, then personally I am happy to accept their opinion. But the author of that code comment might think otherwise.
IIRC the standard says somewhere that size "should" be O(1), not sure if that's in the text for sequences or for containers. I don't think it anywhere specifies that it must be for vector. But even if we read that as a non-requirement there's a fundamental QOI issue here - what on earth am I doing optimizing my code for such a poor implementation at the expense of normal implementations?
If someone uses such an implementation, presumably that's because they want their code to run slowly. Who am I to judge otherwise? ;-)
It's also possible that the author of the code has profiled using a number of end-begin implementations, and measured a significant improvement by caching the size. But I think that's less likely than that the author is being too pessimistic about the worst case their code needs to handle.

Why is 'unbounded_array' more efficient than 'vector'?

It says here that
The unbounded array is similar to a
std::vector in that in can grow in
size beyond any fixed bound. However
unbounded_array is aimed at optimal
performance. Therefore unbounded_array
does not model a Sequence like
std::vector does.
What does this mean?

As a Boost developer myself, I can tell you that it's perfectly fine to question the statements in the documentation ;-)
From reading those docs, and from reading the source code (see storage.hpp) I can say that it's somewhat correct given some assumptions about the implementation of std::vector at the time that code was written. That code dates to 2000 initially, and perhaps as late as 2002. Which means at the time many STD implementations did not do a good job of optimizing destruction and construction of objects in containers. The claim about the non-resizing is easily refuted by using an initially large capacity vector. The claim about speed, I think, comes entirely from the fact that the unbounded_array has special code for eliding dtors & ctors when the stored objects have trivial implementations of them. Hence it can avoid calling them when it has to rearrange things, or when it's copying elements. Compared to really recent STD implementations it's not going to be faster, as new STD implementation tend to take advantage of things like move semantics to do even more optimizations.

It appears to lack insert and erase methods. As these may be "slow," ie their performance depends on size() in the vector implementation, they were omitted to prevent the programmer from shooting himself in the foot.
insert and erase are required by the standard for a container to be called a Sequence, so unlike vector, unbounded_array is not a sequence.
No efficiency is gained by failing to be a sequence, per se.
However, it is more efficient in its memory allocation scheme, by avoiding a concept of vector::capacity and always having the allocated block exactly the size of the content. This makes the unbounded_array object smaller and makes the block on the heap exactly as big as it needs to be.

As I understood it from the linked documentation, it is all about allocation strategy. std::vector afaik postpones allocation until necessary and than might allocate some reasonable chunk of meory, unbounded_array seams to allocate more memory early and therefore it might allocate less often. But this is only a gues from the statement in documentation, that it allocates more memory than might be needed and that the allocation is more expensive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js