Slow memcpy Performance

Slow memcpy Performance - c++

This may seem like a stupid/obvious question to some of you, but I'm still learning so please be gentle haha.
I'm writing an application without the CRT, so I have to implement my own memcpy function. After doing everything and getting it working, I noticed the application was performing significantly slower than it's CRT counterpart. After a while I tracked it down to my custom memcpy function.
void* _memcpy(void* destination, void* source, size_t num)
{
char* d = (char*)destination;
char* s = (char*)source;
while (num--)
*d++ = *s++;
return destination;
}
My friend told me this was a complete sh*t implementation, so I'm posting this here to ask how I could at least improve it to meet the performance of it's CRT counterpart. And also to get an explanation of why it's so slow

First thing first. Computers handle things in words. Typical word size is 4 or 8 bytes long (except on some 8 bit micros). If you can copy a word at a time, things will be much faster.
There are complications though. Many processors don't like misaligned access so each copy should on word boundaries.
Other optimizations might include pre fetching data but these start becoming more complicated.
Take a look a newlib-nano's implementation for inspiration. https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c

Related

How to effectively apply bitwise operation to (large) packed bit vectors?

I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).

This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.

Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.

Do data type ranges matter as a memory-saving measure anymore?

I was always taught to use the appropriate data type depending on the specific needs of the class/method/function/member/variable/what-have-you. That said, does it even matter anymore?
Hypothetically, if I have a class that has a data member that will never be negative and will never be more than the maximum value of unsigned char, does storing it as an unsigned char (1 byte) versus an int (4 bytes) even matter anymore due to implicit type promotion/demotion, internal representation, register size and the often quoted "CPUs are more efficient when working with int"?
Example:
class Foo {
public:
Foo() : _status(0) { /* DO NOTHING */ }
void AddTo(unsigned char value) {
if(std::numeric_limits<unsigned char>::max() - _stat < value) {
value = std::numeric_limits<unsigned char>::max() - _status;
}
_status += value;
}
void Increment() {
if(_status == std::numeric_limits<unsigned char>::max()) return;
++_status;
}
private:
unsigned char _status;
};

A main effect of generally using "right-sized" types is that you and others waste a lot of time on it.
If you have a zillion values stored, e.g. a very large picture, or if you absolutely need a 64-bit range, say, then sure, in such cases it makes sense to right-size.
But using right-sizing as a general guideline produces no significant gain and much pain.
Authority argument: Bjarne Stroustrup, who created the language, generally just uses a few types, e.g. int for integers.

"Premature optimization is the root of all evil" Donald Knuth.
Is this one data member's size going to significantly impact the size of the class? Are you serializing the class? Is the serialization representation seeing any reduction? Are you making the code harder to read worrying about this when your boss doesn't care?
Y2K, IPv4 32bit addresses, ASCII, yes the future will look back at your code and laugh. Remember moores law, write something that works, and expect that something will be wrong. Until it is you'll never know what. Write testable, maintainable, and refactorable code and it might just stay in production long enough for someone to care.

For most use cases when targeting PCs and servers, you're not going to need to worry about using chars vs using ints to hold numeric values. Just use an int or, if you need a larger range, a long.
However, if you're targeting a platform with 16 bytes of RAM which has less than 1 KB to store your program, you may need to carefully consider whether that loop counter really has to take up more than 1 byte.

Unless there's a particular reason for choosing some other variable type, just stick with int. A large part of modern programming is managing complexity and there's no reason to start sprinkling your code with a whole variety of types if it doesn't actually help anything. Sure, if you have 5,000 copies of a particular class or working on a system with a tightly constrained memory footprint then it might be important. But on a multigigabyte system this isn't generally going to be a concern. In that case it's more about writing something understandable and maintainable.

You are hitting one of the problems of C-style languages. They deprive the ability to do range checking that you can do in other languages. If your value should be within a specific range, the ability to say a type can be, say, 1..64 is a big help for error tracking. I have found so many bugs in C/C++ code by converting it to pascal or ada.
I like to use typedefs for documentation purposes in the situation you describe--
COLORCOMPONENT
DEGREES
RADIANS
--for documentation purposes. Even if the compiler does not do the checking for me, I can usually spot when I am using degrees when I should be using radians.

Text iteration, Assembly versus C++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,

No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.

An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...

Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.

Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Why are standard string functions faster than my custom string functions?

I decided to find the speeds of 2 functions :
strcmp - The standard comparison function defined in string.h
xstrcmp- A function that has same parameters and does the same, just that I created it.
Here is my xstrcmp function :
int xstrlen(char *str)
{
int i;
for(i=0;;i++)
{
if(str[i]=='\0')
break;
}
return i;
}
int xstrcmp(char *str1, char *str2)
{
int i, k;
if(xstrlen(str1)!=xstrlen(str2))
return -1;
k=xstrlen(str1)-1;
for(i=0;i<=k;i++)
{
if(str1[i]!=str2[i])
return -1;
}
return 0;
}
I didn't want to depend on strlen, since I want everything user-defined.
So, I found the results. strcmp did 364 comparisons per millisecond and my xstrcmp did just 20 comparisons per millisecond (atleast on my computer!)
Can anyone tell why this is so ? What does the xstrcmp function do to make itself so fast ?

if(xstrlen(str1)!=xstrlen(str2)) //computing length of str1
return -1;
k=xstrlen(str1)-1; //computing length of str1 AGAIN!
You're computing the length of str1 TWICE. That is one reason why your function loses the game.
Also, your implemetation of xstrcmp is very naive compared to the ones defined in (most) Standard libraries. For example, your xstrcmp compares one byte at a time, when in fact it could compare multiple bytes in one go, taking advantage of proper alignment as well, or can do little preprocessing so as to align memory blocks, before actual comparison.

strcmp and other library routines are written in assembly, or specialized C code, by experienced engineers and use a variety of techniques.
For example, the assembly implementation might load four bytes at a time into a register, and compare that register (as a 32-bit integer) to four bytes from the other string. On some machines, the assembly implementation might load eight bytes or even more. If the comparison shows the bytes are equal, the implementation moves on to the next four bytes. If the comparison shows the bytes are unequal, the implementation stops.
Even with this simple optimization, there are a number of issues to be dealt with. If the string addresses are not multiples of four bytes, the processor might not have an instruction that will load four bytes (many processors require four-byte loads to use addresses that are aligned to multiples of four bytes). Depending on the processor, the implementation might have to use slower unaligned loads or to write special code for each alignment case that does aligned loads and shifts bytes in registers to align the bytes to be compared.
When the implementation loads four bytes at once, it must ensure it does not load bytes beyond the terminating null character if those bytes might cause a segment fault (error because you tried to load an address that is not readable).
If the four bytes do contain the terminating null character, the implementation must detect it and not continue comparing further bytes, even if the current four are equal in the two strings.
Many of these issues require detailed assembly instructions, and the required control over the exact instructions used is not available in C. The exact techniques used vary from processor model to processor model and vary greatly from architecture to architecture.

Faster implementation of strlen:
//Return difference in addresses - 1 as we don't count null terminator in strlen.
int xstrlen(char *str)
{
char* ptr = str;
while (*str++);
return str - ptr - 1;
}
//Pretty nifty strcmp from here:
//http://vijayinterviewquestions.blogspot.com/2007/07/implement-strcmpstr1-str2-function.html
int mystrcmp(const char *s1, const char *s2)
{
while (*s1==*s2)
{
if(*s1=='\0')
return(0);
++s1;
++s2;
}
return(*s1-*s2);
}
I'll do the other one later if I have time. You should also note that most of these are done in assembly language or using other optimized means which will be faster than the best stright C implementation you can write.

Aside from the problems in your code (which have been pointed out already), -- at least in the gcc-C-libs, the str- and mem-functions are faster by a margin in most cases because their memory access patterns are higly optimized.
There were some discussions on the topic on SO already.

Try this:
int xstrlen(const char* s){
const char* s0 = s;
while(*s) s++;
return(s - s0);
}
int xstrcmp(const char* a, const char* b){
while(*a && *a==*b){a++; b++;}
return *a - *b;
}
This could probably be sped up with some loop unrolling.

1. Algorithm
Your implementation of strcmp could have a better algorithm. There should be no need to call strlen at all, each call to strlen will iterate over the whole length of the string again. You can find simple but effective implementations online, probably the place to start is something like:
// Adapted from http://vijayinterviewquestions.blogspot.co.uk
int xstrcmp(const char *s1, const char *s2)
{
for (;*s1==*s2;++s1,++s2)
{
if(*s1=='\0') return(0);
}
return(*s1-*s2);
}
That doesn't do everything, but should be simple and work in most cases.
2. Compiler optimisation
It's a stupid question, but make sure you turned on all the optimisation switches when you compile.
3. More sophisticated optimisations
People writing libraries will often use more advanced techniques, such as loading a 4-byte or 8-byte int at once, and comparing it, and only comparing individual bytes if the whole matches. You'd need to be an expert to know what's appropriate for this case, but you can find people discussing the most efficient implementation on stack overflow (link?)
Some standard library functions for some platforms may be hand-written in assembly if the coder can knows there's a more efficient implementation than the compiler can find. That's increasingly rare now, but may be common on some embedded systems.
4. Linker "cheating" with standard library
With some standard library functions, the linker may be able to make your program call them with less overhead than calling functions in your code because it was designed to know more about the specific internals of the functions (link?) I don't know if that applies in this case, it probably doesn't, but it's the sort of thing you have to think about.
5. OK, ok, I get that, but when SHOULD I implement my own strcmp?
Off the top of my head, the only reasons to do this are:
You want to learn how. This is a good reason.
You are writing for a platform which doesn't have a good enough standard library. This is very unlikely.
The string comparison has been measured to be a significant bottleneck in your code, and you know something specific about your strings that mean you can compare them more efficiently than a naive algorithm. (Eg. all strings are allocated 8-byte aligned, or all strings have an N-byte prefix.) This is very, very unlikely.
6. But...
OK, WHY do you want to avoid relying on strlen? Are you worried about code size? About portability of code or of executables?
If there's a good reason, open another question and there may be a more specific answer. So I'm sorry if I'm missing something obvious, but relying on the standard library is usually much better, unless there's something specific you want to improve on.

Assignment vs mempcy - which will be faster in this case

which of the two is faster: ?
1.
char* _pos ..;
short value = ..;
*((short*)_pos = va;
2.
char* _pos ..;
short value = ..;
memcpy(_pos, &value, sizeof(short));

As with all "which is faster?" questions, you should benchmark it to see for yourself. And if it matters, then ask why and pick which you want.
In any case, your first example is technically undefined behavior since you are violating strict-aliasing. So if you had to choose without benchmarking, go with the second one.
To answer the actual question, which is faster will probably depend on the alignment of pos. If it's aligned properly, then 1 will probably be faster. If not, then 2 might be faster depending on how it's optimized by the compiler. (1 might even crash if the hardware doesn't support misaligned access.)
But this is all guess-work. You really need to benchmark it to know for sure.
At the very least, you should look at the compiled assembly:
: *(short *)_pos = value;
mov WORD PTR [rcx], dx
vs.
: memcpy(_pos, &value, sizeof(short));
mov WORD PTR [rcx], dx
Which in this case (in MSVC) shows the exact same assembly with default optimizations. So you can expect the performance to be the same.

With gcc at an optimization level of -O1 or higher, the following two functions compile to exactly the same machine code on x86:
void foo(char *_pos, short value)
{
memcpy(_pos, &value, sizeof(short));
}
void bar(char *_pos, short value)
{
*(short *)_pos = value;
}

The compiler might implement them both the same way.
If it does it naively, assignment will be faster.
For any practical purpose, they'll both be done in no time, and you don't need to worry.
Also note that you may have alignment problem s(_pos may not be aligned on 2 bytes, which may crash on some processors), and type punning problems (the compiler may assume that what _pos points to isn't changed, because you wrote using a short *).

Does it matter? It might be that the first case will save you some cycles (depends on the compiler sophistication and optimizations). But is it worth the readibility and maintainability hit?
Many bugs are introduced because of premature optimization. You should first identify the bottleneck, and if this assignment is that bottleneck - benchmark each of the options (taking care of alignment and other issues mentioned here by others already).

The question is implementation-dependent. In practice, for doing nothing but copying sizeof(short) bytes, if one is going to be slower, it's going to be memcpy. For considerably larger data sets, if one is going to be faster, it's generally going to be memcpy.
As pointed out, #1 invokes undefined behavior.
We can see that simple assignment is certainly easier to read and write and less error prone than both. Clarity and correctness should come first, even in performance-critical areas for the simple reason that it's easier to optimize correct code than it is to fix optimized, incorrect code. If this is really a C++ question, the need for such code (casts or memcpy that bulldoze over the type system to x-ray and copy around bits) should be very, very rare.

If you are certain that there won't be an alignment issue, and you really find this is a bottleneck situation then go ahead and do the first.
If you are unhappy calling memcpy then do something like:
*pos = static_cast<char>(value & 0xff );
*(pos+1) = static_cast<char>(value >> 8 );
although if you are going to do that then use unsigned values.
The above code ensures you get little-endian too. (Obviously reverse the order of the assignments if you want big-endian). You might want a consistent endian-ness if the data is passed around as some kind of binary blob, which is, I assume, what you are trying to create.
You might wish to use something like google protocol buffers if you want to create binary blobs. There is also boost::serialize which includes binary serialization.

You can avoid breaking aliasing rules and calling a function by using a union:
union {
char* c;
short* s;
} _pos;
short value = ...
_pos->s = value;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js