which of the two is faster: ?
1.
char* _pos ..;
short value = ..;
*((short*)_pos = va;
2.
char* _pos ..;
short value = ..;
memcpy(_pos, &value, sizeof(short));
As with all "which is faster?" questions, you should benchmark it to see for yourself. And if it matters, then ask why and pick which you want.
In any case, your first example is technically undefined behavior since you are violating strict-aliasing. So if you had to choose without benchmarking, go with the second one.
To answer the actual question, which is faster will probably depend on the alignment of pos. If it's aligned properly, then 1 will probably be faster. If not, then 2 might be faster depending on how it's optimized by the compiler. (1 might even crash if the hardware doesn't support misaligned access.)
But this is all guess-work. You really need to benchmark it to know for sure.
At the very least, you should look at the compiled assembly:
: *(short *)_pos = value;
mov WORD PTR [rcx], dx
vs.
: memcpy(_pos, &value, sizeof(short));
mov WORD PTR [rcx], dx
Which in this case (in MSVC) shows the exact same assembly with default optimizations. So you can expect the performance to be the same.
With gcc at an optimization level of -O1 or higher, the following two functions compile to exactly the same machine code on x86:
void foo(char *_pos, short value)
{
memcpy(_pos, &value, sizeof(short));
}
void bar(char *_pos, short value)
{
*(short *)_pos = value;
}
The compiler might implement them both the same way.
If it does it naively, assignment will be faster.
For any practical purpose, they'll both be done in no time, and you don't need to worry.
Also note that you may have alignment problem s(_pos may not be aligned on 2 bytes, which may crash on some processors), and type punning problems (the compiler may assume that what _pos points to isn't changed, because you wrote using a short *).
Does it matter? It might be that the first case will save you some cycles (depends on the compiler sophistication and optimizations). But is it worth the readibility and maintainability hit?
Many bugs are introduced because of premature optimization. You should first identify the bottleneck, and if this assignment is that bottleneck - benchmark each of the options (taking care of alignment and other issues mentioned here by others already).
The question is implementation-dependent. In practice, for doing nothing but copying sizeof(short) bytes, if one is going to be slower, it's going to be memcpy. For considerably larger data sets, if one is going to be faster, it's generally going to be memcpy.
As pointed out, #1 invokes undefined behavior.
We can see that simple assignment is certainly easier to read and write and less error prone than both. Clarity and correctness should come first, even in performance-critical areas for the simple reason that it's easier to optimize correct code than it is to fix optimized, incorrect code. If this is really a C++ question, the need for such code (casts or memcpy that bulldoze over the type system to x-ray and copy around bits) should be very, very rare.
If you are certain that there won't be an alignment issue, and you really find this is a bottleneck situation then go ahead and do the first.
If you are unhappy calling memcpy then do something like:
*pos = static_cast<char>(value & 0xff );
*(pos+1) = static_cast<char>(value >> 8 );
although if you are going to do that then use unsigned values.
The above code ensures you get little-endian too. (Obviously reverse the order of the assignments if you want big-endian). You might want a consistent endian-ness if the data is passed around as some kind of binary blob, which is, I assume, what you are trying to create.
You might wish to use something like google protocol buffers if you want to create binary blobs. There is also boost::serialize which includes binary serialization.
You can avoid breaking aliasing rules and calling a function by using a union:
union {
char* c;
short* s;
} _pos;
short value = ...
_pos->s = value;
Related
I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.
// const char *pName
// char *m_pName = nullptr;
for (int i = 0; i < strlen(pName); i++)
m_pName[i] = pName[i];
He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.
If there are more efficient ways or if this is inefficient, please tell me why!
Thanks in advance!
I took a look at actual g++ -O3 output for your code, to see just how bad it was.
char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.
So obviously this is horrible:
a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).
Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.
At best it's somewhat inefficient. At worst, it's quite inefficient.
In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.
In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.
As far as how to do it efficiently, I'd tend to so something like this:
char *dest = m_pName;
for (char const *in = pName; *in; ++in)
*dest++ = *in;
*dest++ = '\0';
This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).
Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.
Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:
unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
m_pName[i] = pName[i];
Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.
Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.
He is claiming that for example memcopy would do the same like the for loop above.
Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.
BTW from your commented code:
// char *m_pName = nullptr;
Initializing like that would lead to undefined behavior without allocating memory for m_pName:
char *m_pName = new char[strlen(pName) + 1];
Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.
Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.
Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).
So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.
Even faster would be strcpy, because it avoids the strlen loop:
strcpy(m_pName, pName);
strcpy does the same as the loop in #JerryCoffin's answer.
For simple operations like that you should almost always say what you mean and nothing more.
In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.
Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.
It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.
Here are a few considerations that you have to get right if you want to run your own code instead of the library code:
What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
How far ahead?
How often?
Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?
Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.
Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too. (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. Trust the compiler.
Oh, except the compiler might not catch that null pointer reference. Stupid compiler!
This code is confused in various ways.
Just do m_pName = pName; because you're not actually copying the string.
You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance.
First get it right.
If you insist on worrying about performance, it's hard to beat strcpy.
What's more, you don't have to worry about it being right.
As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)
if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)
I want to implement
void bitwise_and(
char* __restrict__ result,
const char* __restrict__ lhs,
const char* __restrict__ rhs,
size_t length);
or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.
I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set - but I don't want to spend the time constructing one of those.
So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?
Notes:
I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).
This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.
Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.
If you could lock down you architecture to x86 with at least SSE3 you would do:
void bitwise_and(
char* result,
const char* lhs,
const char* rhs,
size_t length)
{
while(length >= 16)
{
// Load in 16byte registers
auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
auto rhsReg = _mm_loadu_si128((__m128i*)rhs);
// do the op
auto res = _mm_and_si128(lhsReg, rhsReg);
// save off again
_mm_storeu_si128((__m128i*)result, res);
// book keeping
length -= 16;
result += 16;
lhs += 16;
rhs += 16;
}
// do the tail end. Assuming that the array is large the
// most that the following code can be run is 15 times so I'm not
// bothering to optimise. You could do it in 64 bit then 32 bit
// then 16 bit then char chunks if you wanted...
while (length)
{
*result = *lhs & *rhs;
length -= 1;
result += 1;
lhs += 1;
rhs += 1;
}
}
This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).
The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.
If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.
If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.
On arm theres similar NEON instructions.
If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.
I'm pretty sure with a decent processor you dont need any additional cache control.
Don't do it this way. The individual operations will look great, sleek asm, nice performance .. but a composition of them will be terrible. You cannot make this abstraction, nice as it looks. The arithmetic intensity of those kernels is almost the worst possible (the only worse one is doing no arithmetic, such as a straight up copy), and composing them at a high level will retain that awful property. In a sequence of operations each using the result of the previous one, the results are written and read again a lot later (in the next kernel), even though the high level flow could be transposed so that the result the "next operation" needs is right there in a register. Also, if the same argument appears twice in an expression tree (and not both as operands to one operation), they will be streamed in twice, instead of reusing the data for two operations.
It doesn't have that nice warm fuzzy feeling of "look at all this lovely abstraction" about it, but what you should do is find out at a high level how you're combining your vectors, and then try to chop that in pieces that make sense from a performance perspective. In some cases that may mean making big ugly messy loops that will make people get an extra coffee before diving in, that's just too bad then. If you want performance, you often have to sacrifice something else. Usually it's not so bad, it probably just means you have a loop that has an expression consisting of intrinsics in it, instead of an expression of vector-operations that each individually have a loop.
After having read the following 1 and 2 Q/As and having used the technique discussed below for many years on x86 architectures with GCC and MSVC and not seeing a problems, I'm now very confused as to what is supposed to be the correct but also as important "most efficient" way to serialize then deserialize binary data using C++.
Given the following "wrong" code:
int main()
{
std::ifstream strm("file.bin");
char buffer[sizeof(int)] = {0};
strm.read(buffer,sizeof(int));
int i = 0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
return 0;
}
Now as I understand things, the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question - with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
That said the answers provided above seem to indicate as far as C++ is concerned that this is all undefined behavior.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
Furthermore I've seen over the years many situations where a struct made up entirely of pods (using compiler specific pragmas to remove padding) is cast to a char* and subsequently written to a file or socket, then later on read back into a buffer and the buffer cast back to a pointer of the original struct, (ignoring potential endian and float/double format issues between machines), is this kind of code also considered undefined behaviour?
The following is more complex example:
int main()
{
std::ifstream strm("file.bin");
char buffer[1000] = {0};
const std::size_t size = sizeof(int) + sizeof(short) + sizeof(float) + sizeof(double);
const std::size_t weird_offset = 3;
buffer += weird_offset;
strm.read(buffer,size);
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
s = reinterpret_cast<short*>(buffer);
buffer += sizeof(short);
f = reinterpret_cast<float*>(buffer);
buffer += sizeof(float);
d = reinterpret_cast<double*>(buffer);
buffer += sizeof(double);
return 0;
}
First, you can correctly, portably, and efficiently solve the alignment problem using, e.g., std::aligned_storage::value>::type instead of char[sizeof(int)] (or, if you don't have C++11, there may be similar compiler-specific functionality).
Even if you're dealing with a complex POD, aligned_stored and alignment_of will give you a buffer that you can memcpy the POD into and out of, construct it into, etc.
In some more complex cases, you need to write more complex code, potentially using compile-time arithmetic and template-based static switches and so on, but so far as I know, nobody came up with a case during the C++11 deliberations that wasn't possible to handle with the new features.
However, just using reinterpret_cast on a random char-aligned buffer is not enough. Let's look at why:
the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer
Yes, but you're also indicating that it can assume that the buffer is aligned properly for an integer. If you're lying about that, it's free to generate broken code.
and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question
Yes, it's free to issue instructions that either require those alignments, or that assume they're already taken care of.
with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.
Yes, it may issue instructions with the extra reads and shifts. But it may also issue instructions that don't do them, because you've told it that it doesn't have to. So, it could issue a "read aligned word" instruction which raises an interrupt when used on non-aligned addresses.
Some processors don't have a "read aligned word" instruction, and just "read word" faster with alignment than without. Others can be configured to suppress the trap and instead fall back to a slower "read word". But others—like ARM—will just fail.
Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?
You don't need to copy the bytes 1 by 1. You could, for example, memcpy each variable one by one into properly-aligned storage. (That would only be copying bytes 1 by 1 if all of your variables were 1-byte long, in which case you wouldn't be worried about alignment in the first place…)
As for casting a POD to char* and back using compiler-specific pragmas… well, any code that relies on compiler-specific pragmas for correctness (rather than for, say, efficiency) is obviously not correct, portable C++. Sometimes "correct with g++ 3.4 or later on any 64-bit little-endian platform with IEEE 64-bit doubles" is good enough for your use cases, but that's not the same thing as actually being valid C++. And you certainly can't expect it to work with, say, Sun cc on a 32-bit big-endian platform with 80-bit doubles and then complain that it doesn't.
For the example you added later:
// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
buffer += weird_offset;
i = reinterpret_cast<int*>(buffer);
buffer += sizeof(int);
Experts are right. Here's a simple example of the same thing:
int i[2];
char *c = reinterpret_cast<char *>(i) + 1;
int *j = reinterpret_cast<int *>(c);
int k = *j;
The variable i will be aligned at some address divisible by 4, say, 0x01000000. So, j will be at 0x01000001. So the line int k = *j will issue an instruction to read a 4-byte-aligned 4-byte value from 0x01000001. On, say, PPC64, that will just take about 8x as long as int k = *i, but on, say, ARM, it will crash.
So, if you have this:
int i = 0;
short s = 0;
float f = 0.0f;
double d = 0.0;
And you want to write it to a stream, how do you do it?
writeToStream(&i);
writeToStream(&s);
writeToStream(&f);
writeToStream(&d);
How do you read back from a stream?
readFromStream(&i);
readFromStream(&s);
readFromStream(&f);
readFromStream(&d);
Presumably whatever kind of stream you're using (whether ifstream, FILE*, whatever) has a buffer in it, so readFromStream(&f) is going to check whether there are sizeof(float) bytes available, read the next buffer if not, then copy the first sizeof(float) bytes from the buffer to the address of f. (In fact, it may even be smarter—it's allowed to, e.g., check whether you're just near the end of the buffer, and if so issue an asynchronous read-ahead, if the library implementer thought that would be a good idea.) The standard doesn't say how it has to do the copy. Standard libraries don't have to run anywhere but on the implementation they're part of, so your platform's ifstream could use memcpy, or *(float*), or a compiler intrinsic, or inline assembly—and it will probably use whatever's fastest on your platform.
So, how exactly would unaligned access help you optimize this or simplify it?
In nearly every case, picking the right kind of stream, and using its read and write methods, is the most efficient way of reading and writing. And, if you've picked a stream out of the standard library, it's guaranteed to be correct, too. So, you've got the best of both worlds.
If there's something peculiar about your application that makes something different more efficient—or if you're the guy writing the standard library—then of course you should go ahead and do that. As long as you (and any potential users of your code) are aware of where you're violating the standard and why (and you actually are optimizing things, rather than just doing something because it "seems like it should be faster"), this is perfectly reasonable.
You seem to think that it would help to be able to put them into some kind of "packed struct" and just write that, but the C++ standard does not have any such thing as a "packed struct". Some implementations have non-standard features that you can use for that. For example, both MSVC and gcc will let you pack the above into 18 bytes on i386, and you can take that packed struct and memcpy it, reinterpret_cast it to char * to send over the network, whatever. But it won't be compatible with the exact same code compiled by a different compiler that doesn't understand your compiler's special pragmas. It won't even be compatible with a related compiler, like gcc for ARM, which will pack the same thing into 20 bytes. When you use non-portable extensions to the standard, the result is not portable.
I decided to find the speeds of 2 functions :
strcmp - The standard comparison function defined in string.h
xstrcmp- A function that has same parameters and does the same, just that I created it.
Here is my xstrcmp function :
int xstrlen(char *str)
{
int i;
for(i=0;;i++)
{
if(str[i]=='\0')
break;
}
return i;
}
int xstrcmp(char *str1, char *str2)
{
int i, k;
if(xstrlen(str1)!=xstrlen(str2))
return -1;
k=xstrlen(str1)-1;
for(i=0;i<=k;i++)
{
if(str1[i]!=str2[i])
return -1;
}
return 0;
}
I didn't want to depend on strlen, since I want everything user-defined.
So, I found the results. strcmp did 364 comparisons per millisecond and my xstrcmp did just 20 comparisons per millisecond (atleast on my computer!)
Can anyone tell why this is so ? What does the xstrcmp function do to make itself so fast ?
if(xstrlen(str1)!=xstrlen(str2)) //computing length of str1
return -1;
k=xstrlen(str1)-1; //computing length of str1 AGAIN!
You're computing the length of str1 TWICE. That is one reason why your function loses the game.
Also, your implemetation of xstrcmp is very naive compared to the ones defined in (most) Standard libraries. For example, your xstrcmp compares one byte at a time, when in fact it could compare multiple bytes in one go, taking advantage of proper alignment as well, or can do little preprocessing so as to align memory blocks, before actual comparison.
strcmp and other library routines are written in assembly, or specialized C code, by experienced engineers and use a variety of techniques.
For example, the assembly implementation might load four bytes at a time into a register, and compare that register (as a 32-bit integer) to four bytes from the other string. On some machines, the assembly implementation might load eight bytes or even more. If the comparison shows the bytes are equal, the implementation moves on to the next four bytes. If the comparison shows the bytes are unequal, the implementation stops.
Even with this simple optimization, there are a number of issues to be dealt with. If the string addresses are not multiples of four bytes, the processor might not have an instruction that will load four bytes (many processors require four-byte loads to use addresses that are aligned to multiples of four bytes). Depending on the processor, the implementation might have to use slower unaligned loads or to write special code for each alignment case that does aligned loads and shifts bytes in registers to align the bytes to be compared.
When the implementation loads four bytes at once, it must ensure it does not load bytes beyond the terminating null character if those bytes might cause a segment fault (error because you tried to load an address that is not readable).
If the four bytes do contain the terminating null character, the implementation must detect it and not continue comparing further bytes, even if the current four are equal in the two strings.
Many of these issues require detailed assembly instructions, and the required control over the exact instructions used is not available in C. The exact techniques used vary from processor model to processor model and vary greatly from architecture to architecture.
Faster implementation of strlen:
//Return difference in addresses - 1 as we don't count null terminator in strlen.
int xstrlen(char *str)
{
char* ptr = str;
while (*str++);
return str - ptr - 1;
}
//Pretty nifty strcmp from here:
//http://vijayinterviewquestions.blogspot.com/2007/07/implement-strcmpstr1-str2-function.html
int mystrcmp(const char *s1, const char *s2)
{
while (*s1==*s2)
{
if(*s1=='\0')
return(0);
++s1;
++s2;
}
return(*s1-*s2);
}
I'll do the other one later if I have time. You should also note that most of these are done in assembly language or using other optimized means which will be faster than the best stright C implementation you can write.
Aside from the problems in your code (which have been pointed out already), -- at least in the gcc-C-libs, the str- and mem-functions are faster by a margin in most cases because their memory access patterns are higly optimized.
There were some discussions on the topic on SO already.
Try this:
int xstrlen(const char* s){
const char* s0 = s;
while(*s) s++;
return(s - s0);
}
int xstrcmp(const char* a, const char* b){
while(*a && *a==*b){a++; b++;}
return *a - *b;
}
This could probably be sped up with some loop unrolling.
1. Algorithm
Your implementation of strcmp could have a better algorithm. There should be no need to call strlen at all, each call to strlen will iterate over the whole length of the string again. You can find simple but effective implementations online, probably the place to start is something like:
// Adapted from http://vijayinterviewquestions.blogspot.co.uk
int xstrcmp(const char *s1, const char *s2)
{
for (;*s1==*s2;++s1,++s2)
{
if(*s1=='\0') return(0);
}
return(*s1-*s2);
}
That doesn't do everything, but should be simple and work in most cases.
2. Compiler optimisation
It's a stupid question, but make sure you turned on all the optimisation switches when you compile.
3. More sophisticated optimisations
People writing libraries will often use more advanced techniques, such as loading a 4-byte or 8-byte int at once, and comparing it, and only comparing individual bytes if the whole matches. You'd need to be an expert to know what's appropriate for this case, but you can find people discussing the most efficient implementation on stack overflow (link?)
Some standard library functions for some platforms may be hand-written in assembly if the coder can knows there's a more efficient implementation than the compiler can find. That's increasingly rare now, but may be common on some embedded systems.
4. Linker "cheating" with standard library
With some standard library functions, the linker may be able to make your program call them with less overhead than calling functions in your code because it was designed to know more about the specific internals of the functions (link?) I don't know if that applies in this case, it probably doesn't, but it's the sort of thing you have to think about.
5. OK, ok, I get that, but when SHOULD I implement my own strcmp?
Off the top of my head, the only reasons to do this are:
You want to learn how. This is a good reason.
You are writing for a platform which doesn't have a good enough standard library. This is very unlikely.
The string comparison has been measured to be a significant bottleneck in your code, and you know something specific about your strings that mean you can compare them more efficiently than a naive algorithm. (Eg. all strings are allocated 8-byte aligned, or all strings have an N-byte prefix.) This is very, very unlikely.
6. But...
OK, WHY do you want to avoid relying on strlen? Are you worried about code size? About portability of code or of executables?
If there's a good reason, open another question and there may be a more specific answer. So I'm sorry if I'm missing something obvious, but relying on the standard library is usually much better, unless there's something specific you want to improve on.
[This question is related to but not the same as this one.]
My compiler warns about implicitly converting or casting certain types to bool whereas explicit conversions do not produce a warning:
long t = 0;
bool b = false;
b = t; // performance warning: forcing long to bool
b = (bool)t; // performance warning
b = bool(t); // performance warning
b = static_cast<bool>(t); // performance warning
b = t ? true : false; // ok, no warning
b = t != 0; // ok
b = !!t; // ok
This is with Visual C++ 2008 but I suspect other compilers may have similar warnings.
So my question is: what is the performance implication of casting/converting to bool? Does explicit conversion have better performance in some circumstance (e.g., for certain target architectures or processors)? Does implicit conversion somehow confuse the optimizer?
Microsoft's explanation of their warning is not particularly helpful. They imply that there is a good reason but they don't explain it.
I was puzzled by this behaviour, until I found this link:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=99633
Apparently, coming from the Microsoft Developer who "owns" this warning:
This warning is surprisingly
helpful, and found a bug in my code
just yesterday. I think Martin is
taking "performance warning" out of
context.
It's not about the generated code,
it's about whether or not the
programmer has signalled an intent to
change a value from int to bool.
There is a penalty for that, and the
user has the choice to use "int"
instead of "bool" consistently (or
more likely vice versa) to avoid the
"boolifying" codegen. [...]
It is an old warning, and may have
outlived its purpose, but it's
behaving as designed here.
So it seems to me the warning is more about style and avoiding some mistakes than anything else.
Hope this will answer your question...
:-p
The performance is identical across the board. It involves a couple of instructions on x86, maybe 3 on some other architectures.
On x86 / VC++, they all do
cmp DWORD PTR [whatever], 0
setne al
GCC generates the same thing, but without the warnings (at any warning-level).
The performance warning does actually make a little bit of sense. I've had it as well and my curiousity led me to investigate with the disassembler. It is trying to tell you that the compiler has to generate some code to coerce the value to either 0 or 1. Because you are insisting on a bool, the old school C idea of 0 or anything else doesn't apply.
You can avoid that tiny performance hit if you really want to. The best way is to avoid the cast altogether and use a bool from the start. If you must have an int, you could just use if( int ) instead of if( bool ). The code generated will simply check whether the int is 0 or not. No extra code to make sure the value is 1 if it's not 0 will be generated.
Sounds like premature optimization to me. Are you expecting that the performance of the cast to seriously effect the performance of your app? Maybe if you are writing kernel code or device drivers but in most cases, they all should be ok.
As far as I know, there is no warning on any other compiler for this. The only way I can think that this would cause a performance loss is that the compiler has to compare the entire integer to 0 and then assign the bool appropriately (unlike a conversion such as a char to bool, where the result can be copied over because a bool is one byte and so they are effectively the same), or an integral conversion which involves copying some or all of the source to the destination, possibly after a zero of the destination if it's bigger than the source (in terms of memory).
It's yet another one of Microsoft's useless and unhelpful ideas as to what constitutes good code, and leads us to have to put up with stupid definitions like this:
template <typename T>
inline bool to_bool (const T& t)
{ return t ? true : false; }
long t;
bool b;
int i;
signed char c;
...
You get a warning when you do anything that would be "free" if bool wasn't required to be 0 or 1. b = !!t is effectively assigning the result of the (language built-in, non-overrideable) bool operator!(long)
You shouldn't expect the ! or != operators to cost zero asm instructions even with an optimizing compiler. It is usually true that int i = t is usually optimized away completely. Or even signed char c = t; (on x86/amd64, if t is in the %eax register, after c = t, using c just means using %al. amd64 has byte addressing for every register, BTW. IIRC, in x86 some registers don't have byte addressing.)
Anyway, b = t; i = b; isn't the same as c = t; i = c; it's i = !!t; instead of i = t & 0xff;
Err, I guess everyone already knows all that from the previous replies. My point was, the warning made sense to me, since it caught cases where the compiler had to do things you didn't really tell it to, like !!BOOL on return because you declared the function bool, but are returning an integral value that could be true and != 1. e.g. a lot of windows stuff returns BOOL (int).
This is one of MSVC's few warnings that G++ doesn't have. I'm a lot more used to g++, and it definitely warns about stuff MSVC doesn't, but that I'm glad it told me about. I wrote a portab.h header file with stubs for the MFC/Win32 classes/macros/functions I used. This got the MFC app I'm working on to compile on my GNU/Linux machine at home (and with cygwin). I mainly wanted to be able to compile-test what I was working on at home, but I ended up finding g++'s warnings very useful. It's also a lot stricter about e.g. templates...
On bool in general, I'm not sure it makes for better code when used as a return values and parameter passing. Even for locals, g++ 4.3 doesn't seem to figure out that it doesn't have to coerce the value to 0 or 1 before branching on it. If it's a local variable and you never take its address, the compiler should keep it in whatever size is fastest. If it has to spill it from registers to the stack, it could just as well keep it in 4 bytes, since that may be slightly faster. (It uses a lot of movsx (sign-extension) instructions when loading/storing (non-local) bools, but I don't really remember what it did for automatic (local stack) variables. I do remember seeing it reserve an odd amount of stack space (not a multiple of 4) in functions that had some bools locals.)
Using bool flags was slower than int with the Digital Mars D compiler as of last year:
http://www.digitalmars.com/d/archives/digitalmars/D/opEquals_needs_to_return_bool_71813.html
(D is a lot like C++, but abandons full C backwards compat to define some nice new semantics, and good support for template metaprogramming. e.g. "static if" or "static assert" instead of template hacks or cpp macros. I'd really like to give D a try sometime. :)
For data structures, it can make sense, e.g. if you want to pack a couple flags before an int and then some doubles in a struct you're going to have quite a lot of.
Based on your link to MS' explanation, it appears that if the value is merely 1 or 0, there is not performance hit, but if it's any other non-0 value that a comparison must be built at compile time?
In C++ a bool ISA int with only two values 0 = false, 1 = true. The compiler only has to check one bit. To be perfectly clear, true != 0, so any int can override bool, it just cost processing cycles to do so.
By using a long as in the code sample, you are forcing a lot more bit checks, which will cause a performance hit.
No this is not premature optimization, it is quite crazy to use code that takes more processing time across the board. This is simply good coding practice.
Unless you're writing code for a really critical inner loop (simulator core, ray-tracer, etc.) there is no point in worrying about any performance hits in this case. There are other more important things to worry about in your code (and other more significant performance traps lurking, I'm sure).
Microsoft's explanation seems to be that what they're trying to say is:
Hey, if you're using an int, but are
only storing true or false information in
it, make it a bool!
I'm skeptical about how much would be gained performance-wise, but MS may have found that there was some gain (for their use, anyway). Microsoft's code does tend to run an awful lot, so maybe they've found the micro-optimization to be worthwhile. I believe that a fair bit of what goes into the MS compiler is to support stuff they find useful themselves (only makes sense, right?).
And you get rid of some dirty, little casts to boot.
I don't think performance is the issue here. The reason you get a warning is that information is lost during conversion from int to bool.