Why are standard string functions faster than my custom string functions? - c++

I decided to find the speeds of 2 functions :
strcmp - The standard comparison function defined in string.h
xstrcmp- A function that has same parameters and does the same, just that I created it.
Here is my xstrcmp function :
int xstrlen(char *str)
{
int i;
for(i=0;;i++)
{
if(str[i]=='\0')
break;
}
return i;
}
int xstrcmp(char *str1, char *str2)
{
int i, k;
if(xstrlen(str1)!=xstrlen(str2))
return -1;
k=xstrlen(str1)-1;
for(i=0;i<=k;i++)
{
if(str1[i]!=str2[i])
return -1;
}
return 0;
}
I didn't want to depend on strlen, since I want everything user-defined.
So, I found the results. strcmp did 364 comparisons per millisecond and my xstrcmp did just 20 comparisons per millisecond (atleast on my computer!)
Can anyone tell why this is so ? What does the xstrcmp function do to make itself so fast ?

if(xstrlen(str1)!=xstrlen(str2)) //computing length of str1
return -1;
k=xstrlen(str1)-1; //computing length of str1 AGAIN!
You're computing the length of str1 TWICE. That is one reason why your function loses the game.
Also, your implemetation of xstrcmp is very naive compared to the ones defined in (most) Standard libraries. For example, your xstrcmp compares one byte at a time, when in fact it could compare multiple bytes in one go, taking advantage of proper alignment as well, or can do little preprocessing so as to align memory blocks, before actual comparison.

strcmp and other library routines are written in assembly, or specialized C code, by experienced engineers and use a variety of techniques.
For example, the assembly implementation might load four bytes at a time into a register, and compare that register (as a 32-bit integer) to four bytes from the other string. On some machines, the assembly implementation might load eight bytes or even more. If the comparison shows the bytes are equal, the implementation moves on to the next four bytes. If the comparison shows the bytes are unequal, the implementation stops.
Even with this simple optimization, there are a number of issues to be dealt with. If the string addresses are not multiples of four bytes, the processor might not have an instruction that will load four bytes (many processors require four-byte loads to use addresses that are aligned to multiples of four bytes). Depending on the processor, the implementation might have to use slower unaligned loads or to write special code for each alignment case that does aligned loads and shifts bytes in registers to align the bytes to be compared.
When the implementation loads four bytes at once, it must ensure it does not load bytes beyond the terminating null character if those bytes might cause a segment fault (error because you tried to load an address that is not readable).
If the four bytes do contain the terminating null character, the implementation must detect it and not continue comparing further bytes, even if the current four are equal in the two strings.
Many of these issues require detailed assembly instructions, and the required control over the exact instructions used is not available in C. The exact techniques used vary from processor model to processor model and vary greatly from architecture to architecture.

Faster implementation of strlen:
//Return difference in addresses - 1 as we don't count null terminator in strlen.
int xstrlen(char *str)
{
char* ptr = str;
while (*str++);
return str - ptr - 1;
}
//Pretty nifty strcmp from here:
//http://vijayinterviewquestions.blogspot.com/2007/07/implement-strcmpstr1-str2-function.html
int mystrcmp(const char *s1, const char *s2)
{
while (*s1==*s2)
{
if(*s1=='\0')
return(0);
++s1;
++s2;
}
return(*s1-*s2);
}
I'll do the other one later if I have time. You should also note that most of these are done in assembly language or using other optimized means which will be faster than the best stright C implementation you can write.

Aside from the problems in your code (which have been pointed out already), -- at least in the gcc-C-libs, the str- and mem-functions are faster by a margin in most cases because their memory access patterns are higly optimized.
There were some discussions on the topic on SO already.

Try this:
int xstrlen(const char* s){
const char* s0 = s;
while(*s) s++;
return(s - s0);
}
int xstrcmp(const char* a, const char* b){
while(*a && *a==*b){a++; b++;}
return *a - *b;
}
This could probably be sped up with some loop unrolling.

1. Algorithm
Your implementation of strcmp could have a better algorithm. There should be no need to call strlen at all, each call to strlen will iterate over the whole length of the string again. You can find simple but effective implementations online, probably the place to start is something like:
// Adapted from http://vijayinterviewquestions.blogspot.co.uk
int xstrcmp(const char *s1, const char *s2)
{
for (;*s1==*s2;++s1,++s2)
{
if(*s1=='\0') return(0);
}
return(*s1-*s2);
}
That doesn't do everything, but should be simple and work in most cases.
2. Compiler optimisation
It's a stupid question, but make sure you turned on all the optimisation switches when you compile.
3. More sophisticated optimisations
People writing libraries will often use more advanced techniques, such as loading a 4-byte or 8-byte int at once, and comparing it, and only comparing individual bytes if the whole matches. You'd need to be an expert to know what's appropriate for this case, but you can find people discussing the most efficient implementation on stack overflow (link?)
Some standard library functions for some platforms may be hand-written in assembly if the coder can knows there's a more efficient implementation than the compiler can find. That's increasingly rare now, but may be common on some embedded systems.
4. Linker "cheating" with standard library
With some standard library functions, the linker may be able to make your program call them with less overhead than calling functions in your code because it was designed to know more about the specific internals of the functions (link?) I don't know if that applies in this case, it probably doesn't, but it's the sort of thing you have to think about.
5. OK, ok, I get that, but when SHOULD I implement my own strcmp?
Off the top of my head, the only reasons to do this are:
You want to learn how. This is a good reason.
You are writing for a platform which doesn't have a good enough standard library. This is very unlikely.
The string comparison has been measured to be a significant bottleneck in your code, and you know something specific about your strings that mean you can compare them more efficiently than a naive algorithm. (Eg. all strings are allocated 8-byte aligned, or all strings have an N-byte prefix.) This is very, very unlikely.
6. But...
OK, WHY do you want to avoid relying on strlen? Are you worried about code size? About portability of code or of executables?
If there's a good reason, open another question and there may be a more specific answer. So I'm sorry if I'm missing something obvious, but relying on the standard library is usually much better, unless there's something specific you want to improve on.

Related

fatest method to replace strlen in c++ [duplicate]

Assume that you're working a x86 32-bits system. Your task is to implement the strlen as fast as possible.
There're two problems you've to take care:
1. address alignment.
2. read memory with machine word length(4 bytes).
It's not hard to find the first alignment address in the given string.
Then we can read memory once with the 4 bytes, and count up it the total length. But we should stop once there's a zero byte in the 4 bytes, and count the left bytes before zero byte. In order to check the zero byte in a fast way, there's a code snippet from glibc:
unsigned long int longword, himagic, lomagic;
himagic = 0x80808080L;
lomagic = 0x01010101L;
// There's zero byte in 4 bytes.
if (((longword - lomagic) & ~longword & himagic) != 0) {
// do left thing...
}
I used it in Visual C++, to compare with CRT's implementation. The CRT's is much more faster than the above one.
I'm not familiar with CRT's implementation, did they use a faster way to check the zero byte?
You could save the length of the string along with the string when creating it, as is done in Pascal.
First CRT's one is written directly in assembler. you can see it's source code here C:\Program Files\Microsoft Visual Studio 9.0\VC\crt\src\intel\strlen.asm (this is for VS 2008)
It depends. Microsoft's library really has two different versions of strlen. One is a portable version in C that's about the most trivial version of strlen possible, pretty close (and probably equivalent) to:
size_t strlen(char const *str) {
for (char const *pos=str; *pos; ++pos)
;
return pos-str;
}
The other is in assembly language (used only for Intel x86), and quite similar to what you have above, at least as far as load 4 bytes, check of one of them is zero, and react appropriately. The only obvious difference is that instead of subtracting, they basically add pre-negate the bytes and add. I.e. instead of word-0x0101010101, they use word + 0x7efefeff.
there are also compiler intrinsic versions which use the REPNE SCAS instruction pair, though these are generally on older compilers, they can still be pretty fast. there are also SSE2 versions of strlen, such as Dr Agner Fog's performance library's implementation, or something such as this
Remove those 'L' suffixes and see... You are promoting all calculations to "long"! On my 32-bits tests, that alone doubles the cost.
I also do two micro-optimizations:
Since most strings we use scan consist of ASCII chars in the range 0~127, the high bit is (almost) never set, so only check for it in a second test.
Increment an index rather than a pointer, which is cheaper on some architectures (notably x86) and give you the length for 'free'...
uint32_t gatopeich_strlen32(const char* str)
{
uint32_t *u32 = (uint32_t*)str, u, abcd, i=0;
while(1)
{
u = u32[i++];
abcd = (u-0x01010101) & 0x80808080;
if (abcd && // If abcd is not 0, we have NUL or a non-ASCII char > 127...
(abcd &= ~u)) // ... Discard non-ASCII chars
{
#if BYTE_ORDER == BIG_ENDIAN
return 4*i - (abcd&0xffff0000 ? (abcd&0xff000000?4:3) : abcd&0xff00?2:1);
#else
return 4*i - (abcd&0xffff ? (abcd&0xff?4:3) : abcd&0xff0000?2:1);
#endif
}
}
}
Assuming you know the maximum possible length, and you've initated the memory to \0 before use, you could do a binary split and go left/right depending on the value(\0, split on left, else split on right). That way you'd dramatically decrease the amount of checks you'll need to find the length. Not optimal(requires some setup), but should be really fast.
// Eric
Obviously, crafting a tight loop like this in assembler would be fastest, however if you want/need to keep it more human-readable and/or portable in C(++), you can still increase the speed of the standard function by using the register keyword.
The register keyword prompts the compiler to store the counter in a register on the CPU instead of in memory which will significantly speed up the loop.
Note however, that the register keyword is only a suggestion and the compiler is free to ignore it if it thinks it can do better, especially if certain optimization options are used. That said, while it is almost certainly going to be ignored for a local, class variable in a triple for-loop, it is likely to be honored for the code below, thus improving performance quite a bit (nearly on par with the assembler version):
size_t strlen ( const char* s ) {
for (register const char* i=s; *i; ++i);
return (i-s);
}

Implementing memcmp

The following is the Microsoft CRT implementation of memcmp:
int memcmp(const void* buf1,
const void* buf2,
size_t count)
{
if(!count)
return(0);
while(--count && *(char*)buf1 == *(char*)buf2 ) {
buf1 = (char*)buf1 + 1;
buf2 = (char*)buf2 + 1;
}
return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}
It basically performs a byte by byte comparision.
My question is in two parts:
Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
If I were to do 1, are there any potential/obvious problems?
Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.
You could do it as an int-by-int comparison or an even wider data type if you wish.
The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.
Some processors run slower if you access values without following their alignment rules (some even crash if you try it).
So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.
Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.
The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.
It's also more complicated, and thus more likely to contain a bug.
Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).
If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).
Is that really their implementation? I have other issues besides not doing it int-wise:
castng away constness.
does that return statement work? unsigned char - unsigned char = signed int?
int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.
memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.
I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.
Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.
Psuedo code:
while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
fetch source bytes
fetch destination bytes
perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.
For more information, search the web for "Data Driven Design" and "data oriented programming".
Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.
See also loop unrolling.
See also assembly language.
You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.
The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.
The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible.
Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.
Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:
What is the minumum guaranteed buffer alignment?
Can you read any padding bytes past the end of a buffer without triggering an access violation?
May the buffer parameters be identical?
May the buffer size be 0?
Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
...
If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.
Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

would you propose this method for copying strings?

To achieve higher performance, would you propose using the method below when copying strings specially when there are a lot of characters in the string, lot more than 12?
unsigned char one[12]={1,2,3,4,5,6,7,8,9,10,11,12};
unsigned char two[12];
unsigned int (& three)[3]=reinterpret_cast<unsigned int (&)[3]>(one);
unsigned int (& four)[3]=reinterpret_cast<unsigned int (&)[3]>(two);
for (unsigned int i=0;i<3;i++)
four[i]=three[i];
No, (almost) never. Use std::strcpy (although not in this case, since your “strings” aren’t zero terminated), or std::copy, or the std::string copy constructor.
These methods are optimized to do the job for you. If your code (or something similar) happens to be faster than naive character by character copying, rest assured that strcpy will use it underneath. In fact, that is what happens (depending on the architecture).
Don’t try to outsmart modern compilers and frameworks, unless you’re a domain expert (and usually not even then).
Perhaps memcpy / std::copy? Wouldn't those be optimized anyway?
I believe most today's compilers already optimizes string copy. Anyway you should benchmark this, and also compare with memcpy, but I don't think the optimization is worth the loss of readability.
I agree with the other replies here. Usually, attempts to optimize block copying more often than not end up being slower than what your target OS provides. For example, memcpy(), memmove() and the like, usually implement some variation of this algorithm: copy words/halfwords/bytes using GP registers until you hit 16 byte alignment, then use SSE to copy 4 words at a time ( that's 16 chars at a time, provided sizeof(char) == 1 ).
Then again, you can also test the performance of your implementation vs memcpy()/strcpy() and see what you get.
I'm not sure why you're using references here at all.
Do this:
memcpy(two, one, sizeof(two));
Note that your usage is more of a "byte array", especially it being unsigned. Furthermore if you do feel the need to "group" the bytes like that, you'd have more luck grouping them 4s or 8s, given they match typical register sizes.
If you have an issue with string copying, there is always the llvm::StringRef way: provide a reference to the underlying string that cannot alter it. The class attributes are limited to a char const* and a size_t.
Of course, the downside is that YOU have to ensure that the underlying buffer stays allocated for the duration of the use of the StringRef
Using str* is possibly the simplest way to build null-terminated strings in C at least. The performance angle is that before a copy is actually possible the destination position in the destination string needs to be calculated (i e the position of the null byte found). Then the length of the source string must be calculated to ensure that you have enough memory in the destination. This adds overhead (more the longer the string) compared to using memcpy where it is up to you to have a large enough buffer and to keep track of how many bytes you have utilized.
(Then you may have additional complexity if your compiler settings specifies 2-byte characters)
So if your string is 3000 bytes long and you append strings "a" and then "b" each will require scanning through 3000 and 3001 bytes before being able to write the two bytes each in "a" and "b" ('a' + null and 'b' + null). Try to optimize that! Appending "b" to "a" before appending to the 3000 byte string would be much faster.
I personally would use memcpy for destination strings larger than 50 bytes or so. The code becomes a bit more complex but once you've done it a few times it's easy.

How to implement strlen as fast as possible

Assume that you're working a x86 32-bits system. Your task is to implement the strlen as fast as possible.
There're two problems you've to take care:
1. address alignment.
2. read memory with machine word length(4 bytes).
It's not hard to find the first alignment address in the given string.
Then we can read memory once with the 4 bytes, and count up it the total length. But we should stop once there's a zero byte in the 4 bytes, and count the left bytes before zero byte. In order to check the zero byte in a fast way, there's a code snippet from glibc:
unsigned long int longword, himagic, lomagic;
himagic = 0x80808080L;
lomagic = 0x01010101L;
// There's zero byte in 4 bytes.
if (((longword - lomagic) & ~longword & himagic) != 0) {
// do left thing...
}
I used it in Visual C++, to compare with CRT's implementation. The CRT's is much more faster than the above one.
I'm not familiar with CRT's implementation, did they use a faster way to check the zero byte?
You could save the length of the string along with the string when creating it, as is done in Pascal.
First CRT's one is written directly in assembler. you can see it's source code here C:\Program Files\Microsoft Visual Studio 9.0\VC\crt\src\intel\strlen.asm (this is for VS 2008)
It depends. Microsoft's library really has two different versions of strlen. One is a portable version in C that's about the most trivial version of strlen possible, pretty close (and probably equivalent) to:
size_t strlen(char const *str) {
for (char const *pos=str; *pos; ++pos)
;
return pos-str;
}
The other is in assembly language (used only for Intel x86), and quite similar to what you have above, at least as far as load 4 bytes, check of one of them is zero, and react appropriately. The only obvious difference is that instead of subtracting, they basically add pre-negate the bytes and add. I.e. instead of word-0x0101010101, they use word + 0x7efefeff.
there are also compiler intrinsic versions which use the REPNE SCAS instruction pair, though these are generally on older compilers, they can still be pretty fast. there are also SSE2 versions of strlen, such as Dr Agner Fog's performance library's implementation, or something such as this
Remove those 'L' suffixes and see... You are promoting all calculations to "long"! On my 32-bits tests, that alone doubles the cost.
I also do two micro-optimizations:
Since most strings we use scan consist of ASCII chars in the range 0~127, the high bit is (almost) never set, so only check for it in a second test.
Increment an index rather than a pointer, which is cheaper on some architectures (notably x86) and give you the length for 'free'...
uint32_t gatopeich_strlen32(const char* str)
{
uint32_t *u32 = (uint32_t*)str, u, abcd, i=0;
while(1)
{
u = u32[i++];
abcd = (u-0x01010101) & 0x80808080;
if (abcd && // If abcd is not 0, we have NUL or a non-ASCII char > 127...
(abcd &= ~u)) // ... Discard non-ASCII chars
{
#if BYTE_ORDER == BIG_ENDIAN
return 4*i - (abcd&0xffff0000 ? (abcd&0xff000000?4:3) : abcd&0xff00?2:1);
#else
return 4*i - (abcd&0xffff ? (abcd&0xff?4:3) : abcd&0xff0000?2:1);
#endif
}
}
}
Assuming you know the maximum possible length, and you've initated the memory to \0 before use, you could do a binary split and go left/right depending on the value(\0, split on left, else split on right). That way you'd dramatically decrease the amount of checks you'll need to find the length. Not optimal(requires some setup), but should be really fast.
// Eric
Obviously, crafting a tight loop like this in assembler would be fastest, however if you want/need to keep it more human-readable and/or portable in C(++), you can still increase the speed of the standard function by using the register keyword.
The register keyword prompts the compiler to store the counter in a register on the CPU instead of in memory which will significantly speed up the loop.
Note however, that the register keyword is only a suggestion and the compiler is free to ignore it if it thinks it can do better, especially if certain optimization options are used. That said, while it is almost certainly going to be ignored for a local, class variable in a triple for-loop, it is likely to be honored for the code below, thus improving performance quite a bit (nearly on par with the assembler version):
size_t strlen ( const char* s ) {
for (register const char* i=s; *i; ++i);
return (i-s);
}

C++ string comparison in one clock cycle

Is it possible to compare whole memory regions in a single processor cycle? More precisely is it possible to compare two strings in one processor cycle using some sort of MMX assembler instruction? Or is strcmp-implementation already based on that optimization?
EDIT:
Or is it possible to instruct C++ compiler to remove string duplicates, so that strings can be compared simply by their memory location? Instead of memcmp(a,b) compared by a==b (assuming that a and b are both native const char* strings).
Just use the standard C strcmp() or C++ std::string::operator==() for your string comparisons.
The implementations of them are reasonably good and are probably compiled to a very highly optimized assembly that even talented assembly programmers would find challenging to match.
So don't sweat the small stuff. I'd suggest looking at optimizing other parts of your code.
You can use the Boost Flyweight library to intern your immutable strings. String equality/inequality tests then become very fast since all it has to do at that point is compare pointers (pun not intended).
Not really. Your typical 1-byte compare instruction takes 1 cycle.
Your best bet would be to use the MMX 64-bit compare instructions( see this page for an example). However, those operate on registers, which have to be loaded from memory. The memory loads will significantly damage your time, because you'll be going out to L1 cache at best, which adds some 10x time slowdown*. If you are doing some heavy string processing, you can probably get some nifty speedup there, but again, it's going to hurt.
Other people suggest pre-computing strings. Maybe that'll work for your particular app, maybe it won't. Do you have to compare strings? Can you compare numbers?
Your edit suggests comparing pointers. That's a dangerous situation unless you can specifically guarantee that you won't be doing substring compares(ie, you are comparing some two byte strings: [0x40, 0x50] with [0x40, 0x42]. Those are not "equal", but a pointer compare would say they are).
Have you looked at the gcc strcmp() source? I would suggest that doing that would be the ideal starting place.
* Loosely speaking, if a cycle takes 1 unit, a L1 hit takes 10 units, an L2 hit takes 100 units, and an actual RAM hit takes really long.
It's not possible to perform general-purpose string operations in one cycle, but there are many optimizations you can apply with extra information.
If your problem domain allows the use of an aligned, fixed-size buffer for strings that fits in a machine register, you can perform single-cycle comparisons (not counting the load instructions).
If you always keep track of the lengths of your strings, you can compare lengths and use memcmp, which is faster than strcmp. If your application is multi-cultural, keep in mind that this only works for ordinal string comparison.
It appears you are using C++. If you only need equality comparisons with immutable strings, you can use a string interning solution (copy/paste link since I'm a new user) to guarantee that equal strings are stored at the same memory location, at which point you can simply compare pointers. See en.wikipedia.org/wiki/String_interning
Also, take a look at the Intel Optimization Reference Manual, Chapter 10 for details on the SSE 4.2's instructions for text processing. www.intel.com/products/processor/manuals/
Edit: If your problem domain allows the use of an enumeration, that is your single-cycle comparison solution. Don't fight it.
If you're optimizing for string comparisons, you may want to employ a string table (then you only need to compare the indexes of the two strings, which can be done in a single machine instruction).
If that's not feasible, you can also create a hashed string object that contains the string and a hash. Then most of the time you only have to compare the hashes if the strings aren't equal. If the hashes do match you'll have to do a full comparison though to make sure it wasn't a false positive.
It depends on how much preprocessing you do. C# and Java both have a process called interning strings which makes every string map to the same address if they have the same contents. Assuming a process like that, you could do a string equality comparison with one compare instruction.
Ordering is a bit harder.
EDIT: Obviously this answer is sidestepping the actual issue of attempting to do a string comparison within a single cycle. But it's the only way to do it unless you happen to have a sequence of instructions that can look at an unbounded amount of memory in constant time to determine the equivalent of a strcmp. Which is improbable, because if you had such an architecture the person who sold it to you would say "Hey, here's this awesome instruction that can do a string compare in a single cycle! How awesome is that?" and you wouldn't need to post a question on stackoverflow.
But that's just my reasoned opinion.
Or is it possible to instruct c++
compiler to remove string duplicates,
so that strings can be compared simply
by their memory location?
No. The compiler may remove duplicates internally, but I know of no compiler that guarantees or provides facilities for accessing such an optimisation (except possibly to turn it off). Certainly the C++ standard has nothing to say in this area.
Assuming you mean x86 ... Here is the Intel documentation.
But off the top of my head, no, I don't think you can compare more than the size of a register at a time.
Out of curiosity, why do you ask? I'm the last to invoke Knuth prematurely, but ... strcmp usually does a pretty good job.
Edit: Link now points to the modern documentation.
You can certainly compare more than one byte in a cycle. If we take the example of x86-64, you can compare up to 64-bits (8 bytes) in a single instruction (cmps), this isn't necessarily one cycle but will normally be in the low single digits (the exact speed depends on the specific processor version).
However, this doesn't mean you'll be able to all the work of comparing two arrays in memory much faster than strcmp :-
There's more than just the compare - you need to compare the two values, check if they are the same, and if so move to next chunk.
Most strcmp implementations will already be highly optimised, including checking if a and b point to the same address, and any suitable instruction-level optimisations.
Unless you're seeing alot of time spent in strcmp, I wouldn't worry about it - have you got a specific problem / use case you are trying to improve?
Even if both strings were cached, it wouldn't be possible to compare (arbitrarily long) strings in a single processor cycle. The implementation of strcmp in a modern compiler environment should be pretty much optimized, so you shouldn't bother to optimize too much.
EDIT (in reply to your EDIT):
You can't instruct the compiler to unify ALL duplicate strings - most compilers can do something like this, but it's best-effort only (and I don't know any compiler where it works across compilation units).
You might get better performance by adding the strings to a map and comparing iterators after that... the comparison itself might be one cycle (or not much more) then
If the set of strings to use is fixed, use enumerations - that's what they're there for.
Here's one solution that uses enum-like values instead of strings. It supports enum-value-inheritance and thus supports comparison similar to substring comparison. It also uses special character "¤" for naming, to avoid name collisions. You can take any class, function, or variable name and make it into enum-value (SomeClassA will become ¤SomeClassA).
struct MultiEnum
{
vector<MultiEnum*> enumList;
MultiEnum()
{
enumList.push_back(this);
}
MultiEnum(MultiEnum& base)
{
enumList.assign(base.enumList.begin(),base.enumList.end());
enumList.push_back(this);
}
MultiEnum(const MultiEnum* base1,const MultiEnum* base2)
{
enumList.assign(base1->enumList.begin(),base1->enumList.end());
enumList.assign(base2->enumList.begin(),base2->enumList.end());
}
bool operator !=(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)==enumList.end();
}
bool operator ==(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
bool operator &(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
MultiEnum operator|(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
MultiEnum operator+(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
};
MultiEnum
¤someString,
¤someString1(¤someString), // link to "someString" because it is a substring of "someString1"
¤someString2(¤someString);
void Test()
{
MultiEnum a = ¤someString1|¤someString2;
MultiEnum b = ¤someString1;
if(a!=¤someString2){}
if(b==¤someString2){}
if(b&¤someString2){}
if(b&¤someString){} // will result in true, because someString is substring of someString1
}
PS. I had definitely too much free time on my hands this morning, but reinventing the wheel is just too much fun sometimes... :)