Performance std::strstr vs. std::string::find [duplicate] - c++

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
C++ string::find complexity
Recently I noticed that the function std::string::find is an order of magnitude slower than the function std::strstr - in my environment with GCC 4.7 on Linux. The performance difference depends on the lengths of the strings and on the hardware architecture.
There seems to be a simple reason for the difference: std::string::find basically calls std::memcmp in a loop - with time complexity O(m * n). In contrast, std::strstr is highly optimized for the hardware architecture (e.g. with SSE instructions) and uses a more sophisticated string matching algorithm (apparently Knuth-Morris-Pratt).
I was also surprised not to find the time complexities of these two functions in the language documents (i.e. drafts N3290 and N1570). I only found time complexities for char_traits. But that doesn't help, because there is no function for substring search in char_traits.
I would expect, that std::strstr and memmem contain similar optimizations with almost identical performance. And until recently, I assumed that std::string::find uses memmem internally.
The questions are: Is there any good reason, why std::string::find does not use std::memmem? And does it differ in other implementations?
The question is not: What is the best implementation of this function? It is really difficult to argue for C++, if it is slower than C. I wouldn't matter if both implementations would be slow. It is the performance difference that really hurts.

First, what's memmem? I can't find this in the C++ standard, nor the
Posix standard (which contains all of the standard C functions).
Second, any measurement values will depend on the actual data. Using
KMP, for example, will be a pessimisation in a lot of cases; probably
most of the cases where the member functions of std::string are used;
the time to set up the necessary tables will often be more than the
total time of the straightforeward algorithm. Things like O(m*n)
don't mean much when the typical length of the string is short.

Related

About time/space complexity in C/C++ standards

Recently I've read things about abstract machine and as-if rule (What exactly is the "as-if" rule?), and the requirements on time complexity of standard library (like this one: Is list::size() really O(n)?).
Are the time/space complexity requirements on standard library in terms of abstract machine or in terms of real concrete machine?
If these are in terms of abstract machine, it seems an implementation can actually generate less efficient code in terms of complexity even though it seems not to be practical.
Did the standards mention anything about time/space complexity for non-standard-library code?
e.g. I may write a custom sorting code and expect O(n log n) time, but if an implementation just treats this as code in abstract machine, it is allowed to generate a slower sorting in assembly and machine code, like changing it to O(n^2) sort, even though it unlikely will do that in real situation.
Or maybe I missed something about the transformation requirements between abstract machine and real concrete machine. Can you help me to clarify? :)
Even thought I mainly read things about C++ standard, I also want to know the situation about C standard. So this question tags both.
Are the time/space complexity requirements on standard library in terms of abstract machine or in terms of real concrete machine?
The complexity requirements are in terms of the abstract machine:
[intro.abstract] The semantic descriptions in this document define a parameterized nondeterministic abstract machine...
Did the standards mention anything about time/space complexity for non-standard-library code?
No. The only complexity requirements in the standard are for standard containers and algorithms.
if an implementation just treats this as code in abstract machine, it is allowed to generate a slower sorting in assembly and machine code, like changing it to O(n^2) sort
That's not the worst thing it can do. An implementation can put the CPU to sleep for a year between every instruction. As long as you're patient enough, the program would have same observable behaviour as the abstract machine, so it would be conforming.
Many of the complexity requirements in the C++ standard are in terms of specific counts of particular operations. These do constrain the implementation.
E.g. std::find_if
At most last - first applications of the predicate.
This is more specific than "O(N), where N = std::distance(first, last)", as it specifies a constant factor of 1.
And there are others that have Big-O bounds, defining what operation(s) are counted
E.g. std::sort
O(N·log(N)), where N = std::distance(first, last) comparisons.
What this doesn't constrain includes how slow a comparison is, nor how many swaps occur. If your model of computation has fast comparison and slow swapping, you don't get a very useful analysis.
As you've been told in comments, the standards don't have any requirements regarding time or space complexity. And addressing your additional implicit question, yes, a compiler can change your O(n log n) code to run in O(n²) time. Or in O(n!) if it wants to.
The underlying explanation is that the standard defines correct programs, and a program is correct regardless of how long it takes to execute or how much memory it uses. These details are left to the implementation.
Specific implementations can compile your code in whichever way achieves correct behavior. It would be completely permissible, for instance, for an implementation to add a five-second delay between every line of code you wrote — the program is still correct. It would also be permissible for the compiler to figure out a better way of doing what you wrote and rewriting your entire program, as long as the observable behavior is the same.
However, the fact that an implementation is compliant doesn't mean it is perfect. Adding five-second delays wouldn't affect the implementation's compliance, but nobody would want to use that implementation. Compilers don't do these things because they are ultimately tools, and as such, their writers expect them to be useful to those who use them, and making your code intentionally worse is not useful.
TL;DR: bad performance (time complexity, memory complexity, etc.) doesn't affect compliance, but it will make you look for a new compiler.

What string search algorithm does strstr use?

I was reading through the String searching algorithm wikipedia article, and it made me wonder what algorithm strstr uses in Visual Studio? Should I try and use another implementation, or is strstr fairly fast?
Thanks!
The implementation in visual studio strstr is not know to me, and I am uncertain if it is to anyone. However I found these interesting sources and an example implementation. The latter shows that the algorithm runs in worst case quadratic time wrt the size of the searched string. Aggregate should be less than that. The algorithmic limit of non stochastic solutions should be that.
What is actually the case is that depending the size of the input it might be possible that different algorithms are used, mainly optimized to the metal. However, one cannot really bet on that. In case that you are doing DNA sequencing strstr and family are very important and most probably you will have to write your own customized version. Usually, standard implementations are optimized for the general case, but on the other hand those working on compilers know their shit n staff. At any rate you should not bet your own skills against the pros.
But really all this discussion about time to develop is hurting the effort to write good software. Be certain that the benefit of rewriting a custom strstr outweigh the effort that is going to be needed to maintain and tune it for your specific case, before you embark on this task.
As others have recommended: Profile. Perform valid performance tests.
Without the profile data, you could be optimizing a part of the code that runs 20% of the time, a waste of ROI.
Development costs are the prime concern with modern computers, not execution time. The best use of time is to develop the program to operate correctly with few errors before entering System Test. This is where the focus should be. Also due to this reasoning, most people don't care how Visual Studio implements strstr as long as the function works correctly.
Be aware that there is line or point where a linear search outperforms other searches. This line depends on the size of the data or the search criteria. For example, linear search using a processor with branch prediction and a large instruction cache may outperform other techniques for small and medium data sizes. A more complicated algorithm may have more branches that cause reloading of the instruction cache or data cache (wasting execution time).
Another method for optimizing your program is to make the data organization easier for searching. For example, making the string small enough to fit into a cache line. This also depends on the quantity of searching. For a large amount of searches, optimizing the data structure may gain some performance.
In summary, optimize if and only if the program is not working correctly, the User is complaining about speed, it is missing timing constraints or it doesn't fit in the allocated memory. Next step is then to profile and optimize the areas where most of the time is spent. Any other optimization is futile.
The C++ standard refers to the C standard for the description of what strstr does. The C standard doesn't seem to put any restrictions on the complexity, so pretty much any algorithm the finds the first instance of the substring would be compliant.
Thus different implementations may choose different algorithms. You'd have to look at your particular implementation to determine which it uses.
The simple, brute-force approach is likely O(m×n) where m and n are the lengths of the strings. If you need better than that, you can try other libraries, like Boost, or implement one of the sub-linear searches yourself.

How can I implement Python sets in another language (maybe C++)?

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).
The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).
Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

Time Complexity of find operation [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
C++ string::find complexity
What is the time complexity of the find operation that comes built-in with the string library in STL?
The Standard, §21.4.7.2, doesn't give any guarantees as to the complexity.
You can reasonably assume std::basic_string::find takes linear time in the length of the string being searched in, though, as even the naïve algorithm (check each substring for equality) has that complexity, and it's unlikely that the std::string constructor will build a fancy index structure to enable anything faster than that.
The complexity in terms of the pattern being searched for may reasonably vary between linear and constant, depending on the implementation.
As pointed out in comments, standard doesn't specify that.
However, since std::string is a generalized container and it can't make any assumptions about the nature of the string it holds, you can reasonably assume that complexity will be O(n) in case when you search for a single char.
At most, performs as many comparisons as the number of elements in the range [first,last).
http://cplusplus.com/reference/algorithm/find/

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.
System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.
Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.
Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/
This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.
Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.

Categories