O(N) Identification of Permutations - c++

This answer determines if two strings are permutations by comparing their contents. If they contain the same number of each character, they are obviously permutations. This is accomplished in O(N) time.
I don't like the answer though because it reinvents what is_permutation is designed to do. That said, is_permutation has a complexity of:
At most O(N2) applications of the predicate, or exactly N if the sequences are already equal, where N=std::distance(first1, last1)
So I cannot advocate the use of is_permutation where it is orders of magnitude slower than a hand-spun algorithm. But surely the implementer of the standard would not miss such an obvious improvement? So why is is_permutation O(N2)?

is_permutation works on almost any data type. The algorithm in your link works for data types with a small number of values only.
It's the same reason why std::sort is O(N log N) but counting sort is O(N).

It was I who wrote that answer.
When the string's value_type is char, the number of elements required in a lookup table is 256. For a two-byte encoding, 65536. For a four-byte encoding, the lookup table would have just over 4 billion entries, at a likely size of 16 GB! And most of it would be unused.
So the first thing is to recognize that even if we restrict the types to char and wchar_t, it may still be untenable. Likewise if we want to do is_permutation on sequences of type int.
We could have a specialization of std::is_permutation<> for integral types of size 1 or 2 bytes. But this is somewhat reminiscent of std::vector<bool> which not everyone thinks was a good idea in retrospect.
We could also use a lookup table based on std::map<T, size_t>, but this is likely to be allocation-heavy so it might not be a performance win (or at least, not always). It might be worth implementing one for a detailed comparison though.
In summary, I don't fault the C++ standard for not including a high-performance version of is_permutation for char. First because in the real world I'm not sure it's the most common use of the template, and second because the STL is not the be-all and end-all of algorithms, especially where domain knowledge can be used to accelerate computation for special cases.
If it turns out that is_permutation for char is quite common in the wild, C++ library implementors would be within their rights to provide a specialization for it.

The answer you cite works on chars. It assumes they are 8 bit (not necessarily the case) and so there are only 256 possibilities for each value, and that you can cheaply go from each value to a numeric index to use for a lookup table of counts (for char in this case, the value and the index are the same thing!)
It generates a count of how many times each char value occurs in each string; then, if these distributions are the same for both strings, the strings are permutations of each other.
What is the time complexity?
you have to walk each character of each string, so M+N steps for two inputs of lengths M and N
each of these steps involves incrementing an count in a fixed size table at an index given by the char, so is constant time
So the overall time complexity is O(N+M): linear, as you describe.
Now, std::is_permutation makes no such assumptions about its input. It doesn't know that there are only 256 possibilities, or indeed that they are bounded at all. It doesn't know how to go from an input value to a number it can use as an index, never mind how to do that in constant time. The only thing it knows is how to compare two values for equality, because the caller supplies that information.
So, the time complexity:
we know it has to consider each element of each input at some point
we know that, for each element it hasn't seen before (I'll leave discussion of how that's determined and why that doesn't impact the big O complexity as an exercise), it's not able to turn the element into any kind of index or key for a table of counts, so it has no way of counting how many occurrences of that element exist which is better than a linear walk through both inputs to see how many elements match
so the complexity is going to be quadratic at best.

Related

What are the best sorting algorithms when 'n' is very small?

In the critical path of my program, I need to sort an array (specifically, a C++ std::vector<int64_t>, using the gnu c++ standard libray). I am using the standard library provided sorting algorithm (std::sort), which in this case is introsort.
I was curious about how well this algorithm performs, and when doing some research on various sorting algorithms different standard and third party libraries use, almost all of them care about cases where 'n' tends to be the dominant factor.
In my specific case though, 'n' is going to be on the order of 2-20 elements. So the constant factors could actually be dominant. And things like cache effects might be very different when the entire array we are sorting fits into a couple of cache lines.
What are the best sorting algorithms for cases like this where the constant factors likely overwhelm the asymptotic factors? And do there exist any vetted C++ implementations of these algorithms?
Introsort takes your concern into account, and switches to an insertion sort implementation for short sequences.
Since your STL already provides it, you should probably use that.
Insertion sort or selection sort are both typically faster for small arrays (i.e., fewer than 10-20 elements).
Watch https://www.youtube.com/watch?v=FJJTYQYB1JQ
A simple linear insertion sort is really fast. Making a heap first can improve it a bit.
Sadly the talk doesn't compare that against the hardcoded solutions for <= 15 elements.
It's impossible to know the fastest way to do anything without knowing exactly what the "anything" is.
Here is one possible set of assumptions:
We don't have any knowledge of the element structure except that elements are comparable. We have no useful way to group them into bins (for radix sort), we must implement a comparison-based sort, and comparison takes place in an opaque manner.
We have no information about the initial state of the input; any input order is equally likely.
We don't have to care about whether the sort is stable.
The input sequence is a simple array. Accessing elements is constant-time, as is swapping them. Furthermore, we will benchmark the function purely according to the expected number of comparisons - not number of swaps, wall-clock time or anything else.
With that set of assumptions (and possibly some other sets), the best algorithms for small numbers of elements will be hand-crafted sorting networks, tailored to the exact length of the input array. (These always perform the same number of comparisons; it isn't feasible to "short-circuit" these algorithms conditionally because the "conditions" would depend on detecting data that is already partially sorted, which still requires comparisons.)
For a network sorting four elements (in the known-optimal five comparisons), this might look like (I did not test this):
template<class RandomIt, class Compare>
void _compare_and_swap(RandomIt first, Compare comp, int x, int y) {
if (comp(first[x], first[y])) {
auto tmp = first[x];
arr[x] = arr[y];
arr[y] = tmp;
}
}
// Assume there are exactly four elements available at the `first` iterator.
template<class RandomIt, class Compare>
void network_sort_4(RandomIt first, Compare comp) {
_compare_and_swap(2, 0);
_compare_and_swap(1, 3);
_compare_and_swap(0, 1);
_compare_and_swap(2, 3);
_compare_and_swap(1, 2);
}
In real-world environments, of course, we will have different assumptions. For small numbers of elements, with real data (but still assuming we must do comparison-based sorts) it will be difficult to beat naive implementations of insertion sort (or bubble sort, which is effectively the same thing) that have been compiled with good optimizations. It's really not feasible to reason about these things by hand, considering both the complexity of the hardware level (e.g. the steps it takes to pipeline instructions and then compensate for branch mis-predictions) and the software level (e.g. the relative cost of performing the swap vs. performing the comparison, and the effect that has on the constant-factor analysis of performance).

Does std::string::find() use KMP or a suffix tree? [duplicate]

Why the c++'s implemented string::find() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x?
If the complexity of current find is not O(N * M), what is that?
so what algorithm is implemented in gcc? is that KMP? if not, why?
I've tested that and the running time shows that it runs in O(N * M)
Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?
I assume you mean find(), rather than substr() which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).
The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string operations are that size(), max_size(), operator[], swap(), c_str() and data() are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.
The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.
Is that corrected in c++0x?
No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.
If the complexity of current substr is not O(N * M), what is that?
That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.
FYI,
The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:
GCC: PR66414 optimize std::string::find
https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65
LLVM: https://reviews.llvm.org/D27068
The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.
Where do you get the impression from that std::string::substr() doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr() just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.
You may be referring to std::string::find() which doesn't have any complexity requirements or std::search() which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.
The C++ standard does not dictate the performance characteristics of substr (or many other parts, including the find you're most likely referring to with an M*N complexity).
It mostly dictates functional aspects of the language (with some exceptions like the non-legacy sort functions, for example).
Implementations are even free to implement qsort as a bubble sort (but only if they want to be ridiculed and possibly go out of business).
For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::find of C++11, and none of them specify performance parameters.
Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:
Suppose that pattern P and text T are randomly chosen strings of
length m and n, respectively, from the d-ary alphabet
Ʃd = {0; 1; ...; d}, where d >= 2. Show that the
expected number of character-to-character comparisons made by the
implicit loop in line 4 of the naive algorithm is
over all executions of this loop. (Assume that the naive algorithm
stops comparing characters for a given shift once it finds a mismatch
or matches the entire pattern.) Thus, for randomly chosen strings, the
naive algorithm is quite efficient.
NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4 if P[1..m] == T[s+1..s+m]
5 print “Pattern occurs with shift” s
Proof:
For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1} comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □
Where do you get your information about the C++ library? If you do mean string::search and it really doesn't use the KMP algorithm then I suggest that it is because that algorithm isn't generally faster than a simple linear search due to having to build a partial match table before the search can proceed.
If you are going to be searching for the same pattern in multiple texts. The BoyerMoore algorithm is a good choice because the pattern tables need only be computed once , but are used multiple times when searching multiple texts. If you are only going to search for a pattern once in 1 text though, the overhead of computing the tables along with the overhead of allocating memory slows you down too much and std::string.find(....) will beat you since it does not allocate any memory and has no overhead.
Boost has multiple string searching algorithms. I found that BM was an order of magnitude slower in the case of a single pattern search in 1 text than std::string.find().
For my cases BoyerMoore was rarely faster than std::string.find() even when searching multiple texts with the same pattern.
Here is the link to BoyerMoore
BoyerMoore

What is the most efficient way to tell if a string of unicode characters has unique characters in C++?

If it was just ASCII characters I just use array of bool of size 256. But Unicode has so many characters.
1. Wikipedia says unicode has more than 110000 characters. So bool[110000] might not be a good idea?
2. Lets say the characters are coming in a stream and I just want to stop whenever a duplicate is detected. How do I do this?
3. Since the set is so big, I was thinking hash table. But how do I tell when the collision happens because I do not want to continue once a collision is detected. Is there a way to do this in STL implementation of hash table?
Efficient in terms of speed and memory utilization.
Your Options
There are a couple of possible solutions:
bool[0x110000] (note that this is a hex constant, not a decimal constant as in the question)
vector<bool> with a size of 0x110000
sorted vector<uint32_t> or list<uint32_t> containing every encountered codepoint
map<uint32_t, bool> or unordered_map<uint32_t, bool> containing a mapping of codepoints to whether they have been encountered
set<uint32_t> or unordered_set<uint32_t> containing every encountered codepoint
A custom container, e.g. a bloom filter which provides high-density probabilistic storage for exactly this kind of problem
Analysis
Now, let's perform a basic analysis of the 6 variants:
Requires exactly 0x110000 bytes = 1.0625 MiB plus whatever overhead a single allocation does. Both setting and testing is extremely fast.
While it would seem that this is pretty exactly the same solution, it only requires roughly 1/8 of the memory, since it will store the bools in one bit each instead of one byte each. Both setting and testing are extremely fast, performance relative to the first solution may be better or worse, depending on things like cpu cache size, memory performance and of course your test data.
While potentially taking the least amount of memory (4 bytes per encountered codepoint, so it will require less memory as long as the input stream contains at most 0x110000 / 8 / 4 = 34816), performance for this solution will be abysmal: Testing takes O(log(n)) for the vector (binary search) and O(n) for the list (binary search requires random access), while inserting takes O(n) for the vector (all following elements must be moved) and O(1) for the list (assuming that you kept the result of your failed test). This means, that a test + insert cycles takes O(n). Therefore your total runtime will be O(n^2)...
Without doing a lot of talking about this, it should be obvious that we do not need the bool, but would rather just test for existence, leading to solution 5.
Both sets are fairly similar in performance, set is usually implemented with a binary tree and unordered_set with a hash map. This means, that both are fairly inefficient memory wise: Both contain additional overhead (non-leaf nodes in trees and the actual tables containing the hashes in hashmaps) meaning they will probably take 8-16 bytes per entry. Testing and Inserting are O(log(n)) for the set and O(1) amortized for the unordered_set.
To answer the comment, testing whether uint32 const x is contained in unordered_set<uint32_t> data is done like so: if(data.count(x)) or if(data.find(x) != data.end()).
The major drawbacks here is a significant amount of work that the developer has to invest. Also, the bloom filter that was given as an example is a probabilistic data structure, meaning false-positives are possible (false negatives are not in this specific case).
Conclusion
Since your test data is not actual textual data, using a set of either kind is actually very memory inefficient (with high probability). Since this is also the only possibility to achieve better performance than the naive solutions 1 and 2, it will in all probability be significantly slower as well.
Taking into consideration the fact that it is easier to handle and that you seem to be fairly conscious of memory consumption, the final verdict is that vector<bool> seems the most appropriate solution.
Notes
Unicode is intended to represent text. This means that your test case and any analysis following from this is probably highly flawed.
Also, while it is simple to detect duplicate code points, the idea of a character is far more ambiguous, and may require some kind of normalization (e.g. "ä" can be either one codepoint or an "a" and a diacritical mark, or even be based on the cyrillic "a").

creating a hash code form a string in c++

I have a very long string that I need to compare for equality. Since comparing them char by char is very time consuming, I like to create a hash for the string.
I like the generated hash code be unique ( or the chance that two string with the same hash generated, be very small). I think creating an int from a string as hash is not strong enough to eliminate of having two different string with the same hash code, so I am looking for a string hash code.
Am I right that the above assumption?
To clarify, assume that I have a string of say 1K length and I create a hash code of 10 char, then comparison hash codes speed up by 100 times.
The question that I have is how to create such hash code in c++?
I am developing on windows using visual studio 2012.
To be useful in this case, the hash code must be quick to
calculate. Using anything larger than the largest words
supported by the hardware (usually 64 bits) may be counter
productive. Still, you can give it a try. I've found the
following to work fairly well:
unsigned long long
hash( std::string const& s )
{
unsigned long long results = 12345; // anything but 0 is probably OK.
for ( auto current = s.begin(); current != s.end(); ++ current ) {
results = 127 * results + static_cast<unsigned char>( *current );
}
return results;
}
Using a hash like this will probably not be advantageous,
however, unless most of the comparisons are with strings that
aren't equal, but have long common initial sequences. Remember
that if the hashes are equal, you still have to compare the
strings, and that comparison only has to go until the first
characters which aren't equal. (In fact, most of the comparison
functions I've seen start by comparing length, and only compare
characters if the strings are of equal length.)
There are a lot of hashing algorithms present which you may use.
If you want to implement one by yourself, then a simple one could be to take the ascii for each character and align it with 0(i.e. a = 1, b = 2...) and multiply it with the character index in the string. Keep adding these values and store it as a hash value for a particular string.
For example tha hash value for abc would be:
HASH("abc") = 1*1 + 2*2 + 3*3 = 14;
The probability of collision lowers as the string length increases(Considering your strings will be lengthy).
There are many known hash algorithms available. For example MD5, SHA1, etc. You should not need to implement your own algorithm but use one of the available ones. Use the search engine of your choice to find implementations like this one.
It really depends what your hard requirements are. If you have hard requirements such as "search may never take more than so and so long", then it's possible that no solution is applicable. If your intent is simply to speed up a large number of searches, then a simple, short hash will do fine.
While it is generally true that hashing a 1000-character string to an integer (a single 32-bit or 64-bit number) can, and eventually will produce collisions, this is not something to be concerned about.
A 10-charcter hash will also produce collisions. This is a necessary consequence of the fact that 1000 > 10. For every 10-character hash, there exist 100 1000-character strings1.
The important question is whether you will actually see collisions, how often you will see them, and whether it matters at all. Whether (or how likely) you see a collision depends not on the length of the strings, but on the number of distinct strings.
If you hash 77,100 strings (longer than 4 characters) using a 32-bit hash, you have a 50% chance of encountering a collision for each new hash. At 25,000 strings, the likelihood is only somewhere around 5-6%. At 1000 strings, the likelihood is approximately 0.1%.
Note that when I say "50% at 77,100 strings", this does not mean that your chance of actually encountering a collision is that high. It's merely the chance of having two strings with the same hash value. Unless that's the case for the majority of strings, the chance of actually hitting one is again a lot lower.
Which means no more and no less than for most usage cases, it simply doesn't matter. Unless you want to hash hundred thousands of strings, stop worrying now and use a 32-bit hash.
Otherwise, unless you want to hash billions of strings, stop worrying here and use a 64-bit hash.
Thing is, you must be prepared to handle collisions in any case because as long as you have 2 strings, the likelihood for a collision is never exactly zero. Even hashing only 2 or 3 1000-character strings into a 500-byte hash could in principle have a collision (very unlikely but possible).
Which means you must do a string comparison if the hash matches in either case, no matter how long (or how good or bad) your hash is.
If collisions don't happen every time, they're entirely irrelevant. If you have a lot of collisions in your table and encounter one, say, on 1 in 10,000 lookups (which is a lot!), it has no practical impact. Yes, you will have to do a useless string comparison once in 10,000 lookups, but the other 9,999 work by comparing a single integer alone. Unless you have a hard realtime requirement, the measurable impact is exactly zero.
Even if you totally screw up and encounter a collision on every 5th search (pretty desastrous case, that would mean that roughly 800 million string pairs collide, which is only possible at all with at least 1,6 billion strings), this still means that 4 out of 5 searches don't hit a collision, so you still discard 80% of non-matches without doing a comparison.
On the other hand, generating a 10-character hash is cumbersome and slow, and you are likely to create a hash function that has more collision (because of bad design) than a readily existing 32 or 64 bit hash.
Cryptographic hash functions are certainly better, but they run slower than their non-cryptographic counterparts too, and the storage required to store their 16 or 32 byte hash values is much larger, too (at virtually no benefit, to most people). This is a space/time tradeoff.
Personally, I would just use something like djb2, which can be implemented in 3 lines of C code, works well, and runs very fast. There exist of course many other hash functions that you could use, but I like djb2 for its simplicity.
Funnily, after reading James Kanze's answer, posted code seems to be a variation of djb2, only with a different seed and a different multiplier (5381 and 33, respectively).
In the same answer, the remark about comparing string lengths first is a good tip as well. It's noteworthy that you can consider a string's lenght a form of "hash function", too (albeit a rather weak one, but one that often comes "for free").
1However, strings are not some "random binary garbage" as the hash is. They are structured, low-entropy data. Insofar, the comparison does not really hold true.
Well, I would first compare string lenghts. If they match, then I'd start comparing using an algorithm that uses random positions to test character equality, and stop at first difference.
The random positions would be obtained from a stringLength sized vector, filled with random ints ranging from 0 to stringLength-1. I haven't measured this method, though, it's just an idea. But this would save you the concerns of hash collisions, while reducing comparison times.

Caching of floating point values in C++

I would like to assign a unique object to a set of floating point values. Doing so, I am exploring two different options:
The first option is to maintain a static hash map (std::unordered_map<double,Foo*>) in the class and to avoid that duplicates are created in the first place. This means that instead of calling the constructor, I will check if the value is already in the hash and if so, reuse this. I would also need to remove the value from the hash map in the destructor.
The second option is to allow duplicate values during creation, only to try to sort them all at once and detect duplicates after all values have been created. I guess I would need hash maps for that sorting as well. Or would an ordered map ('std::map) work just as well then?
Is there some reason to expect that the first option (which I like better) would be considerably slower in any situation? That is, would finding duplicate entries be much faster if I perform it all entries at once rather than one entry at a time?
I am aware of the pitfalls when cashing floating point numbers and will prevent not-a-numbers and infinities to be added to the map. Some duplicate entries for the same constant is also not a problem, should this occur for a few entries - it will only result in a very small speed penalty.
Depending on the source and the possible values of the floating point
numbers, a bigger problem might be defining a hash function which
respects equality. (0, Inf and NaN are the problem values—most
floating point formats have two representations for 0, +0.0 and
-0.0, which compare equal; I think the same thing holds for Inf. And
two NaN always compare unequal, even when they have exactly the same bit
pattern.)
Other than that, in all questions of performance, you have to measure.
You don't indicate how big the set is likely to become. Unless it is
enormous, if all values are inserted up front, the fastest solution is
often to use push_back on an std::vector, then std::sort and, if
desired, std::unique after the vector has been filled. In many
cases, using an std::vector and keeping it sorted is faster even when
insertions and removals are frequent. (When you get a new request, use
std::lower_bound to find the entry point; if the value at the location
found is not equal, insert a new entry at that point.) The improved
locality of std::vector largely offsets any additional costs due to
moving the objects during insertion and deletion, and often even the
fact that access is O(lg n) rather than O(1). (In one particular case,
I found that the break even point between a hash table and as sorted
std::vector was around 100,000 entries.)
Have you considered actually measuring it?
None of us can tell you how the code you're considering will actually perform. Write the code, compile it, run it and measure how fast it runs.
Spending time trying to predict which solution will be faster is (1) a waste of your time, and (2) likely to yield incorrect results.
But if you want an abstract answer, it is that it depends on your use case.
If you can collect all the values, and sort them once, that can be done in O(n lg n) time.
If you insert the elements one at a time into a data structure with the performance characteristics of std::map, then each insertion will take O(lg n) time, and so, performing n insertions will also take O(n lg n) time.
Inserting into a hash map (std::unordered_map) takes constant time, and so n insertions can be done in O(n). So in theory, for sufficiently large values of n, a hash map will be faster.
In practice, in your case, no one knows. Which is why you should measure it if you're actually concerned about performance.