efficient extraction of elements from a C++ unordered set - c++

In C++ suppose you have an unordered set (https://en.cppreference.com/w/cpp/container/unordered_set) of strings - is there a way to efficiently extract all strings from that set that meet a certain criteria (e.g. find all strings in the set that begin with letter "a") using a method other than iterating through the entire set with a for loop and checking the first character of every single string?

For any criteria this is not possible, see this answer for some more details.
Depending on your other needs, a sorted std::vector is highly likely to be the most efficient for the extraction part alone. Use algorithms like std::lower_bound to work with a sorted std::vector. In the end, your actual use cases is what determines overall which container is best suited performance-wise - although std::vector comes close to a one-size fit all for considering performance (this is because of all the internal optimizations of contiguous storage).
That being said, in general it's advisable to use the container that seems best suited for the problem at hand and only do clever optimizations if there's an actual performance bottleneck.

For the general case of any criteria, you can't do better than iterating over every element.
Each container has specific criteria that it can do better with, e.g.
std::set<std::string> strings = /* something */;
auto first = strings.lower_bound("a"); // O(log(strings)), "a" is the least string that starts with 'a'
auto last = strings.lower_bound("b"); // O(log(strings)), "b" is the first string after those that start with 'a'
strings.erase(first, last); // O(log(strings) + distance(first, last)), all the strings starting with 'a' are removed
Here we remove elements starting with 'a', with a complexity of O(log(strings) + distance(first, last)) which is a O(alphabet) improvement over iterating all elements.
Or the more contrived
std::unordered_set<std::string> strings = /* something */;
auto hashed = strings.hash_function()("Any collision will do"); // O(1)
strings.erase(strings.begin(hashed), strings.end(hashed)); // O(distance(first, last))
Here we remove elements that hash the same as "Any collision will do", with a complexity of O(distance(first, last))

Instead of using an unordered set, adapt your data structure to something like a trie.
In this case it might be more useful to you.
For more details please check: https://en.wikipedia.org/wiki/Trie
Implementation: https://www.geeksforgeeks.org/trie-insert-and-search/.
Depending on your needs you might think of some other algorithms like Aho-Corasick/Suffix-Arrays etc. You might need to do some research on the data-structure you need based on the amount of data that you have, the recomputing that you need and the amount of queries that you do.
I hope this helps.

Related

How to quickly search a large vector many times?

I have a std::vector<std::string> that has 43,000 dictionary words. I have about 315,000 maybe-words and for each one I need to determine if it's a valid word or not. This takes a few seconds and I need to complete the task as fast as possible.
Any ideas on the best way to complete this? Currently I iterate through on each attempt:
for (std::string word : words) {
if (!(std::find(dictionary.begin(), dictionary.end(), word) != dictionary.end())) {
// The word is not the dictionary
return false;
}
}
return true;
Is there a better way to iterate multiple times? I have a few hypothesis, such as
Create a cache of invalid words, since the 315,000 list probably has 25% duplicates
Only compare with words of the same length
Is there a better way to do this? I'm interested in an algorithm or idea.
Is there a better way to iterate multiple times?
Yes. Convert the vector to another data structure that supports faster lookups. The standard library comes with std::set and std::unordered_set which are both likely to be faster than repeated linear search. Other data structures may be even more efficient.
If your goal is to create a range of words or non-words in the maybe set, then another efficient approach would be to sort both vectors, and use std::(ranges::)set_intersection or std::(ranges::)set_difference.

In a low-latency application, Is unordered_map ever a better solution over vector?

Is it advisable to use unordered_map in place of vector while developing a low-latency application ?
I recently appeared for an interview with a financial company which worked on low-latency trading applications. I was asked a question for which I answered using an unordered_map which seemed pretty good efficiency-wise (0(n)) compared to If I had used a vector (O(n*n)). However, I know that it is advisable to use vector as much as possible and avoid unordered_map in order to utilize benefits of cache coherence. I just wanted to see If there is a better solution possible for this problem The problem I was asked was to check If two strings are a permutation of each other.
bool isPermutation(const std::string& first, const std::string& second) {
std::unordered_map<char, int> charDict;
if(first.length() != second.length())
return false;
for(auto it: first) {
charDict[it]++;
}
for(auto it: second) {
if(charDict.count(it) > 0) {
--charDict[it];
} else {
return false;
}
return true;
}
You can assume that both strings are equal length and the function is only assumed to return true If there is an exact number of occurrences of each character in second string as there are in the first string.
Sure, but it really depends on the problem you are trying to solve. If the domain of your key space is unknown, it would be difficult to come up with a generic solution that is faster than unordered_map.
In this case, the domain of your key space is known: it is limited to ASCII characters. This is convenient because you can instantly convert from item (char) to vector index (std::size_t). So you could just use the value of each character as an index into a vector rather than hashing it for every lookup.
But in general, don't optimize prematurely. If unordered_map is the most natural solution, I would start there, then profile, and if you find that performance does not meet your requirements, look at reworking your solution. (This isn't always the best advice; if you know you are working on a highly critical piece of code, there are certain design decisions you will want to take into consideration from the beginning. Coming back and refactoring later may be much more difficult if you start with an incompatible design.)
Since there are only 256 possible keys, you can use a stack-allocated array of 256 counts, which will be faster than a vector or an unordered_map. if first.size()+second.size() < 128, then only initialize the counts to 0 for keys that actually occur. Otherwise memset the whole array.

Efficient removal of a set of integers from another set

I have a (large) set of integers S, and I want to run the following pseudocode:
set result = {};
while(S isn't empty)
{
int i = S.getArbitraryElement();
result.insert(i);
set T = elementsToDelete(i);
S = S \ T; // set difference
}
The function elementsToDelete is efficient (sublinear in the initial size of S) and the size of T is small (assume it's constant). T may contain integers no longer in S.
Is there a way of implementing the above that is faster than O(|S|^2)? I suspect I should be able to get O(|S| k), where k is the time complexity of elementsToDelete. I can of course implement the above in a straightforward way using std::set_difference but my understanding is that set_difference is O(|S|).
Using std::set S;, you can do:
for (auto k : elementsToDelete(i)) {
S.erase(k);
}
Of course the lookup for erase is O(log(S.size())), not the O(1) you're asking for. That can be achieved with std::unordered_set, assuming not too many collisions (which is a big assumption in general but very often true in particular).
Despite the name, the std::set_difference algorithm doesn't have much to do with std::set. It works on anything you can iterate in order. Anyway it's not for in-place modification of a container. Since T.size() is small in this case, you really don't want to create a new container each time you remove a batch of elements. In another example where the result set is small enough, it would be more efficient than repeated erase.
The set_difference in C++ library has time complexity of O(|S|) hence it is not good for your purposes so i advice you to use S.erase() to delete set element in the S in O(logN) implemented as BST . Hence your time complexity reduces to O(NlogN)

Fast search algorithm with std::vector<std::string>

for (std::vector<const std::string>::const_iterator it = serverList.begin(); it != serverList.end(); it++)
{
// found a match, store the location
if (index == *it) // index is a string
{
indexResult.push_back(std::distance(serverList.begin(), it)); // std::vector<unsigned int>
}
}
I Have written the above code to look through a vector of strings and return another vector with the location of any "hits".
Is there a way to do the same, but faster? (If I have 10,000 items in the container, it will take a while).
Please note that I have to check ALL of the items for matches and store its position in the container.
Bonus Kudos: Anyone know any way/links on how I can make the search so that it finds partial results (Example: search for "coolro" and store the location of variable "coolroomhere")
Use binary_search after sorting the vector
std::sort( serverList.begin() , serverList.end() )
std::lower_bound(serverList.begin() , serverList.end() , valuetoFind) to find first matching
Use std::equal_range if you want to find all matching elements
The lower_bound & equal_range search because it is binary is logarithmic compared to your search that is O(N)
Basically, you're asking if it's possible to check all elements for a
match, without checking all elements. If there is some sort of external
meta-information (e.g. the data is sorted), it might be possible (e.g.
using binary search). Otherwise, by its very nature, to check all
elements, you have to check all elements.
If you're going to do many such searches on the list and the list
doesn't vary, you might consider calculating a second table with a good
hash code of the entries; again depending on the type of data being
looked up, it could be more efficient to calculate the hash code of the
index, and compare hash codes first, only comparing the strings if the
hash codes were equal. Whether this is an improvement or not largely
depends on the size of the table and the type of data in it. You might
also, be able to leverage off knowledge about the data in the strings; if
they are all URL's, for example, mostly starting with "http://www.",
starting the comparison at the tenth character, and only coming back to
compare the first 10 if all of the rest are equal, could end up with a big
win.
With regards to finding substrings, you can use std::search for each
element:
for ( auto iter = serverList.begin();
iter != serverList.end();
++ iter ) {
if ( std::search( iter->begin(), iter->end(),
index.begin(), index.end() ) != iter->end() ) {
indexResult.push_back( iter - serverList.begin() );
}
}
Depending on the number of elements being searched and the lengths of
the strings involved, it might be more efficient to use something like
BM search, however, precompiling the search string to the necessary
tables before entering the loop.
If you make the container a std::map instead of a std::vector, the underlying data structure used will be one that is optimized for doing keyword searches like this.
If you instead use a std::multimap, the member function equal_range() will return a pair of iterators covering every match in the map. That sounds to me like what you want.
A smart commenter below points out that if you don't actually store any more infomation than the name (the search key), then you should probably instead use a std::multiset.

How can I increase the performance in a map lookup with key type std::string?

I'm using a std::map (VC++ implementation) and it's a little slow for lookups via the map's find method.
The key type is std::string.
Can I increase the performance of this std::map lookup via a custom key compare override for the map? For example, maybe std::string < compare doesn't take into consideration a simple string::size() compare before comparing its data?
Any other ideas to speed up the compare?
In my situation the map will always contain < 15 elements, but it is being queried non stop and performance is critical. Maybe there is a better data structure that I can use that would be faster?
Update: The map contains file paths.
Update2: The map's elements are changing often.
First, turn off all the profiling and DEBUG switches. These can slow down STL immensely.
If that's not it, part of the problem may be that your strings are identical for the first 80-90% of the string. This isn't bad for map, necessarily, but it is for string comparisons. If this is the case, your search can take much longer.
For example, in this code find() will likely result in a couple of string compares, but each will return after comparing the first character until "david", and then the first three characters will be checked. So at most, 5 characters will be checked per call.
map<string,int> names;
names["larry"] = 1;
names["david"] = 2;
names["juanita"] = 3;
map<string,int>::iterator iter = names.find("daniel");
On the other hand, in the following code, find() will likely check 135+ characters:
map<string,int> names;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/wilma"] = 1;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/fred"] = 2;
names["/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/barney"] = 3;
map<string,int>::iterator iter = names.find("/usr/local/lib/fancy-pants/share/etc/doc/foobar/longpath/yadda/yadda/betty");
That's because the string comparisons have to search deeper to find a match since the beginning of each string is the same.
Using size() in your comparison for equality won't help you much here since your data set is so small. A std::map is kept sorted so its elements can be searched with a binary search. Each call to find should result in less than 5 string comparisons for a miss, and an average of 2 comparisons for a hit. But it does depend on your data. If most of your path strings are of different lengths, then a size check like Motti describes could help a lot.
Something to consider when thinking of alternative algorithms is how many many "hits" you get. Are most of your find() calls returning end() or a hit? If most of your find()s return end() (misses) then you are searching the entire map every time (2logn string compares).
Hash_map is a good idea; it should cut your search time in about half for hits; more for misses.
A custom algorithm may be called for because of the nature of path strings, especially if your data set has common ancestry like in the above code.
Another thing to consider is how you get your search strings. If you are reusing them, it may help to encode them into something that is easier to compare. If you use them once and discard them, then this encoding step is probably too expensive.
I used something like a Huffman coding tree once (a long time ago) to optimize string searches. A binary string search tree like that may be more efficient in some cases, but its pretty expensive for small sets like yours.
Finally, look into alternative std::map implementations. I've heard bad things about some of VC's stl code performance. The DEBUG library in particular is bad about checking you on every call. StlPort used to be a good alternative, but I haven't tried it in a few years. I've always loved Boost too.
As Even said the operator used in a set is < not ==.
If you don't care about the order of the strings in your set you can pass the set a custom comparator that performs better than the regular less-than.
For example if a lot of your strings have similar prefixes (but they vary in length) you can sort by string length (since string.length is constant speed).
If you do so beware a common mistake:
struct comp {
bool operator()(const std::string& lhs, const std::string& rhs)
{
if (lhs.length() < rhs.length())
return true;
return lhs < rhs;
}
};
This operator does not maintain a strict weak ordering, as it can treat two strings as each less than the other.
string a = "z";
string b = "aa";
Follow the logic and you'll see that comp(a, b) == true and comp(b, a) == true.
The correct implementation is:
struct comp {
bool operator()(const std::string& lhs, const std::string& rhs)
{
if (lhs.length() != rhs.length())
return lhs.length() < rhs.length();
return lhs < rhs;
}
};
The first thing is to try using a hash_map if that's possible - you are right that the standard string compare doesn't first check for size (since it compares lexicographically), but writing your own map code is something you'd be better off avoiding. From your question it sounds like you do not need to iterate over ranges; in that case map doesn't have anything hash_map doesn't.
It also depends on what sort of keys you have in your map. Are they typically very long? Also what does "a little slow" mean? If you have not profiled the code it's quite possible that it's a different part taking time.
Update: Hmm, the bottleneck in your program is a map::find, but the map always has less than 15 elements. This makes me suspect that the profile was somehow misleading, because a find on a map this small should not be slow, at all. In fact, a map::find should be so fast, just the overhead of profiling could be more than the find call itself. I have to ask again, are you sure this is really the bottleneck in your program? You say the strings are paths, but you're not doing any sort of OS calls, file system access, disk access in this loop? Any of those should be orders of magnitude slower than a map::find on a small map. Really any way of getting a string should be slower than the map::find.
You can try to use a sorted vector (here's one sample), this may turn out to be faster (you'll have to profile it to make sure of-course).
Reasons to think it'll be faster:
Less memory allocations and deallocations (the vector will expand to the maximal size used and then reuse freed memory).
Binary find with random access should be faster than tree traversal (espacially due to data locality).
Reasons to think it'll be slower:
Deleations and additions will mean moving strings around in memory, since string's swap is efficiant and the size of the data set is small this may not be an issue.
std::map's comparator isn't std::equal_to it's std::less, I'm not sure what the best way to short circuit a < compare so that it would be faster than the built in one.
If there are always < 15 elems, perhaps you could use a key besides std::string?
Motti has a good solution. However, I'm pretty sure that for your < 15 elements a map isn't the right way because its overhead will always be greater than that of a simple lookup table with an appropriate hashing scheme. In your case, it might even be enough to hash by length alone, and if that still produces collisions, use a linear search through all entries of the same length.
To establish if I'm right, a benchmark is of course required but I'm quite sure of its outcome.
You might consider pre-computing a hash for a string, and saving that in your map. Doing so gives the advantage of hash compares instead of string compares during the search through the std::map tree.
class HashedString
{
unsigned m_hash;
std::string m_string;
public:
HashedString(const std::string& str)
: m_hash(HashString(str))
, m_string(str)
{};
// ... copy constructor and etc...
unsigned GetHash() const {return m_hash;}
const std::string& GetString() const {return m_string;}
};
This has the benefits of computing a hash of the string once, on construction. After this, you could implement a comparison function:
struct comp
{
bool operator()(const HashedString& lhs, const HashedString& rhs)
{
if(lhs.GetHash() < rhs.GetHash()) return true;
if(lhs.GetHash() > rhs.GetHash()) return false;
return lhs.GetString() < rhs.GetString();
}
};
Since hashes are now computed on HashedString construction, they are stored that way in the std::map, and so the compare can happen very quickly (an integer compare) in an astronomically high percentage of the time, falling back on standard string compares when the hashes are equal.
Maybe you could reverse the strings prior to using them as keys in the map? That could help if the first few letters of each string are identical.
Here are some things you can consider:
0) Are you sure this is where the performance bottleneck is? Like the results from Quantify, Cachegrind, gprof or something like that? Because lookups on such a smap map should be fairly fast...
1) You can override the functor used to compare the keys in std::map<>, there is a second template parameter to do that. I doubt you can do much better than operator<, however.
2) Are the contents of the map changing a lot? If not, and given the very small size of your map, maybe using a sorted vector and binary search could yield better results (for example because you can exploit memory locality better.
3) Are the elements known at compile time? You could use a perfect hash function to improve lookup times if that is the case. Search for gperf on the web.
4) Do you have a lot of lookups that fail to find anything? If so, maybe comparing with the first and last elements in the collection may eliminate many mismatches quicker than a full search every time.
These have been suggested already, but in more detail:
5) Since you have so few strings, maybe you could use a different key. For example, are your keys all the same size? Can you use a class containing a fixed-length array of characters? Can you convert your strings to numbers or some data structure with only numbers?
Depending on the usage cases, there are some other techniques you can use. For example we had an application that needed to keep up with over a million different file paths. The problem with that there were thousands of objects that needed to keep small maps of these file paths.
Since adding new file paths to the data set was an infrequent operation, when path was added to the system, a master map was searched. If the path was not found, then it was added and a new sequenced integer (starting at 1) was returned. If the path already existed, then the previously assigned integer was returned. Then each map maintained by each object was converted from a string based map to an integer map. Not only did this greatly improve performance, it reduced memory usage by not having so many duplicate copies of the strings.
Sure, this is a very specific optimization. But when it comes to performance improvements, you often find yourself having to make tailored solutions to specific problems.
And I hate strings :) Not are they slow to compare, but they can really trash your CPU caches on high performance software.
Try std::tr1::unordered_map (found in the header <tr1/unordered_map>). This is a hash map, and, while it doesn't maintain a sorted order of elements, will likely be far faster than a regular map.
If your compiler doesn't support TR1, get a newer version. MSVC and gcc both support TR1, and I believe the newest versions of most other compilers also have support. Unfortunately, a lot of the library reference sites haven't been updated, so TR1 remains a largely-unknown piece of technology.
I hope C++0x isn't the same way.
EDIT: Note that the default hashing method for tr1::unordered_map is tr1::hash, which needs to be specialized to work on a UDT, probably.
Where you have long common substrings, a trie might be a better data structure than a map or a hash_map. I said "might", though - a hash_map already only traverses the key once per lookup, so should be fairly fast. I won't discuss it further since others already have.
You could also consider a splay tree if some keys are more frequently looked up than others, but of course this makes the worst-case lookup worse than a balanced tree, and lookups are mutating operations, which may matter to you if you're using e.g. a reader-writer lock.
If you care about the performance of lookups more than modifications, you might do better with an AVL tree than a red-black, which I think is what STL implementations generally use for map. An AVL tree is typically better balanced and so will on average require fewer comparisons per lookup, but the difference is marginal.
Finding an implementation of these that you're happy with might be an issue. A search on the Boost main page suggests they have a splay and AVL tree but not a trie.
You mentioned in a comment that you never have a lookup that fails to find anything. So you could in theory skip the final comparison, which in a tree of 15 < 2^4 elements could give you something like a 20-25% speedup without doing anything else. In fact, maybe more than that, since equal strings are the slowest to compare. Whether it's worth writing your own container just for this optimisation is another question.
You might also consider locality of reference - I don't know whether you could avoid the occasional page miss by allocating the keys and the nodes out of a small heap. If you only need about 15 entries at a time, then assuming a file name limit below 256 bytes you could ensure that everything accessed during a lookup fits into a single 4k page (apart from the key being looked up, of course). It may be that comparing the strings is insignificant compared with a couple of page loads. However, if this is your bottleneck there must be an enormous number of lookups going on, so I'd guess that everything is reasonably close to the CPU. Worth checking, maybe.
Another thought: if you are using pessimistic locking on a structure where there's a lot of contention (you said in a comment the program is massively multi-threaded) then regardless of what the profiler tells you (what code the CPU cycles are spent in), it might be costing you more than you think by effectively limiting you to 1 core. Try a reader-writer lock?
hash_map is not standard, try using unordered_map available in tr1 (which is available in boost if your tool chain doesn't already have it).
For small numbers of strings you might be better using vector, as map is typically implemented as a tree.
Why don't you use a hashtable instead? boost::unordered_map could do. Or you can roll out your own solution, and store the crc of a string instead of the string itself. Or better yet, put #defines for the strings, and use those for lookup, e.g.,
#define "STRING_1" STRING_1