Fast string search? - c++

I have a vector of strings and have to check if each element in vector is present in a given list of 5000 words.
Besides the mundane method of two nested loops, is there any faster way to do this in C++?

You should put the list of strings into an std::set. It's a data structure optimized for searching. Finding if a given element is in the set or not is an operation which is much faster than iterating all entries.
When you are already using C++11, you can also use the std::unordered_set which is even faster for lookup, because it's implemented as a hash table.
Should this be for school/university: Be prepared to explain how these data structures manage to be faster. When your instructor asks you to explain why you used them, "some guys on the internet told me" is unlikely to earn you a sticker in the class book.

You could put the list of words in an std::unordered_set. Then, for each element in the vector, you just have to test if it is in the unordered_set in O(1). You would have an expected complexity of O(n) (look at the comment to see why it is only expected).

You could sort the vector, then you can solve this with one "loop" (taken that your dictionary is sorted too) which means O(n) not counting in the cost of the sort.

So you have a vector of strings, with each string having one or more words, and you have a vector that's a dictionary, and you're supposed to determine which words in the vector of strings are also in the dictionary? The vector of strings is an annoyance, since you need to look at each word. I'd start by creating a new vector, splitting each string into words, and pushing each word into the new vector. Then sort the new vector and run it through the std::unique algorithm to eliminate duplicates. Then sort the dictionary. Then run both ranges through std::set_intersection to write the result.

Related

slow std::map for large entries

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.
If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.
Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

How to efficiently compare two vectors in C++, whose content can't be meaningfully sorted?

How to efficiently compare two vectors in C++, whose content can't be meaningfully sorted?
I read many posts but most talk about first sorting the two vectors and then comparing
the elements. But in my case I can't sort the vectors in a meaningful way that can
help the comparison. That means I will have to do an O(N^2) operation rather than O(N),
I will have to compare each element from the first vector and try to find a unique match
for it in the second vector. So I will have to match up all the elements and if I can
find a unique match for each then the vectors are equal.
Is there an efficient and simple way to do this? Will I have to code it myself?
Edit: by meaningful sorting I mean a way to sort them so that later you can
compare them in linear fashion.
Thank you
If the elements can be hashed in some meaningful way, you can get expected O(n) performance by using a hashmap: Insert all elements from list A into the hashmap, and for each element in list B, check if it exists in the hashmap.
In C++, I believe that unordered_map is the standard hashmap implementation (though I haven't used it myself).
Put all elements of vector A into a hash table, where the element is the key, and the value is a counter of how many times you’ve added the element. [O(n) expected]
Iterate over vector B and decrease the counters in the hash table for each element. [O(n) expected]
Iterate over the hash table and check that each counter is 0. [O(n) expected]
= O(n) expected runtime.
No reason to use a map, since there's no values to associate with, just keys (the elements themselves). In this case you should look at using a set or an unordered_set. Put all elements from A into a new set, and then loop through B. For each element, if( set.find(element) == set.end() ) return false;
If you're set on sorting the array against some arbitrary value, you might want to look at C++11's hash, which returns a size_t. Using this you can write a comparator which hashes the two objects and compares them, which you can use with std::sort to perform a O(n log n) sort on it.
If you really can't sort the vectors you could try this. C++ gurus please free to point out flaws, obvious failures to exploit STL and other libraries, failures to comprehend previous answers, etc, etc :) Apologies in advance as necessary.
Have a vector of ints, 0..n, called C. These ints are the indices of each element in vector B. For each element in vector A compare it against elements in B according to the B indices that are in C. If you find a match remove that index from C. C is now one shorter. For the next A you're again searching B according to indices in C, which being one shorter, takes less time. And if you get lucky it will be quite quick.
Or you could build up a vector of B indices that you have already checked so that you ignore those B's next time round the loop. Saves building a complete C first.

Sorting a std::vector which has parts of it sorted (formed by concatenation of sorted vectors)

I have an std::vector as one of the inputs for an API i am exposing.
I know that the user of this API can send a huge vector, but that vector was formed by concatenation of sorted vectors. This means that the vector that I get is formed from a number of sorted vectors.
I need to sort this vector.
I would like to know which sorting algorithm is best suited.
I would prefer an in-place sorting algo like merge or quick as I dont want to take up more memory (the vector is already a huge one).
Also would it be better to change the API interface to accept N sorted vectors and then do the N-way merging myself. I dont want to go with this unless the saving is really huge.
Also while doing N-way merge I would want to do it in-place if possible.
So ideally i would prefer the approach where i use some ready sort algorithm on the big vector (as that would be simpler I feel).
Take a look at std::inplace_merge. You can use mergesort idea and merge each pair, then next pairs, then next… And so on until only one remains.
You can search the vector to find the concatenation points of the smaller vectors. Then by using these iterators you can do a merge one by one.
To find the concatenation points you can look for the first element that violates the sorting criteria from the beginning. And then from that position to the next and so on..
Timsort looks to be just what you need -- it is an adaptive sort that looks for presorted runs in the data, and merges them as it goes. It has worst-case O(nlog n) performance, and I expect it will do much better than that if the runs (presorted subarrays) are long.

Checking for duplicates in a vector [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Determining if an unordered vector<T> has all unique elements
I have to check a vector for duplicates. What is the best way to approach this:
I take the first element, compare it against all other elements in the vector. Then take the next element and do the same and so on.
Is this the best way to do it, or is there a more efficient way to check for dups?
If your vector is an STL container, the solution is easy:
std::sort(myvec.begin(), myvec.end());
std::erase(std::unique(myvec.begin(), myvec.end()), myvec.end());
According to cppreference (https://en.cppreference.com/w/cpp/algorithm/unique), the elements are shifted around so that the values from myvec.begin() to the return value of std::unique are all unique. The elements after the iterator returned by std::unique are unspecified (useless in every use-case I've seen) so remove them from the std::vector<A> using std::vector<A>::erase.
Use a hash table in which you insert each element. Before you insert an element, check if it's already there. If it is, you have yourself a duplicate. This is O(n) on average, but the worst case is just as bad as your current method.
Alternatively, you can use a set to do the same thing in O(n log n) worst case. This is as good as the sorting solution, except it doesn't change the order of the elements (uses more memory though since you create a set).
Another way is to copy your vector to another vector, sort that and check the adjacent elements there. I'm not sure if this is faster than the set solution, but I think sorting adds less overhead than the balanced search trees a set uses so it should be faster in practice.
Of course, if you don't care about keeping the original order of the elements, just sort the initial vector.
If you don't care about an occasional false positive, you can use a Bloom Filter to detect probable duplicates in the collection. If false positives can't be accepted, take the values that fail the filter and run a second detection pass on those. The list of failed values should be fairly small, although they will need to be checked against the full input.
Sorting and then comparing adjacent elements is the way to go. A sort takes O(n log n) comparisons, an then an additional n-1 to compare adjacent elements.
The scheme in the question would take (n^2)/2 comparisons.
You can also use binary_search.
Here are two good examples that will help you:
http://www.cplusplus.com/reference/algorithm/binary_search/
http://www.cplusplus.com/reference/algorithm/unique_copy/

Should use an insertion sort or construct a heap to improve performance?

We have large (100,000+ elements) ordered vectors of structs (operator < overloaded to provide ordering):
std::vector < MyType > vectorMyTypes;
std::sort(vectorMyType.begin(), vectorMyType.end());
My problem is that we're seeing performance problems when adding new elements to these vectors while preserving sort order. At the moment we're doing something like:
for ( a very large set )
{
vectorMyTypes.push_back(newType);
std::sort(vectorMyType.begin(), vectorMyType.end());
...
ValidateStuff(vectorMyType); // this method expects the vector to be ordered
}
This isn't exactly what our code looks like since I know this example could be optimised in different ways, however it gives you an idea of how performance could be a problem because I'm sorting after every push_back.
I think I essentially have two options to improve performance:
Use a (hand crafted?) insertion sort instead of std::sort to improve the sort performance (insertion sorts on a partially sorted vector are blindingly quick)
Create a heap by using std::make_heap and std::push_heap to maintain the sort order
My questions are:
Should I implement an insertion sort? Is there something in Boost that could help me here?
Should I consider using a heap? How would I do this?
Edit:
Thanks for all your responses. I understand that the example I gave was far from optimal and it doesn't fully represent what I have in my code right now. It was simply there to illustrate the performance bottleneck I was experiencing - perhaps that's why this question isn't seeing many up-votes :)
Many thanks to you Steve, it's often the simplest answers that are the best, and perhaps it was my over analysis of the problem that blinded me to perhaps the most obvious solution. I do like the neat method you outlined to insert directly into a pre-ordered vector.
As I've commented, I'm constrained to using vectors right now, so std::set, std::map, etc aren't an option.
Ordered insertion doesn't need boost:
vectorMyTypes.insert(
std::upper_bound(vectorMyTypes.begin(), vectorMyTypes.end(), newType),
newType);
upper_bound provides a valid insertion point provided that the vector is sorted to start with, so as long as you only ever insert elements in their correct place, you're done. I originally said lower_bound, but if the vector contains multiple equal elements, then upper_bound selects the insertion point which requires less work.
This does have to copy O(n) elements, but you say insertion sort is "blindingly fast", and this is faster. If it's not fast enough, you have to find a way to add items in batches and validate at the end, or else give up on contiguous storage and switch to a container which maintains order, such as set or multiset.
A heap does not maintain order in the underlying container, but is good for a priority queue or similar, because it makes removal of the maximum element fast. You say you want to maintain the vector in order, but if you never actually iterate over the whole collection in order then you might not need it to be fully ordered, and that's when a heap is useful.
According to item 23 of Meyers' Effective STL, you should use a sorted vector if you application use its data structures in 3 phases. From the book, they are :
Setup. Create a new data structure by inserting lots of elements into it. During this phase, almost all operation are insertions and erasure. Lookups are rare on nonexistent
Lookup. Consult the data structure to find specific pieces of information. During this phase, almost all operations are lookups. Insertion and erasures are rare or nonexistent. There are so many lookups, the performance of this phase makes the performance of the other phases incidental.
Reorganize. Modify the content of the data structure. perhaps by erasing all the current data and inserting new data in its place. Behaviorally, this phase is equivalent to phase 1. Once this phase is completed, the application return to phase 2
If your use of your data structure resembles this, you should use a sorted vector, and then use a binary_search as mentionned. If not, a typical associative container should do it, that means a set, multi-set, map or multimap as those structure are ordered by default
Why not just use a binary search to find where to insert the new element? Then you will insert exactly into the required position.
If you need to insert a lot of elements into a sorted sequence, use std::merge, potentially sorting the new elements first:
void add( std::vector<Foo> & oldFoos, const std::vector<Foo> & newFoos ) {
std::vector<Foo> merged;
// precondition: oldFoos _and newFoos_ are sorted
merged.reserve( oldFoos.size() + newFoos.size() ); // only for std::vector
std::merge( oldFoos.begin(), oldFoos.end(),
newFoos.begin(), newFoos.end(),
std::back_inserter( merged );
// apply std::unique, if wanted, here
merged.erase( std::unique( merged.begin(), merged.end() ), merged.end() );
oldFoos.swap( merged ); // commit changes
}
Using a binary search to find the insertion location isn't going to speed up the algorithm much because it will still be O(N) to do the insertion (consider inserting at the beginning of a vector - you have to move every element down one to create the space).
A tree (aka heap) will be O(log(N)) to insert, much better performance.
See http://www.sgi.com/tech/stl/priority_queue.html
Note that a tree will still have worst case O(N) performance for insert unless it is balanced, e.g. an AVL tree.
Why not to use boost::multi_index ?
NOTE: boost::multi_index does not provide memory contiguity, a property of std::vectors by which elements are stored adjacent to one another in a single block of memory.
There are a few things you need to do.
You may want to consider making use of reserve() to avoid excessive re-allocing of the entire vector. If you have knowledge of the size it will grow to, you may gain some performance by doing resrve()s yourself (rather than having the implemetation do them automaticaly using the built in heuristic).
Do a binary search to find the insertion location. Then resize and shift everything following the insertion point up by one to make room.
Consider: do you really want to use a vector? Perhaps a set or map are better.
The advantage of binary search over lower_bound is that if the insertion point is close to the end of the vector you don't have to pay the theta(n) complexity.
If you want insert an element into the "right" position, why do you plan on using sort. Find the position using lower_bound and insert, using, well, `insert' method of the vector. That will still be O(N) to insert new item.
heap is not going to help you, because heap is not sorted. It allows you get get at the smallest element quickly, and then quickly remove it and get next smallest element. However, the data in heap is not stored in sort order, so if you have algorithms that must iterate over data in order, it will not help.
I am afraid you description skimmed to much detail, but it seems like list is just not the right element for the task. std::deque is much better suited for insertion in the middle, and you might also consider std::set. I suggest you explain why you need to keep the data sorted to get more helpful advice.
You might want to consider using a BTree or a Judy Trie.
You don't want to use contiguous memory for large collections, insertions should not take O(n) time;
You want to use at least binary insertion for single elements, multiple elements should be presorted so you can make the search boundaries smaller;
You do not want your data structure wasting memory, so nothing with left and right pointers for each data element.
As others have said I'd probably have created a BTree out of a linked list instead of using a vector. Even if you got past the sorting issue, vectors have the problem of fully reallocating when they need to grow, assuming you don't know your maximum size before hand.
If you are worried about a list allocating on different memory pages and causing cache related performance issues, preallocate your nodes in an array, (pool the objects) and insert these into the list.
You can add a value in your data type that denotes if it is allocated off the heap or from a pool. This way if you detect that your pool runs out of room, you can start allocating off the heap and throw an assert or something to yourself so you know to bump up the pool size (or make this a command line option to set.
Hope this helps, as I see you already have lots of great answers.