How to quickly search a large vector many times? - c++

I have a std::vector<std::string> that has 43,000 dictionary words. I have about 315,000 maybe-words and for each one I need to determine if it's a valid word or not. This takes a few seconds and I need to complete the task as fast as possible.
Any ideas on the best way to complete this? Currently I iterate through on each attempt:
for (std::string word : words) {
if (!(std::find(dictionary.begin(), dictionary.end(), word) != dictionary.end())) {
// The word is not the dictionary
return false;
}
}
return true;
Is there a better way to iterate multiple times? I have a few hypothesis, such as
Create a cache of invalid words, since the 315,000 list probably has 25% duplicates
Only compare with words of the same length
Is there a better way to do this? I'm interested in an algorithm or idea.

Is there a better way to iterate multiple times?
Yes. Convert the vector to another data structure that supports faster lookups. The standard library comes with std::set and std::unordered_set which are both likely to be faster than repeated linear search. Other data structures may be even more efficient.
If your goal is to create a range of words or non-words in the maybe set, then another efficient approach would be to sort both vectors, and use std::(ranges::)set_intersection or std::(ranges::)set_difference.

Related

efficient extraction of elements from a C++ unordered set

In C++ suppose you have an unordered set (https://en.cppreference.com/w/cpp/container/unordered_set) of strings - is there a way to efficiently extract all strings from that set that meet a certain criteria (e.g. find all strings in the set that begin with letter "a") using a method other than iterating through the entire set with a for loop and checking the first character of every single string?
For any criteria this is not possible, see this answer for some more details.
Depending on your other needs, a sorted std::vector is highly likely to be the most efficient for the extraction part alone. Use algorithms like std::lower_bound to work with a sorted std::vector. In the end, your actual use cases is what determines overall which container is best suited performance-wise - although std::vector comes close to a one-size fit all for considering performance (this is because of all the internal optimizations of contiguous storage).
That being said, in general it's advisable to use the container that seems best suited for the problem at hand and only do clever optimizations if there's an actual performance bottleneck.
For the general case of any criteria, you can't do better than iterating over every element.
Each container has specific criteria that it can do better with, e.g.
std::set<std::string> strings = /* something */;
auto first = strings.lower_bound("a"); // O(log(strings)), "a" is the least string that starts with 'a'
auto last = strings.lower_bound("b"); // O(log(strings)), "b" is the first string after those that start with 'a'
strings.erase(first, last); // O(log(strings) + distance(first, last)), all the strings starting with 'a' are removed
Here we remove elements starting with 'a', with a complexity of O(log(strings) + distance(first, last)) which is a O(alphabet) improvement over iterating all elements.
Or the more contrived
std::unordered_set<std::string> strings = /* something */;
auto hashed = strings.hash_function()("Any collision will do"); // O(1)
strings.erase(strings.begin(hashed), strings.end(hashed)); // O(distance(first, last))
Here we remove elements that hash the same as "Any collision will do", with a complexity of O(distance(first, last))
Instead of using an unordered set, adapt your data structure to something like a trie.
In this case it might be more useful to you.
For more details please check: https://en.wikipedia.org/wiki/Trie
Implementation: https://www.geeksforgeeks.org/trie-insert-and-search/.
Depending on your needs you might think of some other algorithms like Aho-Corasick/Suffix-Arrays etc. You might need to do some research on the data-structure you need based on the amount of data that you have, the recomputing that you need and the amount of queries that you do.
I hope this helps.

In a low-latency application, Is unordered_map ever a better solution over vector?

Is it advisable to use unordered_map in place of vector while developing a low-latency application ?
I recently appeared for an interview with a financial company which worked on low-latency trading applications. I was asked a question for which I answered using an unordered_map which seemed pretty good efficiency-wise (0(n)) compared to If I had used a vector (O(n*n)). However, I know that it is advisable to use vector as much as possible and avoid unordered_map in order to utilize benefits of cache coherence. I just wanted to see If there is a better solution possible for this problem The problem I was asked was to check If two strings are a permutation of each other.
bool isPermutation(const std::string& first, const std::string& second) {
std::unordered_map<char, int> charDict;
if(first.length() != second.length())
return false;
for(auto it: first) {
charDict[it]++;
}
for(auto it: second) {
if(charDict.count(it) > 0) {
--charDict[it];
} else {
return false;
}
return true;
}
You can assume that both strings are equal length and the function is only assumed to return true If there is an exact number of occurrences of each character in second string as there are in the first string.
Sure, but it really depends on the problem you are trying to solve. If the domain of your key space is unknown, it would be difficult to come up with a generic solution that is faster than unordered_map.
In this case, the domain of your key space is known: it is limited to ASCII characters. This is convenient because you can instantly convert from item (char) to vector index (std::size_t). So you could just use the value of each character as an index into a vector rather than hashing it for every lookup.
But in general, don't optimize prematurely. If unordered_map is the most natural solution, I would start there, then profile, and if you find that performance does not meet your requirements, look at reworking your solution. (This isn't always the best advice; if you know you are working on a highly critical piece of code, there are certain design decisions you will want to take into consideration from the beginning. Coming back and refactoring later may be much more difficult if you start with an incompatible design.)
Since there are only 256 possible keys, you can use a stack-allocated array of 256 counts, which will be faster than a vector or an unordered_map. if first.size()+second.size() < 128, then only initialize the counts to 0 for keys that actually occur. Otherwise memset the whole array.

Can I check for a word inside of an array/vector without looping through it?

I believe that I can say...
for (int i = 0; i < noun[].size(); i++)
{
if ( word[0] == noun[i])
{ //do something }
}
The problem is, that I need to do this many times with many words. And sometimes I'd like to do many words at once. Like if (words[0] == noun[/*any*/] && words[1] == verb [/*any*/]) Is there any way to just tell it to check every element inside the array/vector against it?
Or maybe there is some container similar to an array that allows quick lookup?
I was only able to find a way to do it in python, but I never programmed in python so I'd like a way to do it in C++.
In python I found something like this:
a = [1,2,3,4,5]
b = 4
if b in a:
print("True!")
else:
print("False")
from here: Check if a value exists in an array in Cython
Unless there is some rule about the position of a particular element in a vector implying the position of another element, if present, the algorithm for the detection of presence must be O(N).
If the vector is sorted, for example, then a good positioning rule holds, and there are plenty of O(log(N)) algorithms out there: std::lower_bound is one such function.
The vector container isn't optimized for lookups. What you need is probably need a set. I recommend you check the answer to this question.
Considering your examples include verb and noun, you'll be trying to look up words in a (practically) fixed dictionary of sorts. There are many optimized containers for this purpose.
The Standard Library contains std::set<>, and a std::set<std::string> nouns will work. It has O(log N) lookup. Slightly more efficient, there's also std::unordered_map<>, which is still sufficient for your needs - you only need to know if a word occurs in the list of nouns, not what the next noun would be.
Outside the Standard Library, there are more data structures. A trie aka prefix tree shares the prefixes of its entries, which is space-efficient.

Storing a big text file into vectors and looping over it

I am making a university project and I had a question before I proceed. I have to import a 'dictionary.txt' to the program so it can correct an other file's text.
Right now my .txt file is 20mb with 2 million words inside. I am storing it to a vector as soon as the program starts. It takes 2 seconds load all the words in.
My question: Is this the right way to import so many words inside a program? The logic behind it is that every word from the "essay" will be looped over the 2 million words until it is found and break.
Before I make this possible I would like to know if this is a bad or good way to do it and why.
If you only want to store the words themselves, std::vector is a good choice, but you should be aware of the reallocation which takes place.
If the dictionary is meant to stay at the same size you should consider reserving memory for the vector.
basically you want to do something like this:
void from_file(std::vector<std::string>& content, std::string pathAndFilename = "")
{
content.reserve(1000000) //the size your dictionary has in words
std::fstream readContent;
if (pathAndFilename.empty())
{
readContent.open("file.txt", std::ios_base::in);
}
else
{
readContent.open(pathAndFilename.c_str(), std::ios_base::in);
}
std::string currentLine;
while (std::getline(readContent, currentLine))
{
content.push_back(currentLine);
}
readContent.flush();
readContent.clear();
readContent.close();
}
The problem with a vector is, searching takes really long. If you can ensure that every word in your dictionary.txt is unique, a set is the way to go since a set is a tree container and searching there is a lot faster.
You can improve the performance of the vector search if you order the vector but you won't reach anything near the map/sets performance. Nevertheless you will have to prepare the dictionary for the use with a set.
Futhermore you could split the dictionary in smaller subvectors (one for Aa one for Bb etc.). This will improve your performance even more since you can start right at the begining letter of the word you are trying to correct.
A map is not suited for this case, since a map is meant to store a key and a value. The key has to be unique as well. Ok you could use your word as key and just use an int id as value but then you can use a set as well.
Overall a vector is no bad choice and the performance should be fine for this case (even if the map will do a better job), but if you want to archieve the best performance, set is what you are looking for.
Usually a vector should be the default container and if you run into performance issues, then you can consider using a different container.

Fastest way to search for a string

I have 300 strings to be stored and searched for and that most of them are identical in terms of characters and lenght. For Example i have string "ABC1","ABC2","ABC3" and so on. and another set like sample1,sample2,sample3. So i am kinda confused as of how to store them like to use an array or a hash table. My main concern is the time i take to search for a string when i need to get one out from the storage. If i use an array i will have to do a string compare on all the index for me to arrive at one. Now if i go and impliment a hash table i will have to take care of collisions(obvious) and that i will have to impliment chaining for storing identical strings.
So i am kind of looking for some suggestions weighing the pros and cons of each and arrive at the best practice
Because the keys are short tend to have a common prefix you should consider radix data structures such as the Patricia trie and Ternary Search Tree (google these, you'll find lots of examples) Time for searching these structures tends to be O(1) with respect to # of entries and O(n) with respect to length of the keys. Beware, however that long strings can use lots of memory.
Search time is similar to hash maps if you don't consider collision resolution which is not a problem in a radix search. Note that I am considering the time to compute the hash as part of the cost of a hash map. People tend to forget it.
One downside is radix structures are not cache-friendly if your keys tend to show up in random order. As someone mentioned, if the search time is really important: measure the performance of some alternative approaches.
This depends on how much your data is changing. With that I mean, if you have 300 index strings which are referencing to another string, how often does those 300 index strings change?
You can use a std::map for quick lookups, but the map will require more resource when it is created the first time (compared to a array, vector or list).
I use maps mostly for some kind of dynamic lookup tables (for example: ip to socket).
So in your case it will look like this:
std::map<std::string, std::string> my_map;
my_map["ABC1"] = "sample1";
my_map["ABC2"] = "sample2";
std::string looked_up = my_map["ABC1"];