efficient is_in(vector<string>& S, string P) function - c++

Given a set of S string { S0, S2, S3,..., Sn-1 }, and a string P, how to determine the function bool is_in( string, vector ) without doing the obvious loop.
Meaning that I don't want to do this:
bool is_in(vector<string>& S, string P)
{
for(int i=0; i<S.size(); i++)
if(P == S[i]) return true;
return false;
}
Ideally, I would like to have a sort of hash function, that I could compute a priori. Something like this:
bool is_in(vector<string>& S, string P)
{
someHashType h = hash( S );
if( someFunction( h, S ) return true;
return false;
}
Note:
S is s static vector (in my case, size 1000, unsorted)
P an entry of a collection of strings I'm testing against S (also unsorted) (in my case, 10M) -
So that's why I need to be fast.
This is NOT a homework problem - But part of a large scale software.

The problem with "I want this function to be faster" is that it does, nearly always, involve SOME extra work somewhere else. And that may or may not mean that the improvement is "worth it". All that depends on what your collection of strings that you are looking for is used for in the rest of the code. If it's just a "is the word in this list then do X" (e.g. a bad word check for commit messages, must not have swear-words and company names in them), then I would change the vector to an unordered_set. That has a O(1) search time, and would look something like:
bool is_in(unordered_set<string>& S, string P)
{
auto it = S.find(P);
return (it != S.end());
}
But this will of course have consequences elsewhere, and if you rely on the list being a vector so that for example iterating over it is fast somewhere else in the code, this will probably slow that part down.
Edit: You have, I take it, profiled your code in a real use-case and found this particular function to take a significant amount of time. Otherwise, you'd be better off measuring that FIRST.

Finally I found what I was looking for:
There is a tool called BloomFilter which allows a pre-computed hash of a collection of strings.
I developed my solution around the code located at C++Bloom Filter Library
The code would go like this:
insert all strings to the 'bloom' filter
check if a given string is in the filter.
The advantage is that the strings don't need to be storage in memory, as it would be in a set, unordered_set or any object like that.
in my particular object, I had a table of strings with 10M strings (800MB).
The size of the filter in memory is around 20M, and the search is quite faster.
The 'Bloom Filter' is an statistical algorithm, so it can have a few false positives. - But the probability for that is quite low (controlled by a parameter)
Note that there is no false negative.

Related

C++ - how to buffer calc results faster than using unordered_map

I read a lot about unordered_map not being very fast but I wonder what's the best alternative to do this:
I need to buffer calculation results for a function of an integer argument. I don't know ahead of time what range or interval will be requested. Storing in a vector with maximal resolution would cost way too much memory.
So I'm using
unordered_map<unsigned long, pair<T, long>>
Where the key is the argument of the function to be computed, the first of the pair the result of the computation of type T, and the second of the pair a version information for that computation.
Only if the unordered_map does not contain the element or it contains it but the version is outdated, the computation is carried out and then added to the unordered_map. The lookup function looks something like this:
template<typename T> class BufferClass{
long MyVersion;
unordered_map<unsigned long, pair<T,long>> Buffer;
public:
BufferClass(): MyVersion{1} {};
T* GetIfValid(unsigned long index)
{
if (!Buffer.count(index)) return nullptr;
pair <T,long> &x{Buffer.at(index)};
if (x.second!=MyVersion) return nullptr;
return &x.first;
}
/* ...Functions to set elements...*/
}
As you can see, I combined element validity check and retrieval in one function, so that I only need one lookup for both.
The profiler shows most of the computation time is used up in the hash function __constrain_hash related to unordered_map.
What would be the fastest way to store and retrieve values like that? The list of stored indices is expected to be non-continuous (there will be a lot of "holes") and first and last index are also mostly unknown.
T will generally be a "small" data type (like double or complex).
Thanks!
Martin
In your code, there could be two hash lookup in one query, one invoked in count() and the other invoked in at(). It is redundant, use unordered_map::find instead, see here.
Sample code:
const auto iter = Buffer.find(index);
if(iter != Buffer.end()) //Found something, so the return value is not end()
{
return &(iter->first);
}
else return nullptr;
In my opinion, unordered_map is slow but not that slow, for 99.9% usage is fast enough. You may want to check whether you call this function (unnecessarily) too many times. Using other fast implementation is not free, it could bloat your code base, harm your application's compatibility with different host systems or so on. If you think std::unordered_map is unreasonably slow, it is almost always because you got somewhere wrong in your work. (either your estimation or your code implementation)
BTW, another thing to mention: You said T is a small data type right? then return its value instead of pointer to it, it is faster and safer.
One thing that strikes me as odd about your implementation is the following two lines:
if (!Buffer.count(index)) return nullptr;
pair <T,long> &x{Buffer.at(index)};
This code is checking if the key exists, then throws away the result and searches for the same key again with bounds checking to boot. I think you'll find searching once with std::unordered_map<unsigned long, std::pair<T, long>>::find and reusing the result to be preferable:
auto it = Buffer.find(index);
if (it == Buffer.end()) return nullptr;
auto& x = *it;

Why is vector faster than unordered_map?

I am solving a problem on LeetCode, but nobody has yet been able to explain my issue.
The problem is as such:
Given an arbitrary ransom note string and another string containing letters from all the magazines, write a function that will return true if the ransom note can be constructed from the magazines ; otherwise, it will return false.
Each letter in the magazine string can only be used once in your ransom note.
Note:
You may assume that both strings contain only lowercase letters.
canConstruct("a", "b") -> false
canConstruct("aa", "ab") -> false
canConstruct("aa", "aab") -> true
My code (which takes 32ms):
class Solution {
public:
bool canConstruct(string ransomNote, string magazine) {
if(ransomNote.size() > magazine.size()) return false;
unordered_map<char, int> m;
for(int i = 0; i < magazine.size(); i++)
m[magazine[i]]++;
for(int i = 0; i < ransomNote.size(); i++)
{
if(m[ransomNote[i]] <= 0) return false;
m[ransomNote[i]]--;
}
return true;
}
};
The code (which I dont know why is faster - takes 19ms):
bool canConstruct(string ransomNote, string magazine) {
int lettersLeft = ransomNote.size(); // Remaining # of letters to be found in magazine
int arr[26] = {0};
for (int j = 0; j < ransomNote.size(); j++) {
arr[ransomNote[j] - 'a']++; // letter - 'a' gives a value of 0 - 25 for each lower case letter a-z
}
int i = 0;
while (i < magazine.size() && lettersLeft > 0) {
if (arr[magazine[i] - 'a'] > 0) {
arr[magazine[i] - 'a']--;
lettersLeft--;
}
i++;
}
if (lettersLeft == 0) {
return true;
} else {
return false;
}
}
Both of these have the same complexity and use the same structure to solve the problem, but I don't understand why one takes almost twice as much time than the other. The time to query a vector is O(1), but its the same for an unordered_map. Same story with adding an entry/key to either of them.
Please, could someone explain why the run time varies so much?
First thing to note is, although the average time to query an unordered_map is constant, the worst case is not O(1). As you can see here it actually rises to the order of O(N), N denoting the size of the container.
Secondly, as vector allocates sequential portions of memory, accessing to that memory is highly efficient and actually is constant, even in the worst-case. (i.e. simple pointer arithmetic, as opposed to computing the result of a more complex hash function) There is also the possibility of various levels of caching of sequential memory that may be involved (i.e. depending on the platform your code is running on) which may make the execution of a code using vector even faster, compared to one that is using unordered_map.
In essence, in terms of complexity, the worst-case performance of a vector is more efficient than that of unordered_map. On top of that, most hardware systems offer features such as caching which give usage of vector an even bigger edge. (i.e. lesser constant factors in O(1) operations)
Your second approach uses plain C array where accessing an element is a simple pointer dereference. But that is not the case with unordered_map. There are two points to note:
First, accessing an element is not a simple pointer dereference. It has to do other works to maintain it's internal structure. An unordered_map is actually a hash table under the hood and C++ standard indirectly mandates it to be implemented using open addressing which is a far more complex algorithm than simple array access.
Second, O(1) access is on average but not on worst case.
For these reasons no wonder that array version will work better than unordered_map even though they have same run time complexity. This is another example where two codes with same run time complexity performs differently.
You will see the benefit of unordered_map only when you have a large number of keys (oppose to fixed 26 here).
"O(1)" means "constant time" -- that is, an algorithm that is (truly) O(1) will not get slower when there is more data (in this case, when there are more items in the map or array). It does not indicate how fast the algorithm runs -- it only indicates that it won't slow down if there is more data. Seeing different times for one O(1) algorithm vs. another does not mean that they are not O(1). You should not expect that one O(1) algorithm will run exactly as fast as another. But, if there is a difference, you should see the same difference if the maps/arrays have more data in them.

C++ code performance strings compare

I have an array of struct (arrBoards) which has some integer values, vector and a string type.
I want to compare if certain string in the struct is equal with entered parameter (string p1).
What idea is faster - to check equation of input string with every string element inside an array, or firstly check if string.length() in current string element of the array greater than 0, then compare the strings.
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName.length() ) //check length of the string in the element of the array
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
I think the second idea is better because it don't need to calculate the name everytime, but I can be wrong because using if could slow down code.
Thanks for the answers
I'm sure the comparison operator (==) of the string class is already optimized enough. Just use it.
operator==(...) returns a bool based on a short-circuit comparison
return __x.size() == __n && _Traits::compare(__x.data(), __s, __n) == 0;
It checks the size of the strings before calling compare(), so, there is no need for further optimization.
Always remember one of the principles of Software Engineering: KISS :P
What you want to do is play percentages.
Since the strings are highly likely to be different, you want to find that out as quickly as possible.
You're comparing length first, but don't assume length is cheap to compute, compared to whatever else you're doing.
Here's the kind of thing I've done (in C):
if (a[0]==b[0] && strcmp(a, b)==0)
so if the leading characters are different, it never gets to the string compare.
If the dataset is such that the leading characters are likely to be different, it saves a lot of time.
(strcmp also has this kind of optimization, but you still have to pay the price of setting up the arguments and getting in and out of the function. We're talking about small numbers of cycles here.)
If you do something like that, then you may find the loop iteration overhead is costing a significant fraction of time.
If so, you might consider unrolling it.
(The compiler might unroll it for you, but I wouldn't depend on it.)
Comparing a number is faster than comparing a string. Try comparing the strings length before comparing the string itself.

Is it possible to process equality in a std::set comparator?

I am sorry if the title isn't very descriptive, I was having a hard time figuring out how to name this question. This is pretty much the first time I need to use a set, though I've been using maps forever.
I don't think it is possible, but I need to ask. I would like to perform a specific action on a struct when I add it to my std::set, but only if equality is true.
For example, I can use a list and then sort() and unique() the list. In my predicate, I can do as I wish, since I will get the result if 2 values are equal.
Here is a quick example of what my list predicate looks like:
bool markovWeightOrdering (unique_ptr<Word>& w1, unique_ptr<Word>& w2) {
if (w1->word_ == w2->word_) {
w1->weight_++;
return true;
}
return false;
}
Does anyone have an idea how to achieve a similar result, while using a std::set for the obvious gain in performance (and simplicity), since my container needs to be unique anyways? Thank you for any help or guidance, it is much appreciated.
element in set are immutable, so you cannot modify them.
if you use set with pointer (or similar), the pointed object may be modified (but care to not modify the order). std::set::insert returns a pair with iterator and a boolean to tell if element has been inserted, so you may do something like:
auto p = s.insert(make_unique<Word>("test"));
if (p.second == false) {
(*p.first)->weight += 1;
}
Live example
Manipulating a compare operator is likely a bad idea.
You might use a std::set with a predicate, instead:
struct LessWord
{
bool operator () (const std::unique_ptr<Word>& w1, const std::unique_ptr<Word>& w2) {
return w1->key < w2->key;
}
};
typedef std::set<std::unique_ptr<Word>, LessWord> word_set;
Than you test at insert if the word is existing and increment the weight:
word_set words;
std::unique_ptr<Word> word_ptr;
auto insert = words.insert(word_ptr);
if( ! insert.second)
++(insert.first->get()->weight_);
Note: Doing this is breaking const correctness, logically. A set element is immutable, but the unique_ptr enables modifications (even a fatal modification of key values).

C++ functor advantage - holding the state [duplicate]

This question already has answers here:
What are C++ functors and their uses?
(14 answers)
Closed 8 years ago.
I did study the whole idea of functors, unfortunately I can't understand the real advantage of functors over typical functions.
According to some academic scripts, functors can hold state unlike functions.
Can anyone elaborate on this one with some simple, easy to understand example ?
I really can't understand why typical, regular function are not able to do the same. I'm really sorry for this kind of novice question.
As a really trivial demonstration, let's consider a Quick sort. We choose a value (usually known as the "pivot") and separate the input collection into those that compare less than the pivot, and those that compare greater than or equal to the pivot1.
The standard library already has std::partition that can do the partitioning itself--separate a collection into those items that satisfy a specified condition, and those that don't. So, to do our partitioning, we just have to supply a suitable predicate.
In this case, we need a simple comparison something like: return x < pivot;. Passing the pivot value every time becomes difficult though. std::partition just passes a value from the collection and asks: "does this pass your test or not?" There's no way for you to tell std::partition what the current pivot value is, and have it pass that to your routine when it's invoked. That could be done, of course (e.g., many enumeration functions in Windows work this way), but it gets pretty clumsy.
When we invoke std::partition we've already chosen the pivot value. What we want is a way to...bind that value to one of the parameters that will be passed to the comparison function. One really ugly way to do that would be to "pass" it via a global variable:
int pivot;
bool pred(int x) { return x < pivot; }
void quick_sort(int *begin, int *end) {
if (end - begin < 2)
return;
pivot = choose_pivot(begin, end);
int *pos = std::partition(begin, end, pred);
quick_sort(begin, pos);
quick_sort(pos, end);
}
I really hope I don't have to point out that we'd rather not use a global for this if we can help it. One fairly easy way to avoid it is to create a function object. We pass the current pivot value when we create the object, and it stores that value as state in the object:
class pred {
int pivot;
public:
pred(int pivot) : pivot(pivot) {}
bool operator()(int x) { return x < pivot; }
};
void quick_sort(int *begin, int *end) {
if (end-begin < 2)
return;
int pivot = choose_pivot(begin, end);
int *pos = std::partition(begin, end, pred(pivot));
quick_sort(begin, pos);
quick_sort(pos, end);
}
This has added a tiny bit of extra code, but in exchange we've eliminated a global--a fairly reasonable exchange.
Of course, as of C++11 we can do quite a bit better still--the language added "lambda expressions" that can create a class pretty much like that for us. Using this, our code looks something like this:
void quick_sort(int *begin, int *end) {
if (end-begin < 2)
return;
int pivot = find_pivot(begin, end);
auto pos = std::partition(begin, end, [pivot](int x) { return x < pivot; });
quick_sort(begin, pos);
quick_sort(pos, end);
}
This changes the syntax we use to specify the class/create the function object, but it's still pretty much the same basic idea as the preceding code: the compiler generates a class with a constructor and an operator(). The values we enclose in the square brackets are passed to the constructor, and the (int x) { return x < pivot; } basically becomes the body of the operator() for that class2.
This makes code much easier to write and much easier to read--but it doesn't change the basic fact that we're creating an object, "capturing" some state in the constructor, and using an overloaded operator() for the comparison.
Of course, a comparison just happens to be what we need for things like sorting. It is a common use of lambda expressions and function objects more generally, but we're certainly not restricted to it. Just for another example, let's consider "normalizing" a collection of doubles. We want to find the largest one, then divide every value in the collection by that, so each item is in the range 0.0 to 1.0, but all retaining the same ratios to each other as they previously had:
double largest = * std::max_element(begin, end);
std::for_each(begin, end, [largest](double d) { return d/largest; });
Here again we have pretty much the same pattern: create a function object that stores some relevant state, then repeatedly apply that function object's operator() to do the real work.
We could separate into less than or equal to, and greater than instead. Or we could create three groups: less than, equal to, greater than. The latter can improve efficiency in the presence of many duplicates, but for the moment we really don't care.
There's a lot more to know about lambda expressions than just this--I'm simplifying some things, and completely ignoring others that we don't care about at the moment.