Fast search algorithm with std::vector<std::string> - c++

for (std::vector<const std::string>::const_iterator it = serverList.begin(); it != serverList.end(); it++)
{
// found a match, store the location
if (index == *it) // index is a string
{
indexResult.push_back(std::distance(serverList.begin(), it)); // std::vector<unsigned int>
}
}
I Have written the above code to look through a vector of strings and return another vector with the location of any "hits".
Is there a way to do the same, but faster? (If I have 10,000 items in the container, it will take a while).
Please note that I have to check ALL of the items for matches and store its position in the container.
Bonus Kudos: Anyone know any way/links on how I can make the search so that it finds partial results (Example: search for "coolro" and store the location of variable "coolroomhere")

Use binary_search after sorting the vector
std::sort( serverList.begin() , serverList.end() )
std::lower_bound(serverList.begin() , serverList.end() , valuetoFind) to find first matching
Use std::equal_range if you want to find all matching elements
The lower_bound & equal_range search because it is binary is logarithmic compared to your search that is O(N)

Basically, you're asking if it's possible to check all elements for a
match, without checking all elements. If there is some sort of external
meta-information (e.g. the data is sorted), it might be possible (e.g.
using binary search). Otherwise, by its very nature, to check all
elements, you have to check all elements.
If you're going to do many such searches on the list and the list
doesn't vary, you might consider calculating a second table with a good
hash code of the entries; again depending on the type of data being
looked up, it could be more efficient to calculate the hash code of the
index, and compare hash codes first, only comparing the strings if the
hash codes were equal. Whether this is an improvement or not largely
depends on the size of the table and the type of data in it. You might
also, be able to leverage off knowledge about the data in the strings; if
they are all URL's, for example, mostly starting with "http://www.",
starting the comparison at the tenth character, and only coming back to
compare the first 10 if all of the rest are equal, could end up with a big
win.
With regards to finding substrings, you can use std::search for each
element:
for ( auto iter = serverList.begin();
iter != serverList.end();
++ iter ) {
if ( std::search( iter->begin(), iter->end(),
index.begin(), index.end() ) != iter->end() ) {
indexResult.push_back( iter - serverList.begin() );
}
}
Depending on the number of elements being searched and the lengths of
the strings involved, it might be more efficient to use something like
BM search, however, precompiling the search string to the necessary
tables before entering the loop.

If you make the container a std::map instead of a std::vector, the underlying data structure used will be one that is optimized for doing keyword searches like this.
If you instead use a std::multimap, the member function equal_range() will return a pair of iterators covering every match in the map. That sounds to me like what you want.
A smart commenter below points out that if you don't actually store any more infomation than the name (the search key), then you should probably instead use a std::multiset.

Related

What is the fastest way to check if value is exists in std::map?

What is the fastest way to check if value is exists in the std::map<int, int>? Should I use unordered map? In this task I can't use any libraries instead of std.
Now, I am not know any ways to do this without checking all values.
The fastest way is to not do it. Don't look for values in maps, look for keys in maps.
If you need to search for a value, use another data structure (or a separate map).
The only way to search for a value in a map is linearly (O(N)), but due to caching overhead in iterating over the map data structure, it's going to be even slower than iterating over e.g. a vector.
Unless you have very big data sets (over 100'000 or so) access times into maps should not bother you, since it's gonna be really minuscule in either cases, because you just have int as a key already.
To check whether value exist in a map or not you should just use iterators or maybe std::find_if. Doesn't really matter what way you choose it's gonna be linear (O(n)) anyway.
// both examples assume you using c++ 17 standart
// simple cycle variant
bool is_value_exists(auto const &map, int val) {
for (auto const &[key, value] : map) {
if (value == val) return true;
}
return false;
}
// find if version
#include <algorithm>
bool is_value_exists2(auto const &map, int val) {
return std::find_if(
map.begin(),
map.end(),
[val](auto const &kv) { return kv.second == val; }
) != map.end();
}
You can find and element using the find method. Maps are usually implemented by red-black trees, which has a logarithmic search complexity.
If you need to search by a value, then you could create a reverse map, a map which has the values of the initial map as keys and the corresponding keys are the values. You can search for the value by key in the second map, which will yield the key. However, rebuilding the invert map takes resources both in time and storage, so you should only do it if you are going to search multiple times.
Regarding the values, a map is not really different from a list or a vector. So exhaustive (linear) search is the fastest way.
Whenever you are entering key-value pair into the std::map instance, also add its pointer to the iterator of that element in an std::vector<iterator_...> variable.
myVec[value]=myMap.find(key); // at time of inserting a new key
This way, you can just use the value as a key(index) in the vector and directly access the map content using that pointer after comparing to nullptr.
Biggest downside is the extra book-keeping required when you remove keys from the map. If the removing operation is frequent enough, you may also use a map in place of it because if the expected value-range is too big (like all of 32bits), it is not memory-efficient.
You can also use the map-iteration-search as a backing-store of a direct-mapped cache (which works the fastest for integer keys(values here)). All cache-hits would be served at the cost of just a bitwise & operation with some value like 8191, 4095, etc (O(1)). All cache-misses would still require a full iteration of the map elements that is slow (O(N)).
So, if the cache-hit ratio is close to 100%, it can approach O(1), otherwise it will be O(N) that is slow.

efficient extraction of elements from a C++ unordered set

In C++ suppose you have an unordered set (https://en.cppreference.com/w/cpp/container/unordered_set) of strings - is there a way to efficiently extract all strings from that set that meet a certain criteria (e.g. find all strings in the set that begin with letter "a") using a method other than iterating through the entire set with a for loop and checking the first character of every single string?
For any criteria this is not possible, see this answer for some more details.
Depending on your other needs, a sorted std::vector is highly likely to be the most efficient for the extraction part alone. Use algorithms like std::lower_bound to work with a sorted std::vector. In the end, your actual use cases is what determines overall which container is best suited performance-wise - although std::vector comes close to a one-size fit all for considering performance (this is because of all the internal optimizations of contiguous storage).
That being said, in general it's advisable to use the container that seems best suited for the problem at hand and only do clever optimizations if there's an actual performance bottleneck.
For the general case of any criteria, you can't do better than iterating over every element.
Each container has specific criteria that it can do better with, e.g.
std::set<std::string> strings = /* something */;
auto first = strings.lower_bound("a"); // O(log(strings)), "a" is the least string that starts with 'a'
auto last = strings.lower_bound("b"); // O(log(strings)), "b" is the first string after those that start with 'a'
strings.erase(first, last); // O(log(strings) + distance(first, last)), all the strings starting with 'a' are removed
Here we remove elements starting with 'a', with a complexity of O(log(strings) + distance(first, last)) which is a O(alphabet) improvement over iterating all elements.
Or the more contrived
std::unordered_set<std::string> strings = /* something */;
auto hashed = strings.hash_function()("Any collision will do"); // O(1)
strings.erase(strings.begin(hashed), strings.end(hashed)); // O(distance(first, last))
Here we remove elements that hash the same as "Any collision will do", with a complexity of O(distance(first, last))
Instead of using an unordered set, adapt your data structure to something like a trie.
In this case it might be more useful to you.
For more details please check: https://en.wikipedia.org/wiki/Trie
Implementation: https://www.geeksforgeeks.org/trie-insert-and-search/.
Depending on your needs you might think of some other algorithms like Aho-Corasick/Suffix-Arrays etc. You might need to do some research on the data-structure you need based on the amount of data that you have, the recomputing that you need and the amount of queries that you do.
I hope this helps.

How to find a pair from set using only the second value?

I want to find the pair using the second element only and the first element could be anything, also all of the second elements are unique.
Code using std::find_if but this takes linear time
set<pair<int,int> > s;
s.insert(make_pair(3,1));
s.insert(make_pair(1,0));
auto it = find_if(s.begin(),s.end(),[value](const pair<int,int>& p ){ return p.second == value; });
if(it==s.end())
s.insert(make_pair(1,value));
else {
int v = it->first;
s.erase(it);
s.insert(make_pair(v+1,value));
}
I want to use std::find function of set so that it takes logarithmic time.
There is no data structure that do exactly what you want.
However databases do something similar. They call it Index Skip Scanning. To implement the same without starting from scratch, you could implement a std::map from the first thing in the pair to a std::map of the second thing in the pair. And now a lookup of a single pair is logarithmic in time, lookup of the things with a given first entry is also logarithmic in time (though iterating through those things may be slower), and lookup of the things with the second entry is linear in the number of first values you have, times logarithmic in the number of second values that you have.
Do note that this is only worthwhile if you have a very large number of pairs, and relatively few values for the first entry in the pair. And furthermore you are constantly changing data (so maintaining multiple indexes is a lot of overhead), and only rarely doing a lookup on the second value in the pair. Break any of those assumptions and the overhead is not worth it.
That is a rather specific set of assumptions to satisfy. It comes up far more often for databases than C++ programmers. Which is why most databases support the operation, and the standard library of C++ does not.

C++ : How can I push values into a vector only if that value isn't stored in the vector already?

If for example, I was just pushing 200 random numbers into a vector, how can I ensure that duplicates will not be pushed in?
seems like a map could be a helpful structure instead of a Vector.
If you must stick to a Vector then you need to divide your task into two parts; duplication detection and then insertion. Again, your could insert into a map and then read that out into the Vector.
In either case the problem is - intrinsically - two problems. Good luck!
You need to check if the vector already contains the value, if not the push new value, i.e.
std::vector<int>::iterator it;
it = find (myvector.begin(), myvector.end(), newvalue);
if (it == myvector.end()) {
// newvalue is not found
}
But this could be costly since find method would be checking every value inside myvector.
Instead using set or map data structure can be more efficient.
If the random numbers are integer and within a relatively small range, you can try this:
You want N unique random numbers from M possible values whereby M >= N
create a container containing one of each of the unique random number
shuffle the container
take the first N from the container and insert to your vector
If M is much bigger than N (like between 0 and rand_max), then you should just check for repetition before insert and repeat until your container size reaches 200. If using vector is not mandatory, I will suggest using std::set instead since it ensures unique values by default.

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.