C++ Search Performance

C++ Search Performance - c++

What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names from my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.
In any case, what I have thus far is:
int main()
{
ifstream namesfile("names.txt"); //names list
ifstream miscfile("misc.txt"); //misc text
vector<string> vecnames; //vector to hold names
vector<string> vecmisc; //vector to hold misc text
size_t found;
string s;
string t;
while (getline(namesfile,s))
veccomp.push_back(s);
while (getline(miscfile,t))
vectenk.push_back(t);
//outer loop iterates through names list
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
//inner loop iterates through the lines of the mist text file
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
found=vecmisc[j].find(vecnames[i]);
if (found!=string::npos) {
cout << vecnames[i] << endl;
break;
}
}
}
cout << "SEARCH COMPLETE";
//to keep console application from exiting
getchar();
return 0;
}
Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference. Thanks.

Use a std::hash_set. Insert all your keywords into the set, then traverse the large document and each time you come to a word, test whether the set includes that word.

Using a vector, the best-case search time you're going to get is O(log N) complexity using a binary search algorithm, and that's only going to work for a sorted list. If you include the time it will take to make sorted insertions into a list, the final amortized complexity for a sorted linear container (arrays, lists), as well as non-linear containers such as binary-search trees, O(N log N). So that basically means that if you add more elements to the list, the time it will take to both add those elements to the list, as well as find them later on, will increase at a rate a little faster than the linear growth rate of the list (i.e., if you double the size of the list, it will take a little over double the time to sort the list, and then any searches on the list will be pretty quick ... in order to double the search time, the list would have to grow by the square of the existing amount of elements).
A good hash-table implementation on the other-hand (such as std::unordered_map) along with a good hash-algorithm that avoids too many collisions, has an amortized complexity of O(1) ... that means overall there's a constant look-up time for any given element, no matter how many elements there are, making searches very fast. The main penalty over a linear list or binary-search tree for the hash-table is the actual memory footprint of the hash table. A good hash-table, in order to avoid too many collisions, will want to have a size equal to some large prime number that is at least greater than 2*N, where N is the total number of elements you plan on storing in the array. But the "wasted space" is the trade-off for efficient and extremely fast look-ups.

While a map of any kind is the simplest solution, Scott Myers makes a good case for sorted vector and binary_search from algorithm (in Effective STL).
Using a sorted vector, your code would look something like
#include <algorithm>
...
int vecsize = vecnames.size();
sort(vecnames.begin(), vecnames.begin() + vecsize );
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j)
{
bool found= binary_search(vecnames.begin(), vecnames.begin()+vecsize,vecmisc[j]);
if (found) std::cout << vecmisc[j] << std::endl;
}
The advantages of using a sorted vector and binary_search are
1) There is no tree to traverse, the binary_search begins at (end-start)/2, and keeps dividing by 2. It will take at most log(n) to search the range.
2) There is no key,value pair. You get the simplicity of a vector without the overhead of a map.
3) The vector's elements are in a contiguous range (which is why you should use reserve before populating the vector, inserts are faster), and so searching through the vector's elements rarely crosses page boundaries (slightly faster).
4) It's cool.

Related

Confusion about time complexity with hash maps

On leetcode I find it is common to "ignore" the worst-case time complexity involving hash maps. I thought in software interviews that it was standard to assume "worst case" as they often do. Below is my solution to a simple problem. The problem is to find the first non repeating char in a string. I understand that hash maps are on average O(1) lookup.. but when iterating over the string, and looking up the hash map, why is the time complexity not O(N^2) and instead is O(N)?
#include <unordered_map>
class Solution {
public:
unordered_map<char, int> m;
int firstUniqChar(string s) {
for(char c : s) {
m[c]++;
}
for(int i =0; i < s.length(); i++) {
if(m[s[i]] == 1) {
return i;
}
}
return -1;
}
};

It is on average O(N) because hash map is on average O(1) per lookup and you do O(N) of them.
On average means by averaging over all possible inputs. That means there might exists an input array that breaks a particular hash and achieves O(N) or much worse on every lookup.
Worst-case is heavily implementation specific - e.g. hashing into buckets depends on how are elements stored in each bucket. If they are in a simple list, then lookup is O(<duplicates>), binary tree will bring that down to O(log<duplicates>). There might also be a difference between searching for keys present and missing.
Also there is a big assumption that all hashed containers can grow with the number of elements stored. I.e. keeping the occupancy of buckets low.
It does not hurt to mention their worst-cases in interviews, it demonstrates you know they can have limits.

The time-complexity of the given problem is O(N). You may provide a perfect hash function for it, that is no collision ever happens. This perfect hash function here is static_cast<size_t>(256+c). Well, if you look at the fastest solutions to this problem on leetcode you see that guys use plain arrays.

Which container is most efficient for multiple insertions / deletions in C++?

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?

We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.

Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.

The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}

Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.

Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

What is the fastest way to compare set of strings to one string?

I have set of strings and I need to find if one specific string is in it. I need to do this only one time (next time strings are different).
I'm thinking to sort strings with bucket sort and then do binary search.
Time complexity: O(n+k)+O(log n)
Is there any faster/better solution?
With set I mean more strings not std::set.

To summarize the comments above in an answer. If you are loading strings to be compared on the fly and do not need them to be in a specific order, then std::unordered_set is by far the fastest.
unordered_set is a hash set and will punch your string through a hash function and find if it is already in the set in constant time O(1).
If you need to retain the order of the elements then it becomes a question what is faster of retaining a vector and doing a linear search though it, or whether it is still worth to build the hash set.
Code:
std::unordered_set<std::string> theSet;
// Insert a few elements.
theSet.insert("Mango");
theSet.insert("Grapes");
theSet.insert("Bananas");
if ( theSet.find("Hobgoblins") == theSet.end() ) {
cout << "Could not find any hobgoblins in the set." << endl;
}
if ( theSet.find("Bananas") != theSet.end() ) {
cout << "But we did find bananas!!! YAY!" << endl;
}
For comparison:
If you use std::vector you will need O(n) time building the vector and then O(n) time finding an element.
If you use std::unordered_set you will still need O(n) time to build the vector, but afterwards you can find an element in constant time O(1).

Is there any array-like data structure that can grow in size on both sides?

I'm a student working on a small project for an high performance computing course, hence efficiency it's a key issue.
Let say that I have a vector of N floats and I want to remove the smallest n elements and the biggest n elements. There are two simple ways of doing this:
A
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
invert elements order // O(N)
remove the last n elements // O(1)
B
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
remove the first n elements // O(N)
In A inverting the elements order require swapping all the elements, while in B removing the first n elements require moving all the others to occupy the positions left empty. Using std::remove would give the same problem.
If I could remove the first n elements for free then solution B would be cheaper. That should be easy to achieve, if instead of having a vector, i.e. an array with some empty space after vector::end(), I would have a container with some free space also before vector::begin().
So the question is: does exist already an array-like (i.e. contiguous memory, no linked lists) in some libraries (STL, Boost) that allows for O(1) inserting/removing on both sides of the array?
If not, do you think that there are better solutions than creating such a data structure?

Have you thought of using std::partition with a custom functor like the example below:
#include <iostream>
#include <vector>
#include <algorithm>
template<typename T>
class greaterLess {
T low;
T up;
public:
greaterLess(T const &l, T const &u) : low(l), up(u) {}
bool operator()(T const &e) { return !(e < low || e > up); }
};
int main()
{
std::vector<double> v{2.0, 1.2, 3.2, 0.3, 5.9, 6.0, 4.3};
auto it = std::partition(v.begin(), v.end(), greaterLess<double>(2.0, 5.0));
v.erase(it, v.end());
for(auto i : v) std::cout << i << " ";
std::cout << std::endl;
return 0;
}
This way you would erase elements from your vector in O(N) time.

Try boost::circular_buffer:
It supports random access iterators, constant time insert and erase operations at the beginning or the end of the buffer and interoperability with std algorithms.
Having looked at the source, it seems (and is only logical) that data is kept as a continuous memory block.
The one caveat is that the buffer has fixed capacity and after exhausting it elements will get overwritten. You can either detect such cases yourself and resize the buffer manually, or use boost::circular_buffer_space_optimized with a humongous declared capacity, since it won't allocate it if not needed.

To shrink & grow a vector at both ends, you can use idea of slices, reserving extra memory to expand into ahead of time at front and back, if efficient growth is needed.
Simply, make a class with not only a length but indices for first & last elements and a suitably sized vector, to create a window of data on the underlying block of stored floats. A C++ class can provide inlined functions, for things like deleting items, address into the array, find the nth largest value, shift the slice values down or up to insert new elements maintaining sorted order. Should no spare elements be available, then dynamic allocation of a new larger float store, permits continuing growth at the cost of an array copy.
A circular buffer is designed as a FIFO, with new elements added at end, removal at front, and not allowing insertion in the middle, a self defined class can also (trivially) support array subscript values different from 0..N-1
Due to memory locality, avoiding excessive indirection due to pointer chains, and the pipelining of subscript calculations on a modern processor, a solution based on an array (or a vector), is likely to be most efficicent, despite element copying on insertion. Deque would be suitable but it fails to guarantee contiguous storage.
Additional supplementary info. Researching classes providing slices, finds some plausible alternatives to evaluate :
A) std::slice which uses slice_arrays
B) Boost Class Range
Hope this is the kind of specific information you were hoping for, in general a simpler clearer solution is more maintainable, than a tricky one. I would expect slices and ranges on sorted data sets, being quite common, for example filtering experimental data where "outliers" are excluded as faulty readings.
I think a good solution, should actually be - O(NlogN), 2xO(1), with any binary searches O(logN +1) for filtering on outlying values, in place of deleting a fixed number of small or large values; it matters that the "O" is relatively fast to, sometimes an O(1) algorithmn can be in practice slower for practical values of N than an O(N) one.

as a complementary to #40two 's answer, before partitioning the array, you will need to find the partitioning pivot, which is you will need to find the nth smallest number, and the nth greatest number in an unsorted array.
There is a discussion on that in SO: How to find the kth largest number in unsorted array
There are several algorithms to solve this problem. Some are deterministic O(N) - on of them is a variation on finding the median (median of medians). There are some non-deterministic algorithms with O(N) average-case.
A good source book to find those algorithms is Introduction to algorithms.
Also in books like
So eventually, your code will run in an O(N) time

How to get a sorted subvector out of a sorted vector, fast

I have a data structure like this:
struct X {
float value;
int id;
};
a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):
std::vector<X> values;
Now, I want to write a function
void subvector(std::vector<X> const& values,
std::vector<int> const& ids,
std::vector<X>& out /*,
helper data here */);
that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).
My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)
this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.
Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).

An alternative approach you could try is to use a hash table instead of a vector to look up ids in:
void subvector(std::vector<X> const& values,
std::unordered_set<int> const& ids,
std::vector<X>& out) {
out.clear();
out.reserve(ids.size());
for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
if(ids.find(i->id) != ids.end()) {
out.push_back(*i);
}
}
}
This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.

Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.
Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.
That or the partition algorithm should work.

If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M).
That is NOT possible.
Your current approach is to have a sorted list of possible values.
This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).
The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M.
For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.
I hope you can understand what I say. If you still have questions I will try to answer them :)
edit: (comment)
I will further explain what I mean.
Say you know that your numbers will range from 1 to 100.
You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form.
If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.
F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js