optimizing boost unordered map and sets, C++ - c++

I will be parsing 60GB of text and doing a lot of insert and lookups in maps.
I just started using boost::unordered_set and boost::unordered_map
As my program starts filling in these containers they start growing bigger and bigger and i was wondering if this would be a good idea to pre allocate memory for these containers.
something like
mymap::get_allocator().allocate(N); ?
or should i just leave them to allocate and figure out grow factors by themselves?
the codes look like this
boost::unordered_map <string,long> words_vs_frequency, wordpair_vs_frequency;
boost::unordered_map <string,float> word_vs_probability, wordpair_vs_probability,
wordpair_vs_MI;
//... ... ...
N = words_vs_frequency.size();
long y =0; float MIWij =0.0f, maxMI=-999999.0f;
for (boost::unordered_map <string,long>::iterator i=wordpair_vs_frequency.begin();
i!=wordpair_vs_frequency.end(); ++i){
if (i->second >= BIGRAM_OCCURANCE_THRESHOLD)
{
y++;
Wij = i->first;
WordPairToWords(Wij, Wi,Wj);
MIWij = log ( wordpair_vs_probability[Wij] /
(word_vs_probability[Wi] * word_vs_probability[Wj])
);
// keeping only the pairs which MI value greater than
if (MIWij > MUTUAL_INFORMATION_THRESHOLD)
wordpair_vs_MI[ Wij ] = MIWij;
if(MIWij > maxMI )
maxMI = MIWij;
}
}
Thanks in advance

According to the documentation, both unordered_set and unordered_map have a method
void rehash(size_type n);
that regenerates the hashtable so that it contains at least n buckets. (It sounds like it does what reserve() does for STL containers).

I would try it both ways, which will let you generate hard data showing whether one method works better than the other. We can speculate all day about which method will be optimal, but as with most performance questions, the best thing to do is try it out and see what happens (and then fix the parts that actually need fixing).
That being said, the Boost authors seem to be very smart, so it quite possibly will work fine as-is. You'll just have to test and see.

Honestly, I think you would be best off writing your own allocator. You could, for instance, make an allocator with a method called preallocate(int N) which would reserve N bytes, then use unordered_map::get_allocator() for all your fun. In addition, with your own allocator, you could tell it to grab huge chunks at a time.

Related

unordered_map to find indices of an array

I want to find indices of a set efficiently. I am using unordered_map and making the inverse map like this
std::unordered_map <int, int> myHash (size);
Int i = 0;
for (it = someSet.begin(); it != someSet.end(); it++)
{
myHash.insert({*it , i++});
}
It works but it is not efficient. I did this so anytime I need the indices I could access them O(1). Performance analysis is showing me that this part became hotspot of my code.
VTune tells me that new operator is my hotspot. I guess something is happening inside the unordered_map.
It seems to me that this case should be handled efficiently. I couldn't find a good way yet. Is there a better solution? a correct constructor?
Maybe I should pass more info to the constructor. I looked up the initialize list but it is not exactly what I want.
Update: Let me add some more information. The set is not that important; I save the set in to an array (sorted). Later I need to find the index of the values which are unique. I can do it in logn but it is not fast enough. It is why I decided to use a hash. The size of the set (columns of submatrix) doesn't change after this point.
It arise from sparse matrix computation which I need to find index of submatrices in a bigger matrix. Therefore the size and the pattern of the look ups is depend on the input matrix. It works reasonable on smaller problems. I could use a lookup table but while I am planning to do it in parallel the lookup table for each thread can be expensive. I have the exact size of hash in the time of creation. I thought by sending it to the constructor it stops reallocating. I really don't understand why it is reallocating this much.
The problem is, std::unordered_map, mainly implemented as a list of vectors, is extremely cache-unfriendly, and will perform especially poorly with small keys/values (like int,int in your case), not to mention requiring tons of (re-)allocations.
As an alternative you can try a third-party hash map implementing open addressing with linear probing (a mouthful, but the underlying structure is simply a vector, i.e. much more cache-friendly). For example, Google's dense_hash_map or this: flat_hash_map. Both can be used as a drop-in replacement for unordered_map, and only additionally require to designate one int value as the "empty" key.
std::unordered_map<int, int> is often implemented as if it was
std::vector<std::list<std::par<int, int>>>
Which causes a lot of allocations and deallocations of each node, each (de-)allocation is using a lock which causes contention.
You can help it a bit by using emplace instead of insert, or you can jump out in the fantastic new world of pmr allocators. If your creation and destruction of the pmr::unordered_map is single threaded you should be able to get a lot of extra performance out of it. See Jason Turners C++ Weekly - Ep 222 - 3.5x Faster Standard Containers With PMR!, his example is a bit on the small side but you can get the general idea.

In a low-latency application, Is unordered_map ever a better solution over vector?

Is it advisable to use unordered_map in place of vector while developing a low-latency application ?
I recently appeared for an interview with a financial company which worked on low-latency trading applications. I was asked a question for which I answered using an unordered_map which seemed pretty good efficiency-wise (0(n)) compared to If I had used a vector (O(n*n)). However, I know that it is advisable to use vector as much as possible and avoid unordered_map in order to utilize benefits of cache coherence. I just wanted to see If there is a better solution possible for this problem The problem I was asked was to check If two strings are a permutation of each other.
bool isPermutation(const std::string& first, const std::string& second) {
std::unordered_map<char, int> charDict;
if(first.length() != second.length())
return false;
for(auto it: first) {
charDict[it]++;
}
for(auto it: second) {
if(charDict.count(it) > 0) {
--charDict[it];
} else {
return false;
}
return true;
}
You can assume that both strings are equal length and the function is only assumed to return true If there is an exact number of occurrences of each character in second string as there are in the first string.
Sure, but it really depends on the problem you are trying to solve. If the domain of your key space is unknown, it would be difficult to come up with a generic solution that is faster than unordered_map.
In this case, the domain of your key space is known: it is limited to ASCII characters. This is convenient because you can instantly convert from item (char) to vector index (std::size_t). So you could just use the value of each character as an index into a vector rather than hashing it for every lookup.
But in general, don't optimize prematurely. If unordered_map is the most natural solution, I would start there, then profile, and if you find that performance does not meet your requirements, look at reworking your solution. (This isn't always the best advice; if you know you are working on a highly critical piece of code, there are certain design decisions you will want to take into consideration from the beginning. Coming back and refactoring later may be much more difficult if you start with an incompatible design.)
Since there are only 256 possible keys, you can use a stack-allocated array of 256 counts, which will be faster than a vector or an unordered_map. if first.size()+second.size() < 128, then only initialize the counts to 0 for keys that actually occur. Otherwise memset the whole array.

Sort a vector in which the n first elements have been already sorted?

Consider a std::vector v of N elements, and consider that the n first elements have already been sorted withn < N and where (N-n)/N is very small:
Is there a clever way using the STL algorithms to sort this vector more rapidly than with a complete std::sort(std::begin(v), std::end(v)) ?
EDIT: a clarification: the (N-n) unsorted elements should be inserted at the right position within the n first elements already sorted.
EDIT2: bonus question: and how to find n ? (which corresponds to the first unsorted element)
Sort the other range only, and then use std::merge.
void foo( std::vector<int> & tab, int n ) {
std::sort( begin(tab)+n, end(tab));
std::inplace_merge(begin(tab), begin(tab)+n, end(tab));
}
for edit 2
auto it = std::adjacent_find(begin(tab), end(tab), std::greater<int>() );
if (it!=end(tab)) {
it++;
std::sort( it, end(tab));
std::inplace_merge(begin(tab), it, end(tab));
}
The optimal solution would be to sort the tail portion independently and then perform in-place merge, as described here
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.5750
The algorithm is quite convoluted and is usually regarded as "not worth the effort".
Of course, with C++ you can use readily available std::inplace_merge. However, the name of that algorithm is highly misleading. Firstly, there's no guarantee that std::inplace_merge actually works in-place. And when it is actually in-place, there's no guarantee that it is not implemented as a full-blown sort. In practice it boils down to trying it and seeing whether it is good enough for your purposes.
But if you really want to make it in-place and formally more efficient than a full sort, then you will have to implement it manually. STL might help with a few utility algorithms, but it does not offer any solid solutions of "just a few calls to standard functions" kind.
Using insertion sort on N - n last elements:
template <typename IT>
void mysort(IT begin, IT end) {
for (IT it = std::is_sorted_until(begin, end); it != end; ++it) {
IT insertPos = std::lower_bound(begin, it, *it);
IT endRotate = it;
std::rotate(insertPos, it, ++endRotate);
}
}
The Timsort sorting algorithm is a hybrid algorithm developed by Pythonista Tim Peters. It makes optimal use of already-sorted subsegments anywhere inside the array, including in the beginning. Although you may find a faster algorithm if you know for sure that in particular the first n elements are already sorted, this algorithm should be useful for the overall class of problems involved. Wikipedia describes it as:
The algorithm finds subsets of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently.
In Tim Peters' own words,
It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.
Full details are described in this undated text document by Tim Peters. The examples are in Python, but Python should be quite readable even to people not familiar with its syntax.
Use std::partition_point (or is_sorted_until) to find n. Then if n-m is small do an insertion sort (linear search+std::rotate).
I assume that your question has two aims:
improve runtime (using a clever way)
with few effort (restricting to STL)
Considering these aims, I'd strongly recommend against this specific optimization, unless you are sure that the effort is worth the benefit.
As far as I remember, std::sort() implements the quick sort algorithm, which is almost as fast on presorted input as to determine, if / how-much-of the input is sorted.
Instead of meddling with std::sort you can try changing the data structure to a sorted/prioritized queue.

SQL-Like Selects in Imperative Languages

I'm doing some coding at work in C++, and a lot of the things that I work on involve analyzing sets of data. Very often I need to select some elements from a STL container, and very frequently I wrote code like this:
using std::vector;
vector< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.push_back( i );
}
vector< int > positive_numbers;
for ( vector< int >::const_iterator it = numbers.begin(), end = numbers.end();
it != end; ++it
) {
if ( number > 0 ) {
positive_numbers.push_back( *it );
}
}
Over time this for loop and the logic contained within it gets a lot more complicated and unreadable. Code like this is less satisfying than the analogous SELECT statement in SQL, assuming that I have a table called numbers with a column named "num" rather than a std::vector< int > :
SELECT * INTO positive_numbers FROM numbers WHERE num > 0
That's a lot more readable to me, and also scales better, over time a lot of the if-statement logic that's in our codebase has become complicated, order-dependent and unmaintainable. If we could do SQL-like statements in C++ without having to go to a database I think that the state of the code might be better.
Is there a simpler way that I can implement something like a SELECT statement in C++ where I can create a new container of objects by only describing the characteristics of the objects that I want? I'm still relatively new to C++, so I'm hoping that there's something magic with either template metaprogramming or clever iterators that would solve this. Thanks!
Edit based on first two answers. Thanks, I had no idea that's what LINQ actually was. I program on Linux and OSX systems primarily, and am interested in something cross-platform across OSX, Linux and Windows. So a more educated version of this question would be - is there a cross-platform implementation of something like LINQ for C++?
You've almost exactly described LINQ. It's a .NET 3.5 feature so you should be able to use it from C++.
The functionality you're describing is commonly found in functional languages that support concepts such as closures, predicates, functors, etc.
The problem with the code above is that it combines:
Logic for iterating over collection (the for loop)
Condition that must be satisfied for an element to be copied to another collection
Logic for copying an element from one collection to another
In reality (1) and (3) are boilerplate, insofar as every time you need to iterate over a collection copying some elements to another collection, it's probably only the conditional code that will change each time. Languages with support for functional programming eliminate this boilerplate. For example, in Groovy you can replace your for loop above with just
def positive_numbers = numbers.findAll{it > 0}
Even though C++ is not a functional language there may be libraries which provide support for functional-style programming with STL collections. For example, the Apache commons collection (and also possibly Google's collection library) provides support for functional style programming with Java collections, even though Java itself is not a functional language.
I think you have described LINQ (a C# and .NET 3.5 feature). Have you looked into that?
LINQ is the obvious answer for .NET (or Mono on non-Windows platforms, but in C++, it shouldn't be that difficult to write something like it yourself in STL.
Use the Boost.Iterator library to write a "select" iterator, for example, one which skips all elements that do not satisfy a given predicate.
Boost already has a few relevant examples in their documentation I believe.
Or http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/filter_iterator.html might actually do what you need out of the box.
In any case, in C++, you could achieve the same effect basically by layering iterators.
If you have a regular iterator, which visits every element in the sequence, you can wrap that in a filter iterator, which increments the underlying iterator until it finds a value satisfying the condition. Then you could even wrap that in a "select" iterator transforming the value to the desired format.
It seems like a fairly obvious idea, but I'm not aware of any complete implementations of it.
You're using STL containers. I would recommend using STL algorithms, which are largely straight out of set theory. A SQL select is translated to repeated applications of std::find_if, or a combination of std::lower_bound and std::upper_bound (on sorted containers). The performance will be about the same as looping, but the syntax is a little more declarative.
LINQ will give you similar syntax and operations, but unless used over IQueryables (i.e., data in a database) you're not going to get any performance gains either.
Your best bet after that is putting things into files for this sort of thing. Whether that's BerkelyDB, NetCDF, HDF5, STXXL, etc. File access is slow, but doing this allows you to work on more data than fits in memory.
For what you're describing, std::vector isn't a terribly good choice. This is an SQL equivalent to a table with no indexes. On top of that, filling one container with the contents of another container is possibly a reasonable performance optimization, but not very readable, and not quite idiomatic, either. There are a number of ways of solving this portably (IE, without relying on managed code .net).
First choice is to choose a better container. If you don't need to have stable iteration, then you should use std::set or std::multi_set. these containers use a balanced search tree to store the values in order. This is equivalent to a simple SQL index of all columns.
std::set< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.insert( i );
}
std::set::iterator first = numbers.find(1);
std::set::iterator end = numbers.end();
Now you can iterate from first until end without wasting any extra effort, over the O(n log(n)) fill and O(log(n) ) seek. Iterating is O(1) for std::set::iterator
If, for some reason you must use a vector, you can get more idiomatic C++ using std::find_if (see Max Lybbert's answer)
bool isPositive(int n) { return n > 0; }
std::vector< int > numbers;
for ( int i = -10; i <= 10; ++i ) {
numbers.push_back( i );
}
for ( std::vector< int >::const_iterator end = numbers.end(),
iter = std::find_if(numbers.begin(), end, isPositive); // <- first positive value
iter != end;
iter = std::find_if(iter, end, isPositive) // <- advance iter to the next positive
) {
// iter is guaranteed to be positive here, do something with it!
}
If you want something even more evocative of SQL without actually connecting to a database, you should look at Boost, particularly the boost::multi_index container and boost iterators.
Check out Mono if you want try out LINQ on Linux / OS X. It's a port of the .NET Framework and LINQ is included now i believe.

Searching fast through a sorted list of strings in C++

I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.
I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:
stdext::hash_set<const std::string> _items;
_items.insert("LONG_NAME_A_WITH_SOMETHING");
_items.insert("LONG_NAME_A_WITH_SOMETHING_ELSE");
_items.insert("SHORTER_NAME");
_items.insert("SHORTER_NAME_SPECIAL");
stdext::hash_set<const std::string>::const_iterator it = _items.find( "SHORTER_NAME" ) );
if( it != _items.end() ) {
std::cout << "item exists" << std::endl;
}
Does anybody else have a good idea for a faster search method without building a complete hashtable myself?
The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.
I've built hashtables before using Aho-Corasick but I'm not really willing to add too much complexity.
I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.'s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be...).
The results where shocking... I thought I had a fast implementation using a hash_set... well, ends out I did not. Here's some statistics (and the eventual code):
Random lookup of 5 existing keys and 1 non-existant key, 50.000 times
My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.
Here's the code I use now:
bool searchWithBinaryLookup(const std::string& strKey) {
static const char arrItems[][NUM_ITEMS] = { /* list of items */ };
/* Binary lookup */
int low, mid, high;
low = 0;
high = NUM_ITEMS;
while( low < high ) {
mid = (low + high) / 2;
if(arrAffectedSymbols[mid] > strKey) {
high = mid;
}
else if(arrAffectedSymbols[mid] < strKey) {
low = mid + 1;
}
else {
return true;
}
}
return false;
}
NOTE: This is Microsoft VC++ so I'm not using the std::hash_set from SGI.
I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.
If your list of strings are fixed at compile time, use gperf
http://www.gnu.org/software/gperf/
QUOTE:
gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
The output of gperf is not governed by gpl or lgpl, afaik.
You could try a PATRICIA Trie if none of the standard containers meet your needs.
Worst-case lookup is bounded by the length of the string you're looking up. Also, strings share common prefixes so it is really easy on memory.So if you have lots of relatively short strings this could be beneficial.
Check it out here.
Note: PATRICIA = Practical Algorithm to Retrieve Information Coded in Alphanumeric
If it's a fixed list, sort the list and do a binary search? I can't imagine, with only a hundred or so strings on a modern CPU, you're really going to see any appreciable difference between algorithms, unless your application is doing nothing but searching said list 100% of the time.
What's wrong with std::vector? Load it, sort(v.begin(), v.end()) once and then use lower_bound() to see if the string is in the vector. lower_bound is guaranteed to be O(log2 N) on a sorted random access iterator. I can't understand the need for a hash if the values are fixed. A vector takes less room in memory than a hash and makes fewer allocations.
I doubt you'd come up with a better hashtable; if the list varies from time to time you've probably got the best way.
The fastest way would be to construct a finite state machine to scan the input. I'm not sure what the best modern tools are (it's been over ten years since I did anything like this in practice), but Lex/Flex was the standard Unix constructor.
A FSM has a table of states, and a list of accepting states. It starts in the beginning state, and does a character-by-character scan of the input. Each state has an entry for each possible input character. The entry could either be to go into another state, or to abort because the string isn't in the list. If the FSM gets to the end of the input string without aborting, it checks the final state it's in, which is either an accepting state (in which case you've matched the string) or it isn't (in which case you haven't).
Any book on compilers should have more detail, or you can doubtless find more information on the web.
If the set of strings to check for numbers in the hundreds as you say, and this is when doing I/O (loading a file, which I assume comes from a disk, commonly), then I'd say: profile what you've got, before looking for more exotic/complex solutions.
Of course, it could be that your "documents" contain hundreds of millions to these strings, in which case I guess it really starts to take time ... Without more detail, it's hard to say for sure.
What I'm saying boils down to "consider the use-case and typical scenarios, before (over)optimizing", which I guess is just a specialization of that old thing about roots of evil ... :)
100 unique strings? If this isn't called frequently, and the list doesn't change dynamically, I'd probably use a straight forward const char array with a linear search. Unless you search it a lot, something that small just isn't worth the extra code. Something like this:
const char _items[][MAX_ITEM_LEN] = { ... };
int i = 0;
for (; strcmp( a, _items[i] ) < 0 && i < NUM_ITEMS; ++i );
bool found = i < NUM_ITEMS && strcmp( a, _items[i] ) == 0;
For a list that small, I think your implementation and maintenance costs with anything more complex would probably outweigh the run time costs, and you're not really going to get cheaper space costs than this. To gain a little more speed, you could do a hash table of first char -> list index to set the initial value of i;
For a list this small, you probably won't get much faster.
You're using binary search, which is O(log(n)). You should look at interpolation search, which is not as good "worst case," but it's average case is better: O(log(log(n)).
I don't know which kind of hashing function MS uses for stings, but maybe you could come up with something simpler (=faster) that works in your special case. The container should allow you to use a custom hashing class.
If it's an implementation issue of the container you can also try if boosts std::tr1::unordered_set gives better results.
a hash table is a good solution, and by using a pre-existing implementation you are likely to get good performance. an alternative though i believe is called "indexing".
keep some pointers around to convenient locations. e.g. if its using letters for the sorting, keep a pointer to everything starting aa, ab, ac... ba, bc, bd... this is a few hundred pointers, but means that you can skip to part of the list which is quite near to the result before continuing. e.g. if an entry is is "afunctionname" then you can binary search between the pointers for af and ag, much faster than searching the whole lot... if you have a million records in total you will likely only have to binary search a list of a few thousand.
i re-invented this particular wheel, but there may be plenty of implementations out there already, which will save you the headache of implementing and are likely faster than any code I could paste in here. :)