slow std::map for large entries

slow std::map for large entries - c++

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.

If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.

Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

Related

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)

It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.

The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

std::map get the lowest n elements time

std::map should be implemented with a binary search tree as I read in the documentation and it sorts them too.
I need to insert rapidly and retrieve rapidly elements. I also need to get the first lowest N elements from time to time.
I was thinking about using a std::map, is it a good choice? If it is, what is the time I would need to retrieve the lowest N elements? O(n*logn)?

Given you need both retrieval and n smallest, I would say std::map is reasonable choice. But depending on the exact access pattern std::vector with sorting might be a good choice too.
I am not sure what you mean by retrieve. Time to read k elements is O(k) (provided you do it sequentially using iterator), time to remove them is O(k log n) (n is the total amount of elements; even if you do it sequentially using iterators).

You can use iterators to rapidly read through the lowest N elements. Going from begin() to the N-1th element will take O(n) time (getting the next element is amortised constant time for a std::map).
I'd note, however, that it is often actually faster to use a sorted std::vector with a binary chop search method to implement what it sounds like you are doing so depending on your exact requirements this might be worth investigating.

The C++ standard requires that all required iterator operations (including iterator increment) be amortized constant time. Consequently, getting the first N items in a container must take amortized O(N) time.

I would say yes to both questions.

std::set<T>::insert, duplicate elements

What would be an efficient implementation for a std::set insert member function? Because the data structure sorts elements based on std::less (operator < needs to be defined for the element type), it is conceptually easy to detect a duplicate.
How does it actually work internally? Does it make use of the red-back tree data structure (a mentioned implementation detail in the book of Josuttis)?
Implementations of the standard data structures may vary...
I have a problem where I am forced to have a (generally speaking) sets of integers which should be unique. The length of the sets varies so I am in need of dynamical data structure (based on my narrow knowledge, this narrows things down to list, set). The elements do not necessarily need to be sorted, but there may be no duplicates. Since the candidate sets always have a lot of duplicates (sets are small, up to 64 elements), will trying to insert duplicates into std::set with the insert member function cause a lot of overhead compared to std::list and another algorithm that may not resort to having the elements sorted?
Additional: the output set has a fixed size of 27 elements. Sorry, I forgot this... this works for a special case of the problem. For other cases, the length is arbitrary (lower than the input set).

If you're creating the entire set all at once, you could try using std::vector to hold the elements, std::sort to sort them, and std::unique to prune out the duplicates.

The complexity of std::set::insert is O(log n), or amortized O(1) if you use the "positional" insert and get the position correct (see e.g. http://cplusplus.com/reference/stl/set/insert/).
The underlying mechanism is implementation-dependent. It's often a red-black tree, but this is not mandated. You should look at the source code for your favourite implementation to find out what it's doing.
For small sets, it's possible that e.g. a simple linear search on a vector will be cheaper, due to spatial locality. But the insert itself will require all the following elements to be copied. The only way to know for sure is to profile each option.

When you only have 64 possible values known ahead of time, just take a bit field and flip on the bits for the elements actually seen. That works in n+O(1) steps, and you can't get less than that.
Inserting into a std::set of size m takes O(log(m)) time and comparisons, meaning that using an std::set for this purpose will cost O(n*log(n)) and I wouldn't be surprised if the constant were larger than for simply sorting the input (which requires additional space) and then discarding duplicates.
Doing the same thing with an std::list would take O(n^2) average time, because finding the insertion place in a list needs O(n).
Inserting one element at a time into an std::vector would also take O(n^2) average time – finding the insertion place is doable in O(log(m)), but elements need to me moved to make room. If the number of elements in the final result is much smaller than the input, that drops down to O(n*log(n)), with close to no space overhead.
If you have a C++11 compiler or use boost, you could also use a hash table. I'm not sure about the insertion characteristics, but if the number of elements in the result is small compared to the input size, you'd only need O(n) time – and unlike the bit field, you don't need to know the potential elements or the size of the result a priori (although knowing the size helps, since you can avoid rehashing).

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items

Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements

If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).

Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.

I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.

If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.

Efficiently finding multiple items in a container

I need to find a number of objects from a large container.
The only way I can think of to do that seems to be to just search the container for one item at a time in a loop, however, even which an efficient search with an average case of say "log n" (where n is the size of the container), this gives me "m log n" (where m is the number of items I'm looking for) for the entire operation.
That seems highly suboptimal to me, and as its something that I am likely to need to do on a frequent bases, something I'd definitely like to improve if possible.
Neither part has been implemented yet, so I'm open for suggestions on the format of the main container, the "list" of items I'm looking for, etc, as well as the actual search algorithm.
The items are complex objects, however the search key is just a simple integer.

Hash tables have basically O(1) lookup. This gives you O(m) to lookup m items; obviously you can't lookup m items faster than O(m) because you need to get the result out.

If you're purely doing look-up (you don't require ordered elements) and can give up some memory, try unordered_map (it's TR1, also implemented in Boost), which has constant-time amortized look-up.
In a game engine, we tested std::map and unordered_map, and while map was faster for insertions (if I recall), unordered_map blew it out of the water for retrieval. We had greater than 1000 elements in the map, for scale, which is fairly low compared to some other tasks you may be doing.
If you require elements to be ordered, your next bet is std::map, which has the look-up times you've posted, and keeps the elements ordered. In general, it also uses less memory than an unordered_map.

If your container is a vector and the elements are sorted, you can use std::lower_bound to search in O(log n) time. If your search items are also sorted, you can do a small optimization by always using the last found iterator as the start of the search for the next one, e.g.
vector<stuff> container;
vector<stuff>::iterator it = container.begin();
for (int i = 0; i < search_items.size() && it != container.end(); ++i)
{
it = std::lower_bound(it, container.end(), search_items[i]);
// make sure the found item is a match
if (it != container.end() && search_items[i] < *it)
it = container.end(); // break out early
}
if (it != container.end()) // found it!

boost/tr1 unordered_map and unordered_set are containers backed by a hash table which gives you search in amortized contant time [ O(1) ]
Boost Unordered documentation.

I suppose if you have a sorted container and a uniform distribution of items then the most efficient type of method would be a recursive bisection search with an execution path somewhat like a tree - calling itself twice whenever all the objects being searched for are in both halves of the bisection.
However, if you choose a container based on a hash-table (boost unordered set, I think?), or something similar, then lookup can be O(1), so searching in a loop really doesn't matter.
EDIT:
note that std::map and std::set are normally (always?) implemented using rb-trees, so are only log(n) for lookup.

Are you sure that m log2(n) is actually going to be a problem? If you are using a std::map that is even relatively large, the number of actually comparisons is still pretty small - if you are looking up 10,000 elements in a map of 1,000,000, the number of comparisons should be about 200,000 or about 20 comparisons per target element. This really isn't bad if your key is just a simple integer.
If you were hashing something that didn't already have a nice key, then I would say go with boost::unordered_map. I would implement it with std::map first, profile it, and then decide if you want to make the next jump to Boost.

If you're frequently performing the same projections on your collection, such as extracting elements with a key of "42", you could consider maintaining these subsets in buckets. You'd internally maintain a hashmap from keys to vectors of elements with that key, and add elements to the appropriate bucket as well as your primary collection representing "everything". Extracting a subgroup of elements is constant time (because the respective collections have already been built), and the memory overhead of maintaining the buckets scales primarily with the number of unique keys in your dataset.
This technique is decidedly less effective if you have a large number of unique key values, and makes insertions and removals more expensive, but it's good for some situations- I thought it was at least worth mentioning.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

slow std::map for large entries - c++

If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range. It’s the same complexity as std::map, but far less allocations.

Related

Storing in std::map/std::set vs sorting a vector after storing all data

std::map get the lowest n elements time

std::set<T>::insert, duplicate elements

c++ container for checking whether ordered data is in a collection

Efficiently finding multiple items in a container

Categories

Resources