std::set<T>::insert, duplicate elements - c++

What would be an efficient implementation for a std::set insert member function? Because the data structure sorts elements based on std::less (operator < needs to be defined for the element type), it is conceptually easy to detect a duplicate.
How does it actually work internally? Does it make use of the red-back tree data structure (a mentioned implementation detail in the book of Josuttis)?
Implementations of the standard data structures may vary...
I have a problem where I am forced to have a (generally speaking) sets of integers which should be unique. The length of the sets varies so I am in need of dynamical data structure (based on my narrow knowledge, this narrows things down to list, set). The elements do not necessarily need to be sorted, but there may be no duplicates. Since the candidate sets always have a lot of duplicates (sets are small, up to 64 elements), will trying to insert duplicates into std::set with the insert member function cause a lot of overhead compared to std::list and another algorithm that may not resort to having the elements sorted?
Additional: the output set has a fixed size of 27 elements. Sorry, I forgot this... this works for a special case of the problem. For other cases, the length is arbitrary (lower than the input set).

If you're creating the entire set all at once, you could try using std::vector to hold the elements, std::sort to sort them, and std::unique to prune out the duplicates.

The complexity of std::set::insert is O(log n), or amortized O(1) if you use the "positional" insert and get the position correct (see e.g. http://cplusplus.com/reference/stl/set/insert/).
The underlying mechanism is implementation-dependent. It's often a red-black tree, but this is not mandated. You should look at the source code for your favourite implementation to find out what it's doing.
For small sets, it's possible that e.g. a simple linear search on a vector will be cheaper, due to spatial locality. But the insert itself will require all the following elements to be copied. The only way to know for sure is to profile each option.

When you only have 64 possible values known ahead of time, just take a bit field and flip on the bits for the elements actually seen. That works in n+O(1) steps, and you can't get less than that.
Inserting into a std::set of size m takes O(log(m)) time and comparisons, meaning that using an std::set for this purpose will cost O(n*log(n)) and I wouldn't be surprised if the constant were larger than for simply sorting the input (which requires additional space) and then discarding duplicates.
Doing the same thing with an std::list would take O(n^2) average time, because finding the insertion place in a list needs O(n).
Inserting one element at a time into an std::vector would also take O(n^2) average time – finding the insertion place is doable in O(log(m)), but elements need to me moved to make room. If the number of elements in the final result is much smaller than the input, that drops down to O(n*log(n)), with close to no space overhead.
If you have a C++11 compiler or use boost, you could also use a hash table. I'm not sure about the insertion characteristics, but if the number of elements in the result is small compared to the input size, you'd only need O(n) time – and unlike the bit field, you don't need to know the potential elements or the size of the result a priori (although knowing the size helps, since you can avoid rehashing).

Related

Is there a data structure like a C++ std set which also quickly returns the number of elements in a range?

In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.
The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.
All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.
Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.

slow std::map for large entries

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.
If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.
Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)
It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.
The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

C++: container replacement for vector/deque for huge sizes

so my applications has containers with 100 million and more elements.
I'm on the hunt for a container which behaves - time-wise - better than std::deque (let alone std::vector) with respect to frequent insertions and deletions all over the container ... including near the middle. Access time to the n-th element does not need to be as fast as vector, but should definetely be better than full traversal like in std::list (which has a huge memory overhead per element anyway).
Elements should be treated ordered by index (like vector, deque, list), so std::set or std::unordered_set also do not work well.
Before I sit down and code such a container myself: has anyone seen such a beast already? I'm pretty sure the STL hasn't anything like this, looking to BOOST I did not find something I could use but I may be wrong.
Any hints?
There's a whole STL replacement for big data, in case your app is centric to such data:
STXXL - http://stxxl.sourceforge.net/
edit: I was actually a bit fast to answer. 100 million is not really a large number. E.g., if each element is one byte, you could save it in a 96MiB array. So whether STXXL is any useful, the size of an element should be significantly bigger.
I think you can get the performance characteristics that you want with a skip list:
https://en.wikipedia.org/wiki/Skip_list#Indexable_skiplist
It's the "indexable" part that you're interested in, of course -- you don't actually want the items to be sorted. So some modification is needed that I leave as an exercise.
You might find that 100 million list nodes begins to strain a 32 bit address space, but probably not an issue in 64 bits.
1) If the data is highly sparse, i.e. has lots of zeroes or can be expressed as such, I would highly recommend a data structure that takes advantage of that:
sparselib++ for matrices
sparsehash for hash maps
2) Hash maps should do O(1) for all the operations you describe and the sparsehash implementation I mentioned earlier is particularly space-efficient; it also includes a sparsetable type which is a bit more low-level and can be used in place of an array.
3) If the strict ordering is not that important (it probably is, because you mentioned elements should be treated ordered by index), you can swap the elements you want to erase to the end of the vector and then resize to do removal in O(1). Insertion would just be push_back.
Try a hash map. The STL has several, all with the unordered naming prefix , such as unorderd_map, etc. It has constant time insertion and look up given a good hashing algorithm. With your 'huge' data set the hash map would most likely cover your needs. Making a slight change to the application to cover the differences in the interfaces is trivial.

Why is inserting multiple elements into a std::set simultaneously faster?

I'm reading through:
"The C++ Standard Library: A Tutorial and Reference by Nicolai M.
Josuttis"
and I'm in the section about Sets & Multisets. I came across a line regarding Inserting and Removing elements:
"Inserting and removing happens faster if, when working with multiple
elements, you use a single call for all elements rather than multiple
calls."
I'm far from a data structures master, but I'm aware that they're implemented with red-black trees. What I don't understand from that is how would the STL implementers write an algorithm to insert multiple elements at once in a faster manner?
Can anyone shed some light on why this quote is true for me?
My first thought was that it might rebalance the tree only after inserting/erasing the whole range. Since the whole operation is inlined in practice, that seems more likely than the number of function calls.
Checking the GCC headers on my local machine, this doesn't seem to be the case - and anyway, I don't know how the tradeoff between reduced rebalancing activity, and potentially increased search times for intermediate inserts to an unbalanced tree, would work out.
Maybe it's considered a QoI issue, but in any case, using the most expressive method is probably best, not just because it saves you writing a for loop and shows your intention most clearly, but because it leaves library writers latitude to optimise more aggressively in the future, without you needing to know and change your code.
There are two reasons:
1) Making a single call for multiple elements instead of N times more calls.
2) The insertion operation checks for each element inserted whether another element exists already in the container with the same value. This can be optimized when insterting multiple elements together.
What you read as you quoted is wrong. Inserting to a std::set is O(log n), unless you use the insert() overload with the position iterator, in which case it is amortized O(n) when the position is valid. But, if you use the range overload with sorted elements then you get O(n) insertion.
Memory management could be a good reason. In this case it could allocate the memory just once. If all elements are called separatelly, all calls try allocate memory separatelly. As I know, most of the set and map implementations try to keep the memory in the same page, or pages near together to minimalize page faults.
I'm not sure about this, but I think that if the number of elements inserted is smaller than the number of elements in the set, then it can be more efficient to sort the inserted range before performing the insertions. This way, all values can be inserted in a single pass over the tree, and duplicates in the inserted range can be easily eliminated (or inserted very fast in the case of a multiset).
Of course, this optimization is only possible if the input iterators allows sorting the input range (i.e. if they are random iterators).