Insert a sorted range into std::set with hint - c++

Assume I have a std::set (which is by definition sorted), and I have another range of sorted elements (for the sake of simplicity, in a different std::set object). Also, I have a guarantee that all values in the second set are larger than all the values in the first set.
I know I can efficiently insert one element into std::set - if I pass a correct hint, this will be O(1). I know I can insert any range into std::set, but as no hint is passed, this will be O(k logN) (where k is number of new elements, and N number of old elements).
Can I insert a range in a std::set and provide a hint? The only way I can think of is to do k single inserts with a hint, which does push the complexity of the insert operations in my case down to O(k):
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
for(auto bigElem : biggerSet)
bigSet.insert(bigSet.end(), bigElem);

First of all, to do the merge you're talking about, you probably want to use set (or map's) merge member function, which will let you merge some existing map into this one. The advantage of doing this (and the reason you might not want to, depending your usage pattern) is that the items being merged in are actually moved from one set to the other, so you don't have to allocate new nodes (which can save a fair amount of time). The disadvantage is that the nodes then disappear from the source set, so if you need each local histogram to remain intact after being merged into the global histogram, you don't want to do this.
You can typically do better than O(log N) when searching a sorted vector. Assuming reasonably predictable distribution you can use an interpolating search to do a search in (typically) around O(log log N), often called "pseudo-constant" complexity.
Given that you only do insertions relatively infrequently, you might also consider a hybrid structure. This starts with a small chunk of data that you don't keep sorted. When you reach an upper bound on its size, you sort it and insert it into a sorted vector. Then you go back to adding items to your unsorted area. When it reaches the limit, again sort it and merge it with the existing sorted data.
Assuming you limit the unsorted chunk to no larger than log(N), search complexity is still O(log N)--one log(n) binary search or log log N interpolating search on the sorted chunk, and one log(n) linear search on the unsorted chunk. Once you've verified that an item doesn't exist yet, adding it has constant complexity (just tack it onto the end of the unsorted chunk). The big advantage is that this can still easily use a contiguous structure such as a vector, so it's much more cache friendly than a typical tree structure.
Since your global histogram is (apparently) only ever populated with data coming from the local histograms, it might be worth considering just keeping it in a vector, and when you need to merge in the data from one of the local chunks, just use std::merge to take the existing global histogram and the local histogram, and merge them together into a new global histogram. This has O(N + M) complexity (N = size of global histogram, M = size of local histogram). Depending on the typical size of a local histogram, this could pretty easily work out as a win.

Merging two sorted containers is much quicker than sorting. It's complexity is O(N), so in theory what you say makes sense. It's the reason why merge-sort is one of the quickest sorting algorithms. If you follow the link, you will also find pseudo-code, what you are doing is just one pass of the main loop.
You will also find the algorithm implemented in STL as std::merge. This takes ANY container as an input, I would suggest using std::vector as default container for new element. Sorting a vector is a very fast operation. You may even find it better to use a sorted-vector instead of a set for output. You can always use std::lower_bound to get O(Nlog(N)) performance from a sorted-vector.
Vectors have many advantages compared with set/map. Not least of which is they are very easy to visualise in a debugger :-)
(The code at the bottom of the std::merge shows an example of using vectors)

You can merge the sets more efficiently using special functions for that.
In case you insist, insert returns information about the inserted location.
iterator insert( const_iterator hint, const value_type& value );
Code:
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
auto hint = bigSet.cend();
for(auto& bigElem : biggerSet)
hint = bigSet.insert(hint, bigElem);
This assumes, of course, that you are inserting new elements that will end up together or close in the final set. Otherwise there is not much to gain, only the fact that since the source is a set (it is ordered) then about half of the three will not be looked up.
There is also a member function
template< class InputIt > void insert( InputIt first, InputIt last );.
That might or might not do something like this internally.

Related

Efficient way to generate sorted vector of pairs in C++

I am attempting to generate a sorted vector std::vector<std::pair<int,int>> from a set of integer. By iterating the integer set, the value of the integer will be set to the first value of the pair and the second value of the pair will be computed by another function. All pairs should be sorted according to the second value in my final vector. To this end I have two plans to generated the vector: either I "insert" the pair to the correct position (by iterating from the vector begin and compare the second-value), or I simply do push-back and sort (by std::sort with some compare-function) the whole vector after all pairs have been pushed. So which plan would be more efficient? (Or there is an even better approach?)
Generating the whole vector and sorting as the final step is almost certainly the way to go. Assuming the generated values are semi-random, attempting to insert to maintain sorted order would involve a continuously growing O(n) insertion cost (the O(log n) binary search cost is less of a problem in theory, but given the semi-random access, might be worse than the insert cost in practice), making the overall construction time O(n²) (n insertions costing O(n) work each).
By contrast, generating the whole thing and sorting at the end is O(n) to build the vector, and O(n log n) to sort it.
The only time to consider inserting into it preserving order is when you have a small number of items to insert into a large existing vector. Of course, if you're in that scenario, you're probably better off using a std::set or std::multiset (or in this case, a std::multimap<int, int> mapping your generated values to the int that generated them) to make the modification work consistently O(log n) per operation.

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)
It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.
The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

STL priority_queue<pair> vs. map

I need a priority queue that will store a value for every key, not just the key. I think the viable options are std::multi_map<K,V> since it iterates in key order, or std::priority_queue<std::pair<K,V>> since it sorts on K before V. Is there any reason I should prefer one over the other, other than personal preference? Are they really the same, or did I miss something?
A priority queue is sorted initially, in O(N) time, and then iterating all the elements in decreasing order takes O(N log N) time. It is stored in a std::vector behind the scenes, so there's only a small coefficient after the big-O behavior. Part of that, though, is moving the elements around inside the vector. If sizeof (K) or sizeof (V) is large, it will be a bit slower.
std::map is a red-black tree (in universal practice), so it takes O(N log N) time to insert the elements, keeping them sorted after each insertion. They are stored as linked nodes, so each item incurs malloc and free overhead. Then it takes O(N) time to iterate over them and destroy the structure.
The priority queue overall should usually have better performance, but it's more constraining on your usage: the data items will move around during iteration, and you can only iterate once.
If you don't need to insert new items while iterating, you can use std::sort with a std::vector, of course. This should outperform the priority_queue by some constant factor.
As with most things in performance, the only way to judge for sure is to try it both ways (with real-world testcases) and measure.
By the way, to maximize performance, you can define a custom comparison function to ignore the V and compare only the K within the pair<K,V>.

keep std vector/list sorted while insert, or sort all

Lets say I have 30000 objects in my vector/list. Which I add one by one.
I need them sorted.
Is it faster to sort all at once (like std::sort), or keep vector/list sorted while I add object one by one?
vector/list WILL NOT be modified later.
When you are keeping your vector list sorted while inserting elements one by one , you are basically performing an insertion sort, that theoretically runs O(n^2) in worst case. The average case is also quadratic, which makes insertion sort impractical for sorting large arrays.
With your input of ~30000 , it will be better to take all inputs and then sort it with a faster sorting algorithm.
EDIT:
As #Veritas pointed out, We can use faster algorithm to search the position for the element (like binary search). So the whole process will take O(nlg(n)) time.
Though , It may also be pointed that here inserting the elements is also a factor to be taken into account. The worst case for inserting elements takes O(n^2) that is still the overall running time if we want to keep the array sorted.
Sorting after input is still by far the better method rather than keeping it sorted after each iteration.
Keeping the vector sorted during insertion would result in quadratic performance since on average you'll have to shift down approximately half the vector for each item inserted. Sorting once at the end would be n log(n), rather faster.
Depending on your needs it's also possible that set or map may be more appropriate.

std::set<T>::insert, duplicate elements

What would be an efficient implementation for a std::set insert member function? Because the data structure sorts elements based on std::less (operator < needs to be defined for the element type), it is conceptually easy to detect a duplicate.
How does it actually work internally? Does it make use of the red-back tree data structure (a mentioned implementation detail in the book of Josuttis)?
Implementations of the standard data structures may vary...
I have a problem where I am forced to have a (generally speaking) sets of integers which should be unique. The length of the sets varies so I am in need of dynamical data structure (based on my narrow knowledge, this narrows things down to list, set). The elements do not necessarily need to be sorted, but there may be no duplicates. Since the candidate sets always have a lot of duplicates (sets are small, up to 64 elements), will trying to insert duplicates into std::set with the insert member function cause a lot of overhead compared to std::list and another algorithm that may not resort to having the elements sorted?
Additional: the output set has a fixed size of 27 elements. Sorry, I forgot this... this works for a special case of the problem. For other cases, the length is arbitrary (lower than the input set).
If you're creating the entire set all at once, you could try using std::vector to hold the elements, std::sort to sort them, and std::unique to prune out the duplicates.
The complexity of std::set::insert is O(log n), or amortized O(1) if you use the "positional" insert and get the position correct (see e.g. http://cplusplus.com/reference/stl/set/insert/).
The underlying mechanism is implementation-dependent. It's often a red-black tree, but this is not mandated. You should look at the source code for your favourite implementation to find out what it's doing.
For small sets, it's possible that e.g. a simple linear search on a vector will be cheaper, due to spatial locality. But the insert itself will require all the following elements to be copied. The only way to know for sure is to profile each option.
When you only have 64 possible values known ahead of time, just take a bit field and flip on the bits for the elements actually seen. That works in n+O(1) steps, and you can't get less than that.
Inserting into a std::set of size m takes O(log(m)) time and comparisons, meaning that using an std::set for this purpose will cost O(n*log(n)) and I wouldn't be surprised if the constant were larger than for simply sorting the input (which requires additional space) and then discarding duplicates.
Doing the same thing with an std::list would take O(n^2) average time, because finding the insertion place in a list needs O(n).
Inserting one element at a time into an std::vector would also take O(n^2) average time – finding the insertion place is doable in O(log(m)), but elements need to me moved to make room. If the number of elements in the final result is much smaller than the input, that drops down to O(n*log(n)), with close to no space overhead.
If you have a C++11 compiler or use boost, you could also use a hash table. I'm not sure about the insertion characteristics, but if the number of elements in the result is small compared to the input size, you'd only need O(n) time – and unlike the bit field, you don't need to know the potential elements or the size of the result a priori (although knowing the size helps, since you can avoid rehashing).