sorting an STL List after all pushbacks or just use Multimap? - list

We were using a multimap<int,string> to store several hundred thousand items (>300K), when we realized we needed to add more data for analysis. So we created a class that held a few items and the necessary overridden operators for stl and used a multimap<ourStruct,String>. This worked fine and didn't take much longer than before (with some test data), when we then realized an stl <list> would do just fine, as long as we sorted it after we finished adding all the items. To our surprise, we found that adding all items to multimap still easily beats the total time to add all items to list, and then sort.
This doesn't make sense to us EE types, since by our thinking every insert to multimap would have to traverse the list then tack it on to the end, where as with list we would just add on to the end (via push back), then hopefully the sort wouldn't take as long.
One more factoid: we intially did the comparison test with out sorting the list and were thrilled to see significant speed ups in speed using list. Then we added the sort, and were a bit stunned...
Any of the CS gurus out there care to weigh in?

std::multimap uses a balanced tree1, so it does not traverse the entire list when you insert an item. The number of items traversed for an insert is approximately the base 2 logarithm of the number of items in the collection.
Based on what you've said, your best bet would probably be to put your data in a vector, and then sort.
1 Technically, the standard doesn't directly require a balanced tree, but it requires ability to traverse in sorted order, and logarithmic complexity for insertions and deletions in the worst case, and I'm not aware of many other data structures that can meet that requirement.

Removed ref to hash .. Balanced tree is why only an n2 traverse is required.

Related

std::list sort algorithm runtime

I have a list of elements that is almost in the correct order and the elements are just off by a relatively small amount of places compared to their correct position (e.g. no element that is supposed to be in the front of the list is in the end).
< TL;DR >
Practical origin: I have an incoming stream of UDP-Packages that contain signals all marked with a certain timestamp. Evaluating the data has shown, that the packages have not been send (or received) in the right order, so that the timestamp is not constantly increasing but jittering a bit. To export the data I need to sort it in advance.
< /TL;DR >
I want to use std::list.sort() to sort this list.
What is the sorting algorithm used by std::list.sort() and how is it affected by the fact that the list is almost sorted. I have a "feeling", that a divide-and-conquer based algorithm might profit from it.
Is there a more efficient algorithm for my quite specific problem?
If every element is within k positions of its proper place, then insertion sort will take less than kN comparisons and swaps/moves. It's also very easy to implement.
Compare this to the N*log(N) operations required by merge sort or quick sort to see if that will work better for you.
It is not defined which algorithm uses, allthough it should be around N log N on average, such as quicksort.
If you are appending packets to the end of the "queue" as you consume them, so you want the queue always sorted, then you can expect any new packets correct position to nearly always be near the "end".
Therefore, rather than sorting the entire queue, just insert the packet into the correct position. Start at the back of the queue and compare the existing packets timestamp with those already there, insert it after the first packet with a smaller timestamp (likely to always be the end) or the front if there is no such packet in the event things are greatly out of order.
Alternatively, if you want to add all packets in order and then sort it, Bubble Sort should be fairly optimal because the list should still be nearly sorted already.
On mostly-sorted data Insertion Sort and Bubble Sort are among the most common ones that perform the best.
Live demo
Note also that having a list structure puts an extra constraint on indexed access, so algorithms that require indexed access will perform extra poorly. Therefore insertion sort is an even better fit since it needs only sequential access.
In the case of Visual Studio prior to 2015, a bottom up merge sorting using 26 internal lists is used, following the algorithm shown in this wiki article:
https://en.wikipedia.org/wiki/Merge_sort#Bottom-up_implementation_using_lists
Visual Studio 2015 added support for not having a default allocator. The 26 internal lists initializers could have been expanded to 26 instances of initializers with user specified allocators, specifically: _Myt _Binlist[_MAXBINS+1] = { _Myt(get_allocator()), ... , _Myt(get_allocator())};, but instead someone at Microsoft switched to a top down merge sort based on iterators. This is slower, but has the advantage that it doesn't need special recovery if the compare throws an exception. The author of that change pointed out the fact that if performance is a goal, then it's faster to copy the list to an array or vector, sort the array or vector, then create a new list from the array or vector.
There's a prior thread about this.
`std::list<>::sort()` - why the sudden switch to top-down strategy?
In your case something like an insertion sort on a doubly linked list should be faster, if an out order node is found, remove the from the list, scan backwards or forwards to the proper spot and insert the node back into the list. If you want to used std::list, you can use iterators, erase, and emplace to "move" nodes, but that involves freeing and reallocating a node for each move. It would be faster to implement this with your own doubly linked list, in which case you can just manipulate the links, avoiding the freeing and reallocation of memory if using std::list.

Stream of Integers arriving at specified interval need to look sorted

Interview question: There is a stream of Integers that arrives at specified intervals (say every 20 sec). Which Container of STL would you use to store them so that the Integers look sorted? My reply was map/set when there is no duplicate or multimap/multiset when there is duplicate. Any better answer if exists?
Use a multiset if you want to preserve duplicates. If you don't want to preserve duplicates, use a set.
If it's only being updated every 20 seconds, it probably doesn't matter a whole lot (unless it goes for so long that the set of integers becomes tremendously huge).
If you had data coming in a lot faster, there are alternatives that might be worth considering. One would be to use a couple of vectors. As data arrives, just push it onto one of the vectors. When you need to do an in-order traversal, sort that newly arrived data, and merge with the other vector of existing (already-sorted data). That'll give you results in order, which you can then write out to another vector, and start the same cycle again.
The big advantage here is that you're dealing with contiguous data instead of individually allocated nodes. Even with a possibility of three vectors in use at a time, your total memory usage is likely to be about equal (or possibly even less than) that of using a set or multiset.
Another possibility to consider (that's a bit of a hybrid between the two) would be something like a B+ tree. This is still a tree, so you can do in-order insertions with logarithmic complexity, but you have all the data in the leaf nodes (which are fairly large) so you get at least a reasonable amount of contiguous access as well.
To maintain a sorted list of integers streaming I would use std::priority_queue with any underlying container (vector or deque depending on the particular use).
You can keep push() ing to the priority_queue and use top() and pop() to retrieve in the sorted order.
Answer should be std::set . std::map<key, value> has to consider when there is a pairs of data as <key, value> and it need to be sorted according to the value of key
In same way if you have to consider duplicates, use std::multiset and std::multimap according to type of data.

C++ Data Structure that would be best to hold a large list of names

Can you share your thoughts on what the best STL data structure would be for storing a large list of names and perform searches on these names?
Edit:
The names are not unique and the list can grow as new names can continuously added to it. And by large I am talking of from 1 million to 10 million names.
Since you want to search names, you want a structure that support fast random access. That means vector, deque and list are all out of the question. Also, vector/array are slow on random adds/inserts for sorted sets because they have to shift items to make room for each inserted item. Adding to end is very fast, though.
Consider std::map, std::unordered_map or std::unordered_multimap (or their siblings std::set, std::unordered_set and std::unordered_multiset if you are only storing keys).
If you are purely going to do unique, random access, I'd start with one of the unordered_* containers.
If you need to store an ordered list of names, and need to do range searches/iteration and sorted operations, a tree based container like std::map or std::set should do better with the iteration operation than a hash based container because the former will store items adjacent to their logical predecessors and successors. For random access, it is O(log N) which is still decent.
Prior to std::unordered_*, I used std::map to hold large numbers of objects for an object cache and though there are faster random access containers, it scaled well enough for our uses. The newer unordered_map has O(1) access time so it is a hashed structure and should give you the near best access times.
You can consider the possibility of using concatenation of those names using a delimiter but the searching might take a hit. You would need to come up with a adjusted binary searching.
But you should try the obvious solution first which is a hashmap which is called unordered_map in stl. See if that meets your needs. Searching should be plenty fast there but at a cost of memory.

Best data structure/ container in C++ for insertion and deletion

I am looking for the best data structure for C++ in which insertion and deletion can take place very efficiently and fast.
Traversal should also be very easy for this data structure. Which one should i go with?
What about SET in C++??
A linked list provides efficient insertion and deletion of arbitrary elements. Deletion here is deletion by iterator, not by value. Traversal is quite fast.
A dequeue provides efficient insertion and deletion only at the ends, but those are faster than for a linked list, and traversal is faster as well.
A set only makes sense if you want to find elements by their value, e.g. to remove them. Otherwise the overhead of checking for duplicate as well as that of keeping things sorted will be wasted.
It depends on what you want to put into this data structure. If the items are unordered or you care about their order, list<> could be used. If you want them in a sorted order, set<> or multiset<> (the later allows multiple identical elements) could be an alternative.
list<> is typically a double-linked list, so insertion and deletion can be done in constant time, provided you know the position. traversal over all elements is also fast, but accessing a specified element (either by value or by position) could become slow.
set<> and its family are typically binary trees, so insertion, deletion and searching for elements are mostly in logarithmic time (when you know where to insert/delete, it's constant time). Traversal over all elements is also fast.
(Note: boost and C++11 both have data structures based on hash-tables, which could also be an option)
I would say a linked list depending on whether or not you're deletions are specific and often. Iterator about it.
It occurs to me, that you need a tree.
I'm not sure about the exact structure (since you didnt provide in-detail info), but if you can put your data into a binary tree, you can achieve decent speed at searching, deleting and inserting elements ( O(logn) average and O(n) worst case).
Note that I'm talking about the data structure here, you can implement it in different ways.

Is the linked list only of limited use?

I was having a nice look at my STL options today. Then I thought of something.
It seems a linked list (a std::list) is only of limited use. Namely, it only really seems
useful if
The sequential order of elements in my container matters, and
I need to erase or insert elements in the middle.
That is, if I just want a lot of data and don't care about its order, I'm better off using an std::set (a balanced tree) if I want O(log n) lookup or a std::unordered_map (a hash map) if I want O(1) expected lookup or a std::vector (a contiguous array) for better locality of reference, or a std::deque (a double-ended queue) if I need to insert in the front AND back.
OTOH, if the order does matter, I am better off using a std::vector for better locality of reference and less overhead or a std::deque if a lot of resizing needs to occur.
So, am I missing something? Or is a linked list just not that great? With the exception of middle insertion/erasure, why on earth would someone want to use one?
Any sort of insertion/deletion is O(1). Even std::vector isn't O(1) for appends, it approaches O(1) because most of the time it is, but sometimes you are going to have to grow that array.
It's also very good at handling bulk insertion, deletion. If you have 1 million records and want to append 1 million records from another list (concat) it's O(1). Every other structure (assuming stadard/naive implementations) are at least O(n) (where n is the number of elements added).
Order is important very often. When it is, linked lists are good. If you have a growing collection, you have the option of linked lists, array lists (vector in C++) and double-ended queues (deque). Use linked lists if you want to modify (add, delete) elements anywhere in the list often. Use array lists if fast retrieval is important. Use double-ended queues if you want to add stuff to both ends of the data structure and fast retrieval is important. For the deque vs vector question: use vector unless inserting/removing things from the beginning is important, in which case use deque. See here for an in-depth look at this.
If order isn't important, linked lists aren't normally ideal.
std::list is notable for its splice() method, which allows you to move one more more elements from one list to another in constant time, without copying or allocating any elements or list nodes.
This question reminds me of this infamous one. Read it for the parallels as to why such simple data structures are important.
Linked List is a fundamental data structure. Other data structures, like hash maps, may use linked lists internally.
Two different algorithms may have O(1) time complexity, for a look up, but that doesn't mean they have the same performance. For example the first one may be 10 or 100 times faster than the second.
Whenever you need to store, iterate and do something with a bunch of data, the normal (and fast) data stucture for that task is the Linked List. More complex data structures are for special cases, ie Set is suitable when you don't want repeated values.
std::list has the following properties:
Sequence
Front Seuqence
Back Seuqence
Forward Container
Reverse Container
Of these properties std::vector does not have (Back Seuqence)
While std::set does not support any sequence properties or (Reverse Container)
So what does this mean?
Will a back sequence supports O(1) for rend() and rbegin() etc
For full information see:
What are the complexity guarantees of the standard containers?
Linked lists are immutable and recursive datastructures whereas arrays are mutable and imperative (=change-based). In functional programming, there are usually no arrays - You don't change elements but transform lists into new lists. While linked lists don't even need additional memory here, this isn't possible efficiently with arrays.
You can easily build or decompose lists without having to change any value.
double [] = []
double (head:rest) = (2 * head):(double rest)
In C++, which is an imperative language, you won't use lists that often. One example could be a list of spaceships in a game from which you can easily remove all spaceships that have been destroyed since the previous frame.