C++/STL Structure for Indexed Linked List (Indices in Hash Table) - c++

I'm looking for a way to remember locations in a doubly-linked list (in hash tables or other data structures).
In C, I would add prev and next pointers to my struct. Then, I could store references to elements of my struct wherever I wanted, and refer to them later. I need only maintain these prev/next pointers to manipulate my linked list, and stored references to locations in the list will stay updated.
What is the C++ approach to this problem?
The end goal is an data structure (which is sequenced, but not ordered, i.e. no comparison function exists, but they are relatively sequenced based on where they are inserted). I need to cheaply insert, delete, move objects as the structure grows. But I also need to cheaply look up each element by some key unrelated to the ordering, and I look up meaningful locations (like head, tail, and various checkpoints in the structure called slices). I need to be able to traverse the sequenced list after looking up a starting place by key or by slice.
Head and tail will be free. I was planning a hash table that maps the keys to list elements, and another hash table that maps slices to list elements.
I asked a more specific question related to this here:
Using Both Map and List for Same Objects
The conclusion I made was that I would need to maintain both a List and various Maps pointing to the same data to get the performance I need. But doing this by storing iterators in C++ seemed subpar. Instead it seemed easier to reimplement linked list (building it into my class) and using STL maps to point to data.
I was hoping for some input about which is a more fruitful route, or if there is some third plan that better meets my needs. My assumption is that the STL implementation of unordered_map is faster than anything I would implement, but I could match or beat the performance of list since I'm only using a subset of its functionality.
Thanks!
More precise description of my data/performance requirements:
Data will come in with a unique key. I will add it into a queue.
I'll need to update/move/remove/delete this data in O(1) based on its unique key.
I'll need to insert new data/read data based on metadata stored in other data structures.
I was speaking imprecisely when I said very large list above. The list will definitely fit into memory. Space is cheap enough that it is worth using other data structures to index this list.

I understand your requirements as being:
the data has a unique key
update/move/remove/delete this data in constant time, using its unique key
According to this the best fit would be the unodered_map: It works with a key, and uses a hash table to access the elements. In average insert, find, update is constant time (thanks to the hash table), unless the hash function is not appropriate (i.e. worst case if all elements would yield the same hash value, you would have linear time, as in a list, due to the colisions).
This seems also to match your initial intention:
Head and tail will be free. I was planning a hash table that maps the
keys to list elements, and another hash table that maps slices to list
elements.
Edit: If you need also to master sequencing of elements, independently of their key, you'd need to build a combined container, based on a list and an unordered_map which associates the key to an iterator to the element in the list. You'd then have to manage synchronisation, for example:
insert element: get iterator by inserting element into list, then add the iterator to the unordered_map using the element's key.
remove element: find iterator to element by searching for the key in the unordered_map, erase element in the list using this iterator, and finally erase the key in the unordered_map.
find element: find iterator to element by searching for the key in the unordered_map
sequential iteration: use the iterator to the begin of the list.

I'd route you to STL containers to browse... but when you write word 'very large' (and I'm currently Big Data professional) everything changes.
Nobody usually gives you good advice for scalability but ... here are points.
What is 'very large' in your case? Does std::list fit your needs? Before 3rd paragraph everything looks suitable if you are not too large. Do your structure fits in memory?
How about your structure aligned to memory manager? Simply C-like list with 'prev' and 'next' has serious disadvantage - every element usually is allocated from memory manager. If you are large, this matters and gives your memory over-usage.
What do you expect to be element external reference? If you use pointers - you loose ability to perform optimization on your structure. But probably you don't need it.
Actually you definitely need to consider some 'pools' management if you are really large and indices in such pools can be pretty good references if you modify your structure intensively.
Please consider about large twice. If you mean really large - you need special solution. Especially if your data is larger than your memory. If you are not so large - why not start with just std:list? When you answer to this question, probably your life could be much more easy ;-).

Related

Should I use std::list or is there a better method?

I am currently coding in a 2D geometry editor in c++. I am having the user place nodes. Lines and arcs can be drawn by selecting 2 nodes.
Right now, I am storing the nodes in a std::deque container (same thing for the lines and arcs) because I would like to store the address of the node into each line/arc. This makes things very convenient coding wise when I implement a feature to move the node. If I were to store the actual node inside of each line/arc, then when I want to move a node, then I would have to iterate through the entire line and arc stucture to find the node that I just moved and update the parameters. This option isn't an option on the table. Hence, the need to be able to store the address of the node inside each line/arc.
However, I am running into some issue where I need to delete the node. Looking on the reference manual, it seems that for all pointer, these are invalidated when you erase an element from the deque (unless that element is at the beginning or the end. For the sake of discussion, I will not be considering this case). This causes issue with the erasing because now, all of my lines/arc reconnect themselves to different nodes or are not drawn at all when a node is erased and the program eventually crashes.
Continuing to look online, I come across std::list which (from my understanding of reading the documentation) does not invalidate any pointers or references when one of the elements is erased. This seems to be a very nice solution to my problem.
However, I have been looking a little bit on stack overflow to see what are the benefits/disadvantages of using a list vs a deque. And it seems like there is more of a preference to use a deque then a list. It seems as though the list is slower to access then the deque. This is not good because I am not sure how many nodes a user would like to draw. For all I know, there could be 10,000+ nodes in the geometry and if the user wants to move a node, I don't want the user have to wait 30 sec for the program to iterate through all of the elements to find the node(s) to erase.
So on one hand, deque are alot faster but as soon as an element is removed, all of the pointers and references are invalidated. On the other hand, std::list allows me to erase whatever element I want without invalidating any of the pointers and references but is slower compared to a deque.
I am considering to switch to a list because even if the list is slower, if I can't erase an element without invalidating the pointers and references, then there isn't much of a benefit speed wise if the program doesn't work.
However, is using a list the best choice in my situation? Is there any way to use a deque? Or is there a third option that I haven't considered?
Edit:
I forgot to mention. One thing that I am not to fond of with lists is the inability to get an element's data directly (in std::deque and vector, I can use the at function to access elements). This isn't a huge deal breaker with my code. But it does makes things convenient. For example, when a user selects a node when they want to create a line/arc, the code iterates over the entire node list to find out which one was selected and then, for the first selection, stores the index into a variable (called firstNodeIndex). For the second node, it does the same thing but when both variables (firstNodeIndex and secondNodeIndex) are viable numbers, then the function for creating the line/arc is called and the function uses the two stored indexes to re-access the node list to grab an address to the node. If I were to use the list, I would have to store the address of the two nodes in variables and then create some additional logic to make sure that the two variables containing the addresses to the two nodes are viable options.
Another alternate solution would be to reiterate through the entire node list again to grab the nodes that are selected (I would have a variable inside each node to indicate that it is selected). But I am afraid that this might not be a good idea given std::list limitations.
I am kind of in favor of my first way but I am open to change if need be or if there is a better method
So your problem is that you don't want your iterators invalidated when you insert or erase element, but you want your data structure to be fast.
Linked list is only slow when you have to iterate all elements frequently. In does not take advantage of continuous data access like vector or deque. Also linear search in list is slow.
I had similar situations. Here are some options:
Use list and try to avoid linear searches. See if memory access speed of linked list affect your performance significantly and if it doesn't - use it.
Use map or set. Same cons as list except search, which is O(logn). Or you can use unordered versions if you don't care about sorting elements.
Use non-standard data structure like plf::colony. If you don't care about order of insertion, this is probably your best option.
Create your own deque-like data structure that does not invalidate iterators (using skipfields or storing free elements somewhere). I wouldn't recommend it since you will probably end up writing something like plf::colony anyway.
A rule of thumb:
will I want to add and delete items at random?
set, list, map, multimap and unordered versions of same
will I want to be able to name individual items and find them quickly?
map, set, multimap and unordered versions of same
does the thing I am storing have mutable data, or is it more detailed than just its name (key)?
map, multimap, unordered versions thereof
do I need the items to say in order?
yes: map, no: unordered_map

Is there a linked hash set in C++?

Java has a LinkedHashSet, which is a set with a predictable iteration order. What is the closest available data structure in C++?
Currently I'm duplicating my data by using both a set and a vector. I insert my data into the set. If the data inserted successfully (meaning data was not already present in the set), then I push_back into the vector. When I iterate through the data, I use the vector.
If you can use it, then a Boost.MultiIndex with sequenced and hashed_unique indexes is the same data structure as LinkedHashSet.
Failing that, keep an unordered_set (or hash_set, if that's what your implementation provides) of some type with a list node in it, and handle the sequential order yourself using that list node.
The problems with what you're currently doing (set and vector) are:
Two copies of the data (might be a problem when the data type is large, and it means that your two different iterations return references to different objects, albeit with the same values. This would be a problem if someone wrote some code that compared the addresses of the "same" elements obtained in the two different ways, expecting the addresses to be equal, or if your objects have mutable data members that are ignored by the order comparison, and someone writes code that expects to mutate via lookup and see changes when iterating in sequence).
Unlike LinkedHashSet, there is no fast way to remove an element in the middle of the sequence. And if you want to remove by value rather than by position, then you have to search the vector for the value to remove.
set has different performance characteristics from a hash set.
If you don't care about any of those things, then what you have is probably fine. If duplication is the only problem then you could consider keeping a vector of pointers to the elements in the set, instead of a vector of duplicates.
To replicate LinkedHashSet from Java in C++, I think you will need two vanilla std::map (please note that you will get LinkedTreeSet rather than the real LinkedHashSet instead which will get O(log n) for insert and delete) for this to work.
One uses actual value as key and insertion order (usually int or long int) as value.
Another ones is the reverse, uses insertion order as key and actual value as value.
When you are going to insert, you use std::map::find in the first std::map to make sure that there is no identical object exists in it.
If there is already exists, ignore the new one.
If it does not, you map this object with the incremented insertion order to both std::map I mentioned before.
When you are going to iterate through this by order of insertion, you iterate through the second std::map since it will be sorted by insertion order (anything that falls into the std::map or std::set will be sorted automatically).
When you are going to remove an element from it, you use std::map::find to get the order of insertion. Using this order of insertion to remove the element from the second std::map and remove the object from the first one.
Please note that this solution is not perfect, if you are planning to use this on the long-term basis, you will need to "compact" the insertion order after a certain number of removals since you will eventually run out of insertion order (2^32 indexes for unsigned int or 2^64 indexes for unsigned long long int).
In order to do this, you will need to put all the "value" objects into a vector, clear all values from both maps and then re-insert values from vector back into both maps. This procedure takes O(nlogn) time.
If you're using C++11, you can replace the first std::map with std::unordered_map to improve efficiency, you won't be able to replace the second one with it though. The reason is that std::unordered map uses a hash code for indexing so that the index cannot be reliably sorted in this situation.
You might wanna know that std::map doesn't give you any sort of (log n) as in "null" lookup time. And using std::tr1::unordered is risky business because it destroys any ordering to get constant lookup time.
Try to bash a boost multi index container to be more freely about it.
The way you described your combination of std::set and std::vector sounds like what you should be doing, except by using std::unordered_set (equivalent to Java's HashSet) and std::list (doubly-linked list). You could also use std::unordered_map to store the key (for lookup) along with an iterator into the list where to find the actual objects you store (if the keys are different from the objects (or only a part of them)).
The boost library does provide a number of these types of combinations of containers and look-up indices. For example, this bidirectional list with fast look-ups example.

Queue-like data structure with random access element removal

Is there a data structure like a queue which also supports removal of elements at arbitrary points? Enqueueing and dequeueing occur most frequently, but mid-queue element removal must be similar in speed terms since there may be periods where that is the most common operation. Consistency of performance is more important than absolute speed. Time is more important than memory. Queue length is small, under 1,000 elements at absolute peak load.In case it's not obvious I'll state it explicitly: random insertion is not required.
Have tagged C++ since that is my implementation language, but I'm not using (and don't want to use) any STL or Boost. Pure C or C++ only (I will convert C solutions to a C++ class.)
Edit: I think what I want is a kind of dictionary that also has a queue interface (or a queue that also has a dictionary interface) so that I can do things like this:
Container.enqueue(myObjPtr1);
MyObj *myObjPtr2 = Container.dequeue();
Container.remove(myObjPtr3);
I think that double-link list is exactly what you want (assuming you do not want a priority queue):
Easy and fast adding elements to both ends
Easy and fast removal of elements from anywhere
You can use std::list container, but (in your case) it is difficult to remove an element
from the middle of the list if you only have a pointer (or reference) to the element (wrapped in STL's list element), but
you do not have an iterator. If using iterators (e.g. storing them) is not an option - then implementing a double linked list (even with element counter) should be pretty easy. If you implement your own list - you can directly operate on pointers to elements (each of them contains pointers to both of its neighbours). If you do not want to use Boost or STL this is probably the best option (and the simplest), and you have control of everything (you can even write your own block allocator for list elements to speed up things).
One option is to use an order statistic tree, an augmented tree structure that supports O(log n) random access to each element, along with O(log n) insertion and deletion at arbitrary points. Internally, the order statistic tree is implemented as a balanced binary search treewith extra information associated with it. As a result, lookups are a slower than in a standard dynamic array, but the insertions are much faster.
Hope this helps!
You can use a combination of a linked list and a hash table. In java it is called a LinkedHashSet.
The idea is simple, have a linked list of elements, and also maintain a hash map of (key,nodes), where node is a pointer to the relevant node in the linked list, and key is the key representing this node.
Note that the basic implementation is a set, and some extra work will be needed to make this data structure allow dupes.
This data structure allows you both O(1) head/tail access, and both O(1) access to any element in the list. [all on average armotorized]

Searchable stack

I'm looking for a stack-like data structure that allows efficient searching of the contents. Effectively I want a structure that both maintains the order in which elements are inserted, but is also searchable faster than O(n) by value of the elements (in order to prevent duplicates).
The elements are small (pointers), and my primary concern is memory efficiency, so simply using two complementary data structures (one to maintain the order and one to search) is definitely not ideal.
Don't underestimate the memory-efficiency of two data structures. You should try the straightforward boost multi-index container library first, and see if its memory footprint is sufficient.
The first less usual data structure I have thought of as an answer was a skip list; however, this list won't do because you are searching for a different key than the one you are ordering on. Just noting for others who have the same idea.
If your primary concern really is a memory efficiency then you better to use a primitive linked list data structure. Linear search complexity is not so bad unless you have proven the inverse.
Or you may try to use any data structure which provides an efficient search with two small upgrades: each element should contain a link to the previously added element, so making a reversed list, and you should store a link to the head of this list, i.e. last added element. These upgrades are required to ease pushing and popping elements.

Is there a data structure that doesn't allow duplicates and also maintains order of entry?

Duplicate: Choosing a STL container with uniqueness and which keeps insertion ordering
I'm looking for a data structure that acts like a set in that it doesn't allow duplicates to be inserted, but also knows the order in which the items were inserted. It would basically be a combination of a set and list/vector.
I would just use a list/vector and check for duplicates myself, but we need that duplicate verification to be fast as the size of the structure can get quite large.
Take a look at Boost.MultiIndex. You may have to write a wrapper over this.
A Boost.Bimap with the insertion order as an index should work (e.g. boost::bimap < size_t, Foo > ). If you are removing objects from the data structure, you will need to track the next insertion order value separately.
Writing your own class that wraps a vector and a set would seem the obvious solution - there is no C++ standard library container that does what you want.
Java has this in the form of an ordered set. I don't thing C++ has this, but it is not that difficult to implement yourself. What the Sun guys did with the Java class was to extend the hash table such that each item was simultaneously inserted into a hash table and kept in a double linked list. There is very little overhead in this, especially if you preallocate the items that are used to construct the linked list from.
If I where you, I would write a class that either used a private vector to store the items in or implement a hashtable in the class yourself. When any item is to be inserted into the set, check to see if it is in the hash table and optionally replace the item in there if such an item is in it. Then find the old item in the hash table, update the list to point to the new element and you are done.
To insert a new element you do the same, except you have to use a new element in the list - you can't reuse the old ones.
To delete an item, you reorder the list to point around it, and free the list element.
Note that it should be possible for you to get the part of the linked list where the element you are interested in is directly from the element so that you don't have to walk the chain each time you have to move or change an element.
If you anticipate having a lot of these items changed during the program run, you might want to keep a list of the list items, such that you can merely take the head of this list, rather than allocating memory each time you have to add a new element.
You might want to look at the dancing links algorithm.
I'd just use two data structures, one for order and one for identity. (One could point into the other if you store values, depending on which operation you want the fastest)
Sounds like a job for an OrderedDictionary.
Duplicate verification that's fast seems to be the critical part here. I'd use some type of a map/dictionary maybe, and keep track of the insertion order yourself as the actual data. So the key is the "data" you're shoving in (which is then hashed, and you don't allow duplicate keys), and put in the current size of the map as the "data". Of course this only works if you don't have any deletions. If you need that, just have an external variable you increment on every insertion, and the relative order will tell you when things were inserted.
Not necessarily pretty, but not that hard to implement either.
Assuming that you're talking ANSI C++ here, I'd either write my own or use composition and delegation to wrap a map for data storage and a vector of the keys for order of insertion. Depending on the characteristics of the data, you might be able to use the insertion index as your map key and avoid using the vector.