Design Pattern for Data Structure - c++

This question has been answered, so what follows below is an explanation of what I wanted to achieve.
I wanted to create a tablular data structure designed to allow efficient access to any row entry through a primary column that could possibly be hashed. I thought that the best way to go about this would be to maintain a vector of doubly-linked lists, each of which would represent one column, and a map that would contain mappings of primary column entry hashes to nodes. Now, the first mistake I made is in thinking that I would need to create my own implementation of a doubly-linked list in order to be able to store pointers to nodes, when in fact the standard states that iterators to std::list do not get invalidated as a result of insertion or splicing (see larsmans's answer). Here's some pseudocode to illustrate what I wanted to do previously. Assume the existence of a typename T representing the entry type and the existence of a dlist and node class, as described previously.
typedef dlist<T> column_type;
typedef vector<T> row_type;
typedef ptr_unordered_map<int32_t, row_type> hash_type;
shared_ptr<ptr_vector<column_type> > columns;
shared_ptr<hash_type> hashes;
Now, after reading larsmans's answer, I learned that I wouldn't need any of this since Boost.MultiIndex fulfills all of my needs as it is. Even if I did, Boost.Intrusive offers more efficient data structures to accomplish what I describe.
Thanks to all who took interest in the question or offered help! If you have any more questions, add another comment and I'll do my best to clarify the question further.

front() should return a reference to a node containing the value_type
Sounds like your thinking of begin instead of front, in STL/Boost terms, except that begin methods usually return iterators instead of references.
How would I be able to use a map of key hashes to std::list::iterator types and allow for addition of rows without having the entries in the map get outdated
Just do; "lists have the important property that insertion and splicing do not invalidate iterators to list elements, and that even removal invalidates only the iterators that point to the elements that are removed" (STL docs).
If you wanted, you could maintain a single std::list for the entire table and a vector of iterators into it to represent the starting points of rows.
Besides, have you looked at Boost.Intrusive and Boost.MultiIndex? And did you know that an std::map (red-black tree) of hashes is a very suboptimal way of representing a hash table?

Related

C++/STL Structure for Indexed Linked List (Indices in Hash Table)

I'm looking for a way to remember locations in a doubly-linked list (in hash tables or other data structures).
In C, I would add prev and next pointers to my struct. Then, I could store references to elements of my struct wherever I wanted, and refer to them later. I need only maintain these prev/next pointers to manipulate my linked list, and stored references to locations in the list will stay updated.
What is the C++ approach to this problem?
The end goal is an data structure (which is sequenced, but not ordered, i.e. no comparison function exists, but they are relatively sequenced based on where they are inserted). I need to cheaply insert, delete, move objects as the structure grows. But I also need to cheaply look up each element by some key unrelated to the ordering, and I look up meaningful locations (like head, tail, and various checkpoints in the structure called slices). I need to be able to traverse the sequenced list after looking up a starting place by key or by slice.
Head and tail will be free. I was planning a hash table that maps the keys to list elements, and another hash table that maps slices to list elements.
I asked a more specific question related to this here:
Using Both Map and List for Same Objects
The conclusion I made was that I would need to maintain both a List and various Maps pointing to the same data to get the performance I need. But doing this by storing iterators in C++ seemed subpar. Instead it seemed easier to reimplement linked list (building it into my class) and using STL maps to point to data.
I was hoping for some input about which is a more fruitful route, or if there is some third plan that better meets my needs. My assumption is that the STL implementation of unordered_map is faster than anything I would implement, but I could match or beat the performance of list since I'm only using a subset of its functionality.
Thanks!
More precise description of my data/performance requirements:
Data will come in with a unique key. I will add it into a queue.
I'll need to update/move/remove/delete this data in O(1) based on its unique key.
I'll need to insert new data/read data based on metadata stored in other data structures.
I was speaking imprecisely when I said very large list above. The list will definitely fit into memory. Space is cheap enough that it is worth using other data structures to index this list.
I understand your requirements as being:
the data has a unique key
update/move/remove/delete this data in constant time, using its unique key
According to this the best fit would be the unodered_map: It works with a key, and uses a hash table to access the elements. In average insert, find, update is constant time (thanks to the hash table), unless the hash function is not appropriate (i.e. worst case if all elements would yield the same hash value, you would have linear time, as in a list, due to the colisions).
This seems also to match your initial intention:
Head and tail will be free. I was planning a hash table that maps the
keys to list elements, and another hash table that maps slices to list
elements.
Edit: If you need also to master sequencing of elements, independently of their key, you'd need to build a combined container, based on a list and an unordered_map which associates the key to an iterator to the element in the list. You'd then have to manage synchronisation, for example:
insert element: get iterator by inserting element into list, then add the iterator to the unordered_map using the element's key.
remove element: find iterator to element by searching for the key in the unordered_map, erase element in the list using this iterator, and finally erase the key in the unordered_map.
find element: find iterator to element by searching for the key in the unordered_map
sequential iteration: use the iterator to the begin of the list.
I'd route you to STL containers to browse... but when you write word 'very large' (and I'm currently Big Data professional) everything changes.
Nobody usually gives you good advice for scalability but ... here are points.
What is 'very large' in your case? Does std::list fit your needs? Before 3rd paragraph everything looks suitable if you are not too large. Do your structure fits in memory?
How about your structure aligned to memory manager? Simply C-like list with 'prev' and 'next' has serious disadvantage - every element usually is allocated from memory manager. If you are large, this matters and gives your memory over-usage.
What do you expect to be element external reference? If you use pointers - you loose ability to perform optimization on your structure. But probably you don't need it.
Actually you definitely need to consider some 'pools' management if you are really large and indices in such pools can be pretty good references if you modify your structure intensively.
Please consider about large twice. If you mean really large - you need special solution. Especially if your data is larger than your memory. If you are not so large - why not start with just std:list? When you answer to this question, probably your life could be much more easy ;-).

How to retrieve the elements from map in the order of insertion?

I have a map which stores <int, char *>. Now I want to retrieve the elements in the order they have been inserted. std::map returns the elements ordered by the key instead. Is it even possible?
If you are not concerned about the order based on the int (IOW you only need the order of insertion and you are not concerned with the normal key access), simply change it to a vector<pair<int, char*>>, which is by definition ordered by the insertion order (assuming you only insert at the end).
If you want to have two indices simultaneously, you need, well, Boost.MultiIndex or something similar. You'd probably need to keep a separate variable that would only count upwards (would be a steady counter), though, because you could use .size()+1 as new "insertion time key" only if you never erased anything from your map.
Now I want to retrieve the elements in the order they have been inserted. [...] Is it even possible?
No, not with std::map. std::map inserts the element pair into an already ordered tree structure (and after the insertion operation, the std::map has no way of knowing when each entry was added).
You can solve this in multiple ways:
use a std::vector<std::pair<int,char*>>. This will work, but not provide the automatic sorting that a map does.
use Boost stuff (#BartekBanachewicz suggested Boost.MultiIndex)
Use two containers and keep them synchronized: one with a sequential insert (e.g. std::vector) and one with indexing by key (e.g. std::map).
use/write a custom container yourself, so that supports both type of indexing (by key and insert order). Unless you have very specific requirements and use this a lot, you probably shouldn't need to do this.
A couple of options (other than those that Bartek suggested):
If you still want key-based access, you could use a map, along with a vector which contains all the keys, in insertion order. This gets inefficient if you want to delete elements later on, though.
You could build a linked list structure into your values: instead of the values being char*s, they're structs of a char* and the previously- and nextly-inserted keys**; a separate variable stores the head and tail of the list. You'll need to do the bookkeeping for that on your own, but it gives you efficient insertion and deletion. It's more or less what boost.multiindex will do.
** It would be nice to store map iterators instead, but this leads to circular definition problems.

C++ map allocator stores items in a vector?

Here is the problem I would like to solve: in C++, iterators for map, multimap, etc are missing two desirable features: (1) they can't be checked at run-time for validity, and (2) there is no operator< defined on them, which means that they can't be used as keys in another associative container. (I don't care whether the operator< has any relationship to key ordering; I just want there to be some < available at least for iterators to the same map.)
Here is a possible solution to this problem: convince map, multimap, etc to store their key/data pairs in a vector, and then have the iterators be a small struct that contain a pointer to the vector itself and a subscript index. Then two iterators, at least for the same container, could be compared (by comparing their subscript indices), and it would be possible to test at run time whether an iterator is valid.
Is this solution achievable in standard C++? In particular, could I define the 'Allocator' for the map class to actually put the items in a vector, and then define the Allocator::pointer type to be the small struct described in the last paragraph? How is the iterator for a map related to the Allocator::pointer type? Does the Allocator::pointer have to be an actual pointer, or can it be anything that supports a dereference operation?
UPDATE 2013-06-11: I'm not understanding the responses. If the (key,data) pairs are stored in a vector, then it is O(1) to obtain the items given the subscript, only slightly worse than if you had a direct pointer, so there is no change in the asymptotics. Why does a responder say map iterators are "not kept around"? The standard says that iterators remain valid as long as the item to which they refer is not deleted. As for the 'real problem': say I use a multimap for a symbol table (variable name->storage location; it is a multimap rather than map because the variables names in an inner scope may shadow variables with the same name), and say now I need a second data structure keyed by variables. The apparently easiest solution is to use as key for the second map an iterator to the specific instance of the variable's name in the first map, which would work if only iterators had an operator<.
I think not.
If you were somehow able to "convince" map to store its pairs in a vector, you would fundamentally change certain (at least two) guarantees on the map:
insert, erase and find would no longer be logarithmic in complexity.
insert would no longer be able to guarantee the validity of unaffected iterators, as the underlying vector would sometimes need to be reallocated.
Taking a step back though, two things suggest to me that you are trying to "solve" the wrong problem.
First, it is unusual to need to have a vector of iterators.
Second, it is unusual to need to check an iterator for validity, as iterators are not generally kept around.
I wonder what the real problem is that you are trying to solve?

Simple and efficient container in C++ with characteristics of map and list containers

I'm looking for a C++ container that will enjoy both map container and list container benefits.
map container advantages I would like to maintain:
O(log(n)) access
operator[] ease of use
sparse nature
list container advantages I would like to maintain:
having an order between the items
being able to traverse the list easily UPDATE: by a sorting order based on the key or value
A simple example application would be to hold a list of certain valid dates (business dates, holidays, some other set of important dates...), once given a specific date, you could find it immediately "map style" and then find the next valid date "list style".
std::map is already a sorted container where you can iterate over the contained items in order. It only provides O(log(n)) access, though.
std::tr1::unordered_map (or std::unordered_map in C++0x) has O(1) access but is unsorted.
Do you really need O(1) access? You have to use large datasets and do many lookups for O(log(n)) not being fast enough.
If O(log(n)) is enough, std::map provides everything you are asking for.
If you don't consider the sparse nature, you can take a look at the Boost Multi-Index library. For the sparse nature, you can take a look at the Boost Flyweight library, but I guess you'll have to join both approaches by yourself. Note that your requirements are often contradictory and hard to achieve. For instance, O(1) and order between the items is difficult to maintain efficiently.
Maps are generally implemented as trees and thus have logarithmic look up time, not O(1), but it sounds like you want a sorted associative container. Hash maps have O(1) best case, O(N) worst case, so perhaps that is what you mean, but they are not sorted, and I don't think they are part of the standard library yet.
In the C++ standard library, map, set, multimap, and multiset are sorted associative containers, but you have to give up the O(1) look up requirement.
According to Stroustrup, the [] operator for maps is O(log(n)). That is much better than the O(n) you'd get if you were to try such a thing with a list, but it is definitely not O(1). The only container that gives you that for the [] operator is vector.
That aside, you can already do all your listy stuff with maps. Iterators work fine on them. So if I were you, I'd stick with map.
having an order between the items
being able to traverse the list easily
Maps already do both. They are sorted, so you start at begin() and traverse until you hit end(). You can, of course, start at any map iterator; you may find map's find, lower_bound, and related methods helpful.
You can store data in a list and have a map to iterators of your list enabling you to find the actual list element itself. This kind of thing is something I often use for LRU containers, where I want a list because I need to move the accessed element to the end to make it the most recently accessed. You can use the splice function to do this, and since the 2003 standard it does not invalidate the iterator as long as you keep it in the same list.
How about this one: all dates are stored in std::list<Date>, but you look it up with helper structure stdext::hash_map<Date, std::list<Date>::iterator>. Once you have iterator for the list access to the next element is simple. In your STL implementation it could be std::tr1::unordered_map instead of stdext::hash_map, and there is boost::unordered_map as well.
You will never find a container that satisfies both O(log n) access and an ordered nature. The reason is that if a container is ordered then inherently it must support an arbitrary order. That's what an ordered nature means: you get to decide exactly where any element is positioned. So to find any element you have to guess where it is. It can be anywhere, because you can place it anywhere!
Note that an ordered sequence is not the same as a sorted sequence. A sorted nature means there is one particular ordering relation between any two elements. An ordered nature means there may be more than one ordering relation among the elements.

index or position in std::set

I have a std::set of std::string. I need the "index" or "position" of each string in the set, is this a meaningful concept in the context?
I guess find() will return an iterator to the string, so my question might be better phrased as : "How do I convert an iterator to a number?".
std::distance is what you need. You will want, I guess std::distance(set.begin(), find_result)
I don't think it is meaningful - set's are 'self keyed' and sorted thus the 'index' would be invalidated when the set is modified.
Of course it depends upon how you intend to use the index and if the set is essentially static (say a dictionary).
Despite what others have written here, I don't think that "index" or "position" has meaning with respect to a set. In mathematical terms, a set exposes only its members and maybe its cardinality. The only meaningful operations involve testing whether an item is a member of the set, and combining or subtracting sets to yield new sets.
Some people talk about sets as data structures in looser terms, by facets of being "ordered" or "unordered", and whether they permit duplicates or enforce uniqueness. The former facet distinguishes an array with an O(n) insertion guard, where an attempt to insert an item first scans the existing members to see if the new item exists and, if not, inserts the new item at the end, and a hash table, that might retain such order only within a bucket's chain. A tree such as the Red-Black Tree used by std::set is somewhere in between; its traversal order is deterministic with respect to the strict weak order imposed by the comparator predicate, but, unlike the array sketched above, it doesn't retain insertion order.
The other facet — whether the set permits duplicate elements — is meaningless in mathematics, and is more accurately described as a bag. Such a structure acknowledges the difference between identity and value-based "sameness."
Your problem may involve caring about some position; it's not clear what that position means, but I expect you're going to need some data structure separate from std::set to model this properly. Perhaps a std::map mapping from your set of elements to each position would do. That would not guarantee that the positions are unique.
It may also help clarify the problem to think how you'd model it as relations, such as in a relational database. What comprises the key? What portions of the entities can vary independently?