C++ data structure to perform indexed list - c++

I am looking for the most efficient data structure to maintain an indexed list. You can easily view it interms of a STL map :
std::map<int,std::vector<int> > eff_ds;
I am using this as an example because I am currently using this setup. The operations that I would like to perform are :
Insert values based on key : similar to eff_ds[key].push_back(..);
Print the contents of the data structure in terms of each key.
I am also trying to use an unordered map and a forward list,
std::unordered_map<int,std::forward_list<int> > eff_ds;
Is this the best I could do in terms of time if I use C++ or are there other options ?
UPDATE:
I can do insertion either way - front/back as long as I do the same for all the keys. To make my problem more clear, consider the following:
At each iteration of my algorithm, I am going to have an external block give me a (key,value) - both of which are single integers - pair as an output. Of course, I will have to insert this value to the corresponding key. Also, at different iterations, the same key might be returned with different values. At the end my output data(written to a file) should look something like this:
k1: v1 v2 v3 v4
k2: v5 v6 v7
k3: v8
.
.
.
kn: vm
The number of these iterations are pretty large ~1m.

There are two dimensions to your problem:
What is the best container to use where you want to be able to look up the items in the container using a numeric key, with a large number of keys, and the keys are sparse
A numeric key might lend itself to a vector for this, however if the keys are sparsely populated that would waste a lot of memory.
Assuming you do not want to iterate through the keys in order (which you did not state as a requirement), then an unordered_map might be the best bet.
What is the best container for a list of numbers, allowing for insertion at either end and the ability to retrieve the list of numbers in order (the value type of the outer map)
The answer to this will depend on how frequently you want to insert elements at the front. If that is commonly occurring then you might want to consider a forward_list. If you are mainly inserting on the end then a vector would be lower overhead.
Based on your updated question, since you can limit yourself to adding the values to the end of the lists, and since you are not concerned with duplicate entries in the lists, I would recommend using std::unordered_map<int,vector<int> >

Related

C++ Find in a vector of <int, pair>

So previously I only had 1 key I needed to look up, so I was able to use a map:
std::map <int, double> freqMap;
But now I need to look up 2 different keys. I was thinking of using a vector with std::pair i.e.:
std::vector <int, std::pair<int, double>> freqMap;
Eventually I need to look up both keys to find the correct value. Is there a better way to do this, or will this be efficient enough (will have ~3k entries). Also, not sure how to search using the second key (first key in the std::pair). Is there a find for the pair based on the first key? Essentially I can access the first key by:
freqMap[key1]
But not sure how to iterate and find the second key in the pair.
Edit: Ok adding the use case for clarification:
I need to look up a val based on 2 keys, a mux selection and a frequency selection. The raw data looks something like this:
Mux, Freq, Val
0, 1000, 1.1
0, 2000, 2.7
0, 10e9, 1,7
1, 1000, 2.2
1, 2500, 0.8
6, 2000, 2.2
The blanket answer to "which is faster" is generally "you have to benchmark it".
But besides that, you have a number of options. A std::map is more efficient than other data structures on paper, but not necessarily in practice. If you truly are in a situation where this is performance critical (i.e. avoid premature optimisation) try different approaches, as sketched below, and measure the performance you get (memory-wise and cpu-wise).
Instead of using a std::map, consider throwing your data into a struct, give it proper names and store all values in a simple std::vector. If you modify the data only seldom, you can optimise retrieval cost at the expense of additional insertion cost by sorting the vector according to the key you are typically using to find an entry. This will allow you to do binary search, which can be much faster than linear search.
However, linear search can be surprisingly fast on a std::vector because of both cache locality and branch prediction. Both of which you are likely to lose when dealing with a map, unordered_map or (binary searched) sorted vector. So, although O(n) sounds much more scary than, say, O(log n) for map or even O(1) for unordered_map, it can still be faster under the right conditions.
Especially if you discover that you don't have a discernible index member you can use to sort your entries, you will have to either stick to linear search in contiguous memory (i.e. vector) or invest into a doubly indexed data structure (effectively something akin to two maps or two unordered_maps). Having two indexes usually prevents you from using a single map/unordered_map.
If you can pack your data more tightly (i.e. do you need an int or would a std::uint8_t do the job?, do you need a double? etc.) you will amplify cache locality and for only 3k entries you have good chances of an unsorted vector to perform best. Although operations on an std::size_t are typically faster themselves than on smaller types, iterating over contiguous memory usually offsets this effect.
Conclusion: Try an unsorted vector, a sorted vector (+binary search), a map and an unordered_map. Do proper benchmarking (with several repetitions) and pick the fastest one. If it doesn't make a difference pick the one that is the most straight-forward to understand.
Edit: Given your example data, it sounds like the first key has an extremely small domain. As far as I can tell "Mux" seems to be limited to a small number of different values which are near each other, in such a situation you may consider using an std::array as your primary indexing structure and have a suitable lookup structure as your second one. For example:
std::array<std::vector<std::pair<std::uint64_t,double>>,10>
std::array<std::unordered_map<std::uint64_t,double>,10>

At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search?

I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.
Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.

Can I reinterpret a memory mapped file of key-value pairs as a map in order to sort them?

I have a memory mapped file that contains key-value pairs. Both the key and value are uint32_t, and all the keys and values are stored in the file in binary, where a key immediately proceeds the value. The file contains only these pairs, no delimiters.
I want to be able to sort all of these key-value pairs by increasing key.
The following just compiled in my code:
struct FileAsMap { map<uint32_t, uint32_t> keyValueMap; };
const FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
but I don't really know what to do from here, since by definition the map container keeps a strict weak ordering of the pairs by key. If I just reinterpret the mapped file as a map, how can I get the pairs to order?
it's not an answer but explanations don't fit into comment limitations.
The keys in a map are usually unique (at least in std::map they are). But maps in general differ one from another in method they sort stored keys. For example std::map is based on a balanced binary tree with average complexity of retrieving a given key equal to O(ln(n)) where n is a number of elements in the map. Or e.g. std::unordered_map is a hashmap internally with the average access time = O(1). That is it looks for a key in constant time regardless of number of elements inside.
In any case all these data containers demand dedicated internal in-memory structure which practically never looks like a simple stream of key-value pairs. That's why I've told above in the first comment that it's almost impossible to reuse one of standard maps as a convenient data accessor for mmap-ed data w/o prior read and unpack the data stream.
But you can create your own map-like class which would iterate over data in mmap-ed area and would check in its operator[](size_t i) if a stored key matches the requested one. Iguess that a simplest implementation would take a single screen of code.
But beware: sequental scan is a relatively expensive operation, so if you got enough elements in the file, it could become unacceptable slow. In this case you'll need some optimized indexing. For example all keys are read in the beginning of processing and an indexing array is built. But all these questions heavily depend on task details, ao it's better to stop explanations now.
If you have any further questions feel free to ask. Of course a good question assumes that you have already studied the subject and now have encountered a particular problem that you can't solve yoursef
There are a lot of reasons why the answer is no. The two simplest are:
Maps are a structure that stores data in a form in which it's already sorted. Your data isn't already sorted, so it's simply not a map.
The map class has its own internal data structure that it uses to store maps. Unless your file replicates this internal structure perfectly (which it almost certainly can't since it likely includes pointers into memory) the map class will misunderstand the data in the file.
How did u serialize the data to the file?
Assuming that you serialized a struct consisting of maps, you'd de-serialize as below:
FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
Gives access to entire structure (blob).
(*fileAsMap).keyValueMap gives access to map.

Implementing a mutable ranking table in c++

In an event-driven simulator, I need to keep track of the popularity of a large number of content elements from a catalog. More specifically I am interested in knowing the rank of any given content, i.e. its position in a list sorted by descending number of requests. I know that the number of requests per content is only going to be increased by one each time, so there is no dramatic shift in the ranking. Furthermore, elements are inserted or deleted from the catalog only in rare occasions, while requests are much more numerous and frequent. What is the best data structure to implement this?
These are the options that I have considered:
a std::map<ContentElement, unsigned int> mapping contents to the number of requests they received. Not a good choice, as it requires me to dump everything to a list and sort it whenever I want to know the ranking of a content, which is very often.
a boost::multi_index_container with two indexes, a hashed_unique for the ContentElement and an ordered_not_unique for the number of requests. This allows me to quickly retrieve a content in order to update its number of requests and to keep the container sorted as I do this through a modify call, but my understanding of the ordered index is that it still forces me to iterate through all its element in order to figure the rank of a content - I could not figure a simple way of extracting the position in the ranking from the ordered iterator.
a boost::bimap between content elements and ranking position, supported by an external sorted vector storing the number of requests per content. Essentially the rank of a content would also represent the index of the vector element with its number of requests. This allows me to do everything I want to do (e.g., easily go from content to rank and viceversa) and sorting the vector after a new request should require at most two swaps in the bimap. However it feels clumsy and error-prone as I could easily loose sync between the map and the vector and then everything would fall apart.
My guts tell me there must be a much simpler and more elegant way of handling this, but I could not find it. Can anyone help?
There is no need to do a full sort. The key insight here is that a ranking can only change by +1 or -1 when it is accessed. I would do the following...
Keep the element in a container of your choice, e.g.
map< elementId, elementInstance >
Maintain a linked list of element rankings, something like this:
list< rankingInstance >
The rankingInstance has a pointer to an elementInstance and the value of the current rank and current number of accesses. On access, you:
access the element in the map
get its current rank, and access count
update the count
using the current rank, access the linked list
check the neighbors
swap position in list if necessary
if swapping occurred, go back and update the two elements whose rank changed
It may seem so simple, but my suggestion is to use Bubble Sort on your list. Since, Bubble Sort compares and switches only the adjacent elements, which is your case, simply one up or one down move in the ranking. Your vector may keep the 'Rank' as key, 'ContentHash' as value in a vector. A map containing 'Content' or 'Content Reference' will also needed. I hope this very simple approach gives some insights about your problem.

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;