Finding top elements of a map using Heap - c++

I have a unordered_hashmap that maps a string (say personName or SSN) to a struct Attributes that has many attributes of that person including annualIncome. There are many such hash maps corresponding to different organizations such as mapOrganizationA, mapOrganizationB etc. I need to find the people (with attributes) with the top-k annual incomes. I was thinking of using a min-heap with k-nodes (with the minimum salary as root), so that I can scan the maps one by one, of the current element has income more than the root of the min-heap, the root can be updated. Is this the right approach to get top-k from different maps? Is there a min-heap datastucture in STL I can make use of.

You can use make_heap, push_heap, pop_heap, sort_heap, is_heap to treat any non-associative container (or sequence, really) as a heap.
That would not fit you map nicely, but I assume nothing would prevent you from storing the values (or pointers/references to those) inside, say, a list for this purpose?
Also, perhaps look at Boost.MultiIndex which is a library precisely focused on providing multiple (efficient!) indexes on the same data

Related

At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search?

I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.
Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.

Fast, sorted data structure with random access?

I want to add several objects in a structure which allows this:
Insertion of objects, immediately ordering the entire structure on add so I have a descending ordering of an int;
Being able to change the int by which the objects are ordered (I mean: say that object number 2, now has a int of 5, so it reorders the structure);
Fast structure, because it will be completely iterated 60 times a second;
Being able to directly access the objects by position;
Only needs to be iterated from top to bottom: higher INT to lower INT
No deletion required, but could become useful later on.
Some indications on how to use the structure would be great, since I don't know much about the C++ standard libraries.
All of the operations that you've listed (except for lookup by index) can be supported by a standard binary search tree, keyed by integer values. This gives you the ability to iterate over the elements in sorted order and to keep the objects sorted during any insertion. As #njr mentioned, you can also update priorities by removing objects from the binary search tree, changing their priority, then reinserting them into the binary search tree.
To support random access by index, you should consider looking into order statistic trees, a variant on binary search trees that in addition to all other operations supports very fast (O(log n)) lookup of an element by its index. That is, you could very efficiently query for the 15th element in the sorted sequence, or the 17th, etc. Order statistic trees aren't part of the C++ standard libraries, but this older question contains answers that can link you to an implementation.
Use a set or a map
For requirement 1 - provide a custom sorting function
For 2 - remove the item and add it again (or provide a wrapper that does that)
3 doesn't make sense (How big is the list, how fast is the processor/ram)
For 4 - Are you sure you need that? It seems to be kind of weird to try to access it by position when the position can change suddenly (some item was added or removed)
5 - same as 1

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;

suitable data structure for set (graph) partition

I need to store data grouping nodes of a graph partition, something like:
[node1, node2] [node3] [node4, node5, node6]
My first idea was to have just a simple vector or array of ints, where the position in the array denoted the node_id and it's value is some kind of group_id
The problem is many partition algorithms rely on operating on pairs of nodes within a group. With this method, I think I would waste a lot of computation searching through the vector to find out which nodes belong to the same group.
I could also store as a stl set of sets, which seems closer to the mathematical definition of a partition, but I am getting the impression nested sets are not advised or unnecessary, and I would need to modify the inner sets which I am not sure is possible.
Any suggestions?
Depending on what exactly you want to do with the sets, you could try a disjoint set data structure. In this structure, each element has a method find that returns the "representative" of the set it belongs to.
A C++ implementation is available in Boost.
There are two good data structures that come to mind.
The first data structure (and one that's been mentioned here before) is the disjoint-set forest, which gives extraordinarily efficient implementations of "merge these two sets" and "what set is x in?". However, it does not support the operation of splitting groups apart from one another.
The other structure I'd recommend is a link/cut tree. This structure lets you build up partitions of a graph that can be joined together into trees. Unlike the disjoint set forest, the tree describing the partition can be cut into smaller trees, allowing you to break partitions into smaller groups. This structure is a bit less efficient than the union/find structure, but it still supports all operations in amortized O(lg n).

How to implement an associative array/map/hash table data structure (in general and in C++)

Well I'm making a small phone book application and I've decided that using maps would be the best data structure to use but I don't know where to start. (Gotta implement the data structure from scratch - school work)
Tries are quite efficient for implementing maps where the keys are short strings. The wikipedia article explains it pretty well.
To deal with duplicates, just make each node of the tree store a linked list of duplicate matches
Here's a basic structure for a trie
struct Trie {
struct Trie* letter;
struct List *matches;
};
malloc(26*sizeof(struct Trie)) for letter and you have an array. if you want to support punctuations, add them at the end of the letter array.
matches can be a linked list of matches, implemented however you like, I won't define struct List for you.
Simplest solution: use a vector which contains your address entries and loop over the vector to search.
A map is usually implemented either as a binary tree (look for red/black trees for balancing) or as a hash map. Both of them are not trivial: Trees have some overhead for organisation, memory management and balancing, hash maps need good hash functions, which are also not trivial. But both structures are fun and you'll get a lot of insight understanding by implementing one of them (or better, both :-)).
Also consider to keep the data in the vector list and let the map contain indices to the vector (or pointers to the entries): then you can easily have multiple indices, say one for the name and one for the phone number, so you can look up entries by both.
That said I just want to strongly recommend using the data structures provided by the standard library for real-world-tasks :-)
A simple approach to get you started would be to create a map class that uses two vectors - one for the key and one for the value. To add an item, you insert a key in one and a value in another. To find a value, you just loop over all the keys. Once you have this working, you can think about using a more complex data structure.