Find common value in two maps without iterating - c++

I have these two maps, each storing 10000+ of entries:
std::map<std::string,ObjectA> mapA;
std::map<std::string,ObjectB> mapB;
I want to retrieve only those values from the maps whose keys are present in both maps.
For example, if key "10001" is found in both mapA and mapB, then I want the corresponding objects from both the maps. Something like doing a join on SQL tables. Easiest way would be to iterate over the smaller map, and then do std::find(iter->first) in each iteration to find the keys that qualify. That would also be very expensive.
Instead, I am considering maintaining a set like this:
std::set<std::string> common;
1) Every time I insert into one of the map, I will check whether it exists in the other map. If it does, I add the key to the above common set.
2) Every time I remove an entry from one of the map, I will remove the key from common set, if it exists.
The common set always maintains the keys that are in both maps. When I want to do the join, I already have the qualifying keys. Is there a faster/better way?

The algorithm is pretty simple. First, you treat the two maps as sequences (using iterators).
If either remaining sequence is empty, you're done.
If the keys at the front of the sequence are the same, you have found a match.
If the keys differ, discard the lower (according to the map's sorting order) of the two.
You'll be iterating over both maps, which means a complexity of O(n+m), which is significantly better than the naive approach with its O(n log m) or O(m log n) complexity.

Related

C++ data structure to perform indexed list

I am looking for the most efficient data structure to maintain an indexed list. You can easily view it interms of a STL map :
std::map<int,std::vector<int> > eff_ds;
I am using this as an example because I am currently using this setup. The operations that I would like to perform are :
Insert values based on key : similar to eff_ds[key].push_back(..);
Print the contents of the data structure in terms of each key.
I am also trying to use an unordered map and a forward list,
std::unordered_map<int,std::forward_list<int> > eff_ds;
Is this the best I could do in terms of time if I use C++ or are there other options ?
UPDATE:
I can do insertion either way - front/back as long as I do the same for all the keys. To make my problem more clear, consider the following:
At each iteration of my algorithm, I am going to have an external block give me a (key,value) - both of which are single integers - pair as an output. Of course, I will have to insert this value to the corresponding key. Also, at different iterations, the same key might be returned with different values. At the end my output data(written to a file) should look something like this:
k1: v1 v2 v3 v4
k2: v5 v6 v7
k3: v8
.
.
.
kn: vm
The number of these iterations are pretty large ~1m.
There are two dimensions to your problem:
What is the best container to use where you want to be able to look up the items in the container using a numeric key, with a large number of keys, and the keys are sparse
A numeric key might lend itself to a vector for this, however if the keys are sparsely populated that would waste a lot of memory.
Assuming you do not want to iterate through the keys in order (which you did not state as a requirement), then an unordered_map might be the best bet.
What is the best container for a list of numbers, allowing for insertion at either end and the ability to retrieve the list of numbers in order (the value type of the outer map)
The answer to this will depend on how frequently you want to insert elements at the front. If that is commonly occurring then you might want to consider a forward_list. If you are mainly inserting on the end then a vector would be lower overhead.
Based on your updated question, since you can limit yourself to adding the values to the end of the lists, and since you are not concerned with duplicate entries in the lists, I would recommend using std::unordered_map<int,vector<int> >

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates.
Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
You should convert each vector element to string one by one like this "4,3,2,0,4,23".
Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector.
If you do not need delete duplicated elements while making sting vector.

finding items to de-duplicate

I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.
The result I need is, for example:
X1 equals X3 and X6
X2 is unique
X4 equals X5
(Order of the lines, or order within a line, doesn't matter).
How can I implement that with pair-wise comparisons?
Here's what I have so far:
Compare all pairs (Xi, Xk) with i < k, and exploit transitivity: if I already found X1==X3 and X1==X6, I don't need to compare X3 and X6.
so I could use the following data structure:
map: index --> group
multimap: group --> indices
where group is arbitrarily assigned (e.g. "line number" in the output).
For a pair (Xi, Xk) with i < k :
if both i and k already have a group assigned, skip
if they compare equal:
if i already has a group assigned, put k in that group
otherwise, create a new group for i and put k in it
if they are not equal:
if i has no group assigned yet, assign a new group for i
same for k
That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.
Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.
An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics.
Code base is C++, something that works with STL / boost containers or algorithms would be nice.
[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.
This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.
Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.
For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).
But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.
So... you already have a hash? How about this:
sort and group on hash
print all groups with size 1 as unique
compare collisions
Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.
(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)
(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)
Make hash of each item. Make a list of pair<hash,item_index>. You can find groups by sorting this list by hash or putting it into std::multimap.
When you output group list, you need compare items for hash collisions.
So for each item you will do one hash calculation and about one comparison. And sorting of hash-list.
I agree with the idea to use a second (hopefully improved) hash function so you can resolve some of your weak hash's collisions without needing to do costly pairwise comparisons. Since you say you are having memory limitation issues, hopefully you can fit the entire hash table (with secondary keys) in memory, where for each entry in the table you store a list of record indices for the records on disk that correspond to that key pair. Then the question is whether for each key pair, whether you can load all the records into memory that have that key pair. If so, then you can just iterate over key pairs; for each key pair, free any records in memory for the previous key pair and load the records in memory for the current key pair, and then do comparisons among these records like you already outlined. If you have a key pair where you can't fit all records into memory, then you'll have to load partial subsets, but you should definitely be able to maintain in memory all the groups (with a unique record representative for each group) you have found for the key pair, since the number of unique records will be small if you have a good secondary hash.

What's the best way to search from several map<key,value>?

I have created a vector which contains several map<>.
vector<map<key,value>*> v;
v.push_back(&map1);
// ...
v.push_back(&map2);
// ...
v.push_back(&map3);
At any point of time, if a value has to be retrieved, I iterate through the vector and find the key in every map element (i.e. v[0], v[1] etc.) until it's found. Is this the best way ? I am open for any suggestion. This is just an idea I have given, I am yet to implement this way (please show if any mistake).
Edit: It's not important, in which map the element is found. In multiple modules different maps are prepared. And they are added one by one as the code progresses. Whenever any key is searched, the result should be searched in all maps combined till that time.
Without more information on the purpose and use, it might be a little difficult to answer. For example, is it necessary to have multiple map objects? If not, then you could store all of the items in a single map and eliminate the vector altogether. This would be more efficient to do the lookups. If there are duplicate entries in the maps, then the key for each value could include the differentiating information that currently defines into which map the values are put.
If you need to know which submap the key was found in, try:
unordered_set<key, pair<mapid, value>>
This has much better complexity for searching.
If the keys do not overlap, i.e., are unique througout all maps, then I'd advice a set or unordered_set with a custom comparision functor, as this will help with the lookup. Or even extend the first map with the new maps, if profiling shows that is fast enough / faster.
If the keys are not unique, go with a multiset or unordered_multiset, again with a custom comparision functor.
You could also sort your vector manually and search it with a binary_search. In any case, I advice using a tree to store all maps.
It depends on how your maps are "independently created", but if it's an option, I'd make just one global map (or multimap) object and pass that to all your creators. If you have lots of small maps all over the place, you can just call insert on the global one to merge your maps into it.
That way you have only a single object in which to perform lookup, which is reasonably efficient (O(log n) for multimap, expected O(1) for unordered_multimap).
This also saves you from having to pass raw pointers to containers around and having to clean up!

C++: multiple keyed map

I am searching for a (multi)map where there values are associated by different key types. Basically what was asked here for Java but for C++. Is there something like this already or do I have to implement it myself?
Another, more simple case (the above case would solve this already but there may be a more simple solution esp for this case):
I want a multimap where my values are all unique and ordered (the keys are also ordered of course) and I want to be able to do a search in the map for a specific value in O(log n) time. So I can get the associated key to a value in O(log n) time. And I can get the associated value to a key also in O(log n) time.
If you want to be able to search both by key and by value use boost.bimap.
If you need multiple keys use boost.multi-index.
Boost Multi-Index.