finding items to de-duplicate

finding items to de-duplicate - c++

I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.
The result I need is, for example:
X1 equals X3 and X6
X2 is unique
X4 equals X5
(Order of the lines, or order within a line, doesn't matter).
How can I implement that with pair-wise comparisons?
Here's what I have so far:
Compare all pairs (Xi, Xk) with i < k, and exploit transitivity: if I already found X1==X3 and X1==X6, I don't need to compare X3 and X6.
so I could use the following data structure:
map: index --> group
multimap: group --> indices
where group is arbitrarily assigned (e.g. "line number" in the output).
For a pair (Xi, Xk) with i < k :
if both i and k already have a group assigned, skip
if they compare equal:
if i already has a group assigned, put k in that group
otherwise, create a new group for i and put k in it
if they are not equal:
if i has no group assigned yet, assign a new group for i
same for k
That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.
Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.
An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics.
Code base is C++, something that works with STL / boost containers or algorithms would be nice.
[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.
This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.

Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.
For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).
But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.

So... you already have a hash? How about this:
sort and group on hash
print all groups with size 1 as unique
compare collisions
Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.
(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)
(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)

Make hash of each item. Make a list of pair<hash,item_index>. You can find groups by sorting this list by hash or putting it into std::multimap.
When you output group list, you need compare items for hash collisions.
So for each item you will do one hash calculation and about one comparison. And sorting of hash-list.

I agree with the idea to use a second (hopefully improved) hash function so you can resolve some of your weak hash's collisions without needing to do costly pairwise comparisons. Since you say you are having memory limitation issues, hopefully you can fit the entire hash table (with secondary keys) in memory, where for each entry in the table you store a list of record indices for the records on disk that correspond to that key pair. Then the question is whether for each key pair, whether you can load all the records into memory that have that key pair. If so, then you can just iterate over key pairs; for each key pair, free any records in memory for the previous key pair and load the records in memory for the current key pair, and then do comparisons among these records like you already outlined. If you have a key pair where you can't fit all records into memory, then you'll have to load partial subsets, but you should definitely be able to maintain in memory all the groups (with a unique record representative for each group) you have found for the key pair, since the number of unique records will be small if you have a good secondary hash.

Related

Find common value in two maps without iterating

I have these two maps, each storing 10000+ of entries:
std::map<std::string,ObjectA> mapA;
std::map<std::string,ObjectB> mapB;
I want to retrieve only those values from the maps whose keys are present in both maps.
For example, if key "10001" is found in both mapA and mapB, then I want the corresponding objects from both the maps. Something like doing a join on SQL tables. Easiest way would be to iterate over the smaller map, and then do std::find(iter->first) in each iteration to find the keys that qualify. That would also be very expensive.
Instead, I am considering maintaining a set like this:
std::set<std::string> common;
1) Every time I insert into one of the map, I will check whether it exists in the other map. If it does, I add the key to the above common set.
2) Every time I remove an entry from one of the map, I will remove the key from common set, if it exists.
The common set always maintains the keys that are in both maps. When I want to do the join, I already have the qualifying keys. Is there a faster/better way?

The algorithm is pretty simple. First, you treat the two maps as sequences (using iterators).
If either remaining sequence is empty, you're done.
If the keys at the front of the sequence are the same, you have found a match.
If the keys differ, discard the lower (according to the map's sorting order) of the two.
You'll be iterating over both maps, which means a complexity of O(n+m), which is significantly better than the naive approach with its O(n log m) or O(m log n) complexity.

Data structure for names and pseudo-IDs: use a hash table or a BST?

I'm tasked with building a data structure that stores a mapping from integer "pseudo-IDs" to names. I can insert new names into the table, where each name is associated with a number of pseudo-IDs, provided that none of the pseudo-IDs is already taken. I need to support lookups by ID and deletion by ID, where if I delete any pseudo-ID for a person, it removes all the pseudo-IDs for that person.
This program runs on a script that looks something like this:
I JackSmart 3 9 1009 1000009
L 1000009
I TedPumpkinhead 1 19
I PeterMeter 1 9
L 19
D 19
L 19
I JohnCritic 2 1 19
L 19
L 1
L 9
Here, the first character of each line determines how to interpret it.
A line starting with I is an insertion. The rest of the line will consist of a name, followed by a number of pseudo-IDs for that name, then each of the pseudo-IDs. The name should be inserted unless any of the pseudo-IDs is already in use.
A line starting with an L is a lookup. The line contains a pseudo-ID to look up. I need to print the name associated with it, or report that no such name exists.
A line starting with a D is a deletion. The line contains the pseudo-ID to be deleted. I need to then remove the person associated with that pseudo-ID from the table, such that looking them up by any of their pseudo-IDs now fails.
The output of this task (according to the sample file at top) would be:
ok
JackSmart
ok
no
TedPumpkinhead
ok
no
ok
JohnCritic
JohnCritic
JackSmart
Which is the best approach here? Which data-structure should I use for this task? As there's insertion and deletion, I think it's BST. Any ideas?
Additionally, this needs to run efficiently. Each task should run in worst-case time O(log n).

Let's go through a progression of different approaches for trying to solve this problem.
One option for solving this problem would be to store some kind of associative array, where each key is a pseudo-ID and each value is a name. To do an insertion, you check whether each pseudo-ID is free. If so, you insert a key/value pair into each slot. To do a lookup, you just do a lookup by ID and return what you find. To do a deletion, you look up the key associated with the key/value pair, then iterate over the table and delete each key/value pair that has that value.
This approach is pretty fast for insertions and lookups, but it's really slow for deletions. Specifically, inserting a name with k pseudo-IDs requires O(k) associative array operations (the runtime of each depends on how we implement the associative array), looking up a name requires O(1) associative array operations, but deletion requires O(n) associative array operations.
To speed things up, one trick we can use would be to have each value store, in addition to the actual name, a list of all the pseudo-IDs associated with that name. That way, when we perform a deletion, we can look up which pseudo-IDs need to be deleted without having to literally look through every entry in the associatve array. This is a lot faster - it drops the cost of a deletion to O(k) associative array operations, where k is the number of elements deleted.
To make things even faster, one clever idea would be to have each value in the associative array be a pointer to a structure that stores two pieces of information: the actual name associated with the entry, plus a flag indicating whether the element has been deleted. Each pseudo-ID key then stores a pointer to this shared structure. Whenever you do a deletion, you can delete all copies of the key/value pair by just setting the "deleted" flag on the structure to true. The key/value mappings will still exist in the table, though, so whenever you do a lookup or insertion and find a key/value pair stored somewhere, before concluding that the pseudo-ID for the key is already in use, you'd check the flag. If it's marked as deleted, then you can just pretend that the ID isn't in use any more. If it's not marked as deleted, you know the entry is live and should skip it.
Here's some pseudocode for this:
Maintain an associative array "table."
To do a lookup(pseudoID):
if table[pseudoID] doesn't exist, return null
if table[pseudoID] exists, but table[pseudoID].deleted is true, return null
return table[pseudoID].value
To do insertion(name, id1, id2, ..., idk):
for (each id):
if (lookup(id) != null) return false
create a new structure:
entry.value = name
entry.deleted = false
for (each id):
table[id] = entry
To do delete(id):
lookup(id) == null: return false;
table[id].deleted = true
Overall, this new approach requires O(k) associative array operations for an insertion of a name with k pseudo-IDs, O(1) associative array operations for a lookup, and O(1) associative array operations for a deletion.
The question now is what the best structure would be for representing the associative array. Notice that we never need to worry about accessing anything in sorted order, so while we could use a BST, it might be better to use a hash table to accomplish this because it's (on expectation) a lot faster. So given a choice between a BST and a hash table, I'd use the hash table since it's likely going to be a lot faster. If you need to guarantee worst-case efficiency, you could use the BST, though.
With a BST, insertion runs in time O(k log n) for k IDs and both lookups and deletions will run in worst-case time O(log n). For a hash table, insertion runs in expected time O(k) for k IDs and both lookups and deletions will run in expected O(1) time.
Another option, depending on the types of numbers that you can use for pseudo-IDs, might be to use a binary trie. That would depend on how many bits can be in the keys and how dense you expect everything to be.

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9

You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.

Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.

Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2

A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

chained hash table keys with universal hasing,does it need a rehash?

I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?

Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.

do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.

How can we benefit from vs2010 hash_map's less?

See this if you don't know vs2010 actually requires total ordering, and hence it require a user defined less.
one of answer said it is possible for binary search, but I don't think so, this because
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
Obviously, it will slow down insertion because of locating the appropriate position.
How does hash-map benefit from this design? and how do we utilize this design?
thanks

The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
There won't be at most one element per hash slot. Some buckets will have to keep more than one key. Unless the input is only from a pre-determined restricted set of values (i.e. perfect hashing), the hash function will have to deal with more inputs than the outputs that it can produce. There will be collisions; this is unavoidable in an implementation as generic as this one. However, good hash functions should produce well-distributed hashes and that makes the number of elements per hash slot stay low.
Obviously, it will slow down insertion because of locating the appropriate position.
Assuming a good hash function and non-degenerate input (input designed so that many elements result in the same hash), there will always be only a few keys per bucket. Inserting into such a binary search tree won't be that big of a cost, and that little cost may bring benefits elsewhere (searches may be faster than on an implementation with a linked list). And in case of degenerate input, the hash map will degenerate into a binary search tree, which is much better than a simple linked list.

Your question is largely irrelevant in practice, because C++ now supplies unordered_map etc. which use an Equal predicate rather than a less-than comparator.
However, consider a hash_map<string, ...>. Clearly, the value space of string is larger than that of size_t, so for any hash function there will be values that have the same hash and so are placed in the same bucket. In the pathological situation where all the items in the hash table are placed in the same bucket, exploiting ordering among keys will result in improved speed of access, insertion and removal.
Note that search on an ordered list (or binary tree) is O(log n) as opposed to O(n).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js