C++ Standard Library hash code sample - c++

I solved a problem to find duplicates in a list
I used the property of a set that it contains only unique members
set<int> s;
// insert the new item into the set
s.insert(nums[index]);
// if size does not increase there is a duplicate
if (s.size() == previousSize)
{
DuplicateFlag = true;
break;
}
Now I am trying to solve the same problem with hash functions in the Standard Library. I have sample code like this
#include <functional>
using namespace __gnu_cxx;
using namespace std;
hash<int> hash_fn2;
int x = 34567672;
size_t int_hash2 = hash_fn2(x);
cout << x << " " << int_hash2 << '\n';
x and int_hash2 are always the same
Am I missing something here ?

For std::hash<int>, it's ok to directly return the original int value. From the specification, it only needs to ensure that for two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max(). Clearly returning the original value satisfies the requirement for std::hash<int>.

x and int_hash2 are always the same Am I missing something here ?
Yes. You say "I am trying to solve the same problem with hash functions", but hash functions are not functional alternatives to std::set<>s, and can not - by themselves - be used to solve your poroblem. You probably want to use a std::unordered_set<>, which will internally use a hash table, using the std::hash<> function (by default) to help it map from elements to "buckets". For the purposes of a hash table, a hash function for integers that returns the input is usually good enough, and if it's not the programmer's expected to provide their preferred alternative as a template parameter.
Anyway, all you have to do to try a hash table approach is change std:set<int> s; to std::unordered_set<int> s; in your original code.

Related

How to remove duplicates of type vector<string> in C++?

I know that a good way to prevent duplicates is to use an unordered_set. However, this method does not seem to work when I want to have an unordered_set<vector<string>>. How can I go about doing this? For example, I want to prevent <"a", "b", "c"> from being duplicated in my unordered_set<vector<string>>.
Can this unordered_set<vector<string>> be used outside the defined class as well?
Code:
unordered_set<vector<string>> abc({"apple", "ball", "carrot"});
abc.insert({"apple", "ball", "carrot"});
cout << abc.size() << endl; //abc.size() should be 1
There is a number of ways to get rid of duplicates, building a set out of your objects is one of them. Whether it is going to be std::set or std::unordered_set is up to you to decide, and the decision usually depends on how good of a hash fuction can you come up with.
This in turn requires the knowledge of the domain, e.g. what your vectors of strings represent and what values can they have. if you do come up with a good hash, you can implement it like this:
struct MyHash
{
std::size_t operator()(std::vector<std::string> const& v) const
{
// your hash code here
return 0; // return your hash value instead of 0
}
};
Then you just declare your unordered_set with that hash:
std::unordered_set<std::vector<std::string>, MyHash> abc;
I would say it's a safe bet to just go with a std::set at first though, unless you have a good hash function on your mind.

Hash value for a std::unordered_map

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.
Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

What's the difference between set<pair> and map in C++?

There are two ways in which I can easily make a key,value attribution in C++ STL: maps and sets of pairs. For instance, I might have
map<key_class,value_class>
or
set<pair<key_class,value_class> >
In terms of algorithm complexity and coding style, what are the differences between these usages?
They are semantically different. Consider:
#include <set>
#include <map>
#include <utility>
#include <iostream>
using namespace std;
int main() {
pair<int, int> p1(1, 1);
pair<int, int> p2(1, 2);
set< pair<int, int> > s;
s.insert(p1);
s.insert(p2);
map<int, int> m;
m.insert(p1);
m.insert(p2);
cout << "Set size = " << s.size() << endl;
cout << "Map size = " << m.size() << endl;
}
http://ideone.com/cZ8Vjr
Output:
Set size = 2
Map size = 1
Set elements cannot be modified while they are in the set. set's iterator and const_iterator are equivalent. Therefore, with set<pair<key_class,value_class> >, you cannot modify the value_class in-place. You must remove the old value from the set and add the new value. However, if value_class is a pointer, this doesn't prevent you from modifying the object it points to.
With map<key_class,value_class>, you can modify the value_class in-place, assuming you have a non-const reference to the map.
map<key_class,value_class> will sort on key_class and allow no duplicates of key_class.
set<pair<key_class,value_class> > will sort on key_class and then value_class if the key_class instances are equal, and will allow multiple values for key_class
The basic difference is that for the set the key is the pair, whereas for the map the key is key_class - this makes looking things up by key_class, which is what you want to do with maps, difficult for the set.
Both are typically implemented with the same data structure (normally a red-black balanced binary tree), so the complexity for the two should be the same.
std::map acts as an associative data structure. In other words, it allows you to query and modify values using its associated key.
A std::set<pair<K,V> > can be made to work like that, but you have to write extra code for the query using a key and more code to modify the value (i.e. remove the old pair and insert another with the same key and a different value). You also have to make sure there are no more than two values with the same key (you guessed it, more code).
In other words, you can try to shoe-horn a std::set to work like a std::map, but there is no reason to.
To understand algorithmic complexity, you first need to understand the implementation.
std::map is implemented using RB-tree where as hash_map are implemented using arrays of linked list. std::map provides O(log(n)) for insert/delete/search operation, hash_map is O(1) is best case and o(n) in worst case depending upon hash collisions.
Visualising that semantic difference Philipp mentioned after stepping through the code, note how map key is a const int and how p2 was not inserted into m:

What is the best way to use two keys with a std::map?

I have a std::map that I'm using to store values for x and y coordinates. My data is very sparse, so I don't want to use arrays or vectors, which would result in a massive waste of memory. My data ranges from -250000 to 250000, but I'll only have a few thousand points at the most.
Currently I'm creating a std::string with the two coordinates (i.e. "12x45") and using it as a key. This doesn't seem like the best way to do it.
My other thoughts were to use an int64 and shove the two int32s into it and use it as a key.
Or to use a class with the two coordinates. What are the requirements on a class that is to be used as the key?
What is the best way to do this? I'd rather not use a map of maps.
Use std::pair<int32,int32> for the key:
std::map<std::pair<int,int>, int> myMap;
myMap[std::make_pair(10,20)] = 25;
std::cout << myMap[std::make_pair(10,20)] << std::endl;
I usually solve this kind of problem like this:
struct Point {
int x;
int y;
};
inline bool operator<(const Point& p1, const Point& p2) {
if (p1.x != p2.x) {
return p1.x < p2.x;
} else {
return p1.y < p2.y;
}
}
Boost has a map container that uses one or more indices.
Multi Index Map
What are the requirements on a class that is to be used as the key?
The map needs to be able to tell whether one key's value is less than another key's value: by default this means that (key1 < key2) must be a valid boolean expression, i.e. that the key type should implement the 'less than' operator.
The map template also implements an overloaded constructor which lets you pass-in a reference to a function object of type key_compare, which can implement the comparison operator: so that alternatively the comparison can be implemented as a method of this external function object, instead of needing to be baked in to whatever type your key is of.
This will stuff multiple integer keys into a large integer, in this case, an _int64. It compares as an _int64, AKA long long (The ugliest type declaration ever. short short short short, would only be slightly less elegant. 10 years ago it was called vlong. Much better. So much for "progress"), so no comparison function is needed.
#define ULNG unsigned long
#define BYTE unsigned char
#define LLNG long long
#define ULLNG unsigned long long
// --------------------------------------------------------------------------
ULLNG PackGUID(ULNG SN, ULNG PID, BYTE NodeId) {
ULLNG CompKey=0;
PID = (PID << 8) + NodeId;
CompKey = ((ULLNG)CallSN << 32) + PID;
return CompKey;
}
Having provided this answer, I doubt this is going to work for you, as you need two separate and distinct keys to navigate with in 2 dimensions, X and Y.
On the other hand, if you already have the XY coordinate, and just want to associate a value with that key, then this works spectacularly, because an _int64 compare takes the same time as any other integer compare on Intel X86 chips - 1 clock.
In this case, the compare is 3X as fast on this synthetic key, vs a triple compound key.
If using this to create a sparsely populated spreadsheet, I would RX using 2 distinct trees, one nested inside the other. Make the Y dimension "the boss", and search Y space first to resolution before proceeding to the X dimension. Spreadsheets are taller than they are wide, and you always want the 1st dimension in any compound key to have the largest number of unique values.
This arrangement would create a map for the Y dimension that would have a map for the X dimension as it's data. When you get to a leaf in the Y dimension, you start searching it's X dimension for the column in the spreadsheet.
If you want to create a very powerful spreadsheet system, add a Z dimension in the same way, and use that for, as an example, organizational units. This is the basis for a very powerful budgeting/forecasting/accounting system, one which allows admin units to have lots of gory detail accounts to track admin expenses and such, and not have those accounts take up space for line units which have their own kinds of detail to track.
I think for your use case, std::pair, as suggested in David Norman's answer, is the best solution. However, since C++11 you can also use std::tuple. Tuples are useful if you have more than two keys, for example if you have 3D coordinates (i.e. x, y, and z). Then you don't have to nest pairs or define a comparator for a struct. But for your specific use case, the code could be written as follows:
int main() {
using tup_t = std::tuple<int, int>;
std::map<tup_t, int> m;
m[std::make_tuple(78, 26)] = 476;
tup_t t = { 12, 45 }; m[t] = 102;
for (auto const &kv : m)
std::cout << "{ " << std::get<0>(kv.first) << ", "
<< std::get<1>(kv.first) << " } => " << kv.second << std::endl;
return 0;
}
Output:
{ 12, 45 } => 102
{ 78, 26 } => 476
Note: Since C++17 working with tuples has become easier, espcially if you want to access multiple elements simultaneously.
For example, if you use structured binding, you can print the tuple as follows:
for (auto const &[k, v] : m) {
auto [x, y] = k;
std::cout << "{ " << x << ", " << y << " } => " << v << std::endl;
}
Code on Coliru
Use std::pair. Better even use QHash<QPair<int,int>,int> if you have many of such mappings.
Hope you will find it useful:
map<int, map<int, int>> troyka = { {4, {{5,6}} } };
troyka[4][5] = 7;
An alternative for the top result that is slightly less performant but allows for easier indexing
std::map<int, std::map<int,int>> myMap;
myMap[10][20] = 25;
std::cout << myMap[10][20] << std::endl;
First and foremost, ditch the string and use 2 ints, which you may well have done by now. Kudos for figuring out that a tree is the best way to implement a sparse matrix. Usually a magnet for bad implementations it seems.
FYI, a triple compound key works too, and I assume a pair of pairs as well.
It makes for some ugly sub-scripting though, so a little macro magic will make your life easier. I left this one general purpose, but type-casting the arguments in the macro is a good idea if you create macros for specific maps. The TresKey12 is tested and running fine. QuadKeys should also work.
NOTE: As long as your key parts are basic data types you DON'T need to write anything more. AKA, no need to fret about comparison functions. The STL has you covered. Just code it up and let it rip.
using namespace std; // save some typing
#define DosKeys(x,y) std::make_pair(std::make_pair(x,y))
#define TresKeys12(x,y,z) std::make_pair(x,std::make_pair(y,z))
#define TresKeys21(x,y,z) std::make_pair(std::make_pair(x,y),z))
#define QuadKeys(w,x,y,z) std::make_pair(std::make_pair(w,x),std::make_pair(y,z))
map<pair<INT, pair<ULLNG, ULLNG>>, pIC_MESSAGE> MapMe;
MapMe[TresKey12(Part1, Part2, Part3)] = new fooObject;
If someone wants to impress me, show me how to make a compare operator for TresKeys that doesn't rely on nesting pairs so I can use a single struct with 3 members and use a comparison function.
PS: TresKey12 gave me problems with a map declared as pair,z as it makes x,pair, and those two don't play nice. Not a problem for DosKeys, or QuadKeys. If it's a hot summer Friday though, you may find an unexpected side-effect of typing in DosEquis
... err.. DosKeys a bunch of times, is a thirst for Mexican beer. Caveat Emptor. As Sheldon Cooper says, "What's life without whimsy?".

container for quick name lookup

I want to store strings and issue each with a unique ID number (an index would be fine). I would only need one copy of each string and I require quick lookup. I check if the string exist in the table often enough that i notice a performance hit. Whats the best container to use for this and how do i lookup if the string exist?
I would suggest tr1::unordered_map. It is implemented as a hashmap so it has an expected complexity of O(1) for lookups and a worst case of O(n). There is also a boost implementation if your compiler doesn't support tr1.
#include <string>
#include <iostream>
#include <tr1/unordered_map>
using namespace std;
int main()
{
tr1::unordered_map<string, int> table;
table["One"] = 1;
table["Two"] = 2;
cout << "find(\"One\") == " << boolalpha << (table.find("One") != table.end()) << endl;
cout << "find(\"Three\") == " << boolalpha << (table.find("Three") != table.end()) << endl;
return 0;
}
try this:
(source: adrinael.net)
Try std::map.
First and foremost you must be able to quantify your options. You have also told us that the main usage pattern you're interested in is lookup, not insertion.
Let N be the number of strings that you expect to be having in the table, and let C be the average number of characters in any given string present in the said table (or in the strings that are checked against the table).
In the case of a hash-based approach, for each lookup you pay the following costs:
O(C) - calculating the hash for the string you are about to look up
between O(1 x C) and O(N x C), where 1..N is the cost you expect from traversing the bucket based on hash key, here multiplied by C to re-check the characters in each string against the lookup key
total time: between O(2 x C) and O((N + 1) x C)
In the case of a std::map-based approach (which uses red-black trees), for each lookup you pay the following costs:
total time: between O(1 x C) and O(log(N) x C) - where O(log(N)) is the maximal tree traversal cost, and O(C) is the time that std::map's generic less<> implementation takes to recheck your lookup key during tree traversal
In the case of large values for N and in the absence of a hash function that guarantees less than log(N) collisions, or if you just want to play it safe, you're better off using a tree-based (std::map) approach. If N is small, by all means, use a hash-based approach (while still making sure that hash collision is low.)
Before making any decision, though, you should also check:
http://meshula.net/wordpress/?p=183
http://wyw.dcweb.cn/mstring.htm
Are the Strings to be searched available statically? You might want to look at a perfect hashing function
sounds like an array would work just fine where the index is the index into the array. To check if it exists, just make sure the index is in bounds of the array and that its entry isn't NULL.
EDIT: if you sort the list, you could always use a binary search which should have fast lookup.
EDIT: Also, if you want to search for a string, you can always use a std::map<std::string, int> as well. This should have some decent lookup speeds.
Easiest is to use std::map.
It works like this:
#include <map>
using namespace std;
...
map<string, int> myContainer;
myContainer["foo"] = 5; // map string "foo" to id 5
// Now check if "foo" has been added to the container:
if (myContainer.find("foo") != myContainer.end())
{
// Yes!
cout << "The ID of foo is " << myContainer["foo"];
}
// Let's get "foo" out of it
myContainer.erase("foo")
Google sparse hash maybe