I need a hash function for 3D vectors with no collisions between close key values.
The key is a 3d vector of integers. I want no collisions within roughly a 64 * 64 * 64 "area" or larger.
Does anyone know of any hashing functions suited for this purpose, or even better, how would you go about designing a hash for this?
If it's necessary to know, I will be implementing it in C++.
Why not create a Map<int,Map<int,Map<int,Object>>> for your objects? Where each int is x,y,z or whatever you're labeling your axis.
Here's an example of how you could use it.
int x,y,z;
map<int,map<int,map<int,string>>> Vectors = map<int,map<int,map<int,string>>>();
/*give x, y and z a real value*/
Vectors[x][y][z] = "value";
/*more code*/
string ValueAtXYZ = Vectors[x][y][z];
Just to explain because its not super obvious.
The Vectors[x] returns a map<int,map<int,string>>.
I then immediately use that maps [] operator with [y].
That then returns (you guessed it) a map<int,string>.
I immediately use that maps [] operator with [z] and can now set the string.
Note: Just be sure to loop through it using iterates and not a for(int x = 0; /*bad code*/;x++) loop because [] adds an element at every location it's used to look up. Here's an example of a loop and Here's and example of an unexpected add.
Edit:
If you want to make sure that you're not overriding an existing value you could do this.
string saveOldValue;
if(Vectors[x][y][z] != ""/*this is the default value of a string*/)
{
/*There was a string in that vector so store the old Value*/
saveOldValue = Vectors[x][y][z];
}
Vectors[x][y][z] = "Value";
If you use [] on a key that isn't in the map the map creates a default object there. For strings this would be the empty string "".
Or
if( Vectors.find(x)!=Vectors.end()
&& Vectors[x].find(y)!=Vectors[x].end()
&& Vectors[x][y].find(z)!=Vectors[x][y].end())
{
/* Vectors[x][y][z] has something in it*/
}else
{
/*Theres nothing at Vectors[x][y][z] so go for it*/
Vectors[x][y][z] ="value";
}
This uses the find(value) function which returns an iterator to the location of the key "value" OR and iterator that points to map::end() if that key is not int the current map.
If you don't have a default value for your thing being stored then use the second check to do your inserts. This greatly increases the useability of this answer and unclutters your code.
The insert function has it's place but in this example it would be very hard to use.
Related
I am interested in performing the following operations on a set of data.
First, we are given a set of keys, as an example:
vector<int> keys{1,2,3,4,5,6};
Each of these keys is understood to be pointing to a unique entry (which is not important to specify, rather what is important is the relation whether each key is pointing to a separate entry, or some keys are pointing to the same entry). Initially, we do not know whether any key is pointing to the same entry or not, so we start out with a data structure that treats all entries as separate for each key:
surjectiveData<int> data;
data.populateUnique(keys.begin(),keys.end());
Graphically, we can illustrate the current state of data as
where we use labels a,b,c,d,e,f to keep track of the unique entries in data. Now, consider adding additional information on which keys are pointing to the same entry. For example:
vector<pair<int,int>> identifications{make_pair(1,2),make_pair(3,4),make_pair(2,4),make_pair(5,6)};
data.couple(identifications.begin(),indentifications.end());
The couple method of surjectiveData goes through the pairs provided and makes them point to the same unique entry. Graphcally, the four identifications would in turn change data as follows:
and now there are only two unique entries in data, which here we denote abcd and ef. Note that once two or more keys point to the same entry, it does not matter which of these keys is identified with which of separate keys, all of them point to the same entry after identification.
Now that we are done with specifying key identifications, we could think of using data as follows. For example, we could ask what is the effective number of unique remaining entries
cout<<data.size()<<endl; // 2
Or, we could iterate through the entries and check how many keys point to each of them
for(auto it=data.begin();it!=data.end();it++){
cout<<it->size()<<" ";// 4 2
}
Ideally, internally the structure should take constant time for each identification, if possible.
I tried to search for such a data structure in the standard library, but could not find any. Did I miss it? Perhaps there is a smart way to implement it based on more basic objects? If so, what would be a minimal example for integers?
The operations you describe can be supported with a disjoint set data structure: https://en.wikipedia.org/wiki/Disjoint-set_data_structure
This is a linked data structure that supports 3 operations:
makeSet() creates a new singleton set and returns its element
union(a,b) given two elements, merges the sets that contain them. One element of each set will be the "representative" of that set
find(a) returns the representative of the set that contains a.
All operations take pretty much constant amortized time.
I usually implement this data structure in a single vector, where each array index denotes is a set element. If its value is >0, then it's a set representative and the value is the size of the set. If its value is < 0 then its value is ~p, where p is its "parent" element in the same set. Sometimes I use the 0 value for "uninitialized".
It's not hard to keep track of the number of sets.
in C++, my usual implementation would look like this:
class DijointSets {
unsigned num_sets;
std::vector<int> sets;
public:
// Create a new singleton set and return its element
unsigned make_set() {
unsigned ret = (unsigned)sets.size();
sets.push_back(1);
++num_sets;
return ret;
}
// Find the representative element of an element's set
unsigned find(unsigned x) {
int p = sets[x];
if (p>=0) {
return x;
}
p = find(~p);
sets[x] = ~p; //might be the same
return p;
}
// Merge the sets that contain two elements
// returns true if a merge was done
boolean union(unsigned a, unsigned b) {
a = find(a);
b = find(b);
if (a==b) {
return false;
}
if (sets[a] > sets[b]) {
sets[a] += sets[b]; //add sizes
sets[b] = ~(int)a;
} else {
sets[b] += sets[a]; //add sizes
sets[a] = ~(int)b;
}
--num_sets;
return true;
}
// get the size of an element's set
unsigned set_size(x) {
return sets[find(x)];
}
// get the number of sets
unsigned set_count() {
return num_sets;
}
}
so I'm just learning (or trying to) a bit about hashing. I'm attempting to make a hashing function, however I'm confused where I save the data to. I'm trying to calculate the number of collisions and print that out. I have made 3 different files, one with 10,000 words, 20,000 words and 30,000 words. Each word is just 10 random numbers/letters.
long hash(char* s]){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
//A lot of examples then mod h by the table size
//I'm a bit confused what this table is... Is it an array of
//10,000 (or however many words)?
//h % TABLE_SIZE
return h
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
while(!input.eof()){
input >> nextWord;
hash(nextWord);
}
}
So that's what I currently have, but I can't figure out what the table is exactly, as I said in the comments above... Is it a predefined array in my main with the number of words in it? For example, if I have a file of 10 words, do I make an array a of size 10 in my main? Then if/when I return h, lets say the order goes: 3, 7, 2, 3
The 4th word is a collision, correct? When that happens, I add 1 to collision and then add 1 to then check if slot 4 is also full?
Thanks for the help!
The point of hashing is to have a constant time access to every element you store. I'll try to explain on simple example bellow.
First, you need to know how much data you'd have to store. If for example you want to store numbers and you know, that you won't store numbers greater than 10. Simpliest solution is to create an array with 10 elements. That array is your "table", where you store your numbers. So how do I achieve that amazing constant time access? Hashing function! It's point is to return you an index to your array. Let's create a simple one: If you'd like to store 7, you just save it to array on position 7. Every time, you'd like to look, for element 7, you just pass it to your hasning funcion and bzaah! You got an position to your element in constant time! But what if you'd like to store more elements with value 7? Your simple hashing function is returning 7 for every element and now its position i already occupied! How to solve that? Well, there is not many solution, the simpliest are:
1: Chaining - you simply save element on first free position. This has significant draw back. Imagine, you want to delete some element ... (this is the method, you describing in question)
2: Linked list - if you create an array of pointers on some linked lists, you can easilly add your new element at the end of linked list, that is on position 7!
Both of this simple solutions has its drawbacks and cons. I guess you can see them. As #rwols has said, you don't have to use array. You can also use a tree or be a real C++ master and use unordered_map and unordered_set with custom hash function, which is quite cool. Also there is structure named trie, which is usefull, when you'd like to create some sort of dictionary (where is really hard to know, how many words you will need to store)
To sum it up. You has to know, how many things, you wan't to store and then, create ideal hashing function, that covers up array of apropriate size and in perfect world, it has to have uniform index distribution, with no colisions. (Achiving this is pretty hard and in the real world, I guess, this is impossible, so the less colisions, the better.)
Your hash function, is pretty bad. It will have lot of colisions (like strings "ab" and "ba") and also, you need to mod m it with m being the size of you array (aka. table), so you can save it to some array and you can profit of it. The modus is a way of simplyfiing the has function, because has function has to "fit" in table, that you specified in beginning, because you can't save element on position 11, 12, ... if you have array of 10.
How should good hashing function look like? Well, there is better sources than me. Some example (Alert! It's in Java)
To your example: You simply can't save 10k or even more words into table of size 10. That'll create a lot of collisions and you loose the main benefit of hashing function - constant access to elements you saved.
And how would your code look? Something like this:
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
TypeOfElement table[size_of_table];
while(!input.eof()){
input >> nextWord;
table[hash(nextWord)] = // desired element which you want to save
}
}
But I guess, your goal isn't to save something somewhere, but to count number of colisions. Also note that code above doesn't solve colisions. If you'd like to count colisions, create array table of ints and initialize it to zero. Than, just increment the value, which is stored on index, which is returned by your hash funcion, like this:
table[hash(nextWord)]++;
I hope I helped. Please specify, what else you want to know.
If a hash table is required then as others have stated std::unordered_map will work in most cases. Now if you need something more powerful because of a large entry base, then I would suggest looking into tries. Tries combine the concepts of (Vector-Array) insertion, (Hashing) & Linked Lists. The run time is close to O(M) where M is the amount of characters in a string if you are hashing a string. It helps to remove the chance of collisions. And the more you add to a trie structure the less work has to be done as certain nodes are opened and created. The one draw back is that tries require more memory. Here is a diagram
Now your trie may vary on the size of the array due to what you are storing, but the overall concept and construction of one is the same. If you was doing a word - definition look up then you may want an array of 26 or a few more for each possible hashing character.
To count a number of words which have same hash, we should know hashes of all previous words. When you count a hash of some word, you should write it down, for example in some array. So you need an array with size equal to the number of words.
Then you should compare the new hash with all previous ones. Method of counting depends on what you need - number of pair of collisions or number off same elements.
Hash function should not be responsible for storing data. Normally you would have a container that uses hash function internally.
From what you wrote I understood that you want to create hashtable. One way you could do that (probably not the most efficient one, but should give you an idea):
#include <fstream>
#include <vector>
#include <string>
#include <map>
#include <memory>
using namespace std;
namespace example {
long hash(char* s){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
return h;
}
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
std::map<long, std::unique_ptr<std::vector<std::string>>> hashtable;
while(!input.eof()){
input >> nextWord;
long newHash = example::hash(nextWord);
auto it = hashtable.find(newHash);
// Collision detected?
if (it == hashtable.end()) {
hashtable.insert(std::make_pair(newHash, std::unique_ptr<std::vector<std::string>>(new std::vector<std::string> { nextWord } )));
}
else {
it->second->push_back(nextWord);
}
}
}
I used some C++ 11 features to write an example faster.
I am not sure that I understand what you do not understand. The explanations below might help you.
A hash table is a kind of associative array. It is used to map keys to values in a similar manner an array is used to map indexes (keys) to values. For instance, an array of three numbers, { 11, -22, 33 }, associates index 0 to 11, index 1 to -22 and index 2 to 33.
Now, let us assume that we would like to associate 1 to 11, 2 to -22 and 3 to 33. The solution is simple: we keep the same array, only we transform the key by subtracting one from it, thus obtaining the original index
This is fine until we realize that this is just a particular case. What if the keys are not so “predictable”? A solution would be to put the associations in a list of {key, value} pairs and when someone is asking for a key, just search the list: { 123, 11}, {3, -22}, {0, 33} If the value associated to 3 is asked, we simply search the keys in list for a match and find -22. That’s fine, but if the list is large we’re in trouble. We could speed the search if we sort the array by keys and use binary search, but still the search may take some time if the list is large.
The search speed may be further enhanced if we break the list in sub-lists (or buckets) made of related pairs. This is what a hash function does: puts together pairs by related keys (an ideal hash function would associate one key to one value).
A hash table is a two columns table (an array):
The first column is the hash key (the index computed by a hash function). The size of the hash table is given by the maximum value of the hash function. If, for instance, the last step in computing the hash function is modulo 10, the size of the table will be 10; the pairs list will be broken into 10 sub-lists.
The second column is a list (bucket) of key/values pairs (the sub-list I was taking about).
Can I use popFront() and then eventually push back what was poped? The number of calls to popFront() might be more than one (but not much greater than it, say < 10, if does matter). This is also the number of calls which the imaginary pushBack() function will be called too.
for example:
string s = "Hello, World!";
int n = 5;
foreach(i; 0 .. n) {
// do something with s.front
s.popFront();
}
if(some_condition) {
foreach(i; 0 .. n) {
s.pushBack();
}
}
writeln(s); // should output "Hello, World!" since number of poped is same as pushed back.
I think popFront() does use .ptr but I'm not sure if it in D does makes any difference and can help anyway to reach my goal easily (i.e, in D's way and not write my own with a Circular buffer or so).
A completely different approach to reach it is very welcome too.
A range is either generative (e.g. if it's a list of random numbers), or it's a view into a container. In neither case does it make sense to push anything onto it. As you call popFront, you're iterating through the list and shrinking your view of the container. If you think of a range being like two C++ iterators for a moment, and you have something like
struct IterRange(T)
{
#property bool empty() { return iter == end; }
#property T front() { return *iter; }
void popFront() { ++iter; }
private Iterator iter;
private Iterator end;
}
then it will be easier to understand. If you called popFront, it would move the iterator forward by one, thereby changing which element you're looking at, but you can't add elements in front of it. That would require doing something like an insertion on the container itself, and maybe the iterator or range could be used to tell the container where you want an alement inserted, but the iterator or range can't do that itself. The same goes if you have a generative range like
struct IncRange(T)
{
#property bool empty() { value == T.max; }
#property T front() { return value; }
void popFront() { ++value; }
private T value;
}
It keeps incrementing the value, and there is no container backing it. So, it doesn't even have anywhere that you could push a value onto.
Arrays are a little bit funny because they're ranges but they're also containers (sort of). They have range semantics when popping elements off of them or slicing them, but they don't own their own memory, and once you append to them, you can get a completely different chunk of memory with the same values. So, it is sort of a range that you can add and remove elements from - but you can't do it using the range API. So, you could do something like
str = newChar ~ str;
but that's not terribly efficient. You could make it more efficient by creating a new array at the target size and then filling in its elements rather than concatenating repeatedly, but regardless, pushing something on the the front of an array is not a particularly idiomatic or efficient thing to be doing.
Now, if what you're looking to do is just reset the range so that it once again refers to the elements that were popped off rather than really push elements onto it - that is, open up the window again so that it shows what it showed before - that's a bit different. It's still not supported by the range API at all (you can never unpop anything that was popped off). However, if the range that you're dealing with is a forward range (and arrays are), then you can save the range before you pop off the elements and then use that to restore the previous state. e.g.
string s = "Hello, World!";
int n = 5;
auto saved = s.save;
foreach(i; 0 .. n)
s.popFront();
if(some_condition)
s = saved;
So, you have to explicitly store the previous state yourself in order to restore it instead of having something like unpopFront, but having the range store that itself (as would be required for unpopFront) would be very inefficient in most cases (much is it might work in the iterator case if the range kept track of where the beginning of the container was).
No, there is no standard way to "unpop" a range or a string.
If you were to pass a slice of a string to a function:
fun(s[5..10]);
You'd expect that that function would only be able to see those 5 characters. If there was a way to "unpop" the slice, the function would be able to see the entire string.
Now, D is a system programming language, so expanding a slice is possible using pointer arithmetic and GC queries. But there is nothing in the standard library to do this for you.
I have a data structure in sparse compressed column format.
For my given algorithm, I need to iterate over all the values in a "column" of data and do a bunch of stuff. Currently, it is working nicely using a regular for loop. The boss wants me to re-code this as a for_each loop for future parallelization.
For those not familiar with sparse compressed column, it use 2 (or 3) vectors to represent the data. One vector is just a long list of values. The second vector is the index of where each column starts.
The current version
// for processing data in column 5
vector values;
vector colIndex;
vector rowIndex;
int column = 5;
for(int i = conIndex[5]; i != colIndex[6]; i++){
value = values[i];
row = rowIndex[i];
// do stuff
}
The key is that I need to know the location(as an integer) in my values column in order to lookup the row position (And a bunch of other stuff I'm not bothering to list here.)
If I use the std::for_each() function, I get the value at the position, not the position. I need the position itself.
One thought, and clearly not efficient, would be to create a vector of integers the same length as my data. That way, I could pass an iterator over that dummy vector to the function in for_each and the value passed to my function would be the postion. However, this seems like the least efficient way.
Any thoughts?
My challenge is that I need to know the position in the vector. for_each takes an iterator and sends the value of that iterator to the function.
Use boost::counting_iterator<int>, or implement your own.
#n.m.'s answer is probably the best, but it is possible with only what the standard library provides, though painfully slow I assume:
void your_loop_func(const T& val){
iterator it = values.find(val);
std::ptrdiff_t index = it - values.begin();
value = val;
row = rowIndices[index];
}
And after writing that, I really can only recommend the Boost counting_iterator version. ;)
I have an integral position-based algorithm. (That is, the output of the algorithm is based on a curvilinear position, and each result is influenced by the values of the previous results).
To avoid recalculating each time, I would like to pre-calculate at a given sample rate, and subsequently perform a lookup and either return a pre-calculated result (if I land directly on one), or interpolate between two adjacent results.
This would be trivial for me in F# or C#, but my C++ is very rusty, (and wasn't even ever that good).
Is map the right construct to use? And could you be so kind as to give me an example of how I'd perform the lookup? (I'm thinking of precalculating in milimetres, which means the key could be an int, the value would be a double).
UPDATE OK, maybe what I need is a sorted dictionary. (Rolls up sleeves), pseudocode:
//Initialisation
fun MyFunction(int position, double previousresult) returns double {/*etc*/};
double lastresult = 0.0;
for(int s = startposition to endposition by sampledist)
{
lastresult = MyFunction(s, lastresult);
MapOrWhatever.Add(s, lastresult);
}
//Using for lookup
fun GetValueAtPosition(int position) returns double
{
CheckPositionIsInRangeElseException(position);
if(MapOrWhatever.ContainsKey(position))
return MapOrWhatever[position];
else
{
int i = 0;
//or possibly something clever with position % sampledist...
while(MapOrWhatever.Keys[i] < position) i+=sampledist;
return Interpolate(MapOrWhatever, i, i+sampledist, position);
}
}
Thinks... maybe if I keep a constant sampledist, I could just use an array and index it...
A std::map sounds reasonable for memoization here provided your values are guaranteed not to be contiguous.
#include <map>
// ...
std::map<int, double> memo;
memo.insert(std::make_pair(5, 0.5));
double x = memo[5]; // x == 0.5
If you consider a map, always consider a vector, too. For values that aren't changed much (or even not at all) during the application running, a pre-sorted std::vector< std::pair<Key,Value> > (with O(N) lookup) more often than never performs faster for lookups than a std::map<key,Value> (with O(log N) lookup) - despite all the theory.
You need to try and measure.
std::map is probably fine as long as speed is not too critical. If the speed of the lookup is critical you could try a vector as mentioned above where you go straight to the element you need (don't use a binary search since you can compute the index from the position). Something like:
vector<double> stored;
// store the values in the vector
double lastresult = 0.0;
for(int s = startposition, index = 0; s <= endposition; s+=sampledist, ++index)
{
lastresult = MyFunction(s, lastresult);
stored[index] = lastresult;
}
//then to lookup
double GetValueAtPosition(int position) returns double
{
int index = (position - startposition) / sampledist;
lower = stored[index];
upper = stored[index+1];
return interpolate(lower, upper, position);
}
please see my comment, but here is map documentation
http://www.cplusplus.com/reference/stl/map/
and important note than another poster did not mention is that if you use [] to search on a key that doesn't exist in the map, map will create an object so that there's something there.
edit: see docs here for this info http://msdn.microsoft.com/en-us/library/fe72hft9%28VS.80%29.aspx
instead, use find(), which returns an iterator. then test this iterator against map.end(), and if it is equal then there was no match.
Refer : http://www.cplusplus.com/reference/stl/map/
You can use Map ,
typedef std::map<int,const double> mapType;
Performance of maps are like :
map:: find
Complexity
Logarithmic in size.
Beware of Operator [ ] in map
If x matches the key of an element in the container, the function returns a reference to its mapped value.
If x does not match the key of any element in the container, the function inserts a new element with that key and returns a reference to its mapped value. Notice that this always increases the map size by one, even if no mapped value is assigned to the element (the element is constructed using its default constructor).
The HASH_MAP is the best STL algoirthim for fast lookup than any other algorithims. But, filling takes little bit more time than map or vector and also it is not sorted. It takes constant time for any value search.
std::hash_map<int, double,> memo;
memo.insert(std::make_pair(5, 0.5));
memo.insert(std::make_pair(7,0.8));
.
.
.
hash_map<int,double>::iterator cur = memo.find(5);
hash_map<int,double>::iterator prev = cur;
hash_map<int,double>::iterator next = cur;
++next;
--prev;
Interpolate current value with (*next).second(), (*prev).second() values..