Let's consider that for two different inputs("tomas", "peter") a hash function yields the same key (3).
Please correct my assumtion how it works under the hood with the separate chaining:
In a hash table, index 3 contains a pointer to a linked list header. The list contains two nodes implemented for example like this:
struct node{
char value_name[];
int value;
node* ptr_to_next_node;
};
The searching mechanism remembers the input name ("peter") and compares value_name in nodes. When it equals with "peter", the mechanism would return the value.
Is this correct? I've learned that ordinary linked list doesn't contain the name of the node so that I didn't know, how could I find the correspondind value in the list with nodes like this for different names ("tomas", "peter"):
struct node{
int value;
node* ptr_to_next_node;
};
Yes, its correct that this is a possible implementation of part of a hash table.
When you say an "ordinary linked list doesn't contain the name of the node", I'd be expecting a linked list to be generic if the implementation language allows that. In C++ it would be a template.
It would be a linked list of a certain type and each element would have an instance of or a handle to that type and a pointer to the next element as your second code snippet shows except substituting int with the type.
In this case the type would most likely be a key-value-pair
So the linked list in that case doesn't directly contain the name - it contains an object that contains the name (and the value)
This is just a possible implementation though. There are other options
Yes, basically: a table of structures, with the hash function limited to return a value between 0 and table size - 1, and that is used as an index into the table.
This gets you to the top of the list of chained elements, which will be greater than or equal to zero in number.
To save time and space, usually the table element is itself a chain list element. Since you specified strings as the data being stored in the hash table, usually the structure would be more like:
struct hash_table_element {
unsigned int length;
char *data_string;
struct hash_table_element *next;
}
and the string space allocated dynamically, maybe with one of the standard library functions. There are a number of ways to manage string tables that may optimize your particular use case, of course, and checking the length first can often times speed up the search if you use short cut evaluation:
if (length == element->length &&
memcmp(string, element->data_string, length))
{
found = TRUE
};
This will not waste time comparing the strings unless they are the same length.
Related
I am interested in performing the following operations on a set of data.
First, we are given a set of keys, as an example:
vector<int> keys{1,2,3,4,5,6};
Each of these keys is understood to be pointing to a unique entry (which is not important to specify, rather what is important is the relation whether each key is pointing to a separate entry, or some keys are pointing to the same entry). Initially, we do not know whether any key is pointing to the same entry or not, so we start out with a data structure that treats all entries as separate for each key:
surjectiveData<int> data;
data.populateUnique(keys.begin(),keys.end());
Graphically, we can illustrate the current state of data as
where we use labels a,b,c,d,e,f to keep track of the unique entries in data. Now, consider adding additional information on which keys are pointing to the same entry. For example:
vector<pair<int,int>> identifications{make_pair(1,2),make_pair(3,4),make_pair(2,4),make_pair(5,6)};
data.couple(identifications.begin(),indentifications.end());
The couple method of surjectiveData goes through the pairs provided and makes them point to the same unique entry. Graphcally, the four identifications would in turn change data as follows:
and now there are only two unique entries in data, which here we denote abcd and ef. Note that once two or more keys point to the same entry, it does not matter which of these keys is identified with which of separate keys, all of them point to the same entry after identification.
Now that we are done with specifying key identifications, we could think of using data as follows. For example, we could ask what is the effective number of unique remaining entries
cout<<data.size()<<endl; // 2
Or, we could iterate through the entries and check how many keys point to each of them
for(auto it=data.begin();it!=data.end();it++){
cout<<it->size()<<" ";// 4 2
}
Ideally, internally the structure should take constant time for each identification, if possible.
I tried to search for such a data structure in the standard library, but could not find any. Did I miss it? Perhaps there is a smart way to implement it based on more basic objects? If so, what would be a minimal example for integers?
The operations you describe can be supported with a disjoint set data structure: https://en.wikipedia.org/wiki/Disjoint-set_data_structure
This is a linked data structure that supports 3 operations:
makeSet() creates a new singleton set and returns its element
union(a,b) given two elements, merges the sets that contain them. One element of each set will be the "representative" of that set
find(a) returns the representative of the set that contains a.
All operations take pretty much constant amortized time.
I usually implement this data structure in a single vector, where each array index denotes is a set element. If its value is >0, then it's a set representative and the value is the size of the set. If its value is < 0 then its value is ~p, where p is its "parent" element in the same set. Sometimes I use the 0 value for "uninitialized".
It's not hard to keep track of the number of sets.
in C++, my usual implementation would look like this:
class DijointSets {
unsigned num_sets;
std::vector<int> sets;
public:
// Create a new singleton set and return its element
unsigned make_set() {
unsigned ret = (unsigned)sets.size();
sets.push_back(1);
++num_sets;
return ret;
}
// Find the representative element of an element's set
unsigned find(unsigned x) {
int p = sets[x];
if (p>=0) {
return x;
}
p = find(~p);
sets[x] = ~p; //might be the same
return p;
}
// Merge the sets that contain two elements
// returns true if a merge was done
boolean union(unsigned a, unsigned b) {
a = find(a);
b = find(b);
if (a==b) {
return false;
}
if (sets[a] > sets[b]) {
sets[a] += sets[b]; //add sizes
sets[b] = ~(int)a;
} else {
sets[b] += sets[a]; //add sizes
sets[a] = ~(int)b;
}
--num_sets;
return true;
}
// get the size of an element's set
unsigned set_size(x) {
return sets[find(x)];
}
// get the number of sets
unsigned set_count() {
return num_sets;
}
}
I need to use a data structure which supports constant time lookups on average. I think that using a std::unordered_map is a good way to do it. My data is a "collection" of numbers.
|115|190|380|265|
These numbers do not have to be in a particular order. I need to have about O(1) time to determine whether or not a given number exists in this data structure. I have the idea of using a std::unordered_map, which is actually a hash table (am I correct?). So the numbers will be keys, and then I would just have dummy values.
So basically I first need to determine if the key matching a given number exists in the data structure, and I run some algorithm based on that condition. And independently of that condition I also want to update a particular key. Let's say 190, and I want to add 20 to it, so now the key would be 210.
And now the data structure would look like this:
|115|210|380|265|
The reason I want to do this is because I have a recursive algorithm which traverses a binary search tree. Each node has an int value, and two pointers to the left and right nodes. When a leaf node is reached, I need to create a new field in the "hash table" data structure holding the current_node->value. Then when I go back up the tree in the recursion, I need to successively add each of the node's value to the previous sum stored in the key. And the reason why my data structure (which I suggest should be a std::unordered_map) has multiple fields of numbers is because each one of them represents a unique path going from a leaf node up the tree to a certain node in the middle. I check if the sum of all the values of the nodes on the path from the leaf going up to a given node is equal to the value of that node. So basically into each key is added the current value of the node, storing the sum of all the nodes on that path. I need to scan that data structure to determine if any one of the fields or keys is equal to the value of the current node. Also I want to insert new values into the data structure in near constant time. This is for competitive programming, and I would hesitate to use a std::vector because looking up an element and inserting an element takes linear time, I think. That would screw up my time complexity. Maybe I should use another data structure other than a std::unordered_map?
You can use unordered_map::erase and unordered_map::insert to update a key. The average time complexity is O(1)(BTW, the worst is O(n)). If you are using C++17, you can also use unordered_map::extract to update a key. The time complexity is the same.
However, since you only need a set of number, I think unordered_set is more suitable for your algorithm.
#include <unordered_map>
#include <iostream>
int main()
{
std::unordered_map<int, int> m;
m[42]; // add
m[69]; // some
m[90]; // keys
int value = 90; // value to check for
auto it = m.find(90);
if (it != m.end()) {
m.erase(it); // remove it
m[value + 20]; // add an altered value
}
}
#include <unordered_map>
#include <string>
int main() {
// replace same key but other instance
std::unordered_map<std::string, int> eden;
std::string k1("existed key");
std::string k2("existed key");
const auto &[it, first] = eden.try_emplace(k1, 1);
if (!first) {
eden.erase(it);
eden.emplace_hint(it, k2, 123);
}
}
Since C++17, you can also use its extract function as follows:
std::unordered_map<int, int> map = make_map();
auto node = map.extract(some_key);
node.key() = new_key;
map.insert(std::move(node));
I am reading about hashing in Robert Sedwick book on Algorithms in C++
We might be using a header node to streamline the code for insertion
into an ordered list, but we might not want to use M header nodes for
individual lists in separate chaining. Indeed, we could even eliminate
the M links to the lists by having the first nodes in the lists
comprise the table
.
class ST
{
struct node
{
Item item;
node* next;
node(Item x, node* t)
{ item = x; next = t; }
};
typedef node *link;
private:
link* heads;
int N, M;
Item searchR(link t, Key v)
{
if (t == 0) return nullItem;
if (t->item.key() == v) return t->item;
return searchR(t->next, v);
}
public:
ST(int maxN)
{
N = 0; M = maxN/5;
heads = new link[M];
for (int i = 0; i < M; i++) heads[i] = 0;
}
Item search(Key v)
{ return searchR(heads[hash(v, M)], v); }
void insert(Item item)
{ int i = hash(item.key(), M);
heads[i] = new node(item, heads[i]); N++; }
};
My two questions on above text what does author mean by
"We could even eliminate the M links to the lists by having the first nodes in the lists comprise the table." How can we modify above code for this?
"we might not want to use M header nodes for individual lists in separate chaining." What does this statement mean.
"We could even eliminate the M links to the lists by having the first nodes in the lists comprise the table."
Consider Node* x[n] vs Node x[n]: the former needs an extra pointer and on-insertion memory allocated for the head Node of every non-empty element, and an extra indirection for every hash table operation, while the latter eliminates the n pointers but requires that any unused elements will be able to be put in some discernable not-in-use state (tracking of which may or may not require extra memory), and if sizeof(Node) size is greater than sizeof(Node*), it may be more wasteful of memory anyway. The difference in memory use can also affect efficiency of cache use: if the table has a high element to buckets ratio then a Node[] gets the Node data into fewer contiguous memory pages, and if you're iterating (in unsorted order) then it's very cache efficient, whereas Node*[] will jump to separate memory allocations that might be all over the place (or on the other hand, might actually be quite close together in some actually useful: e.g. if both access patterns and dynamic memory allocation addresses correlate to chronological time of object creation.
How can we modify above code for this?
First, your existing code has a problem: heads[i] = new node(item, heads[i]); overwrites an entry in the hash table without first checking if it's empty... if there's anything there then you should be adding to the list, not overwriting the array.
The design change discussed needs:
link* heads;
...changed to...
node* head;
You'd initialise it like this:
head = new node[M];
Which needs an extra node constructor (if item has an equivalent default constructor, you can leave out its initialisation below)
node() : item(nullItem), next(nullptr) { }
Then there's some knock on changes to the rest of your code that are easy to work through. Basically, you're getting rid of a layer of pointers.
"we might not want to use M header nodes for individual lists in separate chaining." What does this statement mean.
I didn't write it so can't say authoritatively, but it appears to be saying that when designing the list code, a decision might have been made to have an initial Node even in an empty list, as this simplifies code for several list operations. While the extra data-less Node might seem a reasonable price when contemplating "usual" uses of a list, hash tables are unusual in that you want most of the lists chained of the buckets to have 0 or 1 element, and exponentially fewer should be longer and longer. So, such a list implementation is poorly suited to use in a hash table.
Which one would be more efficient?
I want to keep a list of items but, it's required of me to sort list
by id,
by name
by course credits
by the user
Would it be best to add items in list by id and then sort by the others or just add items without order and sort in the order needed when ever needed by the user?
If you're really required to keep the list sorted -- as opposed to using other data structures to give sorted access to the list -- then you could simply make a list whose elements have different pointers for different sort criteria.
In other words, instead of keeping just previous and next pointers, have previousById, nextById, previousByName, previousByCredits and nextByCredits. Likewise, you would have three head and/or tail pointers, instead of just one.
Please note that this approach has the drawback of being inflexible when it comes to implementing additional sort criteria. I'm assuming that you're trying to solve a homework-type problem, which is why I tried to tailor the answer to what seem to be the homework requirements.
You can use three maps (or hashmaps):
One mapping the id to the item, one mapping name to an item reference (or pointer) and one mapping course credits to item reference again.
It would be more efficient to sort it in whichever order that you know will be sorted for the most, for example if you know you're going to be retrieving by id most often, keep it sorted by id, otherwise pick one of the others though id would be the easiest if it is just an integer field
So then to do that you would check on insert to find where newid is less than nextid but greater than previousid, then allocate a new node with new and set the pointers appropriately.
Keeping the linked list sorted in some way is better than just keeping it unsorted. You're adding some time to how long it takes to insert an item but it's negligible to how long it would take to sort it that particular way
The more efficient would be to store the nodes as is, and keep 4 different indexes up-to-date. This way, when one order is required, you just pick up the right index and that's all. The cost is O(log N) for input, and O(1) for traversal.
Of course, keeping 4 indexes at once, with perhaps different requirements on uniqueness, and in the face of possible exceptions, is relatively difficult, but then, there's a Boost library for this: Boost MultiIndex
On example is to generate a set that can be sorted either by ID or by Name.
Since you can add as many indexes as you wish, it should get you going :)
Keep your lined list objects in the lined list, in random order. To sort the list by any key, use this pseudocode:
struct LinkedList {
string name;
LinkedList *prev;
LinkedList *next;
};
void FillArray(LinkedList *first, LinkedList **output, size_t &size) {
//function creates an array of pointers to every LinkedList object
LinedList *now;
size_t i; //you may use int instead of size_t
//check, how many objects are there in linked list
now=first;
while(now!=NULL) {
size++;
now=now->next;
}
//if linked list is empty
if (size==0) {
*output=NULL;
return;
}
//create the array;
*output = new LinkedList[size];
//fill the array
i=0;
now=first;
while(now!=NULL) {
*output[i++]=now;
now=now->next;
}
}
SortByName(LinkedList *arrayOfPointers, size_t size) {
// your function to sort by name here
}
void TemporatorySort(LinkedList *first, LinkedList **output, size_t &size) {
// this function will create the array of pointer to your linked list,
// sort this array, and return the sorted array. However, the linked
// list will stay as it is. It's good for example when your lined list
// is sorted by ID, but you need to print it sorted by names only once.
FillArray(first, *output, size);
SortByName(output,size);
}
void PermanentSort(LinkedList *first) {
// This function will sort the linked list and save the new order
// permanently.
LinkedList *sorted;
size_t size;
TemporatorySort(first,&sorted,size);
if (size>0) {
sorted[0].prev=NULL;
}
for(int i=1;i<size;i++) {
sorted[i-1].next=sorted[i];
sorted[i].prev=sorted[i-1];
}
sorted[size-1].next=NULL;
}
I hope, I actually did help you. If you don't understand any line from the code, simply put a comment to this "answer".
I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.