Implementing surjective data structure? - c++

I am interested in performing the following operations on a set of data.
First, we are given a set of keys, as an example:
vector<int> keys{1,2,3,4,5,6};
Each of these keys is understood to be pointing to a unique entry (which is not important to specify, rather what is important is the relation whether each key is pointing to a separate entry, or some keys are pointing to the same entry). Initially, we do not know whether any key is pointing to the same entry or not, so we start out with a data structure that treats all entries as separate for each key:
surjectiveData<int> data;
data.populateUnique(keys.begin(),keys.end());
Graphically, we can illustrate the current state of data as
where we use labels a,b,c,d,e,f to keep track of the unique entries in data. Now, consider adding additional information on which keys are pointing to the same entry. For example:
vector<pair<int,int>> identifications{make_pair(1,2),make_pair(3,4),make_pair(2,4),make_pair(5,6)};
data.couple(identifications.begin(),indentifications.end());
The couple method of surjectiveData goes through the pairs provided and makes them point to the same unique entry. Graphcally, the four identifications would in turn change data as follows:
and now there are only two unique entries in data, which here we denote abcd and ef. Note that once two or more keys point to the same entry, it does not matter which of these keys is identified with which of separate keys, all of them point to the same entry after identification.
Now that we are done with specifying key identifications, we could think of using data as follows. For example, we could ask what is the effective number of unique remaining entries
cout<<data.size()<<endl; // 2
Or, we could iterate through the entries and check how many keys point to each of them
for(auto it=data.begin();it!=data.end();it++){
cout<<it->size()<<" ";// 4 2
}
Ideally, internally the structure should take constant time for each identification, if possible.
I tried to search for such a data structure in the standard library, but could not find any. Did I miss it? Perhaps there is a smart way to implement it based on more basic objects? If so, what would be a minimal example for integers?

The operations you describe can be supported with a disjoint set data structure: https://en.wikipedia.org/wiki/Disjoint-set_data_structure
This is a linked data structure that supports 3 operations:
makeSet() creates a new singleton set and returns its element
union(a,b) given two elements, merges the sets that contain them. One element of each set will be the "representative" of that set
find(a) returns the representative of the set that contains a.
All operations take pretty much constant amortized time.
I usually implement this data structure in a single vector, where each array index denotes is a set element. If its value is >0, then it's a set representative and the value is the size of the set. If its value is < 0 then its value is ~p, where p is its "parent" element in the same set. Sometimes I use the 0 value for "uninitialized".
It's not hard to keep track of the number of sets.
in C++, my usual implementation would look like this:
class DijointSets {
unsigned num_sets;
std::vector<int> sets;
public:
// Create a new singleton set and return its element
unsigned make_set() {
unsigned ret = (unsigned)sets.size();
sets.push_back(1);
++num_sets;
return ret;
}
// Find the representative element of an element's set
unsigned find(unsigned x) {
int p = sets[x];
if (p>=0) {
return x;
}
p = find(~p);
sets[x] = ~p; //might be the same
return p;
}
// Merge the sets that contain two elements
// returns true if a merge was done
boolean union(unsigned a, unsigned b) {
a = find(a);
b = find(b);
if (a==b) {
return false;
}
if (sets[a] > sets[b]) {
sets[a] += sets[b]; //add sizes
sets[b] = ~(int)a;
} else {
sets[b] += sets[a]; //add sizes
sets[a] = ~(int)b;
}
--num_sets;
return true;
}
// get the size of an element's set
unsigned set_size(x) {
return sets[find(x)];
}
// get the number of sets
unsigned set_count() {
return num_sets;
}
}

Related

When is a multiset sorted? Insertion, iteration, both?

I have a multi-set containing pointers to custom types. I have provided a custom sorter to the multi-set that compares on a particular attribute of the custom type.
If I change the value of the attribute on any given item (in a way that would influence the sorting order). Do I have to remove the item from the set and re-insert it to guarantee ordering? Or anytime I create an iterator (or a foreach loop), I will still get the items in order?
I can make a quick test for myself, but I wanted to know if the behavior would be consistent on any platform and compiler or if it is standard.
Edit: Here is an example I tried. I noticed two things.
In a multi-set if I change the value that is used to compare before removing the key, I can no longer remove it. Otherwise, my original thought of removing and reinserting seems the best way for this to work.
#include <stdio.h>
#include <set>
struct NodePointerCompare;
struct Node {
int priority;
};
struct NodePointerCompare {
bool operator()(const Node* lhs, const Node* rhs) const {
return lhs->priority < rhs->priority;
}
};
int main()
{
Node n1{1};
Node n2{2};
Node n3{3};
std::multiset<Node*, NodePointerCompare> nodes;
nodes.insert(&n1);
nodes.insert(&n2);
nodes.insert(&n3);
printf("First round\n");
for(Node* n : nodes) {
printf("%d\n", n->priority);
}
n1.priority = 10;
printf("Second round\n");
for(Node* n : nodes) {
printf("%d\n", n->priority);
}
n1.priority = 1;
printf("Third round\n");
nodes.erase(&n1);
n1.priority = 10;
nodes.insert(&n1);
for(Node* n : nodes) {
printf("%d\n", n->priority);
}
return 0;
}
This is the output I get
First round
1
2
3
Second round
10
2
3
Third round
2
3
10
http://eel.is/c++draft/associative.reqmts#general-3
For any two keys k1 and k2 in the same container, calling comp(k1, k2) shall always return the same value.
It is simply illegal to change the change the object in a way that affects how it compares to other objects within the associative container.
If you want to do that, you have to get the object out of the container, apply the change to it, and put it back in. Have a look at https://en.cppreference.com/w/cpp/container/multiset/extract if that's what you want to do.
When is a multiset sorted? Insertion, iteration, both?
The standard doesn't specify explicitly, but practically speaking the ordering must be established on insertion.
If I change the value of the attribute on any given item (in a way that would influence the sorting order). Do I have to remove the item from the set and re-insert it to guarantee ordering?
You may not change the ordering of an element while it is in the set.
However, instead of erase + insert element with different walue, you can extract + modify + re-insert which should be slightly more efficient (or significantly, depending on the element type).
Here is an example I tried.
The behaviour of the example is undefined.
The container must remain sorted at all times because begin has constant complexity. Changing the comparison order of elements in the container is undefined behavior per [associative.reqmts.general]/3 (and [res.on.functions]/2.3):
For any two keys k1 and k2 in the same container, calling comp(k1, k2) shall always return the same value.
You can use node handles to efficiently modify elements by temporarily removing them from the container, although for elements that are just pointers the only efficiency is avoiding a memory (de)allocation.

How to change the key in an unordered_map?

I need to use a data structure which supports constant time lookups on average. I think that using a std::unordered_map is a good way to do it. My data is a "collection" of numbers.
|115|190|380|265|
These numbers do not have to be in a particular order. I need to have about O(1) time to determine whether or not a given number exists in this data structure. I have the idea of using a std::unordered_map, which is actually a hash table (am I correct?). So the numbers will be keys, and then I would just have dummy values.
So basically I first need to determine if the key matching a given number exists in the data structure, and I run some algorithm based on that condition. And independently of that condition I also want to update a particular key. Let's say 190, and I want to add 20 to it, so now the key would be 210.
And now the data structure would look like this:
|115|210|380|265|
The reason I want to do this is because I have a recursive algorithm which traverses a binary search tree. Each node has an int value, and two pointers to the left and right nodes. When a leaf node is reached, I need to create a new field in the "hash table" data structure holding the current_node->value. Then when I go back up the tree in the recursion, I need to successively add each of the node's value to the previous sum stored in the key. And the reason why my data structure (which I suggest should be a std::unordered_map) has multiple fields of numbers is because each one of them represents a unique path going from a leaf node up the tree to a certain node in the middle. I check if the sum of all the values of the nodes on the path from the leaf going up to a given node is equal to the value of that node. So basically into each key is added the current value of the node, storing the sum of all the nodes on that path. I need to scan that data structure to determine if any one of the fields or keys is equal to the value of the current node. Also I want to insert new values into the data structure in near constant time. This is for competitive programming, and I would hesitate to use a std::vector because looking up an element and inserting an element takes linear time, I think. That would screw up my time complexity. Maybe I should use another data structure other than a std::unordered_map?
You can use unordered_map::erase and unordered_map::insert to update a key. The average time complexity is O(1)(BTW, the worst is O(n)). If you are using C++17, you can also use unordered_map::extract to update a key. The time complexity is the same.
However, since you only need a set of number, I think unordered_set is more suitable for your algorithm.
#include <unordered_map>
#include <iostream>
int main()
{
std::unordered_map<int, int> m;
m[42]; // add
m[69]; // some
m[90]; // keys
int value = 90; // value to check for
auto it = m.find(90);
if (it != m.end()) {
m.erase(it); // remove it
m[value + 20]; // add an altered value
}
}
#include <unordered_map>
#include <string>
int main() {
// replace same key but other instance
std::unordered_map<std::string, int> eden;
std::string k1("existed key");
std::string k2("existed key");
const auto &[it, first] = eden.try_emplace(k1, 1);
if (!first) {
eden.erase(it);
eden.emplace_hint(it, k2, 123);
}
}
Since C++17, you can also use its extract function as follows:
std::unordered_map<int, int> map = make_map();
auto node = map.extract(some_key);
node.key() = new_key;
map.insert(std::move(node));

Implementing ranged-loop in custom hashed set: accessing only entries not marked as empty

Container basic setup
I have implemented a simple custom unordered set container, that uses hashing. Internally, it stores data like this:
class Set
{
T *data = nullptr;
bool *emptyList = nullptr;
int size = 0;
... (inner methos go here)
};
That is, it stores two arrays. One called data with the actual values of templated type T, and another called emptyList with bool values that mark whether at that position the set is considered empty or not.
This way, linearly probing to store new values and also erasing entries become way cheaper. Both become, respectively, just a matter of finding the next the emptyList[index] = true, or of setting it to true.
Problem with the ranged for-loop
Currently, I allow iteration over the values stored in the set like for(auto i : set_instance) by having the following public member functions in the set class:
T* begin() const { data[0] };
T* end() const { data[end] };
The problem with that, of course, is that a ranged for-loop also accesses the entries in data that should not be accessed since they are marked in emptyList as being empty.
Is there a way for me to make it so that when the user tries to iterate over the set with ranged loops, only the entries in data that correspond to the entries in emptyList that are not marked as true are actually accessed/processed by the ranged loop?

3D Hash with no collisions on close values

I need a hash function for 3D vectors with no collisions between close key values.
The key is a 3d vector of integers. I want no collisions within roughly a 64 * 64 * 64 "area" or larger.
Does anyone know of any hashing functions suited for this purpose, or even better, how would you go about designing a hash for this?
If it's necessary to know, I will be implementing it in C++.
Why not create a Map<int,Map<int,Map<int,Object>>> for your objects? Where each int is x,y,z or whatever you're labeling your axis.
Here's an example of how you could use it.
int x,y,z;
map<int,map<int,map<int,string>>> Vectors = map<int,map<int,map<int,string>>>();
/*give x, y and z a real value*/
Vectors[x][y][z] = "value";
/*more code*/
string ValueAtXYZ = Vectors[x][y][z];
Just to explain because its not super obvious.
The Vectors[x] returns a map<int,map<int,string>>.
I then immediately use that maps [] operator with [y].
That then returns (you guessed it) a map<int,string>.
I immediately use that maps [] operator with [z] and can now set the string.
Note: Just be sure to loop through it using iterates and not a for(int x = 0; /*bad code*/;x++) loop because [] adds an element at every location it's used to look up. Here's an example of a loop and Here's and example of an unexpected add.
Edit:
If you want to make sure that you're not overriding an existing value you could do this.
string saveOldValue;
if(Vectors[x][y][z] != ""/*this is the default value of a string*/)
{
/*There was a string in that vector so store the old Value*/
saveOldValue = Vectors[x][y][z];
}
Vectors[x][y][z] = "Value";
If you use [] on a key that isn't in the map the map creates a default object there. For strings this would be the empty string "".
Or
if( Vectors.find(x)!=Vectors.end()
&& Vectors[x].find(y)!=Vectors[x].end()
&& Vectors[x][y].find(z)!=Vectors[x][y].end())
{
/* Vectors[x][y][z] has something in it*/
}else
{
/*Theres nothing at Vectors[x][y][z] so go for it*/
Vectors[x][y][z] ="value";
}
This uses the find(value) function which returns an iterator to the location of the key "value" OR and iterator that points to map::end() if that key is not int the current map.
If you don't have a default value for your thing being stored then use the second check to do your inserts. This greatly increases the useability of this answer and unclutters your code.
The insert function has it's place but in this example it would be very hard to use.

implementing hash table using vector c++

I've tried to implement hash table using vector. My table size will be defined in the constructor, for example lets say table size is 31, to create hash table I do followings:
vector<string> entires; // it is filled with entries that I'll put into hash table;
vector<string> hashtable;
hashtable.resize(31);
for(int i=0;i<entries.size();i++){
int index=hashFunction(entries[i]);
// now I need to know whether I've already put an entry into hashtable[index] or not
}
Is there anyone to help me how could I do that ?
Each cell in your hashtable comes with a bit of extra packaging.
If your hash allows deletions you need a state such that a cell can be marked as "deleted". This enables your search to continue looking even if it encounters this cell which has no actual value in it.
So a cell can have 3 states, occupied, empty and deleted.
You might also wish to store the hash-value in the cell. This is useful when you come to resize the table as you don't need to rehash all the entries.
In addition it can be an optimal first-comparison because comparing two numbers is likely to be quicker than comparing two objects.
These are considerations if this is an exercise, or if you find that std::unordered_map / std::unordered_set is not adequate for your purpose or if those are not available to you.
For practical purpose, at least try using those first.
It is possible to have several items for the same hash value
You just need to define your hash-table like this:
vector<vector<string>> hashtable;
hashtable.resize(32); //0-31
for(int i=0;i<entries.size();i++){
int index=hashFunction(entries[i]);
hashtable[index].push_back(entries[i]);
}
the simple implementation of hash table uses vector of pointers to actual entries:
class hash_map {
public:
iterator find(const key_type& key);
//...
private:
struct Entry { // representation
key_type key;
mepped_type val;
Entry* next; // hash overflow link
};
vector<Entry> v; // the actual entries
vector<Entry*> b; // the hash table, pointers into v
};
to find a value operator uses a hash function to find an index in the hash table for the key:
mapped_type& hash_map::operator[](const key_type& k) {
size_type i = hash(k)%b.size(); // hash
for (Entry* p=b[i];p;p=p->next) // search among entries hashed to i
if (eq(k,p->key)) { // found
if (p->erased) { // re-insert
p->erased=false;
no_of_erased--;
return p->val=default_value;
}
// not found, resize if needed
return operator[](k);
v.push_back(Entry(k,default_value,b[i])); // add Entry
b[i]=&v.back(); // point to new element
return b[i]->val;
}