B-Tree Node Splitting Techniques - c++

I've stumbled upon a problem whilst doing my DSA (Data Structures and Algorithms) homework. I'm said to implement a B-Tree with Insertion and Search algorithms. As far as it goes, the search is working correctly, but I'm having trouble implementing the insertion function. Specifically the logic behind the B-Tree node-splitting algorithm. A pseudocode/C-style I could come up with is the following:
#define D 2
#define DD 2*D
typedef btreenode* btree;
typedef struct node
{
int keys[DD]; //D == 2 and DD == 2*D;
btree pointers[DD+1];
int index; //used to iterate throught the "keys" array
}btreenode;
void splitNode(btree* parent, btree* child1, btree* child2)
{
//Copies the content from the splitted node to the children
(*child1)->key[0] = (*parent)->key[0];
(*child1)->key[1] = (*parent)->key[1];
(*child2)->key[0] = (*parent)->key[2];
(*child2)->key[1] = (*parent)->key[3];
(*child1)->index = 1;
(*child2)->index = 1;
//"Clears" the parent node from any data
for(int i = 0; i<DD; i++) (*parent)->key[i] = -1;
for(int i = 0; i<DD+1; i++) (*parent)->pointers[i] = NULL
//Fixed the pointers to the children
(*parent)->index = 0;
//the line bellow was taken out for creating a new node that didn't have to be there.
//(*parent)->key[(*parent)->index] = newNode(); // The newNode() function allocs and inserts a the new key that I need to insert.
(*parent)->pointers[index] = (*child1);
(*parent)->pointers[index+1] = (*child2);
}
I'm almost sure that I'm messing up something with the pointers, but I'm not sure what. Any help is appreciated. Maybe I need a little bit more study on the B-Tree subject? I must add that while I can use basic input/output from C++, I need to use C-style structs.

You don't need to create a new node here. You've apparently already created the two new child nodes. All you have to do here after populating the children is make the parent now point to the two children, via a copy of the first key in each of them, and adjust its key count to two. You don't need to set the parent keys to -1 either.

Related

Trie structure, lock-free inserting

I tried to implement lock free Trie structure, but I am stuck on inserting nodes. At first I believed it was easy (my trie structure would not have any delete methods) but even swapping one pointer atomically can be tricky.
I want to swap pointer to point to structure(TrieNode) atomically only when it was nullptr so as to be sure that I do not lose other nods that other thread could insert inbetween.
struct TrieNode{
int t =0;
std::shared_ptr<TrieNode> child{nullptr};
};
std::shared_ptr<TrieNode> root;
auto p = std::atomic_load(&root);
auto node = std::make_shared<TrieNode>();
node->t=1;
auto tmp = std::shared_ptr<TrieNode>{nullptr};
std::cout<<std::atomic_compare_exchange_strong( &(p->child), &tmp,node)<<std::endl;
std::cout<<node->t;
With this code I get exit code -1073741819 (0xC0000005).
EDIT: Thank you for all your coments. Maybe I did not specify my problem so I want to address it now.After around 10 hours of coding last day I changed few things. Now I use ordinarry pointers and for now it is working. I did not test it for now if its race free with multiple threads inserting words. I plan to do it today.
const int ALPHABET_SIZE =4;
enum Alphabet {A,T,G,C,END};
class LFTrie{
private:
struct TrieNode{
std::atomic<TrieNode*> children[ALPHABET_SIZE+1];
};
std::atomic<TrieNode*> root = new TrieNode();
public:
void Insert(std::string word){
auto p =root.load();
int index;
for(int i=0; i<=word.size();i++){
if(i==word.size())
index = END;
else
index = WhatIndex(word[i]);
auto expected = p->children[index].load();
if(!expected){
auto node = new TrieNode();
if(! p->children[index].compare_exchange_strong(expected,node))
delete node;
}
p = p->children[index];
}
}
};
Now I believe it will work with many threads inserting different words . And yes, in this solution I discard node if there next pointer is not null. Sorry for the trouble (I am not native speaker).
CAS pattern should be something like:
auto expected = p->child;
while( !expected ){
if (success at CAS(&p->child, &expected, make_null_replace() ))
break;
}
if you aren't paying attention to the return value/expected and testing that you are replacing null, stored locally, you are in trouble.
On failure, you need to throw away the new node you made.

Machine Learning Algorithm using recursion

I am currently working on a very beginners version of the ID3 machine learning algorithm. I am stuck on how to recursively call my build_tree function to actually make the rest of the decision tree and output it in a nice format. I have calculated gains, entropies, gain ratios, etc. but I have no clue how to integrate recursion into my function.
I am given a data set, which after doing all the calculations mentioned above, have split it into two datasets. Now I need to be able to recursively call it until both the left and right data sets become pure [which can easily be checked by a function i wrote called dataset.is_pure()], all while keeping track of the threshold at each node. I know that all my calculations and split methods are working as I have done individuual testing on them. It is just the recursive part that I am having trouble with.
Here is my build_tree function that I am having a recursion nightmare with. I am currently working in a linux environment with the g++ compiler. The code I have right now compiles, but when run gives me a segmentation error. Any and all help would be greatly appreciated!
struct node
{
vector<vector<string>> data;
double atrb;
node* parent;
node* left = NULL;
node* right = NULL;
node(node* parent) : parent(parent) {}
};
node* root = new node(NULL);
void build_tree(node* current, dataset data_set)
{
vector<vector<string>> l_d;
vector<vector<string>> r_d;
double global_entropy = calc_entropy(data_set.get_col(data_set.n_col()-1));
int best_col = this->get_best_col(data_set, global_entropy);
hash_map selected_atrb(data_set.n_row(), data_set.truncate(best_col));
double threshold = get_threshold(selected_atrb, global_entropy);
cout << threshold << "\n";
split_data(threshold, best_col, data_set, l_d, r_d);
dataset right_data(r_d);
dataset left_data(l_d);
right_data.delete_col(best_col);
left_data.delete_col(best_col);
if(left_data.is_pure())
return;
else
{
node* new_left = new node(current);
new_left->atrb = threshold;
current->left = new_left;
new_left->data = l_d;
return build_tree(new_left, left_data);
}
if(right_data.is_pure())
return;
else
{
node* new_right = new node(current);
new_right->atrb = threshold;
current->right = new_right;
new_right->data = r_d;
return build_tree(new_right, right_data);
}
}
id3(dataset data)
{
build_tree(root, data);
}
};
This is only a part of my class. If you wish to see any other code, just let me know!
Regards,
I will explain to you with pseudocodigo how the reclusive function works, I will also leave you the code that you make in javascript for the implementation of said algorithm.
Before going into detail, I will mention certain concepts and classes you use.
Attribute: Characteristic of the data set, it is usually the name of a column of the data set.
Class: Decision characteristic, it is generally of binary value and usually it is always the last column of the data set.
Value: Possible value of the attribute in the data set, for example (Sunny, Cloudy, Rainy)
Tree: classes that have a number of nodes associated with each other.
Node: Entity in charge of storing the attribute (question), also has a list with the arcs.
Arc: Contains the value of an attribute and has an attribute that will contain the following child node.
Leaf : Contains a class. This node is the result of a decision, for example (Yes or No).
Best feature: Attribute with the highest information gain.
Function to create the tree from a set of data:
Obtain the values ​​of a class.
Evaluate if there is only one type of class in the data set, for example (Yes).   
If true, then we create a Leaf object and return this object
Obtain the information gain of each current attribute.
Choose the attribute with the highest information gain.
Create a node with the best feature.
Obtain the values ​​of the best feature.
Iterate the list of those values.
Filter the list, so that there are only records with the value that we are iterating (save it in a variable temporary)
Create an Arc with this value.
     - Assign the following attribute to the Arc: (Here comes the recursion) call again the same only function that you send (the filtered list of records, the class, the list of attributes without the best feature, the list of general attributes without the attributes of the best feature)
Add the arc to the node.
Return the node.
This would be the segment of code that is responsible for creating the tree
let crearArbol = (ejemplosLista, clase, atributos, valores) => {
let valoresClase = obtenerValoresAtributo(ejemplosLista, clase);
if (valoresClase.length == 1) {
autoIncremental++;
return new Hoja(valoresClase[0], autoIncremental);
}
if (atributos.length == 0) {
let claseDominante = claseMayoritaria(ejemplosLista);
return new Atributo();
}
let gananciaAtributos = obtenerGananciaAtributos(ejemplosLista, valores, atributos);
let atributoMaximo = atributos[maximaGanancia(gananciaAtributos)];
autoIncremental++;
let nodo = new Atributo(atributoMaximo, [], autoIncremental);
let valoresLista = obtenerValoresAtributo(ejemplosLista, atributoMaximo);
valoresLista.forEach((valor) => {
let ejemplosFiltrados = arrayDistincAtributos(ejemplosLista, atributoMaximo, valor);
let arco = new Arco(valor);
arco.sigNodo = crearArbol(ejemplosFiltrados, clase, [...eliminarAtributo(atributoMaximo, atributos)], [...eliminarValores(atributoMaximo, valores)]);
nodo.hijos.push(arco);
});
return nodo;
};
Unfortunately, the code is only in Spanish.
This is the repository that contains my project with this implementation Source code of id3

Developing dynamic branching factor trees in c++

struct avail
{
int value;
avail **child;
};
avail *n = new avail;
n->child = new avail*[25];
for (int i = 0; i < 25; i++)
n->child[i] = new avail;
This is my solution to generating dynamictrees.But I need to specify the no at the start(25). But for further code I want this to be done dynamically something along the lines of
push(avail(n->child[newindex]))
Or
n->child[29]=new avail;
I want to add nodes on a need basis and create proper hierarchy.I would have used stacks for this but I want parent child relation between the nodes. I want to avoid using vectors to complicate the code.

BFS using adjacency lists in STL

I am trying to write a program for implementing BFS in C++ using STL. I am representing the adjacency list using nested vector where each cell in vector contains a list of nodes connected to a particular vertex.
while(myQ.size()!=0)
{
int j=myQ.front();
myQ.pop();
int len=((sizeof(adjList[j]))/(sizeof(*adjList[j])));
for (int i=0;i<len;i++)
{
if (arr[adjList[j][i]]==0)
{
myQ.push(adjList[j][i]);
arr[adjList[j][i]]=1;
dist(v)=dist(w)+1;
}
}
}
myQ is the queue i am using to keep the nodes along whose edges i will be exploring the graph. In the notation adjList[j] represents the vector pointing to the list and adjList[j][i] represents a particular node in that list. I am storing whether i have explored a particular node by inputting 1 in the array arr. Also dist(v)=dist(w)+1 is not a part of the code but i want to know how i can write it in the correct syntax where my v is the new vertex and w is the old one which discovers v i.e w=myQ.front().
If I have understood your problem, then you want a data structure to store the distances of the graph nodes.
This can be easily done using map.
Use this:
typedef std::map <GraphNode*, int> NodeDist;
NodeDist node_dist;
Replace dist(v)=dist(w)+1; with:
NodeDist::iterator fi = node_dist.find (w);
if (fi == node_dist.end())
{
// Assuming 0 distance of node w.
node_dist[v] = 1;
}
else
{
int w_dist = (*fi).second;
node_dist[v] = w_dist + 1;
}
Please let me if I have misunderstood your problem or the given solution does not work for you. We can work on that.

Pointer comparision issue

I'm having a problem with a pointer and can't get around it..
In a HashTable implementation, I have a list of ordered nodes in each bucket.The problem I have It's in the insert function, in the comparision to see if the next node is greater than the current node(in order to inserted in that position if it is) and keep the order.
You might find this hash implementation strange, but I need to be able to do tons of lookups(but sometimes also very few) and count the number of repetitions if It's already inserted (so I need fasts lookups, thus the Hash , I've thought about self-balanced trees as AVL or R-B trees, but I don't know them so I went with the solution I knew how to implement...are they faster for this type of problem?),but I also need to retrieve them by order when I've finished.
Before I had a simple list and I'd retrieve the array, then do a QuickSort, but I think I might be able to improve things by keeping the lists ordered.
What I have to map It's a 27 bit unsigned int(most exactly 3 9 bits numbers, but I convert them to a 27 bit number doing (Sr << 18 | Sg << 9 | Sb) making at the same time their value the hash_value. If you know a good function to map that 27 bit int to an 12-13-14 bit table let me know, I currently just do the typical mod prime solution.
This is my hash_node struct:
class hash_node {
public:
unsigned int hash_value;
int repetitions;
hash_node *next;
hash_node( unsigned int hash_val,
hash_node *nxt);
~hash_node();
};
And this is the source of the problem
void hash_table::insert(unsigned int hash_value) {
unsigned int p = hash_value % tableSize;
if (table[p]!=0) { //The bucket has some elements already
hash_node *pred; //node to keep the last valid position on the list
for (hash_node *aux=table[p]; aux!=0; aux=aux->next) {
pred = aux; //last valid position
if (aux->hash_value == hash_value ) {
//It's already inserted, so we increment it repetition counter
aux->repetitions++;
} else if (hash_value < (aux->next->hash_value) ) { //The problem
//If the next one is greater than the one to insert, we
//create a node in the middle of both.
aux->next = new hash_node(hash_value,aux->next);
colisions++;
numElem++;
}
}//We have arrive to the end od the list without luck, so we insert it after
//the last valid position
ant->next = new hash_node(hash_value,0);
colisions++;
numElem++;
}else { //bucket it's empty, insert it right away.
table[p] = new hash_node(hash_value, 0);
numElem++;
}
}
This is what gdb shows:
Program received signal SIGSEGV, Segmentation fault.
0x08050b4b in hash_table::insert (this=0x806a310, hash_value=3163181) at ht.cc:132
132 } else if (hash_value < (aux->next->hash_value) ) {
Which effectively indicates I'm comparing a memory adress with a value, right?
Hope It was clear. Thanks again!
aux->next->hash_value
There's no check whether "next" is NULL.
aux->next might be NULL at that point? I can't see where you have checked whether aux->next is NULL.