Machine Learning Algorithm using recursion - c++

I am currently working on a very beginners version of the ID3 machine learning algorithm. I am stuck on how to recursively call my build_tree function to actually make the rest of the decision tree and output it in a nice format. I have calculated gains, entropies, gain ratios, etc. but I have no clue how to integrate recursion into my function.
I am given a data set, which after doing all the calculations mentioned above, have split it into two datasets. Now I need to be able to recursively call it until both the left and right data sets become pure [which can easily be checked by a function i wrote called dataset.is_pure()], all while keeping track of the threshold at each node. I know that all my calculations and split methods are working as I have done individuual testing on them. It is just the recursive part that I am having trouble with.
Here is my build_tree function that I am having a recursion nightmare with. I am currently working in a linux environment with the g++ compiler. The code I have right now compiles, but when run gives me a segmentation error. Any and all help would be greatly appreciated!
struct node
{
vector<vector<string>> data;
double atrb;
node* parent;
node* left = NULL;
node* right = NULL;
node(node* parent) : parent(parent) {}
};
node* root = new node(NULL);
void build_tree(node* current, dataset data_set)
{
vector<vector<string>> l_d;
vector<vector<string>> r_d;
double global_entropy = calc_entropy(data_set.get_col(data_set.n_col()-1));
int best_col = this->get_best_col(data_set, global_entropy);
hash_map selected_atrb(data_set.n_row(), data_set.truncate(best_col));
double threshold = get_threshold(selected_atrb, global_entropy);
cout << threshold << "\n";
split_data(threshold, best_col, data_set, l_d, r_d);
dataset right_data(r_d);
dataset left_data(l_d);
right_data.delete_col(best_col);
left_data.delete_col(best_col);
if(left_data.is_pure())
return;
else
{
node* new_left = new node(current);
new_left->atrb = threshold;
current->left = new_left;
new_left->data = l_d;
return build_tree(new_left, left_data);
}
if(right_data.is_pure())
return;
else
{
node* new_right = new node(current);
new_right->atrb = threshold;
current->right = new_right;
new_right->data = r_d;
return build_tree(new_right, right_data);
}
}
id3(dataset data)
{
build_tree(root, data);
}
};
This is only a part of my class. If you wish to see any other code, just let me know!

Regards,
I will explain to you with pseudocodigo how the reclusive function works, I will also leave you the code that you make in javascript for the implementation of said algorithm.
Before going into detail, I will mention certain concepts and classes you use.
Attribute: Characteristic of the data set, it is usually the name of a column of the data set.
Class: Decision characteristic, it is generally of binary value and usually it is always the last column of the data set.
Value: Possible value of the attribute in the data set, for example (Sunny, Cloudy, Rainy)
Tree: classes that have a number of nodes associated with each other.
Node: Entity in charge of storing the attribute (question), also has a list with the arcs.
Arc: Contains the value of an attribute and has an attribute that will contain the following child node.
Leaf : Contains a class. This node is the result of a decision, for example (Yes or No).
Best feature: Attribute with the highest information gain.
Function to create the tree from a set of data:
Obtain the values ​​of a class.
Evaluate if there is only one type of class in the data set, for example (Yes).   
If true, then we create a Leaf object and return this object
Obtain the information gain of each current attribute.
Choose the attribute with the highest information gain.
Create a node with the best feature.
Obtain the values ​​of the best feature.
Iterate the list of those values.
Filter the list, so that there are only records with the value that we are iterating (save it in a variable temporary)
Create an Arc with this value.
     - Assign the following attribute to the Arc: (Here comes the recursion) call again the same only function that you send (the filtered list of records, the class, the list of attributes without the best feature, the list of general attributes without the attributes of the best feature)
Add the arc to the node.
Return the node.
This would be the segment of code that is responsible for creating the tree
let crearArbol = (ejemplosLista, clase, atributos, valores) => {
let valoresClase = obtenerValoresAtributo(ejemplosLista, clase);
if (valoresClase.length == 1) {
autoIncremental++;
return new Hoja(valoresClase[0], autoIncremental);
}
if (atributos.length == 0) {
let claseDominante = claseMayoritaria(ejemplosLista);
return new Atributo();
}
let gananciaAtributos = obtenerGananciaAtributos(ejemplosLista, valores, atributos);
let atributoMaximo = atributos[maximaGanancia(gananciaAtributos)];
autoIncremental++;
let nodo = new Atributo(atributoMaximo, [], autoIncremental);
let valoresLista = obtenerValoresAtributo(ejemplosLista, atributoMaximo);
valoresLista.forEach((valor) => {
let ejemplosFiltrados = arrayDistincAtributos(ejemplosLista, atributoMaximo, valor);
let arco = new Arco(valor);
arco.sigNodo = crearArbol(ejemplosFiltrados, clase, [...eliminarAtributo(atributoMaximo, atributos)], [...eliminarValores(atributoMaximo, valores)]);
nodo.hijos.push(arco);
});
return nodo;
};
Unfortunately, the code is only in Spanish.
This is the repository that contains my project with this implementation Source code of id3

Related

B-Tree Node Splitting Techniques

I've stumbled upon a problem whilst doing my DSA (Data Structures and Algorithms) homework. I'm said to implement a B-Tree with Insertion and Search algorithms. As far as it goes, the search is working correctly, but I'm having trouble implementing the insertion function. Specifically the logic behind the B-Tree node-splitting algorithm. A pseudocode/C-style I could come up with is the following:
#define D 2
#define DD 2*D
typedef btreenode* btree;
typedef struct node
{
int keys[DD]; //D == 2 and DD == 2*D;
btree pointers[DD+1];
int index; //used to iterate throught the "keys" array
}btreenode;
void splitNode(btree* parent, btree* child1, btree* child2)
{
//Copies the content from the splitted node to the children
(*child1)->key[0] = (*parent)->key[0];
(*child1)->key[1] = (*parent)->key[1];
(*child2)->key[0] = (*parent)->key[2];
(*child2)->key[1] = (*parent)->key[3];
(*child1)->index = 1;
(*child2)->index = 1;
//"Clears" the parent node from any data
for(int i = 0; i<DD; i++) (*parent)->key[i] = -1;
for(int i = 0; i<DD+1; i++) (*parent)->pointers[i] = NULL
//Fixed the pointers to the children
(*parent)->index = 0;
//the line bellow was taken out for creating a new node that didn't have to be there.
//(*parent)->key[(*parent)->index] = newNode(); // The newNode() function allocs and inserts a the new key that I need to insert.
(*parent)->pointers[index] = (*child1);
(*parent)->pointers[index+1] = (*child2);
}
I'm almost sure that I'm messing up something with the pointers, but I'm not sure what. Any help is appreciated. Maybe I need a little bit more study on the B-Tree subject? I must add that while I can use basic input/output from C++, I need to use C-style structs.
You don't need to create a new node here. You've apparently already created the two new child nodes. All you have to do here after populating the children is make the parent now point to the two children, via a copy of the first key in each of them, and adjust its key count to two. You don't need to set the parent keys to -1 either.

priority_queue becomes extremely slow in debug mode

I am currently writing an A* pathfinding algorithm for a game and came across a very strange performance problem regarding priority_queue's.
I am using a typical 'open nodes list', where I store found, but yet unprocessed nodes. This is implemented as an STL priority_queue (openList) of pointers to PathNodeRecord objects, which store information about a visited node. They are sorted by the estimated cost to get there (estimatedTotalCost).
Now I noticed that whenever the pathfinding method is called, the respective AI thread gets completely stuck and takes several (~5) seconds to process the algorithm and calculate the path. Subsequently I used the VS2013 profiler to see, why and where it was taking so long.
As it turns out, the pushing to and popping from the open list (the priority_queue) takes up a very large amount of time. I am no expert in STL containers, but I never had problems with their efficiency before and this is just weird to me.
The strange thing is that this only occurs while using VS's 'Debug' build configuration. The 'Release' conf. works fine for me and the times are back to normal.
Am I doing something fundamentally wrong here or why is the priority_queue performing so badly for me? The current situation is unacceptable to me, so if I cannot resolve it soon, I will need to fall back to using a simpler container and inserting it to the right place manually.
Any pointers to why this might be occuring would be very helpful!
.
Here is a snippet of what the profiler shows me:
http://i.stack.imgur.com/gEyD3.jpg
.
Code parts:
Here is the relevant part of the pathfinding algorithm, where it loops the open list until there are no open nodes:
// set up arrays and other variables
PathNodeRecord** records = new PathNodeRecord*[graph->getNodeAmount()]; // holds records for all nodes
std::priority_queue<PathNodeRecord*> openList; // holds records of open nodes, sorted by estimated rest cost (most promising node first)
// null all record pointers
memset(records, NULL, sizeof(PathNodeRecord*) * graph->getNodeAmount());
// set up record for start node and put into open list
PathNodeRecord* startNodeRecord = new PathNodeRecord();
startNodeRecord->node = startNode;
startNodeRecord->connection = NULL;
startNodeRecord->closed = false;
startNodeRecord->costToHere = 0.f;
startNodeRecord->estimatedTotalCost = heuristic->estimate(startNode, goalNode);
records[startNode] = startNodeRecord;
openList.push(startNodeRecord);
// ### pathfind algorithm ###
// declare current node variable
PathNodeRecord* currentNode = NULL;
// loop-process open nodes
while (openList.size() > 0) // while there are open nodes to process
{
// retrieve most promising node and immediately remove from open list
currentNode = openList.top();
openList.pop(); // ### THIS IS, WHERE IT GETS STUCK
// if current node is the goal node, end the search here
if (currentNode->node == goalNode)
break;
// look at connections outgoing from this node
for (auto connection : graph->getConnections(currentNode->node))
{
// get end node
PathNodeRecord* toNodeRecord = records[connection->toNode];
if (toNodeRecord == NULL) // UNVISITED -> path record needs to be created and put into open list
{
// set up path node record
toNodeRecord = new PathNodeRecord();
toNodeRecord->node = connection->toNode;
toNodeRecord->connection = connection;
toNodeRecord->closed = false;
toNodeRecord->costToHere = currentNode->costToHere + connection->cost;
toNodeRecord->estimatedTotalCost = toNodeRecord->costToHere + heuristic->estimate(connection->toNode, goalNode);
// store in record array
records[connection->toNode] = toNodeRecord;
// put into open list for future processing
openList.push(toNodeRecord);
}
else if (!toNodeRecord->closed) // OPEN -> evaluate new cost to here and, if better, update open list entry; otherwise skip
{
float newCostToHere = currentNode->costToHere + connection->cost;
if (newCostToHere < toNodeRecord->costToHere)
{
// update record
toNodeRecord->connection = connection;
toNodeRecord->estimatedTotalCost = newCostToHere + (toNodeRecord->estimatedTotalCost - toNodeRecord->costToHere);
toNodeRecord->costToHere = newCostToHere;
}
}
else // CLOSED -> evaluate new cost to here and, if better, put back on open list and reset closed status; otherwise skip
{
float newCostToHere = currentNode->costToHere + connection->cost;
if (newCostToHere < toNodeRecord->costToHere)
{
// update record
toNodeRecord->connection = connection;
toNodeRecord->estimatedTotalCost = newCostToHere + (toNodeRecord->estimatedTotalCost - toNodeRecord->costToHere);
toNodeRecord->costToHere = newCostToHere;
// reset node to open and push into open list
toNodeRecord->closed = false;
openList.push(toNodeRecord); // ### THIS IS, WHERE IT GETS STUCK
}
}
}
// set node to closed
currentNode->closed = true;
}
Here is my PathNodeRecord with the 'less' operator overloading to enable sorting in priority_queue:
namespace AI
{
struct PathNodeRecord
{
Node node;
NodeConnection* connection;
float costToHere;
float estimatedTotalCost;
bool closed;
// overload less operator comparing estimated total cost; used by priority queue
// nodes with a higher estimated total cost are considered "less"
bool operator < (const PathNodeRecord &otherRecord)
{
return this->estimatedTotalCost > otherRecord.estimatedTotalCost;
}
};
}
std::priority_queue<PathNodeRecord*> openList
I think the reason is that you have a priority_queue of pointers to PathNodeRecord.
and there is no ordering defined for the pointers.
try changing it to std::priority_queue<PathNodeRecord> first, if it makes a difference then all you need is passing on your own comparator that knows how to compare pointers to PathNodeRecord, it will just dereference the pointers first and then do the comparison.
EDIT:
taking a wild guess about why did you get an extremely slow execution time, I think the pointers were compared based on their address. and the addresses were allocated starting from one point in memory and going up.
and so that resulted in the extreme case of your heap (the heap as in data structure not the memory part), so your heap was actually a list, (a tree where each node had one children node and so on).
and so you operation took a linear time, again just a guess.
You cannot expect a debug build to be as fast as a release optimized one, but you seems to do a lot of dynamic allocation that may interact badly with the debug runtime.
I suggest you to add _NO_DEBUG_HEAP=1 in the environment setting of the debug property page of your project.

Store link list node address

Im currently writing a descending sort function for a doubly link list. I have a flag for the largest value but wondering if there is a way to store the address of a node pointer so i can set its flag outside the loop when operations are done.
Thanks
In this case, our data is relevance
float findLargest(DoublyLinkList largestdata)
{
ListPlayHolder *findbiggest = largestdata.lhead;
float largest = findbiggest ->relevance;
while (findbiggest ->next != NULL)
{
if (findbiggest ->relevance > largest && findbiggest ->largestFlag != true)
{
largest = findbiggest ->relevance;
}
findbiggest = findbiggest->next;
}
return largest;
}
This is no fancy sort, just trying to make a simplistic descending sort of my data. Once i find the largest, i want to set its nodes flag to true. Just need a way to store the address.
If you used std::list you could use std::sort with different comparison functions and not have to change the node structure for every different ordering sequence.
Another idea is to place your items into a std::vector and create std::list<item *> for each item. This would allow you to access the items in the vector in various orders. For example, one list could be for ascending by title. Another could be descending by relevance.
As I understand it, you only need to keep a pointer to the largest element, so nothing fancy, just another ListPlayHolder*(which seems like the data type of a pointer to a node - a quite confusing name if you ask me, but whatever).
Also, I would recommend not to initalize largest with something pointed to by findbiggest - haven't seen your other list code, but I guess that pointer might be NULL if the list is empty.
You actually don't need to store the relevance value separately if you instead hold on to a pointer of the currently largest object (thanks #WhozCraig). Here's the modified code:
ListPlayHolder* findbiggest = largestdata.lhead;
ListPlayHolder* largest = findbiggest;
while (findbiggest && findbiggest ->next != NULL)
{
if (findbiggest ->relevance > largest->relevance && findbiggest->largestFlag != true)
{
largest = findbiggest;
}
findbiggest = findbiggest->next;
}
// here do whatever modifications you need to do to flags? or maybe return largestPtr?
largest->largestFlag = true;
return largest->relevance;

Low Memory Shortest Path Algorithm

I have a global unique path table which can be thought of as a directed un-weighted graph. Each node represents either a piece of physical hardware which is being controlled, or a unique location in the system. The table contains the following for each node:
A unique path ID (int)
Type of component (char - 'A' or 'L')
String which contains a comma separated list of path ID's which that node is connected to (char[])
I need to create a function which given a starting and ending node, finds the shortest path between the two nodes. Normally this is a pretty simple problem, but here is the issue I am having. I have a very limited amount of memory/resources, so I cannot use any dynamic memory allocation (ie a queue/linked list). It would also be nice if it wasn't recursive (but it wouldn't be too big of an issue if it was as the table/graph itself if really small. Currently it has 26 nodes, 8 of which will never be hit. At worst case there would be about 40 nodes total).
I started putting something together, but it doesn't always find the shortest path. The pseudo code is below:
bool shortestPath(int start, int end)
if start == end
if pathTable[start].nodeType == 'A'
Turn on part
end if
return true
else
mark the current node
bool val
for each node in connectedNodes
if node is not marked
val = shortestPath(node.PathID, end)
end if
end for
if val == true
if pathTable[start].nodeType == 'A'
turn on part
end if
return true
end if
end if
return false
end function
Anyone have any ideas how to either fix this code, or know something else that I could use to make it work?
----------------- EDIT -----------------
Taking Aasmund's advice, I looked into implementing a Breadth First Search. Below I have some c# code which I quickly threw together using some pseudo code I found online.
pseudo code found online:
Input: A graph G and a root v of G
procedure BFS(G,v):
create a queue Q
enqueue v onto Q
mark v
while Q is not empty:
t ← Q.dequeue()
if t is what we are looking for:
return t
for all edges e in G.adjacentEdges(t) do
u ← G.adjacentVertex(t,e)
if u is not marked:
mark u
enqueue u onto Q
return none
C# code which I wrote using this code:
public static bool newCheckPath(int source, int dest)
{
Queue<PathRecord> Q = new Queue<PathRecord>();
Q.Enqueue(pathTable[source]);
pathTable[source].markVisited();
while (Q.Count != 0)
{
PathRecord t = Q.Dequeue();
if (t.pathID == pathTable[dest].pathID)
{
return true;
}
else
{
string connectedPaths = pathTable[t.pathID].connectedPathID;
for (int x = 0; x < connectedPaths.Length && connectedPaths != "00"; x = x + 3)
{
int nextNode = Convert.ToInt32(connectedPaths.Substring(x, 2));
PathRecord u = pathTable[nextNode];
if (!u.wasVisited())
{
u.markVisited();
Q.Enqueue(u);
}
}
}
}
return false;
}
This code runs just fine, however, it only tells me if a path exists. That doesn't really work for me. Ideally what I would like to do is in the block "if (t.pathID == pathTable[dest].pathID)" I would like to have either a list or a way to see what nodes I had to pass through to get from the source and destination, such that I can process those nodes there, rather than returning a list to process elsewhere. Any ideas on how i could make that change?
The most effective solution, if you're willing to use static memory allocation (or automatic, as I seem to recall that the C++ term is), is to declare a fixed-size int array (of size 41, if you're absolutely certain that the number of nodes will never exceed 40). By using two indices to indicate the start and end of the queue, you can use this array as a ring buffer, which can act as the queue in a breadth-first search.
Alternatively: Since the number of nodes is so small, Bellman-Ford may be fast enough. The algorithm is simple to implement, does not use recursion, and the required extra memory is only a distance (int, or even byte in your case) and a predecessor id (int) per node. The running time is O(VE), alternatively O(V^3), where V is the number of nodes and E is the number of edges.

For Looping Link List using Templates

Having used the various search engines (and the wonderful stackoverflow database), I have found some similar situations, but they are either far more complex, or not nearly as complex as what I'm trying to accomplish.
C++ List Looping
Link Error Using Templates
C++:Linked List Ordering
Pointer Address Does Not Change In A Link List
I'm trying to work with Link List and Node templates to store and print non-standard class objects (in this case, a collection of categorized contacts). Particularly, I want to print multiple objects that have the same category, out of a bunch of objects with different categories. When printing by category, I compare an sub-object tmpCategory (= "business") with the category part of a categorized contact.
But how to extract this data for comparison in int main()?
Here's what I'm thinking. I create a GetItem member function in LinkList.tem This would initialize the pointer cursor and then run a For loop until the function input matches the iteration number. At which point, GetItem returns object Type using (cursor -> data).
template <class Type>
Type LinkList<Type>::GetItem(int itemNumber) const
{
Node<Type>* cursor = NULL;
for(cursor = first;
cursor != NULL;
cursor = (cursor -> next))
{
for(int i = 0; i < used; i++)
{
if(itemNumber == i)
{
return(cursor -> data);
}
}
}
}
Here's where int main() comes in. I set my comparison object tmpCategory to a certain value (in this case, "Business"). Then, I run a For loop that iterates for cycles equal to the number of Nodes I have (as determined by a function GetUsed()). Inside that loop, I call GetItem, using the current iteration number. Theoretically, this would let the int main loop return the corresponding Node from LinkList.tem. From there, I call the category from the object inside that Node's data (which currently works), which would be compared with tmpCategory. If there's a match, the loop will print out the entire Node's data object.
tmpCategory = "Business";
for(int i = 0; i < myCategorizedContact.GetUsed(); i++)
{
if(myCategorizedContact.GetItem(i).getCategory() == tmpCategory)
cout << myCategorizedContact.GetItem(i);
}
The problem is that the currently setup (while it does run), it returns nothing at all. Upon further testing ( cout << myCategorizedContact.GetItem(i).getCategory() ), I found that it's just printing out the category of the first Node over and over again. I want the overall scheme to evaluate for every Node and print out matching data, not just spit out the same Node.
Any ideas/suggestions are greatly appreciated.
Please look at this very carefully:
template <class Type>
Type LinkList<Type>::GetItem(int itemNumber) const
{
Node<Type>* cursor = NULL;
// loop over all items in the linked list
for(cursor = first;
cursor != NULL;
cursor = (cursor -> next))
{
// for each item in the linked list, run a for-loop
// counter from 0 to (used-1).
for(int i = 0; i < used; i++)
{
// if the passed in itemNumber matches 'i' anytime
// before we reach the end of the for-loop, return
// whatever the current cursor is.
if(itemNumber == i)
{
return(cursor -> data);
}
}
}
}
You're not walking the cursor down the list itemNumber times. The very first item cursor references will kick off the inner-for-loop. The moment that loop index reaches itemNumber you return. You never advance your cursor if the linked list has at least itemNumber items in the list.. In fact, the two of them (cursor and itemNumber) are entirely unrelated in your implementation of this function. And to really add irony, since used and cursor are entirely unrelated, if used is ever less than itemNumber, it will ALWAYS be so, since used doesn't change when cursor advances through the outer loop. Thus cursor eventually becomes NULL and the results of this function are undefined (no return value). In summary, as written you will always either return the first item (if itemNumber < used), or undefined behavior since you have no return value.
I believe you need something like the following instead:
template< class Type >
Type LinkList<Type>::GetItem(int itemNumber) const
{
const Node<Type>* cursor = first;
while (cursor && itemNumber-- > 0)
cursor = cursor->next;
if (cursor)
return cursor->data;
// note: this is here because you're definition is to return
// an item COPY. This case would be better off returning a
// `const Type*` instead, which would at least allow you to
// communicate to the caller that there was no item at the
// proposed index (because the list is undersized) and return
// NULL, which the caller could check.
return Type();
}