Poor tree add performance - c++

I am writing a tree container at the moment (just for understanding and training) and by now I got a first and very basic approach to add elements to the tree.
This is my tree code by know. No destructor, no cleanup and no element access by now.
template <class T> class set
{
public:
struct Node
{
Node(const T& val)
: left(0), right(0), value(val)
{}
Node* left;
Node* right;
T value;
};
set()
{}
template <class T> void add(const T& value)
{
if (m_Root == nullptr)
{
m_Root = new Node(value);
}
Node* next = nullptr;
Node* current = m_Root;
do
{
if (next != nullptr)
{
current = next;
}
next = value >= current->value ? current->left : current->right;
} while (next != nullptr);
value >= current->value ? current->left = new Node(value) : current->right = new Node(value);
}
private:
Node* m_Root;
};
Well, now I tested the add performance against the insert performance of a std::set with unique and balanced (low and high) values and came to the conclusion that the performance is simple awful.
Is there a reason why the set inserts values that much faster and what would be a decent way of improving the insert performance of my approach? (I know that there might be better tree models, but as far as I know, the insert performance should be close together between most tree models).
under an i5 4570 stock clock,
the std::set needs 0.013s to add 1000000 int16 values.
my set need 4.5s to add the same values.
where does this big difference come from?
Update:
Allright, here is my testcode:
int main()
{
int n = 1000000;
test::set<test::int16> mset; //my set
std::set<test::int16> sset; //std set
std::timer timer; //simple wrapper for clock()
test::random_engine engine(0, 500000); //simple wrapper for rand() and yes, it's seeded, and yes I am aware that an int16 will overflow
std::set<test::int16> values; //Set of values to ensure unique values
bool flip = false;
for (int i = 0; n > i; ++i)
{
values.insert(flip ? engine.generate() : 0 - engine.generate());
flip = !flip; //ensure that we get high and low values and no straight line, but at least 2 paths
}
timer.start();
for (std::set<test::int16>::iterator it = values.begin(); values.end() != it; ++it)
{
mset.add(*it);
}
timer.stop();
std::cout << timer.totalTime() << "s for mset\n";
timer.reset();
timer.start();
for (std::set<test::int16>::iterator it = values.begin(); values.end() != it; ++it)
{
sset.insert(*it);
}
timer.stop();
std::cout << timer.totalTime() << "s for std\n";
}
the set won't store every value due to dubicates but both containers will get a high number and the same values in the same order to ensure representative results. I know the test could be more accurate but it should give some comparable numbers.

std::set implementation usually uses red-black tree data structure. It's a self-balancing binary search tree, and it's insert operation is guaranteed to be O(log(n)) time complexity in the worst-case (that is required by the standard). You use simple binary search tree with O(n) worst-case insert operation.
If you insert unique random values, such a big difference looks suspicious. But don't forget that randomness will not make your tree balanced and the height of the tree could be much bigger than log(n)
Edit
It seems I found the main problem with your code. All generated values you store in std::set. After that, you add them to the sets in the increasing order. That's degrading your set to the linked list.

The two obvious differences are:
the red-black tree (probably) used in std::set rebalances itself to put an upper bound on worst-case behaviour, exactly as DAle says.
If this is the problem, you should see it when plotting N (number of nodes inserted) against time-per-insert. You could also keep track of tree depth (at least for debugging purposes), and plot that against N.
the standard containers use an allocator which probably does something smarter than newing each node individually. You could try using std::allocator in your own container to see if that makes a significant improvement.
Edit 1 if you implemented a pool allocator, that's relevant information that should have been in the question.
Edit 2 now that you've added your test code, there's an obvious problem which means your set will always have the worst-case performance for insertion. You pre-sorted your input values! std::set is an ordered container, so putting your values in there first guarantees you always insert in increasing value order, so your tree (which does not self-balance) degenerates to an expensive linked list, and your inserts are always linear rather than logarithmic time.
You can verify this by storing your values in a vector instead (just using the set to detect collisions), or using an unordered_set to deduplicate without pre-sorting.

Related

How to create the insert function for a binary search tree built with a vector?

I am trying to build a binary search tree, however, it is vital for the algorithm that I am implementing to do so with a vector to diminish cache misses. My original idea was to adapt something similar to the heap insertion technique , since data placement is the same and, once you add an item, you need to bubble sort up the branch to make sure the properties of each data structure are respected (thus the O(log n) complexity).
However, adapting the insert function has proven trickier than anticipated.
This is the original working code for the binary heap:
template <typename DataType>
void BinHeap<DataType>:: Insert(const DataType& value)
{
data.push_back(value);
if(data.size() > 1)
{
BubbleUp(data.size() -1);
}
}
template <typename DataType>
void BinHeap<DataType>::BubbleUp(unsigned pos)
{
int parentPos = Parent(pos);
if(parentPos > 0 && data[parentPos] < data[pos])
{
std::swap(data[parentPos], data[pos]);
BubbleUp(parentPos);
}
}
And here is my attempt to adapt it into a vector based Binary Search Tree (please do not mind the odd naming of the class, as this is still not the final version):
template <typename DataType>
void BinHeap<DataType>:: Insert(const DataType& value)
{
data.push_back(value);
if(data.size() > 1)
{
BubbleUp(data.size() -1);
}
}
template <typename DataType>
void BinHeap<DataType>::BubbleUp(unsigned pos)
{
int parentPos = Parent(pos);
bool isLeftSon = LeftSon(parentPos) == pos;
if(parentPos >= 0)
{
if(isLeftSon && ( data[parentPos] < data[pos] ) )
{
std::swap(data[parentPos] , data[pos]);
}
else if (data[parentPos] > data[pos])// RightSon
{
std::swap(data[parentPos] , data[pos]);
}
BubbleUp(parentPos-1);
BubbleDown(parentPos-1);
}
}
template <typename DataType>
void BinHeap<DataType>::BubbleDown(unsigned pos)
{
int leftChild = LeftSon(pos);
int rightChild = RightSon(pos);
bool leftExists = leftChild < data.size() && leftChild > 0;
bool rightExists = rightChild < data.size() && rightChild > 0;
// No children
if(!leftExists && !rightExists)
{
return;
}
if(leftExists && data[pos] < data[leftChild])
{
std::swap(data[leftChild] , data[pos]);
}
else if (rightExists && data[pos] > data[rightChild])
{
std::swap(data[rightChild] , data[pos]);
}
}
This approach is able to guarantee that the properties of the BST are respected locally, but not across siblings or ancestors (grandparents, etc). For example, if every number from 1 to 16 is inserted in order, 12 will have a left child of 6 and right child of 14. However, it parent 16 will have a left child of 8 and a right child of 12 (thus 6 is on the right subtree of 16). I feel my current approach is over complicating the process, but I am not sure how to rearrange it to make the necessary changes in an efficient manner. Any insight would be greatly appreciated.
The realistic answer to the question title (which is at the time I composed this answer "How to create the insert function for a binary search tree built with a vector?") is: Don't do that!
It is clear from your code that you are trying to preserve the compact storage and self-balancing properties of a heap while also wishing it to be searchable via classic left/right child tree navigation. But, the heap trick of using (index-1)/2 to locate the parent node only works for a "perfectly balanced" tree. That is, the N element array is perfectly packed from 0 to N-1. And then, you expect an in-order walk of this tree to be sorted (if you didn't, then your binary left/right search navigation would not be able to find the right node).
Thus, you are maintaining a sorted set of elements in your array. Except, you have some strange rules for how to navigate the array to get the sorted order.
There is no way that your scheme can maintain a binary sorted array any simpler than a scheme that maintains a plain sorted array. The node manipulations only lead to a complicated piece of software that is difficult to understand, to maintain, and reason about correctly. A sorted array, on the other hand, is easy to understand and maintain, and is easy to see how it leads to a correct result. The binary search (or optionally, dictionary search) is fast.
While maintaining a sorted array requires a linear insertion logic, your scheme must be at least as complex, because it is also maintaining a sorted set of elements in the array.
If you want a data structure that is hardware data cache friendly, and provides logarithmic insertion and search, use a B+-tree. It is a little more complex than your average data structure, but this is a case where the complexity can be worth it. Especially if regular trees just cause too much data cache thrash. As a bit of advice, optimal performance usually results if an interior node (with keys) is sized to fit within a cache line or two.
I really don't understand the Big Picture, or overall view, of what you are trying to accomplish. There are many existing functions, and libraries that perform the functionality that I think you want.
Efficient Data Search
Since you are using a vector, placing a B-Tree into a vector seems moot. The general situation is that you maintain a sorted vector and perform a binary_search, upper_bound, or lower_bound on the array. Provided that your program is performing more searches than inserts, this should be faster than traversing a binary tree inside an array.
The maintenance is much easier using an array of sorted values than performing maintenance on a Balanced Binary Tree. The maintenance consists of appending a value, then sorting.
The sorted vector technique also uses less memory. There is no need for child links so you save 2 indices for every slot in the vector.
Using A Binary Tree
There are many examples on the 'net for using Binary Trees. Looks like you want to maintain a balanced tree, so you should search the web for "c++ example balanced binary tree array". If the examples use pointers, replace the pointers by an index.
Trees are complex data structures and have maintenance overhead. I'm not sure if balancing a tree in a vector is slower than sorting a vector; but it is usually more work and more complex.
Usage Criteria
With modern computers executing instructions in the nanosecond time period, searching performance differences become negligible with huge amounts of data. So for small amounts of data, a linear search may be faster than a binary search, due to overhead costs in a binary search.
Similarly, with a binary tree and a sorted array. The overhead of node processing may be more than the overhead of sorting a vector, and only negligible for large amounts of data.
Development time is crucial. The time spent developing a specialized binary tree in a vector is definitely more than using an std::vector, std::sort, and std::lower_bound. These items are already implemented and tested. So while you are developing this specialized algorithm and data structure, another person using a sorted vector could be finished and onto another project by the time you finish your development.

Don't understand this code principle

// set::insert (C++98)
#include <iostream>
#include <set>
int main ()
{
std::set<int> myset;
std::set<int>::iterator it;
std::pair<std::set<int>::iterator,bool> ret;
// set some initial values:
for (int i=1; i<=5; ++i) myset.insert(i*10); // set: 10 20 30 40 50
ret = myset.insert(20); // no new element inserted
if (ret.second==false) it=ret.first; // "it" now points to element 20
myset.insert (it,25); // max efficiency inserting
myset.insert (it,24); // max efficiency inserting
myset.insert (it,26); // no max efficiency inserting
int myints[]= {5,10,15}; // 10 already in set, not inserted
myset.insert (myints,myints+3);
std::cout << "myset contains:";
for (it=myset.begin(); it!=myset.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
return 0;
}
I see this code as example on cplusplus reference site. It says
myset.insert (it,25); // max efficiency inserting
myset.insert (it,24); // max efficiency inserting
this is max efficiency inserting but I don't get it.
Can anybody tell me why?
std::set uses a balanced tree structure. When you call insert, you are allowed to provide a hint to the implementation - which it can use to speed up insertion.
Think of how general insert methods into a regular binary search tree work. You start at the root node, and you must progress down using the usual checks:
void insert(node* current, const T& value)
{
if(node == nullptr) // Construct our new node here
else if(value < node->current) insert(current->left, value);
else if(value > node->current) insert(current->right, value);
}
void insert(const T& value)
{
insert(root, value);
}
In a balanced tree, this must perform (on average) O(log n) comparisons to insert a given value.
However, suppose that, instead of starting at the root node, we give the implementation a starting node that is where the actual insert will happen. For example, in the above, we know that 24 and 25 will become children of the node containing 20. Hence, if we start at that node, we don't need to do our O(log n) comparisons - we can simply insert our nodes straight away. This is what is meant by "maximum efficiency" insertion.
Have you read notices?
position
Hint for the position where the element can be inserted.
c++98
The function optimizes its insertion time if position points to the
element that will precede the inserted element.
it points to 20 and precedes 25.
In general, std::set has to look up where it inserts; this
look up is O(lg n). If you provide a "hint" (the additional
iterator) where the insertion should take place, the
implementation will first check if this hint is correct (which
can be done in O(1)), and if it is, insert there, thus skipping
the O(lg n) look up. If the hint isn't correct, of course, it
then reverts to the insertion as if it hadn't gotten the hint.
There are two cases where you regularly use the hint: The first
is when you are inserting a sequence of already sorted data. In
this case, the hint is the end iterator, since if the data is
already sorted, each new value will in fact be inserted at the
end. The second is when copying into the set using an insertion
iterator. In this case, the "hint" (which can be anything)
isn't only used for syntactical reasons: you can't use
a back_insertion_iterator or a front_insertion_iterator,
because std::set doesn't have push_back or push_front, and
the simple insertion_iterator requires an iterator to tell it
where to insert; this iterator will be passed to insert.

How can I use binary heap in the Dijkstra algorithm?

I am writing code of dijkstra algorithm, for the part where we are supposed to find the node with minimum distance from the currently being used node, I am using a array over there and traversing it fully to figure out the node.
This part can be replaced by binary heap and we can figure out the node in O(1) time, but We also update the distance of the node in further iterations, How will I incorporate that heap?
In case of array, all I have to do is go to the (ith -1) index and update the value of that node, but same thing can't be done in Binary heap, I will have to do the full search to figure out the position of the node and then update it.
What is workaround of this problem?
This is just some information I found while doing this in a class, that I shared with my classmates. I thought I'd make it easier for folks to find it, and I had left this post up so that I could answer it when I found a solution.
Note: I'm assuming for this example that your graph's vertices have an ID to keep track of which is which. This could be a name, a number, whatever, just make sure you change the type in the struct below.
If you have no such means of distinction, then you can use pointers to the vertices and compare their pointed-to addresses.
The problem you are faced with here is the fact that, in Dijkstra's algorithm, we are asked to store the graphs vertices and their keys in this priority queue, then update the keys of the ones left in the queue.
But... Heap data-structures have no way of getting at any particular node that is not the minimum or the last node!
The best we'd be able to do is traverse the heap in O(n) time to find it, then update its key and bubble-it-up, at O(Logn). That makes updating all vertices O(n) for every single edge, making our implementation of Dijkstra O(mn), way worse than the optimal O(mLogn).
Bleh! There has to be a better way!
So, what we need to implement isn't exactly a standard min-heap-based priority queue. We need one more operation than the standard 4 pq operations:
IsEmpty
Add
PopMin
PeekMin
and DecreaseKey
In order to DecreaseKey, we need to:
find a particular vertex inside the Heap
lower its key-value
"heap-up" or "bubble-up" the vertex
Essentially, since you were (I'm assuming it has been implemented sometime in the past 4 months) probably going to use an "array-based" heap implementation,
this means that we need the heap to keep track of each vertex and its index in the array in order for this operation to be possible.
Devising a struct like: (c++)
struct VertLocInHeap
{
int vertex_id;
int index_in_heap;
};
would allow you to keep track of it, but storing those in an array would still give you O(n) time for finding the vertex in the heap. No complexity improvement, and it's more complicated than before. >.<
My suggestion (if optimization is the goal here):
Store this info in a Binary Search Tree whose key value is the `vertex_id`
do a binary-search to find the vertex's location in the Heap in O(Logn)
use the index to access the vertex and update its key in O(1)
bubble-up the vertex in O(Logn)
I actually used a std::map declared as:
std::map m_locations;
in the heap instead of using the struct. The first parameter (Key) is the vertex_id, and the second parameter (Value) is the index in the heap's array.
Since std::map guarantees O(Logn) searches, this works nicely out-of-the-box. Then whenever you insert or bubble, you just m_locations[vertexID] = newLocationInHeap;
Easy money.
Analysis:
Upside: we now have O(Logn) for finding any given vertex in the p-q. For the bubble-up we do O(Log(n)) movements, for each swap doing a O(Log(n)) search in the map of array indexes, resulting in a O(Log^2(n) operation for bubble-up.
So, we have a Log(n) + Log^2(n) = O(Log^2(n)) operation for updating the key values in the Heap for a single edge. That makes our Dijkstra alg take O(mLog^2(n)). That's pretty close to the theoretical optimum, at least as close as I can get it. Awesome Possum!
Downside: We are storing literally twice as much information in-memory for the heap. Is it a "modern" problem? Not really; my desky can store over 8 billion integers, and many modern computers come with at least 8GB of RAM; however, it is still a factor. If you did this implementation with a graph of 4 billion vertices, which can happen a lot more often than you'd think, then it causes a problem. Also, all those extra reads/writes, which may not affect the complexity in analysis, may still take time on some machines, especially if the information is being stored externally.
I hope this helps someone in the future, because I had a devil of a time finding all this information, then piecing the bits I got from here, there, and everywhere together to form this. I'm blaming the internet and lack of sleep.
The problem I ran into with using any form of heap is that, you need to reorder the nodes in the heap. In order to do that, you would have to keep popping everything from the heap until you found the node you need, then change the weight, and push it back in (along with everything else you popped). Honestly, just using an array would probably be more efficient and easier to code than that.
The way I got around this was I used a Red-Black tree (in C++ it's just the set<> data type of the STL). The data structure contained a pair<> element which had a double (cost) and string (node). Because of the tree structure, it is very efficient to access the minimum element (I believe C++ makes it even more efficient by maintaining a pointer to the minimum element).
Along with the tree, I also kept an array of doubles that contained the distance for a given node. So, when I needed to reorder a node in the tree, I simply used the old distance from the dist array along with the node name to find it in the set. I would then remove that element from the tree and re-insert it into the tree with the new distance. To search for a node O(log n) and to insert a node O(log n), so the cost to reorder a node is O(2 * log n) = O(log n). For a binary heap, it also has a O(log n) for both insert and delete (and doesn't support search). So with the cost of deleting all of the nodes until you find the node you want, change its weight, then insert all nodes back in. Once the node has been reordered, I would then change the distance in the array to reflect the new distance.
I honestly can't think of a way to modify a heap in such a way to allow it to dynamically change the weights of a node, because the whole structure of the heap is based on the weights the nodes maintain.
I would do this using a hash table in addition to the Min-Heap array.
The hash table has keys that are hash coded to be the node objects and values that are the indices of where those nodes are in the min-heap arrray.
Then anytime you move something in the min-heap you just need to update the hash table accordingly. Since at most 2 elements will be moved per operation in the min-heap (that is they are exchanged), and our cost per move is O(1) to update the hash table, then we will not have damaged the asymptotic bound of the min-heap operations. For example, minHeapify is O(lgn). We just added 2 O(1) hash table operations per minHeapify operation. Therefore the overall complexity is still O(lgn).
Keep in mind you would need to modify any method that moves your nodes in the min-heap to do this tracking! For example, minHeapify() requires a modification that looks like this using Java:
Nodes[] nodes;
Map<Node, int> indexMap = new HashMap<>();
private minHeapify(Node[] nodes,int i) {
int smallest;
l = 2*i; // left child index
r = 2*i + 1; // right child index
if(l <= heapSize && nodes[l].getTime() < nodes[i].getTime()) {
smallest = l;
}
else {
smallest = i;
}
if(r <= heapSize && nodes[r].getTime() < nodes[smallest].getTime()) {
smallest = r;
}
if(smallest != i) {
temp = nodes[smallest];
nodes[smallest] = nodes[i];
nodes[i] = temp;
indexMap.put(nodes[smallest],i); // Added index tracking in O(1)
indexMap.put(nodes[i], smallest); // Added index tracking in O(1)
minHeapify(nodes,smallest);
}
}
buildMinHeap, heapExtract should be dependent on minHeapify, so that one is mostly fixed, but you do need the extracted key to be removed from the hash table as well. You'd also need to modify decreaseKey to track these changes as well. Once that's fixed then insert should also be fixed since it should be using the decreaseKey method. That should cover all your bases and you will not have altered the asymptotic bounds of your algorithm and you still get to keep using a heap for your priority queue.
Note that a Fibonacci Min Heap is actually preferred to a standard Min Heap in this implementation, but that's a totally different can of worms.
Another solution is "lazy deletion". Instead of decrease key operation you simply insert the node once again to heap with new priority. So, in the heap there will be another copy of node. But, that node will be higher in the heap than any previous copy. Then when getting next minimum node you can simply check if node is already being accepted. If it is, then simply omit the loop and continue (lazy deletion).
This has a little worse performance/higher memory usage due to copies inside the heap. But, it is still limited (to number of connections) and may be faster than other implementations for some problem sizes.
This algorithm: http://algs4.cs.princeton.edu/44sp/DijkstraSP.java.html works around this problem by using "indexed heap": http://algs4.cs.princeton.edu/24pq/IndexMinPQ.java.html which essentially maintains the list of mappings from key to array index.
I believe the main difficulty is being able to achieve O(log n) time complexity when we have to update vertex distance. Here are the steps on how you could do that:
For heap implementation, you could use an array.
For indexing, use a Hash Map, with Vertex number as the key and its index in heap as the value.
When we want to update a vertex, search its index in the Hash Map in O(1) time.
Reduce the vertex distance in heap and then keep traversing up (Check its new distance against its root, if root's value is greater swap root and current vertex). This step would also take O(log n).
Update the vertex's index in Hash Map as you make changes while traversing up the heap.
I think this should work and the overall time complexity would be O((E+V)*log V), just as the theory implies.
I am using the following approach. Whenever I insert something into the heap I pass a pointer to an integer (this memory location is ownned by me, not the heap) which should contain the position of the element in the array managed by the heap. So if the sequence of elements in the heap is rearranged it is supposed to update the values pointed to by these pointers.
So for the Dijkstra algirithm I am creating a posInHeap array of sizeN.
Hopefully, the code will make it more clear.
template <typename T, class Comparison = std::less<T>> class cTrackingHeap
{
public:
cTrackingHeap(Comparison c) : m_c(c), m_v() {}
cTrackingHeap(const cTrackingHeap&) = delete;
cTrackingHeap& operator=(const cTrackingHeap&) = delete;
void DecreaseVal(size_t pos, const T& newValue)
{
m_v[pos].first = newValue;
while (pos > 0)
{
size_t iPar = (pos - 1) / 2;
if (newValue < m_v[iPar].first)
{
swap(m_v[pos], m_v[iPar]);
*m_v[pos].second = pos;
*m_v[iPar].second = iPar;
pos = iPar;
}
else
break;
}
}
void Delete(size_t pos)
{
*(m_v[pos].second) = numeric_limits<size_t>::max();// indicate that the element is no longer in the heap
m_v[pos] = m_v.back();
m_v.resize(m_v.size() - 1);
if (pos == m_v.size())
return;
*(m_v[pos].second) = pos;
bool makingProgress = true;
while (makingProgress)
{
makingProgress = false;
size_t exchangeWith = pos;
if (2 * pos + 1 < m_v.size() && m_c(m_v[2 * pos + 1].first, m_v[pos].first))
exchangeWith = 2 * pos + 1;
if (2 * pos + 2 < m_v.size() && m_c(m_v[2 * pos + 2].first, m_v[exchangeWith].first))
exchangeWith = 2 * pos + 2;
if (pos > 0 && m_c(m_v[pos].first, m_v[(pos - 1) / 2].first))
exchangeWith = (pos - 1) / 2;
if (exchangeWith != pos)
{
makingProgress = true;
swap(m_v[pos], m_v[exchangeWith]);
*m_v[pos].second = pos;
*m_v[exchangeWith].second = exchangeWith;
pos = exchangeWith;
}
}
}
void Insert(const T& value, size_t* posTracker)
{
m_v.push_back(make_pair(value, posTracker));
*posTracker = m_v.size() - 1;
size_t pos = m_v.size() - 1;
bool makingProgress = true;
while (makingProgress)
{
makingProgress = false;
if (pos > 0 && m_c(m_v[pos].first, m_v[(pos - 1) / 2].first))
{
makingProgress = true;
swap(m_v[pos], m_v[(pos - 1) / 2]);
*m_v[pos].second = pos;
*m_v[(pos - 1) / 2].second = (pos - 1) / 2;
pos = (pos - 1) / 2;
}
}
}
const T& GetMin() const
{
return m_v[0].first;
}
const T& Get(size_t i) const
{
return m_v[i].first;
}
size_t GetSize() const
{
return m_v.size();
}
private:
Comparison m_c;
vector< pair<T, size_t*> > m_v;
};

Hashing to Calculate Frequencies can be improved?

I'm currently working on building a hash table in order to calculate the frequencies, depending on the running time of the data structure. O(1) insertion, O(n) worse look up time etc.
I've asked a few people the difference between std::map and the hash table and I've received an answer as;
"std::map adds the element as a binary tree thus causes O(log n) where with the hash table you implement it will be O(n)."
Thus I've decided to implement a hash table using the array of linked lists (for separate chaining) structure. In the code below I've assigned two values for the node, one being the key(the word) and the other being the value(frequency). It works as; when the first node is added if the index is empty it is directly inserted as the first element of linked list with the frequency of 0. If it is already in the list (which unfortunately takes O(n) time to search) increment its frequency by 1. If not found simply add it to the beginning of the list.
I know there are a lot of flows in the implementation thus I would like to ask the experienced people in here, in order to calculate frequencies efficiently, how can this implementation be improved?
Code I've written so far;
#include <iostream>
#include <stdio.h>
using namespace std;
struct Node {
string word;
int frequency;
Node *next;
};
class linkedList
{
private:
friend class hashTable;
Node *firstPtr;
Node *lastPtr;
int size;
public:
linkedList()
{
firstPtr=lastPtr=NULL;
size=0;
}
void insert(string word,int frequency)
{
Node* newNode=new Node;
newNode->word=word;
newNode->frequency=frequency;
if(firstPtr==NULL)
firstPtr=lastPtr=newNode;
else {
newNode->next=firstPtr;
firstPtr=newNode;
}
size++;
}
int sizeOfList()
{
return size;
}
void print()
{
if(firstPtr!=NULL)
{
Node *temp=firstPtr;
while(temp!=NULL)
{
cout<<temp->word<<" "<<temp->frequency<<endl;
temp=temp->next;
}
}
else
printf("%s","List is empty");
}
};
class hashTable
{
private:
linkedList* arr;
int index,sizeOfTable;
public:
hashTable(int size) //Forced initalizer
{
sizeOfTable=size;
arr=new linkedList[sizeOfTable];
}
int hash(string key)
{
int hashVal=0;
for(int i=0;i<key.length();i++)
hashVal=37*hashVal+key[i];
hashVal=hashVal%sizeOfTable;
if(hashVal<0)
hashVal+=sizeOfTable;
return hashVal;
}
void insert(string key)
{
index=hash(key);
if(arr[index].sizeOfList()<1)
arr[index].insert(key, 0);
else {
//Search for the index throughout the linked list.
//If found, increment its value +1
//else if not found, add the node to the beginning
}
}
};
Do you care about the worst case? If no, use an std::unordered_map (it handles collisions and you don't want a multimap) or a trie/critbit tree (depending on the keys, it may be more compact than a hash, which may lead to better caching behavior). If yes, use an std::set or a trie.
If you want, e.g., online top-k statistics, keep a priority queue in addition to the dictionary. Each dictionary value contains the number of occurrences and whether the word belongs to the queue. The queue duplicates the top-k frequency/word pairs but keyed by frequency. Whenever you scan another word, check whether it's both (1) not already in the queue and (2) more frequent than the least element in the queue. If so, extract the least queue element and insert the one you just scanned.
You can implement your own data structures if you like, but the programmers who work on STL implementations tend to be pretty sharp. I would make sure that's where the bottleneck is first.
1- The complexity time for search in std::map and std::set is O(log(n)). And, the amortize time complexity for std::unordered_map and std::unordered_set is O(n). However, the constant time for hashing could be very large and for small numbers become more than log(n). I always consider this face.
2- if you want to use std::unordered_map, you need to make sure that std::hash is defined for you type. Otherwise you should define it.

Binary Search Tree Implementation in C++ STL?

Do you know, please, if C++ STL contains a Binary Search Tree (BST) implementation, or if I should construct my own BST object?
In case STL conains no implementation of BST, are there any libraries available?
My goal is to be able to find the desired record as quickly as possible: I have a list of records (it should not be more few thousands.), and I do a per-frame (its a computer game) search in that list. I use unsigned int as an identifier of the record of my interest. Whatever way is the fastest will work best for me.
What you need is a way to look up some data given a key. With the key being an unsigned int, this gives you several possibilities. Of course, you could use a std::map:
typedef std::map<unsigned int, record_t> my_records;
However, there's other possibilities as well. For example, it's quite likely that a hash map would be even faster than a binary tree. Hash maps are called unordered_map in C++, and are a part of the C++11 standard, likely already supported by your compiler/std lib (check your compiler version and documentation). They were first available in C++TR1 (std::tr1::unordered_map)
If your keys are rather closely distributed, you might even use a simple array and use the key as an index. When it comes to raw speed, nothing would beat indexing into an array. OTOH, if your key distribution is too random, you'd be wasting a lot of space.
If you store your records as pointers, moving them around is cheap, and an alternative might be to keep your data sorted by key in a vector:
typedef std::vector< std::pair<unsigned int, record_t*> > my_records;
Due to its better data locality, which presumably plays nice with processor cache, a simple std::vector often performs better than other data structures which theoretically should have an advantage. Its weak spot is inserting into/removing from the middle. However, in this case, on a 32bit system, this would require moving entries of 2*32bit POD around, which your implementation will likely perform by calling CPU intrinsics for memory move.
std::set and std::map are usually implemented as red-black trees, which are a variant of binary search trees. The specifics are implementation dependent tho.
A clean and simple BST implementation in CPP:
struct node {
int val;
node* left;
node* right;
};
node* createNewNode(int x)
{
node* nn = new node;
nn->val = x;
nn->left = nullptr;
nn->right = nullptr;
return nn;
}
void bstInsert(node* &root, int x)
{
if(root == nullptr) {
root = createNewNode(x);
return;
}
if(x < root->val)
{
if(root->left == nullptr) {
root->left = createNewNode(x);
return;
} else {
bstInsert(root->left, x);
}
}
if( x > root->val )
{
if(root->right == nullptr) {
root->right = createNewNode(x);
return;
} else {
bstInsert(root->right, x);
}
}
}
int main()
{
node* root = nullptr;
int x;
while(cin >> x) {
bstInsert(root, x);
}
return 0;
}
STL's set class is typically implemented as a BST. It's not guaranteed (the only thing that is is it's signature, template < class Key, class Compare = less<Key>, class Allocator = allocator<Key> > class set;) but it's a pretty safe bet.
Your post says you want speed (presumably for a tighter game loop).
So why waste time on these slow-as-molasses O(lg n) structures and go for a hash map implementation?