Hashing to Calculate Frequencies can be improved? - c++

I'm currently working on building a hash table in order to calculate the frequencies, depending on the running time of the data structure. O(1) insertion, O(n) worse look up time etc.
I've asked a few people the difference between std::map and the hash table and I've received an answer as;
"std::map adds the element as a binary tree thus causes O(log n) where with the hash table you implement it will be O(n)."
Thus I've decided to implement a hash table using the array of linked lists (for separate chaining) structure. In the code below I've assigned two values for the node, one being the key(the word) and the other being the value(frequency). It works as; when the first node is added if the index is empty it is directly inserted as the first element of linked list with the frequency of 0. If it is already in the list (which unfortunately takes O(n) time to search) increment its frequency by 1. If not found simply add it to the beginning of the list.
I know there are a lot of flows in the implementation thus I would like to ask the experienced people in here, in order to calculate frequencies efficiently, how can this implementation be improved?
Code I've written so far;
#include <iostream>
#include <stdio.h>
using namespace std;
struct Node {
string word;
int frequency;
Node *next;
};
class linkedList
{
private:
friend class hashTable;
Node *firstPtr;
Node *lastPtr;
int size;
public:
linkedList()
{
firstPtr=lastPtr=NULL;
size=0;
}
void insert(string word,int frequency)
{
Node* newNode=new Node;
newNode->word=word;
newNode->frequency=frequency;
if(firstPtr==NULL)
firstPtr=lastPtr=newNode;
else {
newNode->next=firstPtr;
firstPtr=newNode;
}
size++;
}
int sizeOfList()
{
return size;
}
void print()
{
if(firstPtr!=NULL)
{
Node *temp=firstPtr;
while(temp!=NULL)
{
cout<<temp->word<<" "<<temp->frequency<<endl;
temp=temp->next;
}
}
else
printf("%s","List is empty");
}
};
class hashTable
{
private:
linkedList* arr;
int index,sizeOfTable;
public:
hashTable(int size) //Forced initalizer
{
sizeOfTable=size;
arr=new linkedList[sizeOfTable];
}
int hash(string key)
{
int hashVal=0;
for(int i=0;i<key.length();i++)
hashVal=37*hashVal+key[i];
hashVal=hashVal%sizeOfTable;
if(hashVal<0)
hashVal+=sizeOfTable;
return hashVal;
}
void insert(string key)
{
index=hash(key);
if(arr[index].sizeOfList()<1)
arr[index].insert(key, 0);
else {
//Search for the index throughout the linked list.
//If found, increment its value +1
//else if not found, add the node to the beginning
}
}
};

Do you care about the worst case? If no, use an std::unordered_map (it handles collisions and you don't want a multimap) or a trie/critbit tree (depending on the keys, it may be more compact than a hash, which may lead to better caching behavior). If yes, use an std::set or a trie.
If you want, e.g., online top-k statistics, keep a priority queue in addition to the dictionary. Each dictionary value contains the number of occurrences and whether the word belongs to the queue. The queue duplicates the top-k frequency/word pairs but keyed by frequency. Whenever you scan another word, check whether it's both (1) not already in the queue and (2) more frequent than the least element in the queue. If so, extract the least queue element and insert the one you just scanned.
You can implement your own data structures if you like, but the programmers who work on STL implementations tend to be pretty sharp. I would make sure that's where the bottleneck is first.

1- The complexity time for search in std::map and std::set is O(log(n)). And, the amortize time complexity for std::unordered_map and std::unordered_set is O(n). However, the constant time for hashing could be very large and for small numbers become more than log(n). I always consider this face.
2- if you want to use std::unordered_map, you need to make sure that std::hash is defined for you type. Otherwise you should define it.

Related

Poor tree add performance

I am writing a tree container at the moment (just for understanding and training) and by now I got a first and very basic approach to add elements to the tree.
This is my tree code by know. No destructor, no cleanup and no element access by now.
template <class T> class set
{
public:
struct Node
{
Node(const T& val)
: left(0), right(0), value(val)
{}
Node* left;
Node* right;
T value;
};
set()
{}
template <class T> void add(const T& value)
{
if (m_Root == nullptr)
{
m_Root = new Node(value);
}
Node* next = nullptr;
Node* current = m_Root;
do
{
if (next != nullptr)
{
current = next;
}
next = value >= current->value ? current->left : current->right;
} while (next != nullptr);
value >= current->value ? current->left = new Node(value) : current->right = new Node(value);
}
private:
Node* m_Root;
};
Well, now I tested the add performance against the insert performance of a std::set with unique and balanced (low and high) values and came to the conclusion that the performance is simple awful.
Is there a reason why the set inserts values that much faster and what would be a decent way of improving the insert performance of my approach? (I know that there might be better tree models, but as far as I know, the insert performance should be close together between most tree models).
under an i5 4570 stock clock,
the std::set needs 0.013s to add 1000000 int16 values.
my set need 4.5s to add the same values.
where does this big difference come from?
Update:
Allright, here is my testcode:
int main()
{
int n = 1000000;
test::set<test::int16> mset; //my set
std::set<test::int16> sset; //std set
std::timer timer; //simple wrapper for clock()
test::random_engine engine(0, 500000); //simple wrapper for rand() and yes, it's seeded, and yes I am aware that an int16 will overflow
std::set<test::int16> values; //Set of values to ensure unique values
bool flip = false;
for (int i = 0; n > i; ++i)
{
values.insert(flip ? engine.generate() : 0 - engine.generate());
flip = !flip; //ensure that we get high and low values and no straight line, but at least 2 paths
}
timer.start();
for (std::set<test::int16>::iterator it = values.begin(); values.end() != it; ++it)
{
mset.add(*it);
}
timer.stop();
std::cout << timer.totalTime() << "s for mset\n";
timer.reset();
timer.start();
for (std::set<test::int16>::iterator it = values.begin(); values.end() != it; ++it)
{
sset.insert(*it);
}
timer.stop();
std::cout << timer.totalTime() << "s for std\n";
}
the set won't store every value due to dubicates but both containers will get a high number and the same values in the same order to ensure representative results. I know the test could be more accurate but it should give some comparable numbers.
std::set implementation usually uses red-black tree data structure. It's a self-balancing binary search tree, and it's insert operation is guaranteed to be O(log(n)) time complexity in the worst-case (that is required by the standard). You use simple binary search tree with O(n) worst-case insert operation.
If you insert unique random values, such a big difference looks suspicious. But don't forget that randomness will not make your tree balanced and the height of the tree could be much bigger than log(n)
Edit
It seems I found the main problem with your code. All generated values you store in std::set. After that, you add them to the sets in the increasing order. That's degrading your set to the linked list.
The two obvious differences are:
the red-black tree (probably) used in std::set rebalances itself to put an upper bound on worst-case behaviour, exactly as DAle says.
If this is the problem, you should see it when plotting N (number of nodes inserted) against time-per-insert. You could also keep track of tree depth (at least for debugging purposes), and plot that against N.
the standard containers use an allocator which probably does something smarter than newing each node individually. You could try using std::allocator in your own container to see if that makes a significant improvement.
Edit 1 if you implemented a pool allocator, that's relevant information that should have been in the question.
Edit 2 now that you've added your test code, there's an obvious problem which means your set will always have the worst-case performance for insertion. You pre-sorted your input values! std::set is an ordered container, so putting your values in there first guarantees you always insert in increasing value order, so your tree (which does not self-balance) degenerates to an expensive linked list, and your inserts are always linear rather than logarithmic time.
You can verify this by storing your values in a vector instead (just using the set to detect collisions), or using an unordered_set to deduplicate without pre-sorting.

Implementing a Set Using a Sorted Array

This is a data structures question, but also regarding implementation. A set is typically implemented using a BST, but my professor wants us to know how to implement some data structures when only given limited options. So he wants us to be able to understand how to create a set using only an array.
Using a standard (unsorted) array I understand the implementation/complexity...
void add(Student[] arr, Student findstu)
{
Student stu = new Student();
int i=0;
boolean found = false;
while(stu!=NULL)
{
stu = arr[i++];
if (stu==findstu)
{
found = true;
}
}
if (found==false)
{
arr[i+1] = findstu;
}
}
The add/remove/contains are ally pretty much the same code, all will have the first while loop, which will make them O(n).
But if we used a sorted array, why would contains be O(lgn) and add/remove O(n)?
Searching would be of O(logN) because due to the fact that the array is sorted you could apply binary search which is of O(logN) complexity.
Insertion and erasure would be O(N) complexity (i.e., linear time) because every time you would attempt to insert or erase an element in the sorted array you would have to shift the elements of your array one position which is O(N) linear time complexity.

separate chaining in hashing

I am reading about hashing in Robert Sedwick book on Algorithms in C++
We might be using a header node to streamline the code for insertion
into an ordered list, but we might not want to use M header nodes for
individual lists in separate chaining. Indeed, we could even eliminate
the M links to the lists by having the first nodes in the lists
comprise the table
.
class ST
{
struct node
{
Item item;
node* next;
node(Item x, node* t)
{ item = x; next = t; }
};
typedef node *link;
private:
link* heads;
int N, M;
Item searchR(link t, Key v)
{
if (t == 0) return nullItem;
if (t->item.key() == v) return t->item;
return searchR(t->next, v);
}
public:
ST(int maxN)
{
N = 0; M = maxN/5;
heads = new link[M];
for (int i = 0; i < M; i++) heads[i] = 0;
}
Item search(Key v)
{ return searchR(heads[hash(v, M)], v); }
void insert(Item item)
{ int i = hash(item.key(), M);
heads[i] = new node(item, heads[i]); N++; }
};
My two questions on above text what does author mean by
"We could even eliminate the M links to the lists by having the first nodes in the lists comprise the table." How can we modify above code for this?
"we might not want to use M header nodes for individual lists in separate chaining." What does this statement mean.
"We could even eliminate the M links to the lists by having the first nodes in the lists comprise the table."
Consider Node* x[n] vs Node x[n]: the former needs an extra pointer and on-insertion memory allocated for the head Node of every non-empty element, and an extra indirection for every hash table operation, while the latter eliminates the n pointers but requires that any unused elements will be able to be put in some discernable not-in-use state (tracking of which may or may not require extra memory), and if sizeof(Node) size is greater than sizeof(Node*), it may be more wasteful of memory anyway. The difference in memory use can also affect efficiency of cache use: if the table has a high element to buckets ratio then a Node[] gets the Node data into fewer contiguous memory pages, and if you're iterating (in unsorted order) then it's very cache efficient, whereas Node*[] will jump to separate memory allocations that might be all over the place (or on the other hand, might actually be quite close together in some actually useful: e.g. if both access patterns and dynamic memory allocation addresses correlate to chronological time of object creation.
How can we modify above code for this?
First, your existing code has a problem: heads[i] = new node(item, heads[i]); overwrites an entry in the hash table without first checking if it's empty... if there's anything there then you should be adding to the list, not overwriting the array.
The design change discussed needs:
link* heads;
...changed to...
node* head;
You'd initialise it like this:
head = new node[M];
Which needs an extra node constructor (if item has an equivalent default constructor, you can leave out its initialisation below)
node() : item(nullItem), next(nullptr) { }
Then there's some knock on changes to the rest of your code that are easy to work through. Basically, you're getting rid of a layer of pointers.
"we might not want to use M header nodes for individual lists in separate chaining." What does this statement mean.
I didn't write it so can't say authoritatively, but it appears to be saying that when designing the list code, a decision might have been made to have an initial Node even in an empty list, as this simplifies code for several list operations. While the extra data-less Node might seem a reasonable price when contemplating "usual" uses of a list, hash tables are unusual in that you want most of the lists chained of the buckets to have 0 or 1 element, and exponentially fewer should be longer and longer. So, such a list implementation is poorly suited to use in a hash table.

doubly linked list implementation

Which one would be more efficient?
I want to keep a list of items but, it's required of me to sort list
by id,
by name
by course credits
by the user
Would it be best to add items in list by id and then sort by the others or just add items without order and sort in the order needed when ever needed by the user?
If you're really required to keep the list sorted -- as opposed to using other data structures to give sorted access to the list -- then you could simply make a list whose elements have different pointers for different sort criteria.
In other words, instead of keeping just previous and next pointers, have previousById, nextById, previousByName, previousByCredits and nextByCredits. Likewise, you would have three head and/or tail pointers, instead of just one.
Please note that this approach has the drawback of being inflexible when it comes to implementing additional sort criteria. I'm assuming that you're trying to solve a homework-type problem, which is why I tried to tailor the answer to what seem to be the homework requirements.
You can use three maps (or hashmaps):
One mapping the id to the item, one mapping name to an item reference (or pointer) and one mapping course credits to item reference again.
It would be more efficient to sort it in whichever order that you know will be sorted for the most, for example if you know you're going to be retrieving by id most often, keep it sorted by id, otherwise pick one of the others though id would be the easiest if it is just an integer field
So then to do that you would check on insert to find where newid is less than nextid but greater than previousid, then allocate a new node with new and set the pointers appropriately.
Keeping the linked list sorted in some way is better than just keeping it unsorted. You're adding some time to how long it takes to insert an item but it's negligible to how long it would take to sort it that particular way
The more efficient would be to store the nodes as is, and keep 4 different indexes up-to-date. This way, when one order is required, you just pick up the right index and that's all. The cost is O(log N) for input, and O(1) for traversal.
Of course, keeping 4 indexes at once, with perhaps different requirements on uniqueness, and in the face of possible exceptions, is relatively difficult, but then, there's a Boost library for this: Boost MultiIndex
On example is to generate a set that can be sorted either by ID or by Name.
Since you can add as many indexes as you wish, it should get you going :)
Keep your lined list objects in the lined list, in random order. To sort the list by any key, use this pseudocode:
struct LinkedList {
string name;
LinkedList *prev;
LinkedList *next;
};
void FillArray(LinkedList *first, LinkedList **output, size_t &size) {
//function creates an array of pointers to every LinkedList object
LinedList *now;
size_t i; //you may use int instead of size_t
//check, how many objects are there in linked list
now=first;
while(now!=NULL) {
size++;
now=now->next;
}
//if linked list is empty
if (size==0) {
*output=NULL;
return;
}
//create the array;
*output = new LinkedList[size];
//fill the array
i=0;
now=first;
while(now!=NULL) {
*output[i++]=now;
now=now->next;
}
}
SortByName(LinkedList *arrayOfPointers, size_t size) {
// your function to sort by name here
}
void TemporatorySort(LinkedList *first, LinkedList **output, size_t &size) {
// this function will create the array of pointer to your linked list,
// sort this array, and return the sorted array. However, the linked
// list will stay as it is. It's good for example when your lined list
// is sorted by ID, but you need to print it sorted by names only once.
FillArray(first, *output, size);
SortByName(output,size);
}
void PermanentSort(LinkedList *first) {
// This function will sort the linked list and save the new order
// permanently.
LinkedList *sorted;
size_t size;
TemporatorySort(first,&sorted,size);
if (size>0) {
sorted[0].prev=NULL;
}
for(int i=1;i<size;i++) {
sorted[i-1].next=sorted[i];
sorted[i].prev=sorted[i-1];
}
sorted[size-1].next=NULL;
}
I hope, I actually did help you. If you don't understand any line from the code, simply put a comment to this "answer".

Binary Search Tree Implementation in C++ STL?

Do you know, please, if C++ STL contains a Binary Search Tree (BST) implementation, or if I should construct my own BST object?
In case STL conains no implementation of BST, are there any libraries available?
My goal is to be able to find the desired record as quickly as possible: I have a list of records (it should not be more few thousands.), and I do a per-frame (its a computer game) search in that list. I use unsigned int as an identifier of the record of my interest. Whatever way is the fastest will work best for me.
What you need is a way to look up some data given a key. With the key being an unsigned int, this gives you several possibilities. Of course, you could use a std::map:
typedef std::map<unsigned int, record_t> my_records;
However, there's other possibilities as well. For example, it's quite likely that a hash map would be even faster than a binary tree. Hash maps are called unordered_map in C++, and are a part of the C++11 standard, likely already supported by your compiler/std lib (check your compiler version and documentation). They were first available in C++TR1 (std::tr1::unordered_map)
If your keys are rather closely distributed, you might even use a simple array and use the key as an index. When it comes to raw speed, nothing would beat indexing into an array. OTOH, if your key distribution is too random, you'd be wasting a lot of space.
If you store your records as pointers, moving them around is cheap, and an alternative might be to keep your data sorted by key in a vector:
typedef std::vector< std::pair<unsigned int, record_t*> > my_records;
Due to its better data locality, which presumably plays nice with processor cache, a simple std::vector often performs better than other data structures which theoretically should have an advantage. Its weak spot is inserting into/removing from the middle. However, in this case, on a 32bit system, this would require moving entries of 2*32bit POD around, which your implementation will likely perform by calling CPU intrinsics for memory move.
std::set and std::map are usually implemented as red-black trees, which are a variant of binary search trees. The specifics are implementation dependent tho.
A clean and simple BST implementation in CPP:
struct node {
int val;
node* left;
node* right;
};
node* createNewNode(int x)
{
node* nn = new node;
nn->val = x;
nn->left = nullptr;
nn->right = nullptr;
return nn;
}
void bstInsert(node* &root, int x)
{
if(root == nullptr) {
root = createNewNode(x);
return;
}
if(x < root->val)
{
if(root->left == nullptr) {
root->left = createNewNode(x);
return;
} else {
bstInsert(root->left, x);
}
}
if( x > root->val )
{
if(root->right == nullptr) {
root->right = createNewNode(x);
return;
} else {
bstInsert(root->right, x);
}
}
}
int main()
{
node* root = nullptr;
int x;
while(cin >> x) {
bstInsert(root, x);
}
return 0;
}
STL's set class is typically implemented as a BST. It's not guaranteed (the only thing that is is it's signature, template < class Key, class Compare = less<Key>, class Allocator = allocator<Key> > class set;) but it's a pretty safe bet.
Your post says you want speed (presumably for a tighter game loop).
So why waste time on these slow-as-molasses O(lg n) structures and go for a hash map implementation?