Fast search-insert-delete algorithm for low power processor - c++

We have a application that run on a low power processor that needs to have fast response to incoming data. The data comes in with a associated key. Each key ranges from 0 - 0xFE (max 0xFF). The data itself ranges from 1kB to 4kB in size. The system processes data like:
data in
key lookup -> insert key & data if not found
buffer data in existing key
After some event, a key and associated data is destroy'ed.
We are trialing a couple of solutions:
Pre-allocated std::vector<std::pair<int,unsigned char *>>, that looks up a key value based on the index position.
std::map<int, unsigned char *>
Red-Black tree
std::vector<...> with a binary sort and binary search of the key's
Are there any other algorithms that could be fast at insert-search-delete?
Thanks.

std::map uses a balanced tree (like red-black tree) itself, so there is no point in re-implementing it.
A sorted std::vector with binary search has the same performance of a balanced binary tree. The difference is that placing a key in the middle of the vector is costly.
Since your keys have a very limited range, your best choice is similar to your first suggestion:
std::vector<unsigned char *> data(0xFF); // no need to have a pair
This way, a simple check of data[key] == NULL shows you whether data for this key exists or not. If it was me, I would even make it simpler:
unsigned char *data[0xFF];

If the key is in the range [0, 0xFF), then you could use this:
std::vector<std::string> lut(0xFF); //lookup table
//insert
lut[key] = data; //insert data at position 'key'
//delete
lut[key].clear(); //clear - it means data empty
//search
if(lut[key].empty() )  //empty string means no key, no data!
{
//key not found
}
else
{
std::string & data = lut[key]; //found
}
Note that I used empty string to indicate that data doesn't exist.

Related

Broadcast STL Map using MPI

I have a variable looks like this
map< string, vector<double> > a_data;
long story short, a_data can be filled only by node 0. Hence, broadcasting it using MPI_Bcast() is necessary.
As we know that we can only use primitive data type. So, how should I do to broadcast STL datatype like map using MPI_Bcast()??
One approach that you can do is to:
first broadcast the number of keys to every process; So that every process knows the number of keys that will have to compute;
broadcast an array that has coded the size of each of those keys;
broadcast another array that has coded the size of each array of values;
create a loop to iterate over the keys;
broadcast first the key string (as an array of chars);
broadcast next the values as an array of doubles.
So in pseudo-code would look like:
// number_of_keys <- get number of keys from a_data;
// MPI_Bcast() number_of_keys;
// int key_sizes[number_of_keys];
// int value_sizes[number_of_keys];
//
// if(node == 0){ // the root process
// for every key in a_data do
// key_sizes[i] = the size of the key;
// value_sizes[i] = size of the vector of values associated to key
// }
//
// MPI_Bcast() the array key_sizes
// MPI_Bcast() the array value_sizes
//
// for(int i = 0; i < number_of_keys; i++){
// key <- get key in position 0 from a_data
// values <- get the values associated with the key
//
// MPI_Bcast() the key and use the size stored on key_sizes[i]
// MPI_Bcast() the values and use the size stored on value_sizes[i]
//
// // Non root processes
// if(node != 0){
// add key to the a_data of the process
// add the values to the corresponded key
// }
// }
You just need to adapt the code to C++ (which I am not an expert) so you might have to adapt a bit, but the big picture is there. After having the approach working you can optimized further by reducing the number of broadcast needed. That can be done by packing more information per broadcast. For instance, you can broadcast first the number of items, the sizes of the keys and values, and finally the keys and the values together. For the latter you would need to create your custom MPI Datatype similar to the example showcased here.

Multiple Hash Tables for the Word Count Project

I already wrote a working project but my problem is, it is way slower than what I aimed in the first place so I have some ideas about how to improve it but I don't know how to implement these ideas or should I actually implement these ideas in the first place?
The topic of my project is, reading a CSV (Excel) file full of tweets and counting every single word of it, then displaying most used words.
(Every row of the Excel there is information about the tweet and the tweet itself, I should only care about the tweet)
Instead of sharing the whole code I will just simply wrote what I did so far and only ask about the part I am struggling.
First of all, I want to apologize because it will be a long question.
Important note: Only thing I should focus is speed, storage or size is not a problem.
All the steps:
Read a new line from Excel file.
Find the "tweet" part from the whole line and store it as a string.
Split the tweet string into words and store it in the array.
For every word stored in an array, calculate the ASCII value of the word.
(For finding ascii value of the word I simply sum the ascii value of each letter it has)
Put the word in Hash Table with the key of ASCII value.
(Example: Word "hello" has ascii value of 104+101+108+108+111 = 532, so it stored with key 532 in the hast table)
In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.
I will share the Insert function (for inserting something to the Hashtable) because I believe it will be confusing if I will try to explain this part without a code.
void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{
if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.
if (table[key] != NULL) //If there is already something in hast table
{
if (table[key]->value == value) //If existing value is same as the value we want to add
{
countArray[key][0]++;
}
else //If value is different,
{
Insert(key + 100, value); //Call this function again with the key 100 more than before.
}
}
else //There is nothing saved in this place so save this value
{
table[key] = new HashEntry(key, value);
countArray[key][1] = key;
countArray[key][0]++;
}
}
So "Insert" function has three-part.
Add the value to hash table if hast table with the given key is empty.
If hast table with the given key is not empty that means we already put a word with this ascii value.
Because different words can have exact same ascii value.
The program first checks if this is the same word.
If it is, count increase (In the count array).
If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.
After whole lines are read and every word is stored in Hashtable ->
Sort the Count array
Print the first 10 element
This is the end of the program, so what is the problem?
Now my biggest problem is I am reading a very huge CSV file with thousands of rows, so every unnecessary thing increases the time noticeably.
My second problem is there is a lot of values with the same ASCII value, my method of checking hundred more than normal ascii value methods work, but ? for finding the empty space or the same word, Insert function call itself hundred times per word.
(Which caused the most problem).
So I thought about using multiple Hashtables.
For example, I can check the first letter of the word and if it is
Between A and E, store in the first hash table
Between F and J, store in the second hash table
...
Between V and Z, store in the last hash table.
Important note again: Only thing I should focus is speed, storage or size is not a problem.
So conflicts should minimize mostly in this way.
I can even create an absurd amount of hash tables and for every different letter, I can use a different hash table.
But I am not sure if it is the logical thing to do or maybe there are much simpler methods I can use for this.
If it is okay to use multiple hash tables, instead of creating hash tables, one by one, is it possible to create an array which stores a Hashtable in every location?
(Same as Array of Arrays but this time array store Hashtables)
If it is possible and logical, can someone show how to do it?
This is the hash table I have:
class HashEntry
{
public:
int key;
string value;
HashEntry(int key, string value)
{
this->key = key;
this->value = value;
}
};
class HashMap
{
private:
HashEntry * *table;
public:
HashMap()
{
table = new HashEntry *[TABLE_SIZE];
for (int i = 0; i < TABLE_SIZE; i++)
{
table[i] = NULL;
}
}
//Functions
}
I am very sorry for such a long question I asked and I am again very sorry if I couldn't explain every part clear enough, English is not my mother language.
Also one last note, I am doing this for a school project so I shouldn't use any third party software or include any different libraries because it is not allowed.
You are using a very bad hash function (adding all characters), that's why you get so many collisions and your Insert method calls itself so many times as a result.
For a detailed overview of different hash functions see the answer to this question. I suggest you try DJB2 or FNV-1a (which is used in some implementations of std::unordered_map).
You should also use more localized "probes" for the empty place to improve cache-locality and use a loop instead of recursion in your Insert method.
But first I suggest you tweak your HashEntry a little:
class HashEntry
{
public:
string key; // the word is actually a key, no need to store hash value
size_t value; // the word count is the value.
HashEntry(string key)
: key(std::move(key)), value(1) // move the string to avoid unnecessary copying
{ }
};
Then let's try to use a better hash function:
// DJB2 hash-function
size_t Hash(const string &key)
{
size_t hash = 5381;
for (auto &&c : key)
hash = ((hash << 5) + hash) + c;
return hash;
}
Then rewrite the Insert function:
void Insert(string key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
++table[index]->value;
return;
}
++index;
if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
index = 0;
}
table[index] = new HashEntry(std::move(key));
}
To find the hash table entry by key you can use a similar approach:
HashEntry *Find(const string &key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
return table[index];
}
++index;
if (index == TABLE_SIZE)
index = 0;
}
return nullptr;
}

Reading from unordered_multiset results in crash

While refactoring some old code a cumbersome multilevel-map developed in-house was replaced by an std::undordered_multiset.
The multilevel-map was something like [string_key1,string_val] . A complex algorithm was applied to derive the keys from string_val and resulted in duplicate string_val being stored in the map but with different keys.
Eventually at some point of the application the multilevel-map was iterated to get the string_val and its number of occurrences.
It replaced was an std::unordered_multilevelset and string_val are just inserted to it. It seems much simpler than having an std::map<std::string,int> and checking-retrieving-updating the counter for every insertion.
What I want to do retrieve the number of occurrences of its inserted element, but I do not have the keys beforehands. So I iterate over the buckets but my program crashes upon creation of the string.
// hash map declaration
std::unordered_multiset<std::string> clevel;
// get element and occurences
for (size_t cbucket = clevel->bucket_count() - 1; cbucket != 0; --cbucket)
{
std::string cmsg(*clevel->begin(cbucket));
cmsg += t_str("times=") + \
std::to_string(clevel->bucket_size(cbucket));
}
I do not understand what is going on here, tried to debug it but I am somehow stack( overflown ?) :) . Program crashes in std::string cmsg(*it);
You should consider how multiset actually works as a hashtable. For example reading this introduction you should notice that hash maps actually preallocate their internal buckets , and the number of buckets is optimized.
Therefore if you insert element "hello" , you will probably get a number of buckets already created, but only the one corresponding to hash("hello") will actually have an element that you may dereference. The rest will be let's say invalid.
Dereferencing the iterator to the begin of every bucket results in SEGV which is your case here.
To remedy this situation you should check every time that begin is not past the end.
for (size_t cbucket = clevel->bucket_count() - 1; cbucket != 0; --cbucket)
{
auto it = clevel->begin(cbucket);
if (it != clevel->end(cbucket))
{
std::string cmsg(*it);
cmsg += t_str("times=") + \
std::to_string(clevel->bucket_size(cbucket));
}
}

Can I put a lua_table in another table to build a multi-dimensional array via C++?

for my game im trying to accomplish the following array structure by C++ because the data come from an external source and should be available in a lua_script.
The array structure should look like this: (The data are in a map, the map contains the name of the variable and a list of Pairs (Each pair is a key value pair considered to be one element in one subarray)...
The data prepared in the map are complete and the structure is definetly okay.
So basically I have
typedef std::map<std::string, std::list<std::pair> >;
/\index(e.g: sword) /\ /\
|| ||
|| Pair: Contains two strings (key/value pair)
||
List of Pairs for each array
items = {
["sword"] = {item_id = 1294, price = 500},
["axe"] = {item_id = 1678, price = 200},
["red gem"] = {item_id = 1679, price = 2000},
}
What I got so far now is:
for(ArrayMap::iterator it = this->npc->arrayMap.begin(); it != this->npc->arrayMap.end(); it++) {
std::string arrayName = (*it).first;
if((*it).second.size() > 0) {
lua_newtable(luaState);
for(ArrayEntryList::iterator itt = (*it).second.begin(); itt != (*it).second.end(); itt++) {
LuaScript::setField(luaState, (*itt).first.c_str(), (*itt).second.c_str());
}
lua_setglobal(luaState, arrayName.c_str());
}
}
But this will only generate the following structure:
(table)
[item_id] = (string) 2000
[name] = (string) sword
[price] = (string) 500
The problem is that the table can ofcourse only contain each index once.
Thatswhy I need something like "a table in a table", is that possible?
Is there a way to achieve this? Im glad for any hints.
So from what I understand, if you have two "sword", then you can not store second one? If that's the case, you are doing it wrong. The key of map should be unique and if you decide that you are going to use std::map to store your items then your external source should provide unique keys. I used std::string as key in my previous game. Example:
"WeakSword" -> { some more data }
"VeryWeakSword" -> { some more data }
or, with your data (assuming item_ids are unique) you can get something like following from external source:
1294 -> { some more data }
1678 -> { some more data }
I'm not sure how efficient is this but I wasn't programming a hardware-hungry 3D bleeding-edge game so it just did a fine job.
The data structure that you are using also depends on the how you are going to use it. For example, if you are always iterating through this structure why don't you store as follows:
class Item {public: ... private: std::string name; int id; int value;}
std::vector<Item> items // be careful tho, std::vector copies item before it pushes
Extract(or Parse?) the actual value you want from each entity in external source and store them in std::vector. Reaching the middle of std::vector is expensive, however, if your intention is not instant accessing but rather iterating over data, why use map? But, if your intention is actually reaching a specific key/value pair, you should alter your external data and use unique keys.
Finally, there is also another associative container that stores non-unique key/value pairs called std::multimap but I really doubt you really need it here.

hash table for strings in c++

i've done in the past a small exercise about hashtable but the user was giving array size and also the struct was like this (so the user was giving a number and a word each time as input)
struct data
{
int key;
char c[20];
};
So it was quite simple since i knew the array size and also the user was saying how many items he will be give as input. The way i did it was
Hashing the keys the user gave me
find the position array[hashed(key)] in the array
if it was empty i would put the data there
if it wasn't i would put it in the next free position i would find.
But now i have to make inverted index and i am reasearching so i can make a hashtable for it. So the words will be collected from around 30 txts and they will be so many.
So in this case how long should the array be? How can i hash words? Should i use hasing with open adressing or with chaining. The exercise sais that we could use a hash table as it is if we find it online. but i prefer to understand and create it by my own. Any clues will help me :)
In this exerice(inverted index using hash table) the structs looks like this.
data type is the type of the hash table i will create.
struct posting
{
string word;
posting *next
}
struct data
{
string word;
posting *ptrpostings;
data *next
};
Hashing can be done anyway you choose. Suppose that the string is ABC. You can employ hashing as A=1, B=2, C=3, Hash = 1+2+3/(length = 3) = 2. But, this is very primitive.
The size of the array will depend on the hash algorithm that you deploy, but it is better to choose an algorithm that returns a definite length hash for every string. For example, if you chose to go with SHA1, you can safely allocate 40 bytes per hash. Refer Storing SHA1 hash values in MySQL Read up on the algorithm http://en.wikipedia.org/wiki/SHA-1. I believe that it can be safely used.
On the other hand, if it just for a simple exercise, you can also use MD5 hash. I wouldn't recommend using it in practical purposes as its rainbow tables are available easily :)
---------EDIT-------
You can try to implement like this ::
#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdio.h>
#define MAX_LEN 30
using namespace std;
typedef struct
{
string name; // for the filename
... change this to your specification
}hashd;
hashd hashArray[MAX_LEN]; // tentative
int returnHash(string s)
{
// A simple hashing, no collision handled
int sum=0,index=0;
for(string::size_type i=0; i < s.length(); i++)
{
sum += s[i];
}
index = sum % MAX_LEN;
return index;
}
int main()
{
string fileName;
int index;
cout << "Enter filename ::\t" ;
cin >> fileName;
cout << "Enter filename is ::\t" + fileName << "\n";
index = returnHash(fileName);
cout << "Generated index is ::\t" << index << "\n";
hashArray[index].name = fileName;
cout << "Filename in array ::\t" <<hashArray[index].name ;
return 0;
}
Then, to achieve O(1), anytime you want to fetch the filename's contents, just run the returnHash(filename) function. It will directly return the index of the array :)
A hash table can be implemented as a simple 2-dimensional array. The question is how to compute the unique key for each item to be stored. Some things have keys built into the data, and for other things you'll have to compute one: MD5 as suggested above is probably just fine for your needs.
The next problem you need to solve is how to lay out, or size, your hash table. That's something that you'll ultimately need to tune to your own needs through some testing. You might start by setting up the 1st dimension of your array with 255 entries -- one for each combination of the first 2 digits of the MD5 hash. Whenever you have a collision, you add another entry along the 2nd dimension of your array at that 1st dimension index. This means that you'll statically define a 1-dimensional array while dynamically allocating the 2nd dimension entries as needed. Hopefully that makes as much sense to you as it does to me.
When doing lookups, you can immediately find the right 1st dimension index using the 1st 2-digits of the MD5 hash. Then a relativley short linear search along the 2nd dimension will quickly bring you to the item you seek.
You might find from experimentation that it's more efficient to use a larger 1st dimension (use the fisrt 3 digits of the MD5 hash) if your data set is sufficiently large. Depending on the size of texts involved and the distribution of their use of the lexicon, your results will probably dictate some of your architecture.
On the other hand, you might just start small and build in some intelligence to automatically resize and layout your table. If your table gets too long in either direction, performance will suffer.