Optimizating my code simulating a database (2)

Optimizating my code simulating a database (2) - c++

Some days ago I made you a question and I got some really useful answers. I will make a summary to those of you who didn't read and I will explain my new doubts and where I have problems now.
Explanation
I have been working on a program, simulating a small database, that first of all read information from txt files and store them in the computer memory and then, I can make queries taking normal tables and/or transposed tables. The problem is that the performance is not good enough yet. It works slower than what I expect. I have improved it but I think I should improve it more. I have specific points where my program doesn't have a good performance.
Current problem
The first problem that I have now (where my program is slower) is that I spend more time to, for example table with 100,000 columns & 100 rows (0.325 min, I've improved this thanks to your help) than 100,000 rows & 100 columns (1.61198 min, the same than before). But on the other hand, access time to some data is better in the second case (in a determined example, 47 seconds vs. 6079 seconds in the first case) any idea why??
Explanation
Now let me remind you how my code works (with an atached summary of my code)
First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 7 rows and 5 columns). I also have the transposed table
NormalTable.txt
42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV|14772cjZ`SN|
28704HxDYjzC|6869xXj\nIe|27530EymcTU|9041ByZM]I|24371fZKbNk|
24085cLKeIW|16945TuuU\Nc|16542M[Uz\|13978qMdbyF|6271ait^h|
13291_rBZS|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg|
1656nn_FQf|4042OAegZq|24022nIGz|4735Syi]\|18128klBfynQ|
6618t\SjC|20601S\EEp|11009FqZN|20486rYVPR|7449SqGC|
14799yNvcl|23623MTetGw|6192n]YU\Qe|20329QzNZO_|23845byiP|
TransposedTable.txt (This is new from the previous post)
42sKuG^uM|28704HxDYjzC|24085cLKeIW|13291_rBZS|1656nn_FQf|6618t\SjC|14799yNvcl|
24465\lHXP|6869xXj\nIe|16945TuuU\Nc|4032aFqa|4042OAegZq|20601S\EEp|23623MTetGw|
2996fQo\kN|27530EymcTU|16542M[Uz\|13967r^\\`T|24022nIGz|11009FqZN|6192n]YU\Qe|
293cvByiV|9041ByZM]I|13978qMdbyF|27754k]dOTdh|4735Syi]\|20486rYVPR|20329QzNZO_|
14772cjZ`SN|24371fZKbNk|6271ait^h|24947]v_uzg|18128klBfynQ|7449SqGC|23845byiP|
Explanation
This information in a .txt file is read by my program and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.
Here you have the part of the code that read this information from a file and store in the computer.
Code that reads data from the Table.txt file and store it in the computer memory
int h;
do
{
cout<< "Do you want to query the normal table or the transposed table? (1- Normal table/ 2- Transposed table):" ;
cin>>h;
}while(h!=1 && h!=2);
string ruta_base("C:\\Users\\Raul Velez\\Desktop\\Tables\\");
if(h==1)
{
ruta_base +="NormalTable.txt"; // Folder where my "Table.txt" is found
}
if(h==2)
{
ruta_base +="TransposedTable.txt";
}
string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table
ifstream ifs(ruta_base.c_str());
while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file.
{
size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer
// --- NEW PART ------------------------------------
const char* p = temp.c_str();
char* p1 = strdup(p);
char* pch = strtok(p1, "|");
while(pch)
{
buffer.push_back(string(pch));
pch = strtok(NULL,"|");
}
free(p1);
ElementSet sss(0,buffer);
buffer.clear();
RowsCols.push_back(sss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols"
// --- NEW PART END ------------------------------------
}
Table TablesStorage(RowsCols); // After every loop we will store the information about every .txt file in the vector<Table> TablesDescriptor
vector<Table> TablesDescriptor;
TablesDescriptor.push_back(TablesStorage); // In the vector<Table> TablesDescriptor will be stores all the different tables with all its information
DataBase database(1, TablesDescriptor);
Information already given in the previous post
After this, comes the access to the information part. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.
Problem
In the loop if(h == 2) where data from the transposed tables are accessed, performance is poorer ¿why?
Code that access to the required information upon my input
int n, numTuples;
unsigned long long int y;
cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"
cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"
cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"
unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;
vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<5000> mask;
mask=0x1;
clock_t t1, t2;
t1=clock(); // Start of the query time
bitset<5000> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000
// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{
if(binaryNumber.test(r) & mask.test(r)) // if both of them are bit "1"
{
columnsToBeQueried.push_back(r);
}
mask=mask<<1;
}
do
{
for(int z=0;z<columnsToBeQueried.size();z++)
{
ElementSet selectedElementSet;
int i;
i=columnsToBeQueried.at(z);
Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry ElementSet selectedElementSet;
if(h == 1)
{
selectedElementSet=selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"
}
if(h == 2)
{
selectedElementSet=selectedTable.getRowsCols().at(i);
shownVector.push_back(selectedElementSet.getElements().at(n)); // We save in the vector shownVector the element "n" of the row "i"
}
n=n+1;
t++;
}
}while(t<numTuples);
t2=clock(); // End of the query time
showVector().finalVector(shownVector);
float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;
cout<<"Time: "<<microseconds<<endl;
Class definitions
Here I attached some of the class definitions so that you can compile the code, and understand better how it works:
class ElementSet
{
private:
int id;
vector<string> elements;
public:
ElementSet();
ElementSet(int, vector<string>&);
const int& getId();
void setId(int);
const vector<string>& getElements();
void setElements(vector<string>);
};
class Table
{
private:
vector<ElementSet> RowsCols;
public:
Table();
Table(vector<ElementSet>&);
const vector<ElementSet>& getRowsCols();
void setRowsCols(vector<ElementSet>);
};
class DataBase
{
private:
int id;
vector<Table> pointer;
public:
DataBase();
DataBase(int, vector<Table>&);
const int& getId();
void setId(int);
const vector<Table>& getPointer();
void setPointer(vector<Table>);
};
class Utilities
{
public:
Utilities();
static bitset<500> getDecToBin(unsigned long long int);
};
Summary of my problems
Why the load of the data is different depending on the table format???
Why the access to the information also depends on the table (and the performance is in the opposite way than the table data load?
Thank you very much for all your help!!! :)

One thing I see that may explain both your problems is that you are doing many allocations, a lot of which appear to be temporary. For example, in your loading you:
Allocate a temporary string per row
Allocate a temporary string per column
Copy the row to a temporary ElementSet
Copy that to a RowSet
Copy the RowSet to a Table
Copy the Table to a TableDescriptor
Copy the TableDescriptor to a Database
As far as I can tell, each of these copies is a complete new copy of the object. If you only had a few 100 or 1000 records that might be fine but in your case you have 10 million records so the copies will be time consuming.
Your loading times may differ due to the number of allocations done in the loading loop per row and per column. Memory fragmentation may also contribute at some point (when dealing with a large number of small allocations the default memory handler sometimes takes a long time to allocate new memory). Even if you removed all your unnecessary allocations I would still expect the 100 column case to be slightly slower than the 100,000 case due to how your are loading and parsing by line.
Your information access times may be different as you are creating a full copy of a row in selectedElementSet. When you have 100 columns this will be fast but when you have 100,000 columns it will be slow.
A few specific suggestions to improving your code:
Reduce the number of allocations and copies you make. The ideal case would be to make one allocation for reading the file and then another allocation per record when stored.
If you're going to store the data in a Database then put it there from the beginning. Don't make half-a-dozen complete copies of your data to go from a temporary object to the Database.
Make use of references to the data instead of actual copies when possible.
When profiling make sure you get times when running a new instance of the program. Memory use and fragmentation may have a significant impact if you test both cases in the same instance and the order in which you do the tests will matter.
Edit: Code Suggestion
To hopefully improve your speed in the search loop try something like:
for(int z=0;z<columnsToBeQueried.size();z++)
{
int i;
i=columnsToBeQueried.at(z);
Table& selectedTable = database.getPointer().at(0);
if(h == 1)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i));
}
else if(h == 2)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols().at(i);
shownVector.push_back(selectedElementSet.getElements().at(n));
}
n=n+1;
t++;
}
I've just changed the selectedElementSet to use a reference which should complete eliminate the row copies taking place and, in theory, it should have a noticeable impact in performance. For even more performance gain you can change shownVector to be a reference/pointer to avoid yet another copy.
Edit: Answer Comment
You asked where you were making copies. The following lines in your original code:
ElementSet selectedElementSet;
selectedElementSet = selectedTable.getRowsCols().at(n);
creates a copy of the vector<string> elements member in ElementSet. In the 100,000 column case this will be a vector containing 100,000 strings so the copy will be relatively expensive time wise. Since you don't actually need to create a new copy changing selectedElementSet to be a reference, like in my example code above, will eliminate this copy.

Related

Multiple Hash Tables for the Word Count Project

I already wrote a working project but my problem is, it is way slower than what I aimed in the first place so I have some ideas about how to improve it but I don't know how to implement these ideas or should I actually implement these ideas in the first place?
The topic of my project is, reading a CSV (Excel) file full of tweets and counting every single word of it, then displaying most used words.
(Every row of the Excel there is information about the tweet and the tweet itself, I should only care about the tweet)
Instead of sharing the whole code I will just simply wrote what I did so far and only ask about the part I am struggling.
First of all, I want to apologize because it will be a long question.
Important note: Only thing I should focus is speed, storage or size is not a problem.
All the steps:
Read a new line from Excel file.
Find the "tweet" part from the whole line and store it as a string.
Split the tweet string into words and store it in the array.
For every word stored in an array, calculate the ASCII value of the word.
(For finding ascii value of the word I simply sum the ascii value of each letter it has)
Put the word in Hash Table with the key of ASCII value.
(Example: Word "hello" has ascii value of 104+101+108+108+111 = 532, so it stored with key 532 in the hast table)
In Hash Table only the word "as a string" and the key value "as an int" is stored and count of the words (how much the same word is used) is stored in a separated array.
I will share the Insert function (for inserting something to the Hashtable) because I believe it will be confusing if I will try to explain this part without a code.
void Insert(int key, string value) //Key (where we want to add), Value (what we want to add)
{
if (key < 0) key = 0; //If key is somehow less than 0, for not taking any error key become 0.
if (table[key] != NULL) //If there is already something in hast table
{
if (table[key]->value == value) //If existing value is same as the value we want to add
{
countArray[key][0]++;
}
else //If value is different,
{
Insert(key + 100, value); //Call this function again with the key 100 more than before.
}
}
else //There is nothing saved in this place so save this value
{
table[key] = new HashEntry(key, value);
countArray[key][1] = key;
countArray[key][0]++;
}
}
So "Insert" function has three-part.
Add the value to hash table if hast table with the given key is empty.
If hast table with the given key is not empty that means we already put a word with this ascii value.
Because different words can have exact same ascii value.
The program first checks if this is the same word.
If it is, count increase (In the count array).
If not, Insert function is again called with the key value of (same key value + 100) until empty space or same value is found.
After whole lines are read and every word is stored in Hashtable ->
Sort the Count array
Print the first 10 element
This is the end of the program, so what is the problem?
Now my biggest problem is I am reading a very huge CSV file with thousands of rows, so every unnecessary thing increases the time noticeably.
My second problem is there is a lot of values with the same ASCII value, my method of checking hundred more than normal ascii value methods work, but ? for finding the empty space or the same word, Insert function call itself hundred times per word.
(Which caused the most problem).
So I thought about using multiple Hashtables.
For example, I can check the first letter of the word and if it is
Between A and E, store in the first hash table
Between F and J, store in the second hash table
...
Between V and Z, store in the last hash table.
Important note again: Only thing I should focus is speed, storage or size is not a problem.
So conflicts should minimize mostly in this way.
I can even create an absurd amount of hash tables and for every different letter, I can use a different hash table.
But I am not sure if it is the logical thing to do or maybe there are much simpler methods I can use for this.
If it is okay to use multiple hash tables, instead of creating hash tables, one by one, is it possible to create an array which stores a Hashtable in every location?
(Same as Array of Arrays but this time array store Hashtables)
If it is possible and logical, can someone show how to do it?
This is the hash table I have:
class HashEntry
{
public:
int key;
string value;
HashEntry(int key, string value)
{
this->key = key;
this->value = value;
}
};
class HashMap
{
private:
HashEntry * *table;
public:
HashMap()
{
table = new HashEntry *[TABLE_SIZE];
for (int i = 0; i < TABLE_SIZE; i++)
{
table[i] = NULL;
}
}
//Functions
}
I am very sorry for such a long question I asked and I am again very sorry if I couldn't explain every part clear enough, English is not my mother language.
Also one last note, I am doing this for a school project so I shouldn't use any third party software or include any different libraries because it is not allowed.

You are using a very bad hash function (adding all characters), that's why you get so many collisions and your Insert method calls itself so many times as a result.
For a detailed overview of different hash functions see the answer to this question. I suggest you try DJB2 or FNV-1a (which is used in some implementations of std::unordered_map).
You should also use more localized "probes" for the empty place to improve cache-locality and use a loop instead of recursion in your Insert method.
But first I suggest you tweak your HashEntry a little:
class HashEntry
{
public:
string key; // the word is actually a key, no need to store hash value
size_t value; // the word count is the value.
HashEntry(string key)
: key(std::move(key)), value(1) // move the string to avoid unnecessary copying
{ }
};
Then let's try to use a better hash function:
// DJB2 hash-function
size_t Hash(const string &key)
{
size_t hash = 5381;
for (auto &&c : key)
hash = ((hash << 5) + hash) + c;
return hash;
}
Then rewrite the Insert function:
void Insert(string key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
++table[index]->value;
return;
}
++index;
if (index == TABLE_SIZE) // "wrap around" if we've reached the end of the hash table
index = 0;
}
table[index] = new HashEntry(std::move(key));
}
To find the hash table entry by key you can use a similar approach:
HashEntry *Find(const string &key)
{
size_t index = Hash(key) % TABLE_SIZE;
while (table[index] != nullptr) {
if (table[index]->key == key) {
return table[index];
}
++index;
if (index == TABLE_SIZE)
index = 0;
}
return nullptr;
}

Writing 2-D array int[n][m] to HDF5 file using Visual C++

I'm just getting started with HDF5 and would appreciate some advice on the following.
I have a 2-d array: data[][] passed into a method. The method looks like:
void WriteData( int data[48][100], int sizes[48])
The size of the data is not actually 48 x 100 but rather 48 x sizes[i]. I.e. each row could be a different length! In one simple case I'm dealing with, all rows are the same size (but not 100), so you can say that the array is 48 X sizes[0].
How best to write this to HDF5?
I have some working code where I loop through 0 to 48 and create a new dataset for each row.
Something like:
for (int i = 0; i < 48; i++)
{
hsize_t dsSize[2];
dsSize[0] = 48;
dsSize[1] = sizes[0]; // use sizes[i] in most general case
// Create the Data Space
DataSpace dataSpace = DataSpace(2, dsSize);
DataSet dataSet = group.createDataSet(dataSetName, intDataType, dataSpace);
dataSet.write(data[i], intDataType);
}
Is there a way to write the data all at once in one DataSet? Perhaps one solution for the simpler case of all rows the same length, and another for the ragged rows?
I've tried a few things to no avail. I called dataSet.write(data, intDataType), i.e. I threw the whole array at it. I seemed to get garbage in the file, I suspect because the array the data is stored in is actually 48x100 and I only need a small part of that.
It occurred to me that I could maybe use double ptrs int** or vector> but I'm stuck on that. As far as I can tell, "write" need a void* ptr. Also, I'd like the file to "look correct". I.e. one giant row with all rows of data is not desirable, if I must go that route, someone would need to communicate a slick way to store the info that would allow me to read the data back in from file (perhaps store row lengths as attributes?).
Perhaps my real problem is finding C++ examples of non-trivial use cases.
Any help is much appreciated.
Dave

Here is how you can do it using variable length arrays if your data is a vector of vectors (which seems to make sense for your use case):
void WriteData(const std::vector< std::vector<int> >& data)
{
hsize_t dim(data.size());
H5::DataSpace dspace(1, &dim);
H5::VarLenType dtype(H5::PredType::NATIVE_INT);
H5::DataSet dset(group.createDataSet(dataSetName, dtype, dspace));
hvl_t vl[dim];
for (hsize_t i = 0; i < dim; ++i)
{
vl[i].len = data[i].size();
vl[i].p = &data[i][0];
}
dset.write(vl, dtype);
}

hash table for strings in c++

i've done in the past a small exercise about hashtable but the user was giving array size and also the struct was like this (so the user was giving a number and a word each time as input)
struct data
{
int key;
char c[20];
};
So it was quite simple since i knew the array size and also the user was saying how many items he will be give as input. The way i did it was
Hashing the keys the user gave me
find the position array[hashed(key)] in the array
if it was empty i would put the data there
if it wasn't i would put it in the next free position i would find.
But now i have to make inverted index and i am reasearching so i can make a hashtable for it. So the words will be collected from around 30 txts and they will be so many.
So in this case how long should the array be? How can i hash words? Should i use hasing with open adressing or with chaining. The exercise sais that we could use a hash table as it is if we find it online. but i prefer to understand and create it by my own. Any clues will help me :)
In this exerice(inverted index using hash table) the structs looks like this.
data type is the type of the hash table i will create.
struct posting
{
string word;
posting *next
}
struct data
{
string word;
posting *ptrpostings;
data *next
};

Hashing can be done anyway you choose. Suppose that the string is ABC. You can employ hashing as A=1, B=2, C=3, Hash = 1+2+3/(length = 3) = 2. But, this is very primitive.
The size of the array will depend on the hash algorithm that you deploy, but it is better to choose an algorithm that returns a definite length hash for every string. For example, if you chose to go with SHA1, you can safely allocate 40 bytes per hash. Refer Storing SHA1 hash values in MySQL Read up on the algorithm http://en.wikipedia.org/wiki/SHA-1. I believe that it can be safely used.
On the other hand, if it just for a simple exercise, you can also use MD5 hash. I wouldn't recommend using it in practical purposes as its rainbow tables are available easily :)
---------EDIT-------
You can try to implement like this ::
#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdio.h>
#define MAX_LEN 30
using namespace std;
typedef struct
{
string name; // for the filename
... change this to your specification
}hashd;
hashd hashArray[MAX_LEN]; // tentative
int returnHash(string s)
{
// A simple hashing, no collision handled
int sum=0,index=0;
for(string::size_type i=0; i < s.length(); i++)
{
sum += s[i];
}
index = sum % MAX_LEN;
return index;
}
int main()
{
string fileName;
int index;
cout << "Enter filename ::\t" ;
cin >> fileName;
cout << "Enter filename is ::\t" + fileName << "\n";
index = returnHash(fileName);
cout << "Generated index is ::\t" << index << "\n";
hashArray[index].name = fileName;
cout << "Filename in array ::\t" <<hashArray[index].name ;
return 0;
}
Then, to achieve O(1), anytime you want to fetch the filename's contents, just run the returnHash(filename) function. It will directly return the index of the array :)

A hash table can be implemented as a simple 2-dimensional array. The question is how to compute the unique key for each item to be stored. Some things have keys built into the data, and for other things you'll have to compute one: MD5 as suggested above is probably just fine for your needs.
The next problem you need to solve is how to lay out, or size, your hash table. That's something that you'll ultimately need to tune to your own needs through some testing. You might start by setting up the 1st dimension of your array with 255 entries -- one for each combination of the first 2 digits of the MD5 hash. Whenever you have a collision, you add another entry along the 2nd dimension of your array at that 1st dimension index. This means that you'll statically define a 1-dimensional array while dynamically allocating the 2nd dimension entries as needed. Hopefully that makes as much sense to you as it does to me.
When doing lookups, you can immediately find the right 1st dimension index using the 1st 2-digits of the MD5 hash. Then a relativley short linear search along the 2nd dimension will quickly bring you to the item you seek.
You might find from experimentation that it's more efficient to use a larger 1st dimension (use the fisrt 3 digits of the MD5 hash) if your data set is sufficiently large. Depending on the size of texts involved and the distribution of their use of the lexicon, your results will probably dictate some of your architecture.
On the other hand, you might just start small and build in some intelligence to automatically resize and layout your table. If your table gets too long in either direction, performance will suffer.