hash table for strings in c++ - c++

i've done in the past a small exercise about hashtable but the user was giving array size and also the struct was like this (so the user was giving a number and a word each time as input)
struct data
{
int key;
char c[20];
};
So it was quite simple since i knew the array size and also the user was saying how many items he will be give as input. The way i did it was
Hashing the keys the user gave me
find the position array[hashed(key)] in the array
if it was empty i would put the data there
if it wasn't i would put it in the next free position i would find.
But now i have to make inverted index and i am reasearching so i can make a hashtable for it. So the words will be collected from around 30 txts and they will be so many.
So in this case how long should the array be? How can i hash words? Should i use hasing with open adressing or with chaining. The exercise sais that we could use a hash table as it is if we find it online. but i prefer to understand and create it by my own. Any clues will help me :)
In this exerice(inverted index using hash table) the structs looks like this.
data type is the type of the hash table i will create.
struct posting
{
string word;
posting *next
}
struct data
{
string word;
posting *ptrpostings;
data *next
};

Hashing can be done anyway you choose. Suppose that the string is ABC. You can employ hashing as A=1, B=2, C=3, Hash = 1+2+3/(length = 3) = 2. But, this is very primitive.
The size of the array will depend on the hash algorithm that you deploy, but it is better to choose an algorithm that returns a definite length hash for every string. For example, if you chose to go with SHA1, you can safely allocate 40 bytes per hash. Refer Storing SHA1 hash values in MySQL Read up on the algorithm http://en.wikipedia.org/wiki/SHA-1. I believe that it can be safely used.
On the other hand, if it just for a simple exercise, you can also use MD5 hash. I wouldn't recommend using it in practical purposes as its rainbow tables are available easily :)
---------EDIT-------
You can try to implement like this ::
#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdio.h>
#define MAX_LEN 30
using namespace std;
typedef struct
{
string name; // for the filename
... change this to your specification
}hashd;
hashd hashArray[MAX_LEN]; // tentative
int returnHash(string s)
{
// A simple hashing, no collision handled
int sum=0,index=0;
for(string::size_type i=0; i < s.length(); i++)
{
sum += s[i];
}
index = sum % MAX_LEN;
return index;
}
int main()
{
string fileName;
int index;
cout << "Enter filename ::\t" ;
cin >> fileName;
cout << "Enter filename is ::\t" + fileName << "\n";
index = returnHash(fileName);
cout << "Generated index is ::\t" << index << "\n";
hashArray[index].name = fileName;
cout << "Filename in array ::\t" <<hashArray[index].name ;
return 0;
}
Then, to achieve O(1), anytime you want to fetch the filename's contents, just run the returnHash(filename) function. It will directly return the index of the array :)

A hash table can be implemented as a simple 2-dimensional array. The question is how to compute the unique key for each item to be stored. Some things have keys built into the data, and for other things you'll have to compute one: MD5 as suggested above is probably just fine for your needs.
The next problem you need to solve is how to lay out, or size, your hash table. That's something that you'll ultimately need to tune to your own needs through some testing. You might start by setting up the 1st dimension of your array with 255 entries -- one for each combination of the first 2 digits of the MD5 hash. Whenever you have a collision, you add another entry along the 2nd dimension of your array at that 1st dimension index. This means that you'll statically define a 1-dimensional array while dynamically allocating the 2nd dimension entries as needed. Hopefully that makes as much sense to you as it does to me.
When doing lookups, you can immediately find the right 1st dimension index using the 1st 2-digits of the MD5 hash. Then a relativley short linear search along the 2nd dimension will quickly bring you to the item you seek.
You might find from experimentation that it's more efficient to use a larger 1st dimension (use the fisrt 3 digits of the MD5 hash) if your data set is sufficiently large. Depending on the size of texts involved and the distribution of their use of the lexicon, your results will probably dictate some of your architecture.
On the other hand, you might just start small and build in some intelligence to automatically resize and layout your table. If your table gets too long in either direction, performance will suffer.

Related

Recursive String Transformations

EDIT: I've made the main change of using iterators to keep track of successive positions in the bit and character strings and pass the latter by const ref. Now, when I copy the sample inputs onto themselves multiple times to test the clock, everything finishes within 10 seconds for really long bit and character strings and even up to 50 lines of sample input. But, still when I submit, CodeEval says the process was aborted after 10 seconds. As I mention, they don't share their input so now that "extensions" of the sample input work, I'm not sure how to proceed. Any thoughts on an additional improvement to increase my recursive performance would be greatly appreciated.
NOTE: Memoization was a good suggestion but I could not figure out how to implement it in this case since I'm not sure how to store the bit-to-char correlation in a static look-up table. The only thing I thought of was to convert the bit values to their corresponding integer but that risks integer overflow for long bit strings and seems like it would take too long to compute. Further suggestions for memoization here would be greatly appreciated as well.
This is actually one of the moderate CodeEval challenges. They don't share the sample input or output for moderate challenges but the output "fail error" simply says "aborted after 10 seconds," so my code is getting hung up somewhere.
The assignment is simple enough. You take a filepath as the single command-line argument. Each line of the file will contain a sequence of 0s and 1s and a sequence of As and Bs, separated by a white space. You are to determine whether the binary sequence can be transformed into the letter sequence according to the following two rules:
1) Each 0 can be converted to any non-empty sequence of As (e.g, 'A', 'AA', 'AAA', etc.)
2) Each 1 can be converted to any non-empty sequences of As OR Bs (e.g., 'A', 'AA', etc., or 'B', 'BB', etc) (but not a mixture of the letters)
The constraints are to process up to 50 lines from the file and that the length of the binary sequence is in [1,150] and that of the letter sequence is in [1,1000].
The most obvious starting algorithm is to do this recursively. What I came up with was for each bit, collapse the entire next allowed group of characters first, test the shortened bit and character strings. If it fails, add back one character from the killed character group at a time and call again.
Here is my complete code. I removed cmd-line argument error checking for brevity.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
using namespace std;
//typedefs
typedef string::const_iterator str_it;
//declarations
//use const ref and iterators to save time on copying and erasing
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front);
int main(int argc, char* argv[])
{
//check there are at least two command line arguments: binary executable and file name
//ignore additional arguments
if(argc < 2)
{
cout << "Invalid command line argument. No input file name provided." << "\n"
<< "Goodybe...";
return -1;
}
//create input stream and open file
ifstream in;
in.open(argv[1], ios::in);
while(!in.is_open())
{
char* name;
cout << "Invalid file name. Please enter file name: ";
cin >> name;
in.open(name, ios::in);
}
//variables
string line_bits, line_chars;
//reserve space up to constraints to reduce resizing time later
line_bits.reserve(150);
line_chars.reserve(1000);
int line = 0;
//loop over lines (<=50 by constraint, ignore the rest)
while((in >> line_bits >> line_chars) && (line < 50))
{
line++;
//impose bit and char constraints
if(line_bits.length() > 150 ||
line_chars.length() > 1000)
continue; //skip this line
(TransformLine(line_bits, line_bits.begin(), line_chars, line_chars.begin()) == true) ? (cout << "Yes\n") : (cout << "No\n");
}
//close file
in.close();
return 0;
}
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front)
{
//using iterators so store current length as local const
//can make these const because they're not altered here
int bits_length = distance(bits_front, bits.end());
int chars_length = distance(chars_front, chars.end());
//check success rule
if(bits_length == 0 && chars_length == 0)
return true;
//Check fail rules:
//1. next bit is 0 but next char is B
//2. bits length is zero (but char is not, by previous if)
//3. char length is zero (but bits length is not, by previous if)
if((*bits_front == '0' && *chars_front == 'B') ||
bits_length == 0 ||
chars_length == 0)
return false;
//we now know that chars_length != 0 => chars_front != chars.end()
//kill a bit and then call recursively with each possible reduction of front char group
bits_length = distance(++bits_front, bits.end());
//current char group tracker
const char curr_char_type = *chars_front; //use const so compiler can optimize
int curr_pos = distance(chars.begin(), chars_front); //position of current front in char string
//since chars are 0-indexed, the following is also length of current char group
//start searching from curr_pos and length is relative to curr_pos so subtract it!!!
int curr_group_length = chars.find_first_not_of(curr_char_type, curr_pos)-curr_pos;
//make sure this isn't the last group!
if(curr_group_length < 0 || curr_group_length > chars_length)
curr_group_length = chars_length; //distance to end is precisely distance(chars_front, chars.end()) = chars_length
//kill the curr_char_group
//if curr_group_length = char_length then this will make chars_front = chars.end()
//and this will mean that chars_length will be 0 on next recurssive call.
chars_front += curr_group_length;
curr_pos = distance(chars.begin(), chars_front);
//call recursively, adding back a char from the current group until 1 less than starting point
int added_back = 0;
while(added_back < curr_group_length)
{
if(TransformLine(bits, bits_front, chars, chars_front))
return true;
//insert back one char from the current group
else
{
added_back++;
chars_front--; //represents adding back one character from the group
}
}
//if here then all recursive checks failed so initial must fail
return false;
}
They give the following test cases, which my code solves correctly:
Sample input:
1| 1010 AAAAABBBBAAAA
2| 00 AAAAAA
3| 01001110 AAAABAAABBBBBBAAAAAAA
4| 1100110 BBAABABBA
Correct output:
1| Yes
2| Yes
3| Yes
4| No
Since a transformation is possible if and only if copies of it are, I tried just copying each binary and letter sequences onto itself various times and seeing how the clock goes. Even for very long bit and character strings and many lines it has finished in under 10 seconds.
My question is: since CodeEval is still saying it is running longer than 10 seconds but they don't share their input, does anyone have any further suggestions to improve the performance of this recursion? Or maybe a totally different approach?
Thank you in advance for your help!
Here's what I found:
Pass by constant reference
Strings and other large data structures should be passed by constant reference.
This allows the compiler to pass a pointer to the original object, rather than making a copy of the data structure.
Call functions once, save result
You are calling bits.length() twice. You should call it once and save the result in a constant variable. This allows you to check the status again without calling the function.
Function calls are expensive for time critical programs.
Use constant variables
If you are not going to modify a variable after assignment, use the const in the declaration:
const char curr_char_type = chars[0];
The const allows compilers to perform higher order optimization and provides safety checks.
Change data structures
Since you are perform inserts maybe in the middle of a string, you should use a different data structure for the characters. The std::string data type may need to reallocate after an insertion AND move the letters further down. Insertion is faster with a std::list<char> because a linked list only swaps pointers. There may be a trade off because a linked list needs to dynamically allocate memory for each character.
Reserve space in your strings
When you create the destination strings, you should use a constructor that preallocates or reserves room for the largest size string. This will prevent the std::string from reallocating. Reallocations are expensive.
Don't erase
Do you really need to erase characters in the string?
By using starting and ending indices, you overwrite existing letters without have to erase the entire string.
Partial erasures are expensive. Complete erasures are not.
For more assistance, post to Code Review at StackExchange.
This is a classic recursion problem. However, a naive implementation of the recursion would lead to an exponential number of re-evaluations of a previously computed function value. Using a simpler example for illustration, compare the runtime of the following two functions for a reasonably large N. Lets not worry about the int overflowing.
int RecursiveFib(int N)
{
if(N<=1)
return 1;
return RecursiveFib(N-1) + RecursiveFib(N-2);
}
int IterativeFib(int N)
{
if(N<=1)
return 1;
int a_0 = 1, a_1 = 1;
for(int i=2;i<=N;i++)
{
int temp = a_1;
a_1 += a_0;
a_0 = temp;
}
return a_1;
}
You would need to follow a similar approach here. There are two common ways of approaching the problem - dynamic programming and memoization. Memoization is the easiest way of modifying your approach. Below is a memoized fibonacci implementation to illustrate how your implementation can be speeded up.
int MemoFib(int N)
{
static vector<int> memo(N, -1);
if(N<=1)
return 1;
int& res = memo[N];
if(res!=-1)
return res;
return res = MemoFib(N-1) + MemoFib(N-2);
}
Your failure message is "Aborted after 10 seconds" -- implying that the program was working fine as far as it went, but it took too long. This is understandable, given that your recursive program takes exponentially more time for longer input strings -- it works fine for the short (2-8 digit) strings, but will take a huge amount of time for 100+ digit strings (which the test allows for). To see how your running time goes up, you should construct yourself some longer test inputs and see how long they take to run. Try things like
0000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
00000000111111110000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
and longer. You need to be able to handle up to 150 digits and 1000 letters.
At CodeEval, you can submit a "solution" that just outputs what the input is, and do that to gather their test set. They may have variations so you may wish to submit it a few times to gather more samples. Some of them are too difficult to solve manually though... the ones you can solve manually will also run very quickly at CodeEval too, even with inefficient solutions, so there's that to consider.
Anyway, I did this same problem at CodeEval (using VB of all things), and my solution recursively looked for the "next index" of both A and B depending on what the "current" index is for where I was in a translation (after checking stoppage conditions first thing in the recursive method). I did not use memoization but that might've helped speed it up even more.
PS, I have not run your code, but it does seem curious that the recursive method contains a while loop within which the recursive method is called... since it's already recursive and should therefore encompass every scenario, is that while() loop necessary?

Optimizating my code simulating a database (2)

Some days ago I made you a question and I got some really useful answers. I will make a summary to those of you who didn't read and I will explain my new doubts and where I have problems now.
Explanation
I have been working on a program, simulating a small database, that first of all read information from txt files and store them in the computer memory and then, I can make queries taking normal tables and/or transposed tables. The problem is that the performance is not good enough yet. It works slower than what I expect. I have improved it but I think I should improve it more. I have specific points where my program doesn't have a good performance.
Current problem
The first problem that I have now (where my program is slower) is that I spend more time to, for example table with 100,000 columns & 100 rows (0.325 min, I've improved this thanks to your help) than 100,000 rows & 100 columns (1.61198 min, the same than before). But on the other hand, access time to some data is better in the second case (in a determined example, 47 seconds vs. 6079 seconds in the first case) any idea why??
Explanation
Now let me remind you how my code works (with an atached summary of my code)
First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 7 rows and 5 columns). I also have the transposed table
NormalTable.txt
42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV|14772cjZ`SN|
28704HxDYjzC|6869xXj\nIe|27530EymcTU|9041ByZM]I|24371fZKbNk|
24085cLKeIW|16945TuuU\Nc|16542M[Uz\|13978qMdbyF|6271ait^h|
13291_rBZS|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg|
1656nn_FQf|4042OAegZq|24022nIGz|4735Syi]\|18128klBfynQ|
6618t\SjC|20601S\EEp|11009FqZN|20486rYVPR|7449SqGC|
14799yNvcl|23623MTetGw|6192n]YU\Qe|20329QzNZO_|23845byiP|
TransposedTable.txt (This is new from the previous post)
42sKuG^uM|28704HxDYjzC|24085cLKeIW|13291_rBZS|1656nn_FQf|6618t\SjC|14799yNvcl|
24465\lHXP|6869xXj\nIe|16945TuuU\Nc|4032aFqa|4042OAegZq|20601S\EEp|23623MTetGw|
2996fQo\kN|27530EymcTU|16542M[Uz\|13967r^\\`T|24022nIGz|11009FqZN|6192n]YU\Qe|
293cvByiV|9041ByZM]I|13978qMdbyF|27754k]dOTdh|4735Syi]\|20486rYVPR|20329QzNZO_|
14772cjZ`SN|24371fZKbNk|6271ait^h|24947]v_uzg|18128klBfynQ|7449SqGC|23845byiP|
Explanation
This information in a .txt file is read by my program and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.
Here you have the part of the code that read this information from a file and store in the computer.
Code that reads data from the Table.txt file and store it in the computer memory
int h;
do
{
cout<< "Do you want to query the normal table or the transposed table? (1- Normal table/ 2- Transposed table):" ;
cin>>h;
}while(h!=1 && h!=2);
string ruta_base("C:\\Users\\Raul Velez\\Desktop\\Tables\\");
if(h==1)
{
ruta_base +="NormalTable.txt"; // Folder where my "Table.txt" is found
}
if(h==2)
{
ruta_base +="TransposedTable.txt";
}
string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table
ifstream ifs(ruta_base.c_str());
while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file.
{
size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer
// --- NEW PART ------------------------------------
const char* p = temp.c_str();
char* p1 = strdup(p);
char* pch = strtok(p1, "|");
while(pch)
{
buffer.push_back(string(pch));
pch = strtok(NULL,"|");
}
free(p1);
ElementSet sss(0,buffer);
buffer.clear();
RowsCols.push_back(sss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols"
// --- NEW PART END ------------------------------------
}
Table TablesStorage(RowsCols); // After every loop we will store the information about every .txt file in the vector<Table> TablesDescriptor
vector<Table> TablesDescriptor;
TablesDescriptor.push_back(TablesStorage); // In the vector<Table> TablesDescriptor will be stores all the different tables with all its information
DataBase database(1, TablesDescriptor);
Information already given in the previous post
After this, comes the access to the information part. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.
Problem
In the loop if(h == 2) where data from the transposed tables are accessed, performance is poorer ¿why?
Code that access to the required information upon my input
int n, numTuples;
unsigned long long int y;
cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"
cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"
cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"
unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;
vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<5000> mask;
mask=0x1;
clock_t t1, t2;
t1=clock(); // Start of the query time
bitset<5000> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000
// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{
if(binaryNumber.test(r) & mask.test(r)) // if both of them are bit "1"
{
columnsToBeQueried.push_back(r);
}
mask=mask<<1;
}
do
{
for(int z=0;z<columnsToBeQueried.size();z++)
{
ElementSet selectedElementSet;
int i;
i=columnsToBeQueried.at(z);
Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry ElementSet selectedElementSet;
if(h == 1)
{
selectedElementSet=selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"
}
if(h == 2)
{
selectedElementSet=selectedTable.getRowsCols().at(i);
shownVector.push_back(selectedElementSet.getElements().at(n)); // We save in the vector shownVector the element "n" of the row "i"
}
n=n+1;
t++;
}
}while(t<numTuples);
t2=clock(); // End of the query time
showVector().finalVector(shownVector);
float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;
cout<<"Time: "<<microseconds<<endl;
Class definitions
Here I attached some of the class definitions so that you can compile the code, and understand better how it works:
class ElementSet
{
private:
int id;
vector<string> elements;
public:
ElementSet();
ElementSet(int, vector<string>&);
const int& getId();
void setId(int);
const vector<string>& getElements();
void setElements(vector<string>);
};
class Table
{
private:
vector<ElementSet> RowsCols;
public:
Table();
Table(vector<ElementSet>&);
const vector<ElementSet>& getRowsCols();
void setRowsCols(vector<ElementSet>);
};
class DataBase
{
private:
int id;
vector<Table> pointer;
public:
DataBase();
DataBase(int, vector<Table>&);
const int& getId();
void setId(int);
const vector<Table>& getPointer();
void setPointer(vector<Table>);
};
class Utilities
{
public:
Utilities();
static bitset<500> getDecToBin(unsigned long long int);
};
Summary of my problems
Why the load of the data is different depending on the table format???
Why the access to the information also depends on the table (and the performance is in the opposite way than the table data load?
Thank you very much for all your help!!! :)
One thing I see that may explain both your problems is that you are doing many allocations, a lot of which appear to be temporary. For example, in your loading you:
Allocate a temporary string per row
Allocate a temporary string per column
Copy the row to a temporary ElementSet
Copy that to a RowSet
Copy the RowSet to a Table
Copy the Table to a TableDescriptor
Copy the TableDescriptor to a Database
As far as I can tell, each of these copies is a complete new copy of the object. If you only had a few 100 or 1000 records that might be fine but in your case you have 10 million records so the copies will be time consuming.
Your loading times may differ due to the number of allocations done in the loading loop per row and per column. Memory fragmentation may also contribute at some point (when dealing with a large number of small allocations the default memory handler sometimes takes a long time to allocate new memory). Even if you removed all your unnecessary allocations I would still expect the 100 column case to be slightly slower than the 100,000 case due to how your are loading and parsing by line.
Your information access times may be different as you are creating a full copy of a row in selectedElementSet. When you have 100 columns this will be fast but when you have 100,000 columns it will be slow.
A few specific suggestions to improving your code:
Reduce the number of allocations and copies you make. The ideal case would be to make one allocation for reading the file and then another allocation per record when stored.
If you're going to store the data in a Database then put it there from the beginning. Don't make half-a-dozen complete copies of your data to go from a temporary object to the Database.
Make use of references to the data instead of actual copies when possible.
When profiling make sure you get times when running a new instance of the program. Memory use and fragmentation may have a significant impact if you test both cases in the same instance and the order in which you do the tests will matter.
Edit: Code Suggestion
To hopefully improve your speed in the search loop try something like:
for(int z=0;z<columnsToBeQueried.size();z++)
{
int i;
i=columnsToBeQueried.at(z);
Table& selectedTable = database.getPointer().at(0);
if(h == 1)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i));
}
else if(h == 2)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols().at(i);
shownVector.push_back(selectedElementSet.getElements().at(n));
}
n=n+1;
t++;
}
I've just changed the selectedElementSet to use a reference which should complete eliminate the row copies taking place and, in theory, it should have a noticeable impact in performance. For even more performance gain you can change shownVector to be a reference/pointer to avoid yet another copy.
Edit: Answer Comment
You asked where you were making copies. The following lines in your original code:
ElementSet selectedElementSet;
selectedElementSet = selectedTable.getRowsCols().at(n);
creates a copy of the vector<string> elements member in ElementSet. In the 100,000 column case this will be a vector containing 100,000 strings so the copy will be relatively expensive time wise. Since you don't actually need to create a new copy changing selectedElementSet to be a reference, like in my example code above, will eliminate this copy.

Optimizating my code simulating a database

I have been working on a program, simulating a small database where I could make queries, and after writing the code, I have executed it, but the performance is quite bad. It works really slow. I have tried to improve it, but I started with C++ on my own a few months ago, so my knowledge is still very low. So I would like to find a solution to improve the performance.
Let me explain how my code works. Here I have atached a summarized example of how my code works.
First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 5 rows and 5 columns).
Table.txt
0|42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV
1|14772cjZ`SN|28704HxDYjzC|6869xXj\nIe|27530EymcTU
2|9041ByZM]I|24371fZKbNk|24085cLKeIW|16945TuuU\Nc
3|16542M[Uz\|13978qMdbyF|6271ait^h|13291_rBZS
4|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg
This information in a .txt file is read by my program and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.
Here you have the part of the code that read this information from a file and store in the computer.
Code that reads data from the Table.txt file and store it in the computer memory
string ruta_base("C:\\a\\Table.txt"); // Folder where my "Table.txt" is found
string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table
ifstream ifs(ruta_base.c_str());
while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file.
{
size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer
while (tokenPosition != string::npos)
{
string element;
tokenPosition = temp.find("|");
element = temp.substr(0, tokenPosition);
buffer.push_back(element);
temp.erase(0, tokenPosition+1);
}
ElementSet ss(0,buffer);
buffer.clear();
RowsCols.push_back(ss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols"
}
vector<Table> TablesDescriptor;
Table TablesStorage(RowsCols);
TablesDescriptor.push_back(TablesStorage);
DataBase database(1, TablesDescriptor);
After this, comes the IMPORTANT PART. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.
Code that access to the required information upon my input
int n, numTuples;
unsigned long long int y;
clock_t t1, t2;
cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"
cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"
cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"
unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;
vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<500> mask;
mask=0x1;
t1=clock(); // Start of the query time
bitset<500> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000
// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{
if(binaryNumber.test(r) & mask.test(r)) // if both of them are bit "1"
{
columnsToBeQueried.push_back(r);
}
mask=mask<<1;
}
do
{
for(int z=0;z<columnsToBeQueried.size();z++)
{
int i;
i=columnsToBeQueried.at(z);
vector<int> colTab;
colTab.push_back(1); // Don't really worry about this
//idTable = colTab.at(i); // We identify in which table (with the id) is column_i
// In this simple example we only have one table, so don't worry about this
const Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry ElementSet selectedElementSet;
ElementSet selectedElementSet;
selectedElementSet=selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"
}
n=n+1;
t++;
}while(t<numTuples);
t2=clock(); // End of the query time
float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;
cout<<"The query time is: "<<microseconds<<" microseconds."<<endl;
Class definitions
Here I attached some of the class definitions so that you can compile the code, and understand better how it works:
class ElementSet
{
private:
int id;
vector<string> elements;
public:
ElementSet();
ElementSet(int, vector<string>);
const int& getId();
void setId(int);
const vector<string>& getElements();
void setElements(vector<string>);
};
class Table
{
private:
vector<ElementSet> RowsCols;
public:
Table();
Table(vector<ElementSet>);
const vector<ElementSet>& getRowsCols();
void setRowsCols(vector<ElementSet>);
};
class DataBase
{
private:
int id;
vector<Table> pointer;
public:
DataBase();
DataBase(int, vector<Table>);
const int& getId();
void setId(int);
const vector<Table>& getPointer();
void setPointer(vector<Table>);
};
class Utilities
{
public:
Utilities();
static bitset<500> getDecToBin(unsigned long long int);
};
So the problem that I get is that my query time is very different depending on the table size (it has nothing to do a table with 100 rows and 100 columns, and a table with 10000 rows and 1000 columns). This makes that my code performance is very low for big tables, what really matters me... Do you have any idea how could I optimizate my code????
Thank you very much for all your help!!! :)
Whenever you have performance problems, the first thing you want to do is to profile your code. Here is a list of free tools that can do that on windows, and here for linux. Profile your code, identify the bottlenecks, and then come back and ask a specific question.
Also, like I said in my comment, can't you just use SQLite? It supports in-memory databases, making it suitable for testing, and it is lightweight and fast.
One obvious issue is that your get-functions return vectors by value. Do you need to have a fresh copy each time? Probably not.
If you try to return a const reference instead, you can avoid a lot of copies:
const vector<Table>& getPointer();
and similar for the nested get's.
I have not done the job, but you may analyse the complexity of your algorithm.
The reference says that access an item is in constant time, but when you create loops, the complexity of your program increases:
for (i=0;i<1000; ++i) // O(i)
for (j=0;j<1000; ++j) // O(j)
myAction(); // Constant in your case
The program complexity is O(i*j), so how big may be i an j?
What if myAction is not constant in time?
No need to reinvent the wheel again, use FirebirdSQL embedded database instead. That combined with IBPP C++ interface gives you a great foundation for any future needs.
http://www.firebirdsql.org/
http://www.ibpp.org/
Though I advise you to please use a profiler to find out which parts of your code are worth optimizing, here is how I would write your program:
Read the entire text file into one string (or better, memory-map the file.) Scan the string once to find all | and \n (newline) characters. The result of this scan is an array of byte offsets into the string.
When the user then queries item M of row N, retrieve it with code something like this:
char* begin = text+offset[N*items+M]+1;
char* end = text+offset[N*items+M+1];
If you know the number of records and fields before the data is read, the array of byte offsets can be a std::vector. If you don't know and must infer from the data, it should be a std::deque. This is to minimize costly memory allocation and deallocation, which I imagine is the bottleneck in such a program.

Char* vs String Speed in C++

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
}
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!
My guess it that it should be tied to the vector<string>'s performance
About the vector
A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...
To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.
If this is the case, they you can do one of those four things:
if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
use a std::deque, which works almost like a vector
use a std::list (which doesn't give you random access to its items)
use the std::vector<char *>
About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.
Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).
Edit
After reading the updated code, I see the following problems.
if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
consider making ids_ and names_ deque, or lists.
faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.
Other comments:
Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)
Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.
Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.
Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):
id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);
Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.
You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).
But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.
So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.
I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies
streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( numeric_limits<streamsize>::max(), '\n');
ids_.push_back (id);
names_.push_back(name);
}
NewList.close();
}
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference:
http://www.cplusplus.com/reference/iostream/ifstream/
You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.

How to keep only the last duplicate when iterating through rows

Following code iterates through many data-rows, calcs some score per row and then sorts the rows according to that score:
unsigned count = 0;
score_pair* scores = new score_pair[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores[count].score = score;
scores[count].doc_id = row->docid;
count++;
}
assert(count <= num_rows);
qsort(scores, count, sizeof(score_pair), score_cmp);
Unfortunately, there are many duplicate rows with the same docid but different score. Now i like to keep the last score for any docid only. The docids are unsigned int, but usually big (=> no lookup-array) - using a HashMap to lookup the last count for a docid would probably be too slow (many millions of rows, should only take seconds not minutes...).
Ok, i modified my code to use a std:map:
map<int, int> docid_lookup;
unsigned count = 0;
score_pair* scores = new score_pair[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
map<int, int>::iterator iter;
iter = docid_lookup.find(row->docid);
if (iter != docid_lookup.end()) {
scores[iter->second].score = score;
scores[iter->second].doc_id = row->docid;
} else {
scores[count].score = score;
scores[count].doc_id = row->docid;
docid_lookup[row->docid] = count;
count++;
}
}
It works and the performance hit is not as bad as i expected - now it runs a minute instead of 16 seconds, so it's about a factor of 3. Memory usage has also gone up from about 1Gb to 4Gb.
The first thing I'd try would be a map or unordered_map: I'd be surprised if performance is a factor of 60 slower than what you did without any unique-ification. If the performance there isn't acceptable, another option is something like this:
// get the computed data into a vector
std::vector<score_pair>::size_type count = 0;
std::vector<score_pair> scores;
scores.reserve(num_rows);
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores.push_back(score_pair(score, row->docid));
}
assert(scores.size() <= num_rows);
// remove duplicate doc_ids
std::reverse(scores.begin(), scores.end());
std::stable_sort(scores.begin(), scores.end(), docid_cmp);
scores.erase(
std::unique(scores.begin(), scores.end(), docid_eq),
scores.end()
);
// order by score
std::sort(scores.begin(), scores.end(), score_cmp);
Note that the use of reverse and stable_sort is because you want the last score for each doc_id, but std::unique keeps the first. If you wanted the first score you could just use stable_sort, and if you didn't care what score, you could just use sort.
The best way of handling this is probably to pass reverse iterators into std::unique, rather than a separate reverse operation. But I'm not confident I can write that correctly without testing, and errors might be really confusing, so you get the unoptimised code...
Edit: just for comparison with your code, here's how I'd use the map:
std::map<int, float> scoremap;
while ((row = data.next_row())) {
scoremap[row->docid] = calc_score(data.next_feature());
}
std::vector<score_pair> scores(scoremap.begin(), scoremap.end());
std::sort(scores.begin(), scores.end(), score_cmp);
Note that score_pair would need a constructor taking a std::pair<int,float>, which makes it non-POD. If that's not acceptable, use std::transform, with a function to do the conversion.
Finally, if there is much duplication (say, on average 2 or more entries per doc_id), and if calc_score is non-trivial, then I would be looking to see whether it's possible to iterate the rows of data in reverse order. If it is, then it will speed up the map/unordered_map approach, because when you get a hit for the doc_id you don't need to calculate the score for that row, just drop it and move on.
I'd go for a std::map of docids. If you could create an appropriate hashing function, a hash-map would be preferable. But I guess it's too difficult. And no - the std::map ist not too slow. Access is O(log n), which is nearly as good as O(1). O(1) is array access time (and Hashmap btw).
Btw, if std::map is too slow, qsort O(n log n) is too slow as well. And, using a std::map and iterating over it's contents, you can perhaps save your qsort.
Some additions for the comment (by onebyone):
I did not go for the implementation
details, since there wasn't enough
information on that.
qsort may behave bad with sorted data
(depending on the implementation).
Std::map may not. This is a real
advantage, especially if you read the
values from a database that might
output them ordered by key.
There was no word on the memory allocation strategy. Changing to a memory allocator with fast allocation of small objects may improve the performance.
Still - the fastest would be a hash map with an appropriate hash function. Since there's not enough information about the distribution of the keys, presenting one in this answer is not possible.
Short - if you ask general questions, you get general answers. This means - at least for me, looking at the time complexity in the O-Notation. Still you were right, depending on different factors, the std::map may be too slow while qsort is still fast enough - it may also be the other way round in the worst case of qsort, where it has n^2 complexity.
Unless I've misunderstood the question, the solution can be simplified considerably. At least as I understand it, you have a few million docid's (which are of type unsigned int) and for each unique docid, you want to store one 'score' (which is a float). If the same docid occurs more than once in the input, you want to keep the score from the last one. If that's correct, the code can be reduced to this:
std::map<unsigned, float> scores;
while ((row = data.next_row()))
scores[row->docid] = calc_score(data.next_feature());
This will probably be somewhat slower than your original version since it allocates a lot of individual blocks rather than one big block of memory. Given your statement that there are a lot of duplicates in the docid's, I'd expect this to save quite a bit of memory, since it only stores data for each unique docid rather than for every row in the original data.
If you wanted to optimize this, you could almost certainly do so -- since it uses a lot of small blocks, a custom allocator designed for that purpose would probably help quite a bit. One possibility would be to take a look at the small-block allocator in Andrei Alexandrescu's Loki library. He's done more work on the problem since, but the one in Loki is probably sufficient for the task at hand -- it'll almost certainly save a fair amount of memory and run faster as well.
If your C++ implementation has it, and most do, try hash_map instead of std::map (it's sometimes available under std::hash_map).
If the lookups themselves are your computational bottleneck, this could be a significant speedup over std::map's binary tree.
Why not sort by doc id first, calculate scores, then for any subset of duplicates use the max score?
On re-reading the question; I'd suggest a modification to how scores are read in. Keep in mind C++ isn't my native language, so this won't quite be compilable.
unsigned count = 0;
pair<int, score_pair>* scores = new pair<int, score_pair>[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores[count].second.score = score;
scores[count].second.doc_id = row->docid;
scores[count].first = count;
count++;
}
assert(count <= num_rows);
qsort(scores, count, sizeof(score_pair), pair_docid_cmp);
//getting number of unique scores
int scoreCount = 0;
for(int i=1; i<num_rows; i++)
if(scores[i-1].second.docId != scores[i].second.docId) scoreCount++;
score_pair* actualScores=new score_pair[scoreCount];
int at=-1;
int lastId = -1;
for(int i=0; i<num_rows; i++)
{
//if in first entry of new doc id; has the last read time by pair_docid_cmp
if(lastId!=scores[i].second.docId)
actualScores[++at]=scores[i].second;
}
qsort(actualScores, count, sizeof(score_pair), score_cmp);
Where pair_docid_cmp would compare first on docid; grouping same docs together, then second by reverse order read; such that the last item read is the first in the sublist of items with the same docid. Should only be ~5/2x memory usage, and ~double the execution speed.