Large Qt Sqlite SELECT Statement - c++

I have a database with a table containing ~150 million rows. The columns are just:
id (INTEGER), value_one (INTEGER), value_two (INTEGER), value_3 (INTEGER)
I'm need to import all this data into a QList, but I'm running into a problem where Qt is asserting qAllocMore: 'Requested size is too large!', file tools\qbytearray.cpp, line 73 when I'm running a SELECT query. I'm able to run the same code on a table containing ~7 million entries, and it works without error.
This is my SELECT statement:
bool e = query.exec("SELECT * FROM DocumentTerms");
if (!e) {
qWarning() << __FUNCTION__ << "\tError: " << query.lastError().text();
}
while (query.next()) {
int docId = query.value(1).toInt();
int termId = query.value(2).toInt();
int frequency = query.value(3).toInt();
//store it in a QHash<int, QPair<int, int>>
}
It looks like it's iterating through the query.next loop, but the assert pops up after ~16 million iterations. Any idea what's causing it?

My previous answer was nonsense. Stupid calculation bug. However, I think I now have the solution. It is not memory in general, what you are missing, but consecutive memory.
I have tried the following:
QList<int> testlist;
for(int i = 0; i < 150000000;++i){
testlist << i << i << i << i;
}
Stupid little code, does nothing else, but put 4 ints 150000000 times into a list.
I get after a few seconds:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
More or less: Out of memory.
Now I change the code above:
QList<int> testlist;
testList.reserve(150000000*4);
for(int i = 0; i < 150000000;++i){
testlist << i << i << i << i;
}
This code does nothing else than the one before. The QList is exactly the same size as before. However, I reserve all memory before I start the loop. The result? The list does not need to grow and constantly request more memory. With this version I had no problem at all. I got my list.

Related

Build different number of vectors/maps during runtime, to insert into BST - C++

The problem: So earlier the requirement for this C++ program was to just deal with one file's input (Each line represents the average of 10 minutes of the weather detail, about 50k lines). The end user wanted to be able to find out the average of the weather attributes for: a) A specified month and year, b)Average for each month of a specified year, c) Total for each month of a specified year, and d) average of each month of a specified year outputted to a .csv file.
Example: (First 4 lines of input csv)
WAST,DP,Dta,Dts,EV,QFE,QFF,QNH,RF,RH,S,SR,ST1,ST2,ST3,ST4,Sx,T
1/01/2010 9:00,8,151,23,0.1,1014.6,1018.1,1018.2,0,43.3,7,813,23.6,27,26.9,25.4,10,20.98
1/01/2010 9:10,8.7,161,28,0.1,1014.6,1018.1,1018.2,0,44.4,6,845,23.7,26.9,26.9,25.4,10,21.37
1/01/2010 9:20,8.9,176,21,0.2,1014.6,1018.1,1018.2,0,43.4,6,877,23.8,26.9,26.9,25.4,9,21.96
Solution: As not all the data from each line was required, upon reading in each line, they're parsed, segregated and built into an object instance of 'Weather', which consists of:
Date m_dateObj;
Time m_timeObj;
float m_windSpeed;
float m_solarRadiation;
float m_airTemperature;
A vector of Weather object was made to host this information.
Now, the problem has expanded to multiple files (150K-500K lines of data). Reading in multiple files is fine, all the data is retrieved and converted to Weather object with no problems, I'm just having trouble with the design(more specifically the syntax aspect of it, I know what I want to do). Additionally, there is a new option introduced where the user will enter dd/mm/yy and instances of highest solarRadiation for that day will be outputted(This requires me to have access to each specific object of weather and I cant just store aggregates).
BST and Maps are mandatory, so what I thought was: Data is read in line by line, for each line - Convert into Weather obj, store into a vector specifically for that month+year, so for every month of every year there will be a different vector eg; jan2007, feb2007, jan2008 etc. and each of these vectors are stored in a map:
map<pair<int, int>, vector<Weather> > monthMap;
So it looks like
<pair<3,2007>, march2007Vec>
and stores these maps into the BST (which I would need to randomize since its sorted data, to avoid making my BST a linked list, tips on how to do it? I found snippets for self-balancing trees that I might implement). This should work as the key for all maps are unique, thus making all BST nodes unique.
So it would look like this -
User runs program
Program opens files (there is a txt file with file names in it)
For each file
Open file
For each line
Convert into weather Object
Check month+year,
if map for combination exists,
add to that vector (eg march2007)
else
create new vector store in new map
Close file
add all maps to BST
BST will self sort
Provide user with menu to choose from
The actual computation of what the user needs is pretty simple, I just need help figuring out how to actually make it so there are n numbers of maps and vectors (n = number of maps = number of vectors, I think), as I don't know how many months/years there will be.
Heres a snippet of my code to get a better understanding of what I'm trying to do:
int main()
{
vector<Weather> monthVec;
map<pair<int, int>, vector<Weather> > monthMap;
map<pair<int, int>, vector<Weather> >::iterator itr;
int count = 0;
bool found = false;
Weather weatherObj;
ifstream weatherFileList;
weatherFileList.open("data/met_index.txt");
if(weatherFileList.is_open())
{
cout << "Success";
while (!weatherFileList.eof())
{
string data;
string fileName;
getline(weatherFileList, fileName);
cout << fileName << endl;
fileName = "data/" + fileName;
cout << fileName << endl;
ifstream weatherFile;
weatherFile.open(fileName.c_str());
getline(weatherFile, data);
while (!weatherFile.eof())
{
getline(weatherFile, data);
if (!data.empty())
{
weatherObj = ConvertData(data);
//cout << count << " " << weatherObj.GetTime().ToString() << endl;
//monthVec.push_back(weatherObj);
// for (itr = monthMap.begin(); itr != monthMap.end(); ++itr)
// {
//
// }
int month = weatherObj.GetDate().GetMonth();
int year = weatherObj.GetDate().GetYear();
itr = monthMap.find(make_pair(month,year));
if(itr != monthMap.end())
{
monthVec = itr->second;
monthVec.push_back(weatherObj);
}
else
{
}
count++;
}
//cout << data << endl;
}
weatherFile.close();
}
listOptions();
}
else
{
cout << "Not open";
}
cout << count << endl;
cout << monthVec.size() << "/" << monthVec.capacity();
return 0;
}
Apologies for the untidy code, so I was thinking about how to make it so for every new combination there's a new vector placed in a new map, but because of my inexperience, I don't know how to syntax it or even search it well.
TLDR: Need to map unknown number of combinations of ,VectorOfObject>
Would one make a switch case and have 12 vectors, one for each month hardcoded and just store all February (2007 2008 2009 etc) details in it, that would mean a lot of unnecessary processing.
How would one create different vectors without actually giving them a unique name for reference in the code, (<3,2007>,March2007)
How would one retrieve the contents of the vector(Of which we don't know the name, sure we know the key is 03 2007 aka march 2007, but wouldn't we need an explicit name to open the vector? march2007.find()), which is inside a map.
Thanks for the read, and potential help!
Please do Direct Message me if you'd like to see the problem in more detail, I would be grateful!

Displaying content with a for loop

I'm trying to write a program that randomly selects three items from an array that contains five different fruits or vegetables and then display this randomly selected content to the user. Now I'm having trouble understanding why my output is not consistent because sometimes when I run it I'll get something like this:
This bundle contains the following:
Broccoli
With the first two items missing and sometimes I'll get this:
This bundle contains the following:
Tomato
Tomato
Tomato
This is the portion of the code I'm currently having trouble with:
void BoxOfProduce::output() {
cout << "This bundle contains the following: " << endl;
// Use this so it will be truly random
srand(time(0));
for (int f = 0; f < 3; f++) {
// Here were making the random number
int boxee = rand() % 5;
// Adding the content to the box
Box[f] = Box[boxee];
// Now were calling on the function to display the content
displayWord(boxee);
} // End of for loop
} // End of output()
void BoxOfProduce::displayWord(int boxee) {
cout << Box[boxee] << endl;
}
int main() {
BoxOfProduce b1;
b1.input();
b1.output();
Can someone help me understand why i'm getting this output? Thanks!
Don't do it like you are doing it :)
Like #John3136 pointed out, you are messing up your Box variable..
void BoxOfProduce::output()
{
srand(time(NULL)); //keep in mind that this is NOT entirely random!
int boxee = rand() % 5;
int i = 0;
while(i<5)
{
boxee = rand() % 5;
cout<<Box[boxee]<<endl; //this line might be wrong, my point is to print element of your array with boxee index
i++;
}
}
Box[f] = Box[boxee]; is changing the contents of the "Box" you are picking things out of. If the first random number is 3, item 3 gets copied to item 0, so now you have twice as much chance of getting that item the next time through the loop...
You are overwriting the elements of your array with randomly selected item.
Box[f] = Box[boxee];
For eg: If boxee=1 and f=0, it will overwrite element at index 0 with 1, while at same time element at index 1 is same leaving two copies of same item.
Use :std:random_shuffle instead.

Large vector "Segmentation fault" error

I have gathered a large amount of extremely useful information from other peoples' questions and answers on SO, and have searched duly for an answer to this one as well. Unfortunately I have not found a solution to this problem.
The following function to generate a list of primes:
void genPrimes (std::vector<int>* primesPtr, int upperBound = 10)
{
std::ofstream log;
log.open("log.txt");
std::vector<int>& primesRef = *primesPtr;
// Populate primes with non-neg reals
for (int i = 2; i <= upperBound; i++)
primesRef.push_back(i);
log << "Generated reals successfully." << std::endl;
log << primesRef.size() << std::endl;
// Eratosthenes sieve to remove non-primes
for (int i = 0; i < primesRef.size(); i++) {
if (primesRef[i] == 0) continue;
int jumpStart = primesRef[i];
for (int jump = jumpStart; jump < primesRef.size(); jump += jumpStart) {
if (primesRef[i+jump] == 0) continue;
primesRef[i+jump] = 0;
}
}
log << "Executed Eratosthenes Sieve successfully.\n";
for (int i = 0; i < primesRef.size(); i++) {
if (primesRef[i] == 0) {
primesRef.erase(primesRef.begin() + i);
i--;
}
}
log << "Cleaned list.\n";
log.close();
}
is called by:
const int SIZE = 500;
std::vector<int>* primes = new std::vector<int>[SIZE];
genPrimes(primes, SIZE);
This code works well. However, when I change the value of SIZE to a larger number (say, 500000), the compiler returns a "segmentation error." I'm not familiar enough with vectors to understand the problem. Any help is much appreciated.
You are accessing primesRef[i + jump] where i could be primesRef.size() - 1 and jump could be primesRef.size() - 1, leading to an out of bounds access.
It is happening with a 500 limit, it is just that you happen to not have any bad side effects from the out of bound access at the moment.
Also note that using a vector here is a bad choice as every erase will have to move all of the following entries in memory.
Are you sure you wanted to do
new std::vector<int> [500];
and not
new std::vector<int> (500);
In the latter case, you are specifying the size of the vector, whose location is available to you via the variable named 'primes'.
In the former, you are requesting space for 500 vectors, each sized to the default that the STL library wants.
That would be something like (on my system : 24*500 bytes). In the latter case, 500 length vector(only one vector) is what you are asking for.
EDIT: look at the usage - he needs just one vector.
std::vector& primesRef = *primesPtr;
The problem lies here:
// Populate primes with non-neg reals
for (int i = 2; i <= upperBound; i++)
primesRef.push_back(i);
You only have N-2 elements in your vector pushed back, but then try to access an element at N-1 (i+jump). The fact that it did not fail on 500 is just dumb luck that the memory being overwritten was not catastrophic.
This code works well. However, when I change the value of SIZE to a larger number (say, 500000), ...
That may blow your stack, and be to big allocated with it. You need dynamic memory allocation for all of the std::vector<int> instances you believe to need.
To achieve that, simply use a nested std::vetcor like this.
std::vector<std::vector<int>> primes(SIZE);
instead.
But to get straight on, I seriously doubt you need number of SIZE vector instances to store all of the prime numbers found, but just a single one initialized like this:
std::vector<int> primes(SIZE);

C++, Postgres , libpqxx huge query

I have to execute an SQL query to Postgres by the following code. The query returns a huge number of rows (40M or more) and has 4 integer fields: When I use a workstation with 32Gb everything works but on a 16Gb workstation the query is very slow (due to swapping I guess). Is there any way to tell the C++ to load rows at batches, without waiting the entire dataset? With Java I never had these issues before, due to the probably better JDBC driver.
try {
work W(*Conn);
result r = W.exec(sql[sqlLoad]);
W.commit();
for (int rownum = 0; rownum < r.size(); ++rownum) {
const result::tuple row = r[rownum];
vid1 = row[0].as<int>();
vid2 = row[1].as<int>();
vid3 = row[2].as<int>();
.....
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;
}
I am using PostgreSQL 9.3 and there I see this http://www.postgresql.org/docs/9.3/static/libpq-single-row-mode.html, but I do not how to use it on my C++ code. Your help will be appreciated.
EDIT: This query runs only once, for creating the necessary main memory data structures. As such, tt cannot be optimized. Also, pgAdminIII could easily fetch those rows, in under one minute on the same (or with smaller RAM) PCs. Also, Java could easily handle twice the number of rows (with Statent.setFetchSize() http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize%28int%29) So, it is really an issue for the libpqxx library and not an application issue. Is there a way to enforce this functionality in C++, without explicitly setting limits / offsets manually?
Use a cursor?
See also FETCH. The cursor will use it for you behind the scenes, I gather, but just in case, you can always code the streaming retrieval manually with the FETCH.
To answer my own question, I adapted How to use pqxx::stateless_cursor class from libpqxx?
try {
work W(*Conn);
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor(W, sql[sqlLoad], "mycursor", false);
/* Assume you know total number of records returned */
for (size_t idx = 0; idx < countRecords; idx += 100000) {
/* Fetch 100,000 records at a time */
result r = cursor.retrieve(idx, idx + 100000);
for (int rownum = 0; rownum < r.size(); ++rownum) {
const result::tuple row = r[rownum];
vid1 = row[0].as<int>();
vid2 = row[1].as<int>();
vid3 = row[2].as<int>();
.............
}
}
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;
}
Cursors are a good place to start. Here's another cursor example, using a do-while()
const std::conStr("user=" + opt::dbUser + " password=" + opt::dbPasswd + " host=" + opt::dbHost + " dbname=" + opt::dbName);
pqxx::connection conn(connStr);
pqxx::work txn(conn);
std::string selectString = "SELECT id, name FROM table_name WHERE condition";
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor(txn, selectString, "myCursor", false);
//cursor variables
size_t idx = 0; //starting location
size_t step = 10000; //number of rows for each chunk
pqxx::result result;
do{
//get next cursor chunk and update the index
result = cursor.retrieve( idx, idx + step );
idx += step;
size_t records = result.size();
cout << idx << ": records pulled = " << records << endl;
for( pqxx::result::const_iterator row : result ){
//iterate over cursor rows
}
}
while( result.size() == step ); //if the result.size() != step, we're on our last loop
cout << "Done!" << endl;
I'm iterating over approximately 33 million rows in my application. In addition to using a cursor, I used the following approach:
Split the data into smaller chunks. For me, that was using bounding
boxes to grab data in a given area.
Construct a query to grab that
chunk, and use a cursor to iterate over it.
Store the chunks on the
heap and free them once you're done processing the data from a given
chunk.
I know this is a very late answer to your question, but I hope this might help someone!

Why do i get a repeated QStringList?

I am writing a Qt application that deals with scheduling employees. The header data for the main QTableView is a pointer to a QStringList. The headerData() function works correctly, but when i add a string to the list elsewhere, it appends the entire list including the new string to the end of the list.
For example, if i have the list 1,2,3 and i append 4 to it, then iterating through the list based on the pointer gives the result 1,2,3,1,2,3,4. I don't know a better way than using pointers to have multiple classes access the same data. Does anyone know how to fix the repeating list?
Example Code
//function to save a new employee in memory
bool EmployeeViewDialog::saveEmployee(Employee *e)
{
employees->insert(e->name,e);
*employeeNames << e->name;
for (int i = 0; i < employeeNames->length(); i++) {
qDebug() << employeeNames->at(i);
}
QList<QStandardItem*> items;
items << new QStandardItem(e->name);
items << new QStandardItem(e->id);
items << new QStandardItem(e->phone);
items << new QStandardItem(e->email);
model->appendRow(items);
return true;
}
The append was just changed to the << method. It is the employeeNames << e->name; line.
The for loop iterates through the list and does the same thing as what happens in the external class.