I have to execute an SQL query to Postgres by the following code. The query returns a huge number of rows (40M or more) and has 4 integer fields: When I use a workstation with 32Gb everything works but on a 16Gb workstation the query is very slow (due to swapping I guess). Is there any way to tell the C++ to load rows at batches, without waiting the entire dataset? With Java I never had these issues before, due to the probably better JDBC driver.
try {
work W(*Conn);
result r = W.exec(sql[sqlLoad]);
W.commit();
for (int rownum = 0; rownum < r.size(); ++rownum) {
const result::tuple row = r[rownum];
vid1 = row[0].as<int>();
vid2 = row[1].as<int>();
vid3 = row[2].as<int>();
.....
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;
}
I am using PostgreSQL 9.3 and there I see this http://www.postgresql.org/docs/9.3/static/libpq-single-row-mode.html, but I do not how to use it on my C++ code. Your help will be appreciated.
EDIT: This query runs only once, for creating the necessary main memory data structures. As such, tt cannot be optimized. Also, pgAdminIII could easily fetch those rows, in under one minute on the same (or with smaller RAM) PCs. Also, Java could easily handle twice the number of rows (with Statent.setFetchSize() http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize%28int%29) So, it is really an issue for the libpqxx library and not an application issue. Is there a way to enforce this functionality in C++, without explicitly setting limits / offsets manually?
Use a cursor?
See also FETCH. The cursor will use it for you behind the scenes, I gather, but just in case, you can always code the streaming retrieval manually with the FETCH.
To answer my own question, I adapted How to use pqxx::stateless_cursor class from libpqxx?
try {
work W(*Conn);
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor(W, sql[sqlLoad], "mycursor", false);
/* Assume you know total number of records returned */
for (size_t idx = 0; idx < countRecords; idx += 100000) {
/* Fetch 100,000 records at a time */
result r = cursor.retrieve(idx, idx + 100000);
for (int rownum = 0; rownum < r.size(); ++rownum) {
const result::tuple row = r[rownum];
vid1 = row[0].as<int>();
vid2 = row[1].as<int>();
vid3 = row[2].as<int>();
.............
}
}
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;
}
Cursors are a good place to start. Here's another cursor example, using a do-while()
const std::conStr("user=" + opt::dbUser + " password=" + opt::dbPasswd + " host=" + opt::dbHost + " dbname=" + opt::dbName);
pqxx::connection conn(connStr);
pqxx::work txn(conn);
std::string selectString = "SELECT id, name FROM table_name WHERE condition";
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor(txn, selectString, "myCursor", false);
//cursor variables
size_t idx = 0; //starting location
size_t step = 10000; //number of rows for each chunk
pqxx::result result;
do{
//get next cursor chunk and update the index
result = cursor.retrieve( idx, idx + step );
idx += step;
size_t records = result.size();
cout << idx << ": records pulled = " << records << endl;
for( pqxx::result::const_iterator row : result ){
//iterate over cursor rows
}
}
while( result.size() == step ); //if the result.size() != step, we're on our last loop
cout << "Done!" << endl;
I'm iterating over approximately 33 million rows in my application. In addition to using a cursor, I used the following approach:
Split the data into smaller chunks. For me, that was using bounding
boxes to grab data in a given area.
Construct a query to grab that
chunk, and use a cursor to iterate over it.
Store the chunks on the
heap and free them once you're done processing the data from a given
chunk.
I know this is a very late answer to your question, but I hope this might help someone!
Related
I've downloaded google diff library for C++ Qt.
https://code.google.com/archive/p/google-diff-match-patch/
But I don't really understand how to use it for a simple comparing of two strings.
Let assume I have two QStrings.
QString str1="Stackoverflow"
QString str2="Stackrflow"
As I understood I need to create dmp object of diff_match_match class and then call the method for comparing.
So what do I do to have for example "ove has deleted from 5 position".
Usage is explained in the API wiki and diff_match_patch.h.
The position isn’t contained in the Diff object. To obtain it, you could iterate over the list and calculate the change position:
Unchanged substrings and deletes increment the position by the length of the unchanged/deleted substring.
Insertions do not alter positions in the original string.
Deletes followed by inserts are actually replacements. In that case the insert operation happens at the same position where the delete occured, so that last delete should not increment the position.
i.e. something like this (untested):
auto diffResult = diff_main(str1, str2);
int equalLength = 0;
int deleteLength = 0;
bool lastDeleteLength = 0; // for undoing position offset for replacements
for (const auto & diff : diffResult) {
if (diff.operation == Operation.EQUAL) {
equalLength += diff.text.length();
lastDeleteLength = 0;
}
else if (diff.operation == Operation.INSERT) {
pos = equalLength + deleteLength - lastDeleteLength;
qDebug() << diff.toString() << "at position" << pos;
lastDeleteLength = 0;
}
else if (diff.operation == Operation.DELETE) {
qDebug() << diff.toString() << "at position" << equalLength + deleteLength;
deleteLength += diff.text.length();
lastDeleteLength = diff.text.length();
}
}
Hi I am having trouble implementing a striping algorithm. I am also having a problem loading 30000 records in one vector, I tried this, but it is not working.
The program should declare variables to store ONE RECORD at a time. It should read a record and process it then read another record, and so on. Each process should ignore records that "belong" to another process. This can be done by keeping track of the record count and determining if the current record should be processed or ignored. For example, if there are 4 processes (numProcs = 4) process 0 should work on records 0, 4, 8, 12, ... (assuming we count from 0) and ignore all the other records in between.`
Residence res;
int numProcs = 4;
int linesNum = 0;
int recCount = 0;
int count = 0;
while(count <= numProcs)
{
while(!residenceFile.eof())
{
++recCount;
//distancess.push_back(populate_distancesVector(res,foodbankData));
if(recCount % processIS == linesNum)
{
residenceFile >> res.x >>res.y;
distancess.push_back(populate_distancesVector(res,foodbankData));
}
++linesNum;
}
++count;
}
Update the code
Residence res;
int numProcs = 1;
int recCount = 0;
while(!residenceFile.eof())
{
residenceFile >> res.x >>res.y;
//distancess.push_back(populate_distancesVector(res,foodbankData));
if ( recCount == processId)//process id
{
distancess.push_back(populate_distancesVector(res,foodbankData));
}
++recCount;
if(recCount == processId )
recCount = 0;
}
update sudo code
while(!residenceFile.eof())
{
residenceFile >> res.x >>res.y;
if ( recCount % numProcs == numLines)
{
distancess.push_back(populate_distancesVector(res,foodbankData));
}
else
++numLines
++recCount
}
You have tagged your post with MPI, but I don't see any place where you are checking a processor ID to see which record it should process.
Pseudocode for a solution to what I think you're asking:
While(there are more records){
If record count % numProcs == myID
ProcessRecord
else
Increment file stream pointer forward one record without processing
Increment Record Count
}
If you know the # of records you will be processing beforehand, then you can come up with a cleverer solution to move the filestream pointer ahead by numprocs records until that # is reached or surpassed.
A process that will act on records 0 and 4 must still read records 1, 2 and 3 (in order to get to 4).
Also, while(!residenceFile.eof()) isn't a good way to iterate through a file; it will read one round past the end. Do something like while(residenceFile >> res.x >>res.y) instead.
As for making a vector that contains 30,000 records, it sounds like a memory limitation. Are you sure you need that many in memory at once?
EDIT:
Look carefully at the updated code. If the process ID (numProcs) is zero, the process will act on the first record and no other; if it is something else, it will act on none of them.
EDIT:
Alas, I do not know Arabic. I will try to explain clearly in English.
You must learn a simple technique, before you attempt a difficult technique. If you guess at the algorithm, you will fail.
First, write a loop that iterates {0,1,2,3,...} and prints out all of the numbers:
int i=0;
while(i<10)
{
cout << i << endl;
++i;
}
Understand this before going farther. Then write a loop that iterates the same way, but prints out only {0,4,8,...}:
int i=0;
while(i<10)
{
if(i%4==0)
cout << i << endl;
++i;
}
Understand this before going farther. Then write a loop that prints out only {1,5,9,...}. Then write a loop that reads the file, and reports on every record. Then combine that with the logic from the previous exercise, and report on only one record out of every four.
Start with something small and simple. Add complexity in small measures. Develop new techniques in isolation. Test every step. Never add to code that doesn't work. This is the way to write code that works.
I have a database with a table containing ~150 million rows. The columns are just:
id (INTEGER), value_one (INTEGER), value_two (INTEGER), value_3 (INTEGER)
I'm need to import all this data into a QList, but I'm running into a problem where Qt is asserting qAllocMore: 'Requested size is too large!', file tools\qbytearray.cpp, line 73 when I'm running a SELECT query. I'm able to run the same code on a table containing ~7 million entries, and it works without error.
This is my SELECT statement:
bool e = query.exec("SELECT * FROM DocumentTerms");
if (!e) {
qWarning() << __FUNCTION__ << "\tError: " << query.lastError().text();
}
while (query.next()) {
int docId = query.value(1).toInt();
int termId = query.value(2).toInt();
int frequency = query.value(3).toInt();
//store it in a QHash<int, QPair<int, int>>
}
It looks like it's iterating through the query.next loop, but the assert pops up after ~16 million iterations. Any idea what's causing it?
My previous answer was nonsense. Stupid calculation bug. However, I think I now have the solution. It is not memory in general, what you are missing, but consecutive memory.
I have tried the following:
QList<int> testlist;
for(int i = 0; i < 150000000;++i){
testlist << i << i << i << i;
}
Stupid little code, does nothing else, but put 4 ints 150000000 times into a list.
I get after a few seconds:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
More or less: Out of memory.
Now I change the code above:
QList<int> testlist;
testList.reserve(150000000*4);
for(int i = 0; i < 150000000;++i){
testlist << i << i << i << i;
}
This code does nothing else than the one before. The QList is exactly the same size as before. However, I reserve all memory before I start the loop. The result? The list does not need to grow and constantly request more memory. With this version I had no problem at all. I got my list.
I'm learning libpqxx, the C++ API to PostgreSQL. I'd like to use the pqxx::stateless_cursor class, but 1) I find the Doxygen output unhelpful in this case, and 2) the pqxx.org website has been down for some time now.
Anyone know how to use it?
I believe this is how I construct one:
pqxx::stateless_cursor <pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor( work, "SELECT * FROM mytable", ?, ? );
The last two parms are called cname and hold, but are not documented.
And once the cursor is created, how would I go about using it in a for() loop to get each row, one at a time?
Thanks #Eelke for the comments on cname and hold.
I figured out how to make pqxx::stateless_cursor work. I have no idea if there is a cleaner or more obvious way but here is an example:
pqxx::work work( conn );
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor( work, "SELECT * FROM mytable", "mycursor", false );
for ( size_t idx = 0; true; idx ++ )
{
pqxx::result result = cursor.retrieve( idx, idx + 1 );
if ( result.empty() )
{
// nothing left to read
break;
}
// Do something with "result" which contains a single
// row in this example since we told the cursor to
// retrieve row #idx (inclusive) to idx+1 (exclusive).
std::cout << result[ 0 ][ "name" ].as<std::string>() << std::endl;
}
I do not know the pqxx library but based on the underlying DECLARE command of postgresql I would guess
That cname is the name of the cursor, so it can be anything postgresql normally accepts as a cursor name.
That hold refers to the WITH HOLD option of a cursor, from the docs:
WITH HOLD specifies that the cursor can continue to be used after the
transaction that created it successfully commits. WITHOUT HOLD
specifies that the cursor cannot be used outside of the transaction
that created it. If neither WITHOUT HOLD nor WITH HOLD is specified,
WITHOUT HOLD is the default.
Here's another cursor example, using a do-while() loop:
const std::conStr("user=" + opt::dbUser + " password=" + opt::dbPasswd + " host=" + opt::dbHost + " dbname=" + opt::dbName);
pqxx::connection conn(connStr);
pqxx::work txn(conn);
std::string selectString = "SELECT id, name FROM table_name WHERE condition";
pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
cursor(txn, selectString, "myCursor", false);
//cursor variables
size_t idx = 0; //starting location
size_t step = 10000; //number of rows for each chunk
pqxx::result result;
do{
//get next cursor chunk and update the index
result = cursor.retrieve( idx, idx + step );
idx += step;
size_t records = result.size();
cout << idx << ": records pulled = " << records << endl;
for( pqxx::result::const_iterator row : result ){
//iterate over cursor rows
}
}
while( result.size() == step ); //if the result.size() != step, we're on our last loop
cout << "Done!" << endl;
I am using c++ with ADO to connect to a mySql database, and I am using the standard ADO/C++ method to create a connection to the mySql database, and recordset is the pointer to the retrieved first record
_RecordsetPtr recordset;
recordset->Open("Select * from table",p_connection_.GetInterfacePtr(),adOpenForwardOnly,adLockReadOnly,adCmdText);
My concern is if the table contains too many records, and if I query all records, it will consume alot of memory?
I want to only retrieve, maybe 100 records each time and process them. Is it possible? The table does not contain id or index as its attribute, so "Select * from table where id >= 1 and id <= 100" does not work.
You will want to use limits on the query and cycle through them.
//SELECT * FROM table LIMIT 0 OFFSET 100
int tlimit, blimit;
std::string query;
std::stringstream sstm;
_RecordsetPtr recordset, count;
count->Open("SELECT COUNT(*) FROM table",p_connection_.GetInterfacePtr(),adOpenForwardOnly,adLockReadOnly,adCmdText);
for(int i = 0; i < count/100 + 1; i++)
{
tlimit = 100 * i + 100;
blimit = 100 * i;
sstm << "SELECT * FROM table LIMIT " << blimit << " OFFSET " << tlimit;
query = sstm.str();
recordset->Open(query,p_connection_.GetInterfacePtr(),adOpenForwardOnly,adLockReadOnly,adCmdText);
//suggest passing the recordset to a function to do what ever you want with it here
}
Note that if you are not using a database that starts its records off at 1 you will have to modify that algorithm a bit.