QSqlQuery using hundreds of MB of memory - c++

I'm using a QSqlQuery object to do a linear walk of a 10million row SQL table with a very complex 6 column primary key in sorted order. Because of the key(which I CAN NOT change), breaking upmy query SELECT * FROM table1 with < or > with LIMIT causes a huge number of issues with the algorithm I'm using.
My problem arising is as follows, for whatever reason QSqlQuery seems to be caching the entire result set in memory until it hits a bad alloc and kills the application. So I may read some couple hundred rows, seek() over a couple hundred thousand, and by this point QSqlQuery is using 300mb of memory and my application dies. I read the docs and it seems the only thing that can be done is to use setForwardOnly(), however I often have a need for previous()(which is why breaking up the query with LIMIT is a PITA)
Is there no way to cap the cache for QSqlQuery?

Why don't you store previous yourself? There's a QContiguousCache which seems ideal for this.

Related

Is it worth dropping stale columns on a large data set?

I have a relatively large data set in a table with about 60 columns, of which about 20 have gone stale. I've found a few posts on dropping multiple columns and the performance of DROP COLUMN, but nothing on whether or not dropping a bunch of columns would result in a noticeable performance increase.
Any insight as to whether or not something like this could a perceptible impact?
Dropping one or more columns can be done in a single statement and is very fast. All it needs is a short ACCESS EXCLUSIVE lock on the table, so long running queries would block it.
The table is not rewritten during this operation, and it will not shrink. Subsequent rewrites (with VACUUM (FULL) or similar) will get rid of the column data.

Best way to store mappings in a database

Suppose I have an employees table(with around a million employees) and a tasks table(with a few hundred tasks).
Now, I have a mechanism to predict how probable(percentage) an employee is to complete the task -- let's say I have four such mechanisms, and each of the mechanism outputs it's own probability.
Putting it all together, I now have n1(employees) times n2(tasks) times n3(mechanisms) results to store.
I was wondering what would be the best way to store these results.
I have a few options and thoughts:
Maintain a column(JSONField) in either of employees or tasks tables -- Concern: Have to update the whole column data if one of the values changes
Maintaining a third table predictions with foreign keys to employee and task with a column to store the predicted_probability -- Concern: Will have to store n1 * n2 * n3 records, I'm worried about scalability and performance
Thanks for any help.
PS: I'm using Django with postgres
The predictions table is the correct way to go. Depending on how you access the data, the size of the table won't matter. e.g. I would expect that reading the prediction for a single employee has a pretty constant performance. Large tables tend to be a problem only when you need to process all (or a large fraction) of the rows. If you hit a performance problem once you test this, you could e.g. partition that table by task or by task and mechanism (depending on how your queries are structured)
-Credits to #a_horse_with_no_name

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!
Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

SQL Server CE Delete Performance

I am using SQL Server CE 4.0 and am getting poor DELETE query performance.
My table has 300,000 rows in it.
My query is:
DELETE from tableX
where columnName1 = '<some text>' AND columnName2 = '<some other text>'
I am using a non-clustered index on the 2 fields columnName1 and columnName2.
I noticed that when the number of rows to delete is small (say < 2000), the index can help performance by 2-3X. However, when the number of rows to delete is larger (say > 15000), the index does not help at all.
My theory for this behavior is that when the number of rows is large, the index maintenance is killing the gains achieved by using the index (index seek instead of table scan). Is this correct?
Unfortunately, I can't get rid of the index because it significantly helps non-mutating query performance.
Also, what else can I do to improve the delete performance for the > 15,000 row case?
I am using SQL Server CE 4.0 on Windows 7 (32-bit).
My application is written in C++ and uses the OLE DB interface to manipulate the database.
There is something known as "the tipping point" where the cost of locating individual rows using a seek is not worth it, and it is easier to just perform a single scan instead of thousands of seeks.
A couple of things you may consider for performance:
have a filtered index, if those are supported in CE (I honestly have no idea)
instead of deleting 15,000 rows at once, batch the deletes into chunks.
consider a "soft delete" - where you simply update an active column to 0. Then you can actually delete the rows in smaller batches in the background. I mean, is a user really sitting around and waiting actively for you to delete 15,000+ rows? Why?

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.