Reading large result set from Mongo - performance issue

Reading large result set from Mongo - performance issue - c++

Faced performance issue where reading several thousands. We have RoR application where we read data stored in Mongo. We use Monogid. Each stored documents contain 17 fields (15 float, and 2 as integer). We execute query which supported by index. Cursor return very fast (<50ms) but reading all documents take more then 500ms.
To find bottleneck we run the same query in Mongo Shell and query took <50ms to complete and iterate overall rows in result set. We have tested Mongo Ruby Driver and query take 250ms to complete. The same result we have if using Moped. Finally we have wrote c++ app which use mongo c++ driver and time to iterate over all result set - <20ms. But if we unzip received BSON object (to output it to console) time rise up to 120ms.
Does extraction from BSON take that much time?

Related

Informatica PowerExchange CDC Data results in target DB way too slow

First of all, I'm very new to Informatica PowerCenter and PowerExchange.
We are using Informatica PowerCenter and PowerExchange to receive CDC data from our source DB2 to a PostgreSQL DB. Therefore we have one workflow where 7 tables are mapped and we get the result in our PostgreSQL. It works fine so far, but it's lacking performance. Not that the size of data is the problem, it's more the delay I see results in the target DB.
When I insert or delete some data on the DB2 (just like 10 rows in one db), I see the results in our PostgreSQL mostly in about ~10-30 seconds (very rare in less than 5 seconds).
My goal would be to speed up this delay. Is this possible? What would I need for that?
I played a little bit with commit interval, and DTM Buffer size, but nothing helped pretty much.
Also I have the feeling that when I configure the workflow to run continuously, it's even slower, compared to when I execute the workflow, after I made the Inserts/Deletes.
Thanks in advance

SAS PROC SQL: How to clear cache between testing

I am reading this paper: "Need for Speed - Boost Performance in Data Processing with SAS/Access® Interface to Oracle". And I would like to know how to clear the cache / buffer in SAS, so my repeated query / test will be reflective of the changes accurately?
I noticed the same query running the first time takes 10 seconds, and (without) changes running it immediately after will take shorter time (say 1-2 seconds). Is there a command / instruction to clear the cache / buffer. So I can have a clean test for my new changes.
I am using SAS Enterprise Guide with data hosted on an Oracle server. Thanks!

In order to flush caches on the Oracle side, you need both DBA privileges (to run alter system flush buffer_cache; in Oracle) and OS-level access (to flush the OS' buffer cache - echo 3 > /proc/sys/vm/drop_caches on common filesystems under Linux).
If you're running against a production database, you probably don't have those permissions -- you wouldn't want to run those commands on a production database anyways, since it would degrade the performance for all users of the database, and other queries would affect the time it takes to run yours.
Instead of trying to accurately measure the time it takes to run your query, I would suggest paying attention to how the query is executed:
what part of it is 'pushed down' to the DB and how much data flows between SAS and Oracle
what is Oracle's explain plan for the query -- does it have obvious inefficiencies
When a query is executed in a clearly suboptimal way, you will find (more often than not) that the fixed version will run faster both with cold and hot caches.
To apply this to the case you mention (10 seconds vs 2 seconds) - before thinking how to measure this accurately, start by looking
if your query gets correctly pushed down to Oracle (it probably does),
and whether it requires a full table (partition) scan of a sufficiently large table (depending on how slow the IO in your DB is - on the order of 1-10 GB).
If you find that the query needs to read 1 GB of data and your typical (in-database) read speed is 100MB/s, then 10s with cold cache is the expected time to run it.

I'm no Oracle expert but I doubt there's any way you can 'clear' the oracle cache (and if there were you would probably need to be a DBA to do so).
Typically what I do is I change the parameters of the query slightly so that the exact query no longer matches anything in the cache. For example, you could change the date range you are querying against.
It won't give you an exact performance comparison (because you're pulling different results) but it will give you a pretty good idea if one query performs significantly better than the other.

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX

It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

What portable data backends are there which have fast append and random access?

I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.

I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c

I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.

Which is the fastest way to retrieve all items in SQLite?

I am programming on windows, I store my infors in sqlite.
However I find to get all items is a bit slow.
I am using the following way:
select * from XXX;
Retrieving all items in 1.7MB SQLite DB takes about 200-400ms.
It is too slow. Can anyone help?
Many Thanks!
Thanks for your answers!
I have to do a complex operation on the data, so everytime, when I open the app, I need to read all information from DB.

I would try the following:
Vacuum your database by running the "vacuum" command
SQLite starts with a default cache size of 2000 pages. (Run the command "pragma cache_size" to be sure. Each page is 512 bytes, so it looks like you have about 1 MByte of cache, which is not quite enough to contain your database. Increase your cache size by running "pragma default_cache_size=4000". That should get you 2 Mbytes cache, which is enough to get your entire database into the cache. You can run these pragma commands from the sqlite3 command line, or through your program as if it were another query.
Add an index to your table on the field you are ordering with.

You could possibly speed it up slightly by selecting only those columns you want, but otherwise nothing will beat an unordered select with no where clause for getting all the data.
Other than that a faster disk/cpu is your only option.
What type of hardware is this on?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading large result set from Mongo - performance issue - c++

Related

Informatica PowerExchange CDC Data results in target DB way too slow

SAS PROC SQL: How to clear cache between testing

How to optimize writing this data to a postgres database

What portable data backends are there which have fast append and random access?

Which is the fastest way to retrieve all items in SQLite?

Categories

Resources