Inserting in a WStandardItemModel is too slow - c++

I am working on a application built uppon WT.
We have a performance problem, as it must display a lot of data in a WTableView associated with a WStandardItemModel.
For each new item to be added in the table it does:
model->setData( row, column, data )
(which happens a few thousand times).
Is there some way to make it faster? some other way to add data in the table?
it can take 2 seconds to generate the data and several minutes to display it ...

WStandardItemModel is a general-purpose model that is easy to use, but it's not optimal for very large models. Try to specialize a WAbstractTableModel; you only need to reimplement three methods and you can read your data from wherever it resides, or compute it on the fly.
It's not normal that a view takes minutes to display. I've used views on tables with many thousands of entries without performance problems. Was your system swapping because of the memory wasted in a (extremely large) WStandardItemModel?

Related

Time required to open very small to very large leveldb databases

I have to give some background first. I want to implement an optimized storage engine for OSM planet data (50GB+). The purpose of this engine is to enable map area extractions as fast as possible - while also remaining the ability for minutely updates. The design I've chosen for several reasons (not mentioning all of them here) is to use one data cell per grid. E.g. think of a all cells on a map being distinct files or databases: http://3.bp.blogspot.com/_CntRFtGsdQo/TTU5UMlLkTI/AAAAAAAAARk/_hW8n33t4Ok/s1600/utmworld.gif
(Jut to get the idea though, this is not the actual cell grid I'll be using)
I have never used leveldb before, but settled on it for it's bulk insert and update performance. However, I'd like to know about the "performance characteristics" when opening many very small and very large leveldb databases. very small meaning just a few kB, very large meaning a few hundred MB
I expect that I have to open / close somewhere between 10-100 dbs per minute. I'd rule out leveldb if it needs significant initialization time.
An answer to this question could be either concrete figures, or insight to what leveldb does during initialization and how it relates to data / index size.
PS. I'll also do my own measurements of course. But as with all tests, I may draw wrong conclusions from my sample data.

Displayiing BIG CSV File as Table

I want to read a big file with csv-data (>1 GB, export from an ERP System) and provide a table interface for the data.
In fact, I have a good working table class. This works in this (abstract) fashion:
a table row which is a vector for the column data
a vector for the rows.
To read the big files this goes in memory problems, I think because the vector does need the whole memory at once on the heap. So I created a new class with has only pointers to strings in the column like this:
a table row which is a vector<string *> for the column data
a vector<row> for the rows.
This works better. It has about 1/3 less memory footprint on the heap. I think the separated string data fits in some holes on the heap ;-)
But if the data gets bigger, the memory problem is also there.
To read the file and convert it takes about 2 minutes.
I tried SQLLite, but the import is very slow. Reading the big file (about 3000000 lines) and insert them, takes about 15 hours. I know I can speed up this a lot, but i do not really know if this is the solution. BTW:The sqlite browser crashes during import such a file!
Does anyone else have such problems or do you know a good way to manage the memory for such BIG-Data-Tables? The table is a look up table for some tasks so it should fit into the memory at once, if possible.
Currently I am working with Visual Studio C++ 2012.
Without knowing to much about your problem, here is what I would do when I had a similar situation 10 years ago and it took 36 hours to dump into a Oracle database, this more than halved to to 16:
Create a bunch of buffers (say 10,000 lines of data) and have a thread read in the data into these in a circular fashion.
Then have another thread start actually working on the data.
Admittedly this only works if each row is independent of others.
Edit: This link about memory locality may help. Essentially use [] instead of vectors.

GAE ndb design, performance and use of repeated properties

Say I have a picture gallery and a picture could potentially have 100k+ fans. Which ndb design is more efficient?
class picture(ndb.model):
fanIds = ndb.StringProperty(repeated=True)
... [other picture properties]
or
class picture(ndb.model):
... [other picture properties]
class fan(ndb.model):
pictureId = StringProperty()
fanId = StringProperty()
Is there any limit on the number of items you can add to an ndb repeated property and is there any performance hit with storing a large amount of items in a repeated property? If it is less efficient to use repeated properties, what is their intended use?
Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.
Generally v1 would be much cheaper.
In terms of read/write costs, you pay per entity fetch/written, so you want to reduce the number of entities. version 1 will be cheaper. Significantly cheaper if you fetch every fan every time you fetch a picture.
However each entity is limited to 1MB. If you have 100k+ fans, you could hit that limit depending on the size of your fanId. That's not counting your other picture data, so you could blow that 1MB limit. You'll have to add some more complex code to handle overflow cases.
Large entities take longer to fetch than small entities. If you're going to fetch all the fans at once all the time, v1 will be better. If you're only going to fetch say 5 fans at any one point, v2 might be faster (only might). If on the other hand you try to pull 100k fan entities... that's gonna take forever.

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

What portable data backends are there which have fast append and random access?

I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.