SQLite has a reputation of being small, fast and flexible. I used it in one of my C++ projects to save simple statistics to a file. Once for 15 minutes 3-5 new simple records (5 rows of integers) were saved into the database. During few weeks of such SQLite usage I quickly observed clearly noticeable disk usage. I wasn't expecting that, because amount of data written was very small. If I would write it to a plain text file a reaction of the disk would be hardly noticeable. Is SQLite really such light database, or was my problem too simple for use of a relational database?
VACUUM may solve your problem.
http://www.sqlite.org/lang_vacuum.html
Well, i have used SQLite for storing a table with the content of a English Dictionary with 100000 entries, and it occupied about 20MB, so, i don't think the problem lies on SQLite, but it would be good if you provided more clues in order to get a more accured answer
Related
I have to give some background first. I want to implement an optimized storage engine for OSM planet data (50GB+). The purpose of this engine is to enable map area extractions as fast as possible - while also remaining the ability for minutely updates. The design I've chosen for several reasons (not mentioning all of them here) is to use one data cell per grid. E.g. think of a all cells on a map being distinct files or databases: http://3.bp.blogspot.com/_CntRFtGsdQo/TTU5UMlLkTI/AAAAAAAAARk/_hW8n33t4Ok/s1600/utmworld.gif
(Jut to get the idea though, this is not the actual cell grid I'll be using)
I have never used leveldb before, but settled on it for it's bulk insert and update performance. However, I'd like to know about the "performance characteristics" when opening many very small and very large leveldb databases. very small meaning just a few kB, very large meaning a few hundred MB
I expect that I have to open / close somewhere between 10-100 dbs per minute. I'd rule out leveldb if it needs significant initialization time.
An answer to this question could be either concrete figures, or insight to what leveldb does during initialization and how it relates to data / index size.
PS. I'll also do my own measurements of course. But as with all tests, I may draw wrong conclusions from my sample data.
I have a 200 GB CSV file that represent locations (points) around the globe. Each entry (row) has 64 columns and it has redundant information. I made some calculations and the size is approx. 800 million rows. My first approach was to push all the data into Postgres + Postgis. The data is not very clean and there are some rows that do not hold the datatype feature thus, I made an ORM implementation to first, validate and fix the datatype inconsistencies and handle exceptions.
The ORM I used was Django > 1.5 and it took approx. 3 hours to process less than 0.1% of the total dataset.
I tried also to partition the dataset in different files so that little by little I can process them pushing into the database. I used common unix commands like "Sed, Cat, AWK and Head" to do this but it takes so much time!
My questions are the following:
Using Django ORM sounds like a god approach?
What about SQLAlchemy could it help in making the insertions faster?
How can I split the dataset in shorter time?
I recently saw Pandas (Python library for data analysts) can it help with this task, maybe making the queries easier once the data is stored in the database.
Which other tools would you recommend to work with this massive amount of data?
Thank you for your help and reading the long post.
I want to read a big file with csv-data (>1 GB, export from an ERP System) and provide a table interface for the data.
In fact, I have a good working table class. This works in this (abstract) fashion:
a table row which is a vector for the column data
a vector for the rows.
To read the big files this goes in memory problems, I think because the vector does need the whole memory at once on the heap. So I created a new class with has only pointers to strings in the column like this:
a table row which is a vector<string *> for the column data
a vector<row> for the rows.
This works better. It has about 1/3 less memory footprint on the heap. I think the separated string data fits in some holes on the heap ;-)
But if the data gets bigger, the memory problem is also there.
To read the file and convert it takes about 2 minutes.
I tried SQLLite, but the import is very slow. Reading the big file (about 3000000 lines) and insert them, takes about 15 hours. I know I can speed up this a lot, but i do not really know if this is the solution. BTW:The sqlite browser crashes during import such a file!
Does anyone else have such problems or do you know a good way to manage the memory for such BIG-Data-Tables? The table is a look up table for some tasks so it should fit into the memory at once, if possible.
Currently I am working with Visual Studio C++ 2012.
Without knowing to much about your problem, here is what I would do when I had a similar situation 10 years ago and it took 36 hours to dump into a Oracle database, this more than halved to to 16:
Create a bunch of buffers (say 10,000 lines of data) and have a thread read in the data into these in a circular fashion.
Then have another thread start actually working on the data.
Admittedly this only works if each row is independent of others.
Edit: This link about memory locality may help. Essentially use [] instead of vectors.
I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.
I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.
I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c
I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.