Which is the fastest way to retrieve all items in SQLite? - c++

I am programming on windows, I store my infors in sqlite.
However I find to get all items is a bit slow.
I am using the following way:
select * from XXX;
Retrieving all items in 1.7MB SQLite DB takes about 200-400ms.
It is too slow. Can anyone help?
Many Thanks!
Thanks for your answers!
I have to do a complex operation on the data, so everytime, when I open the app, I need to read all information from DB.

I would try the following:
Vacuum your database by running the "vacuum" command
SQLite starts with a default cache size of 2000 pages. (Run the command "pragma cache_size" to be sure. Each page is 512 bytes, so it looks like you have about 1 MByte of cache, which is not quite enough to contain your database. Increase your cache size by running "pragma default_cache_size=4000". That should get you 2 Mbytes cache, which is enough to get your entire database into the cache. You can run these pragma commands from the sqlite3 command line, or through your program as if it were another query.
Add an index to your table on the field you are ordering with.

You could possibly speed it up slightly by selecting only those columns you want, but otherwise nothing will beat an unordered select with no where clause for getting all the data.
Other than that a faster disk/cpu is your only option.
What type of hardware is this on?

Related

Dynamic spool space allocation in teradata

I have huge data and am importing it from teradata to hdfs. While doing so, usually spool space is insufficient and the job thus fails. Is there any solution for this? Can spool space be allocated dynamically as per data size? Or can we load the data from sqoop import into a temporary buffer and then write it to hdfs?
If you're running out of SPOOL, it could be any of these scenarios:
Incorrectly written query (i.e. unintended CROSS JOIN)
Do an EXPLAIN on your query and check for a product join or anything that looks like it will take a long time
Inefficient query plan
Run an EXPLAIN and see if there are any long estimates. Also, you can try DIAGNOSTIC HELPSTATS ON FOR SESSION. When you enable this flag, any time you run an EXPLAIN, at the bottom you will get a bunch of recommended statistics to collect. Some of these suggestions may be useful
Tons of data
Not much you can do here. Maybe try to do the import in batches.
Also, you can check to see what the MaxSpool parameter is for the user running the query. You could try to increase the MaxSpool value to see if that helps. Keep in mind, the actual spool available will be capped by the amount of unallocated PERM space.

SAS PROC SQL: How to clear cache between testing

I am reading this paper: "Need for Speed - Boost Performance in Data Processing with SAS/Access® Interface to Oracle". And I would like to know how to clear the cache / buffer in SAS, so my repeated query / test will be reflective of the changes accurately?
I noticed the same query running the first time takes 10 seconds, and (without) changes running it immediately after will take shorter time (say 1-2 seconds). Is there a command / instruction to clear the cache / buffer. So I can have a clean test for my new changes.
I am using SAS Enterprise Guide with data hosted on an Oracle server. Thanks!
In order to flush caches on the Oracle side, you need both DBA privileges (to run alter system flush buffer_cache; in Oracle) and OS-level access (to flush the OS' buffer cache - echo 3 > /proc/sys/vm/drop_caches on common filesystems under Linux).
If you're running against a production database, you probably don't have those permissions -- you wouldn't want to run those commands on a production database anyways, since it would degrade the performance for all users of the database, and other queries would affect the time it takes to run yours.
Instead of trying to accurately measure the time it takes to run your query, I would suggest paying attention to how the query is executed:
what part of it is 'pushed down' to the DB and how much data flows between SAS and Oracle
what is Oracle's explain plan for the query -- does it have obvious inefficiencies
When a query is executed in a clearly suboptimal way, you will find (more often than not) that the fixed version will run faster both with cold and hot caches.
To apply this to the case you mention (10 seconds vs 2 seconds) - before thinking how to measure this accurately, start by looking
if your query gets correctly pushed down to Oracle (it probably does),
and whether it requires a full table (partition) scan of a sufficiently large table (depending on how slow the IO in your DB is - on the order of 1-10 GB).
If you find that the query needs to read 1 GB of data and your typical (in-database) read speed is 100MB/s, then 10s with cold cache is the expected time to run it.
I'm no Oracle expert but I doubt there's any way you can 'clear' the oracle cache (and if there were you would probably need to be a DBA to do so).
Typically what I do is I change the parameters of the query slightly so that the exact query no longer matches anything in the cache. For example, you could change the date range you are querying against.
It won't give you an exact performance comparison (because you're pulling different results) but it will give you a pretty good idea if one query performs significantly better than the other.

Gradually increasing cpu usage without memory increase. Ideas?

So I have an app written in C++, running on Ubuntu 12.04, that initially reads some data from the db, then watches a directory for files. When they show up, it processes them, then writes some data back to the db. Over time, the cpu usage gradually increases, on the order of about 5% per day, but the memory usage stays the same. Logically it looks like this:
-open db connect
-while(keep_running())
- check dir for new files (I know - it should use the watch system and callbacks, but..)
- process files
- (possibly) update db
-end while
-close db connect
Where keep_running() is always true until you SIGINT
The code is not that complicated, so I'm at a loss for the cpu usage - callgrind looks right. I suspect the db connection, but that hasn't exhibited this behavior in other similar apps. My next step is attaching valgrind to a process and letting it run for a few days - in the mean time, anything else I could try?
This isn't surprising. As you describe the application, the database tables are getting larger.
Queries on larger tables probably take a bit more CPU. You don't describe the tables, indexes, or queries, but the behavior is reasonable.
You won't necessarily see an increase in space used by the database, because databases typically reserve extra space on disk for growing tables.
Turns out to not be the db portion - someone was using a .find() on a huge map object. It's one call buried in a bunch of db sections, which is why I was leaning toward the db.
Nothing to see here, carry on :)

How to optimize writing this data to a postgres database

I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.

Efficient update of SQLite table with many records

I am trying to use sqlite (sqlite3) for a project to store hundreds of thousands of records (would like sqlite so users of the program don't have to run a [my]sql server).
I have to update hundreds of thousands of records sometimes to enter left right values (they are hierarchical), but have found the standard
update table set left_value = 4, right_value = 5 where id = 12340;
to be very slow. I have tried surrounding every thousand or so with
begin;
....
update...
update table set left_value = 4, right_value = 5 where id = 12340;
update...
....
commit;
but again, very slow. Odd, because when I populate it with a few hundred thousand (with inserts), it finishes in seconds.
I am currently trying to test the speed in python (the slowness is at the command line and python) before I move it to the C++ implementation, but right now this is way to slow and I need to find a new solution unless I am doing something wrong. Thoughts? (would take open source alternative to SQLite that is portable as well)
Create an index on table.id
create index table_id_index on table(id)
Other than making sure you have an index in place, you can checkout the SQLite Optimization FAQ.
Using transactions can give you a very big speed increase as you mentioned and you can also try to turn off journaling.
Example 1:
2.2 PRAGMA synchronous
The Boolean synchronous value controls
whether or not the library will wait
for disk writes to be fully written to
disk before continuing. This setting
can be different from the
default_synchronous value loaded from
the database. In typical use the
library may spend a lot of time just
waiting on the file system. Setting
"PRAGMA synchronous=OFF" can make a
major speed difference.
Example 2:
2.3 PRAGMA count_changes
When the count_changes setting is ON,
the callback function is invoked once
for each DELETE, INSERT, or UPDATE
operation. The argument is the number
of rows that were changed. If you
don't use this feature, there is a
small speed increase from turning this
off.