Qt SQLite search query speed - c++

I'm creating a database in SQLite as follows:
QSqlQuery create_address;
create_address.prepare("CREATE TABLE addresses (addressid INTEGER PRIMARY KEY AUTOINCREMENT, address TEXT UNIQUE)");
QSqlQuery create_devices;
create_devices.prepare("CREATE TABLE devices (ch TEXT PRIMARY KEY, addressid INTEGER REFERENCES addresses(addressid))");
create_devices.exec();
create_address.exec();
I need to access this database a lot of times (~660'000) passing ch and retrieving the corresponding address, the ch passed could not be in the database (empty string is returned).
To do so i made the following query
//outside loop
QSqlQuery find_address;
find_address.prepare("SELECT address FROM addresses,devices WHERE devices.addressid = addresses.addressid AND devices.ch = :chcode");
//in loop
find_address.bindValue(":chcode",QString::fromStdString(ch_code));
find_address.exec();
The problem is that this process is very slow (it takes almost 12 minutes to finish all the 660'000 searches).
Before this i tried with an INNER JOIN but the performance was pretty much the same.
Is there a better way to write the query and/or structure the DB to get a faster execution time?

Since you have a loop with SQL query, you can wrap it into transaction which may improve the performance:
QSqlDatabase::database().transaction();
.........
// your loop
.........
QSqlDatabase::database().commit();
Also the performance may be improved by adding indexes. In your case index can be created on the fields devices.addressid and devices.ch. In sqlite console do the following:
CREATE INDEX devices_index ON devices(ch, addressid);

Without any measurement or insights how the tables addresses and devices look like it's hard to give a precise advice.
Maybe the join is the bottleneck so you can try to create a view first. This would avoid joining two tables 659000 times. See here on how to create a view.
Next (shot in the dark), instead of executing 660000 queries, make batches. For example replace AND devices.ch = :chcode by AND devices.ch IN(:chcodelist) and glue multiple chcodes together. Depending on the content, take care of their escaping yourself.

Related

Insert many items from list into SQLite

I have a list of lots of data (will be near 1000). I want to add it all in one go to a row. Is this straight forward like a for loop over list with multiple inserts?multiple commits? Is this bad practice?thanks
I haven’t tried yet as just setting up table columns which is many so need to know if feasible thanks
If you're using SQL to insert:
INSERT INTO 'tablename' ('column1', 'column2') VALUES
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2');
If you're using code... generate that above query using a for loop then run it.
For a more efficient approach consider a union as shown in: Is it possible to insert multiple rows at a time in an SQLite database?
insert into 'tablename' ('column1','column2')
select data1 as 'column1',data2 as 'column2'
union select data3,data4
union...
In sqlite you don't have network latency, so it does not really matter performance wise to issue many small requests toward the engine. For more reference about that you can read this page from the official documentation: https://www.sqlite.org/np1queryprob.html
But in write mode (insert or update), each individual query will have to pay the cost of an implicit transaction. To avoid that you need to gather your insert queries in an explicit transaction. Depending of your programming language, how you do that may vary. Here is a code sample on how to do that in go. I've simplified error code management, to have a better view of the gist.
tx, _ := db.Begin()
for _, item := range items {
tx.Exec(`INSERT INTO testtable (col1, col2) VALUES (?, ?)`, item.Field1, item.Field2)
}
tx.Commit()
If you detect an error in your loop instead calling tx.Commit() you need to call tx.Rollback() in order to cancel all previous writes to your database so that the final state is as if no insert query has been issued at all.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Partitioning a table in sybase-select query

My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?
Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.

Getting generatedauto-increment ID without second query (MySQL)

I have been searching for a while on how to get the generated auto-increment ID from an "INSERT . INTO ... (...) VALUES (...)". Even on stackoverflow, I only find the answer of using a "SELECT LAST_INSERT_ID()" in a subsequent query. I find this solution unsatisfactory for a number of reasons:
1) This will effectively double the queries sent to the database, especially since it is mostly handling inserts.
2) What will happen if more than one thread access the database at the same time? What if more than one application accesses the database at the same time? It seems to me the values are bound to become erroneous.
It's hard for me to believe that the MySQL C++ Connector wouldn't offer the feature that the Java Connector as well as the PHP Connector offer.
An example taken from http://forums.mysql.com/read.php?167,294960,295250
sql::Statement* stmt = conn->createStatement();
sql::ResultSet* res = stmt->executeQuery("SELECT ##identity AS id");
res->next();
my_ulong retVal = res->getInt64("id");
In nutshell, if your ID column is not an auto_increment column then you can as well use
SELECT ##identity AS id
EDIT:
Not sure what do you mean by second query/round trip. First I thought you are trying to know a different way to get the ID of the last inserted row but it looks like you are more interested in knowing whether you can save the round trip or not?
If that's the case, then I am completely agree with #WhozCraig; you can punch in both your queries in a single statement like inser into tab value ....;select last_inserted_id() which will be a single call
OR
you can have stored procedure like below to do the same and save the round trip
create procedure myproc
as
begin
insert into mytab values ...;
select last_inserted_id();
end
Let me know if this is not what you are trying to achieve.

MySQL Performance issues with large amounts of data

I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.
I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:
CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
`header_id` tinyint(3) unsigned NOT NULL auto_increment,
`sensor` varchar(10) NOT NULL,
`bytes` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`header_id`),
UNIQUE KEY `header_id_UNIQUE` (`header_id`),
UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`transverse` bit(1) NOT NULL default b'0',
`data` varbinary(1080) NOT NULL,
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
KEY `time` (`time_sec`)
KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`raster_id` int(10) unsigned NOT NULL,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`header_id` tinyint(3) unsigned NOT NULL,
`data_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.
Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.
The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.
Here is an example SQL Query we use to get history data:
SELECT
rel_RASTER_TABLE.time_sec,
rel_RASTER_TABLE.time_nsec,
RAW_DATA_TABLE.transverse,
HEADER_TABLE.bytes,
RAW_DATA_TABLE.data
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!
EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?
First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.
I would say partitioning is an absolute must in a case like this:
large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.
Check out this for mySQL.
Looking at your select stmt (which filters on time), I'll say partition on the time column.
Of course you might wanna add a few indexes based on the frequent queries you want to use.
--edit--
I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date.
So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first.
Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.
--edit end--
A quickie on partitioning if you're new to it.
Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
insertions are faster as physically it's inserted into a smaller file (partition).
deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
queries with a where clause on the key by which the table is partitioned is much much faster.
Index building is faster.
I don't have much experience with MySQL, but here are some a priori thoughts that jump to mind.
Is your select in a stored procedure?
The select's predicate is usually searched in the order its asked in. If the data on the disk is reordered to match the primary key, then doing raster id first is fine. You would be paying the cost of reordering on every insert though. If the data is stored in time order on disk, you would probably want to search on time_sec before raster_id.
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
Your indexes don't follow the search predicates.
It will create indexes based on the keys, generally.
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
It may not be using the primary index because you aren't using internal_id. You may want to set internal_id as the primary key and create a separate index based on your search parameters. At least on raster_id and time_sec.
Are the joins too loose?
This may be my inexperience with MySQL, but I expect to see conditions on the joins. Does using FROM here do a natural join? I don't see any foreign keys specified, so I don't know how it would join these tables rationally.
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
Usually when developing something like this I would work with a smaller set and remove predicates to makes sure that each step meets what I expect. If you accidentally cast a wide net up front, then narrow down later you may mask some inefficiencies.
Most query optimizers have a way to output how the optimized, make sure it meets your expectations. One of the comments mention Explain plans, I assume that is what it is called.
Without knowing what all the queries are its difficult to give specific advice, however looking at the single query you have provided, there are no indexes which are idealy suited to resolving this.
In fact the structure is a bit messy - if internal_id is an auto-increment value then it is unique - why add other stuff in the primary key? It looks as if a more sensible structure for rel_RASTER_TABLE would be:
PRIMARY KEY (`internal_id`),
KEY (`raster_id`,`time_sec`,`time_nsec`),
And as for RAW_DATA_TABLE, it should be blindingly obvious that its indexes are far from optimal. And should probably be:
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
KEY `time` (`time_sec`, `time_nsec`)
Note that removing redundant indexes will speed up inserts/updates.
Capturing slow queries should help - and learn how to use 'explain' to see what indexes are redundant / needed.
You may also get a performance boost by tuning the mysql instance - particularly increasing the sort and join buffers - try running mysqltuner
First, I would try to create a view with only the necessary info that needs to be selected between the different tables.
By the way, MySQL is not necessarily the most optimized database system for what you are trying to accomplish... Look into other solutions such Oracle, Microsoft SQL, PostgreSQL etc. Also, the performance will vary depending on the server being used.