I am using SQL Server CE 4.0 and am getting poor DELETE query performance.
My table has 300,000 rows in it.
My query is:
DELETE from tableX
where columnName1 = '<some text>' AND columnName2 = '<some other text>'
I am using a non-clustered index on the 2 fields columnName1 and columnName2.
I noticed that when the number of rows to delete is small (say < 2000), the index can help performance by 2-3X. However, when the number of rows to delete is larger (say > 15000), the index does not help at all.
My theory for this behavior is that when the number of rows is large, the index maintenance is killing the gains achieved by using the index (index seek instead of table scan). Is this correct?
Unfortunately, I can't get rid of the index because it significantly helps non-mutating query performance.
Also, what else can I do to improve the delete performance for the > 15,000 row case?
I am using SQL Server CE 4.0 on Windows 7 (32-bit).
My application is written in C++ and uses the OLE DB interface to manipulate the database.
There is something known as "the tipping point" where the cost of locating individual rows using a seek is not worth it, and it is easier to just perform a single scan instead of thousands of seeks.
A couple of things you may consider for performance:
have a filtered index, if those are supported in CE (I honestly have no idea)
instead of deleting 15,000 rows at once, batch the deletes into chunks.
consider a "soft delete" - where you simply update an active column to 0. Then you can actually delete the rows in smaller batches in the background. I mean, is a user really sitting around and waiting actively for you to delete 15,000+ rows? Why?
Related
I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)
I using Google Datastore to store multiple objects. Millions. At some point, I no longer want to keep storing rows on the database. The criterion to delete - Delete all the rows that older from 10 days.
I saw that Google provide two options to make this job:
Send delete command in batch. Of cause that you should GET all the ids before. It sounds like a very slow idea when you have to remove millions of rows. It's also expensive.
Use Google Dataflow product and provide an option to Delete bulk from Datastore. The problem here is just the price - high price.
The problem of those two options above is the pricing. I calculated that the price of deleting 16M rows in a month will cost 480$ (datastore read operations + delete operations) - which is too much money for small tasks. Additional to this you have to add the dataflow operations costs.
It seems that there is no cheap option to delete data from Datastore - I'm wrong?
You don't have to read to delete. Deletes are based on keys. So, all you need is to identify keys. For this you can do keys only query which are much cheaper (just one operation for the entire projection, although there may be a limit on how many keys can be fetched at a time with a projection query).
Also, how did you compute $480? As per
https://cloud.google.com/datastore/pricing
for a Multi-region, it costs $0.06 for 100,000 reads and $0.02 for 100,000 deletes. Using these numbers, I get the following for 16M.
16*10^6 * ( (1/1000) * 0.06/10^5 + 0.02 / 10^5) = $3.2096
Here the 1/1000 factor is a single read operation for 1000 keys read using keysonly query.
I'm using a QSqlQuery object to do a linear walk of a 10million row SQL table with a very complex 6 column primary key in sorted order. Because of the key(which I CAN NOT change), breaking upmy query SELECT * FROM table1 with < or > with LIMIT causes a huge number of issues with the algorithm I'm using.
My problem arising is as follows, for whatever reason QSqlQuery seems to be caching the entire result set in memory until it hits a bad alloc and kills the application. So I may read some couple hundred rows, seek() over a couple hundred thousand, and by this point QSqlQuery is using 300mb of memory and my application dies. I read the docs and it seems the only thing that can be done is to use setForwardOnly(), however I often have a need for previous()(which is why breaking up the query with LIMIT is a PITA)
Is there no way to cap the cache for QSqlQuery?
Why don't you store previous yourself? There's a QContiguousCache which seems ideal for this.
I'm parsing poker hand histories, and storing the data in a postgres database. Here's a quick view of that:
I'm getting a relatively bad performance, and parsing files will take several hours. I can see that the database part takes 97% of the total program time. So only a little optimization would make this a lot quicker.
The way I have it set-up now is as follows:
Read next file into a string.
Parse one game and store it into object GameData.
For every player, check if we have his name in the std::map. If so; store the playerids in an array and go to 5.
Insert the player, add it to the std::map, store the playerids in an array.
Using the playerids array, insert the moves for this betting round, store the moveids in an array.
Using the moveids array, insert a movesequence, store the movesequenceids in an array.
If this isn't the last round played, go to 5.
Using the movesequenceids array, insert a game.
If this was not the final game, go to 2.
If this was not the last file, go to 1.
Since I'm sending queries for every move, for every movesequence, for every game, I'm obviously doing too many queries. How should I bundle them for best performance? I don't mind rewriting a bit of code, so don't hold back. :)
Thanks in advance.
CX
It's very hard to answer this without any queries, schema, or a Pg version.
In general, though, the answer to these problems is to batch the work into bigger coarser batches to avoid repeating lots of work, and, most importantly, by doing it all in one transaction.
You haven't said anything about transactions, so I'm wondering if you're doing all this in autocommit mode. Bad plan. Try wrapping the whole process in a BEGIN and COMMIT. If it's a seriously long-running process the COMMIT every few minutes / tens of games / whatever, write a checkpoint file or DB entry your program can use to resume the import from that point, and open a new transaction to carry on.
It'll help to use multi-valued inserts where you're inserting multiple rows to the same table. Eg:
INSERT INTO some_table(col1, col2, col3) VALUES
('a','b','c'),
('1','2','3'),
('bork','spam','eggs');
You can improve commit rates with synchronous_commit=off and a commit_delay, but that's not very useful if you're batching work into bigger transactions.
One very good option will be to insert your new data into UNLOGGED tables (PostgreSQL 9.1 or newer) or TEMPORARY tables (all versions, but lost when session disconnects), then at the end of the process copy all the new rows into the main tables and drop the import tables with commands like:
INSERT INTO the_table
SELECT * FROM the_table_import;
When doing this, CREATE TABLE ... LIKE is useful.
Another option - really a more extreme version of the above - is to write your results to CSV flat files as you read and convert them, then COPY them into the database. Since you're working in C++ I'm assuming you're using libpq - in which case you're hopefully also using libpqtypes. libpq offers access to the COPY api for bulk-loading, so your app wouldn't need to call out to psql to load the CSV data once it'd produced it.
I am trying to use sqlite (sqlite3) for a project to store hundreds of thousands of records (would like sqlite so users of the program don't have to run a [my]sql server).
I have to update hundreds of thousands of records sometimes to enter left right values (they are hierarchical), but have found the standard
update table set left_value = 4, right_value = 5 where id = 12340;
to be very slow. I have tried surrounding every thousand or so with
begin;
....
update...
update table set left_value = 4, right_value = 5 where id = 12340;
update...
....
commit;
but again, very slow. Odd, because when I populate it with a few hundred thousand (with inserts), it finishes in seconds.
I am currently trying to test the speed in python (the slowness is at the command line and python) before I move it to the C++ implementation, but right now this is way to slow and I need to find a new solution unless I am doing something wrong. Thoughts? (would take open source alternative to SQLite that is portable as well)
Create an index on table.id
create index table_id_index on table(id)
Other than making sure you have an index in place, you can checkout the SQLite Optimization FAQ.
Using transactions can give you a very big speed increase as you mentioned and you can also try to turn off journaling.
Example 1:
2.2 PRAGMA synchronous
The Boolean synchronous value controls
whether or not the library will wait
for disk writes to be fully written to
disk before continuing. This setting
can be different from the
default_synchronous value loaded from
the database. In typical use the
library may spend a lot of time just
waiting on the file system. Setting
"PRAGMA synchronous=OFF" can make a
major speed difference.
Example 2:
2.3 PRAGMA count_changes
When the count_changes setting is ON,
the callback function is invoked once
for each DELETE, INSERT, or UPDATE
operation. The argument is the number
of rows that were changed. If you
don't use this feature, there is a
small speed increase from turning this
off.