I use ADO components in C++ Builder and I need to add about 200 000 records to my MS Access database. If I add those records one by one it takes a lot of time so I wanted to use threads. Each thread would create a TADOTable, connect to a database and insert it's own rows. But, when running the application it is even slower then using only one thread!
So, how to do it? I need to add many records to my Access database but want to avoid one-by-one insertion. A code would be useful.
Thank you.
First of all multithreading will not increase the speed of inserts. It will slow it down because of context switching and stuff. What you need is the way to have bulk inserts , that is sending multiple rows in a single transaction
Try searching for bulk inserts in acesss tables. There is a lot of information there.
Related
We have a use case where we need to execute multiple update statements(each updating different set of rows) on the same table in snowflake without any latency caused by queuing time of the update queries. Currently, a single update statements takes about 1 min to execute and all the other concurrent update statements are locked (about 40 concurrent update statements) and are queued, so the total time including the wait time is around 1 hour but the expected time is around 2 mins( assuming all update statements execute at the same time and the size of warehouse supports 40 queries at the same time without any queueing).
What is the best solution to avoid this lock time? We've considered the following two options :-
Make changes in the application code to batch all the update statements and execute as one query - Not possible for our use case.
Have a separate table for each client (each update statement, updates rows for different clients in the table) - This way, all the update statements will be executing in separate table and there won't be any locks.
Is the second approach the best way to go or is there any other workaround that would help us reduce the latency of the queueing time?
The scenario is expected to happen since Snowflake locks table during update.
Option 1 is ideal to scale in data model. But since you can't make it happen, you can go by option 2.
You can also put all the updates in one staging table and do upsert in bulk - Delete and Insert instead of update. Check if you can afford the delay.
But if you ask me, snowflake should not be used for atomic update. It has to be an upsert (delete and insert and that too in bulk). Atomic updates will have limitations. Try going for a row based store like Citus or MySQL if your usecase allows this.
I am building a dashboard-like webapp in Django and my view takes forever to load due to a relatively large database (a single table with 60.000 rows...and growing), the complexity of the queries and quiet a lot of number crunching and data manipulation in python, according to django debug toolbar the page needs 12 seconds to load.
To speed up the page loading time I thought about the following solution:
Build a view that is called automatically every night, completeles all the complex queries, number crunching and data manipulation and saves the results in a small lookup table in the database
Build a second view that is returning the dashbaord but retrieves the data from the small lookup table via a very simple query and hence loads much faster
Since the queries from the first view are executed every night, the data is always up-to-date in the lookup table
My questions: Does my idea make sense, and if yes does anyone have any exerience with such an approach? How can I write a view that gets called automatically every night?
I also read about caching but with caching the first loading of the page after a database update would still take a very long time, and the data in the database gets updated on a regular basis
Yes, it is common practice.
We are pre-calculating some stuff and we are using celery to run those tasks around midnight daily. For some stuff we have special new model, but usually we add database columns to the model, that contains pre-calculated information.
This approach basically has nothing to do with views - you use them normally, just access data differently.
Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.
I am working on an API, and I have a question. I was looking into the usage of select_related(), in order to save myself some database queries, and indeed it does help in reducing the amount of database queries performed, on the expense of bigger and more complex queries.
My question is, does using select_related() cause heavier memory usage? Running some experiments I noticed that indeed this is the case, but I'm wondering why. Regardless of whether I use select_related(), the response will contain the exact same data, so why does the use of select_related() cause more memory to be used?
Is it because of caching? Maybe separate data objects are used to cache the same model instances? I don't know what else to think.
It's a tradeoff. It takes time to send a query to the database, the database to prepare results, and then send those results back. select_related works off the principle that the most expensive part of this process is the request and response cycle, not the actual query, so it allows you to combine what would otherwise have been distinct queries into just one so there's only one request and response instead of multiple.
However, if your database server is under-powered (not enough RAM, processing power, etc.), the larger query could actually end up taking longer than the request and response cycle. If that's the case, you probably need to upgrade the server, though, rather than not use select_related.
The rule of thumb is that if you need related data, you use select_related. If it's not actually faster, then that's a sign that you need to optimize your database.
UPDATE (adding more explanation)
Querying a database actually involves multiple steps:
Application generates the query (negligible)
Query is sent to the database server (milliseconds to seconds)
Database processes the query (milliseconds to seconds)
Query results are sent back to application (milliseconds to seconds)
In a well-tuned environment (sufficient server resources, fast connections) the entire process is finished in mere milliseconds. However, steps 2 and 4, still usually take more time overall than step 3. This is why it makes more sense to send fewer more complex queries than multiple simpler queries: the bottleneck is most usually the transport layer not the processing.
However, a poorly optimized database, on an under-powered machine with large and complex tables could take a very long time to run the query, becoming the bottleneck. That would end up negating the decrease in time gained from sending one complex query instead of multiple simpler ones, i.e. the database would have responded quicker to the simpler queries and the entire process would have taken less net time.
Nevertheless, if this is the case, the proper response is to fix the database-side: optimize the database and its configuration, add more server resources, etc., rather than reverting to sending multiple simple queries instead.
I think I've read somewhere that Django's ORM lazily loads objects. Let's say I want to update a large set of objects (say 500,000) in a batch-update operation. Would it be possible to simply iterate over a very large QuerySet, loading, updating and saving objects as I go?
Similarly if I wanted to allow a paginated view of all of these thousands of objects, could I use the built in pagination facility or would I manually have to run a window over the data-set with a query each time because of the size of the QuerySet of all objects?
If you evaluate a 500000-result queryset, which is big, it will get cached in memory. Instead, you can use the iterator() method on your queryset, which will return results as requested, without the huge memory consumption.
Also, use update() and F() objects in order to do simple batch-updates in single query.
If the batch update is possible using a SQL query, then i think using sql-queries or django-orm will not make a major difference. But if the update actually requires loading each object, processing the data and then updating them, you can use the orm or write your own sql query and run update queries on each of the processed data, the overheads completely depends on the code logic.
The built-in pagination facility runs a limit,offset query (if you are doing it correct), so i don't think there are major overheads in the pagination either ..
As I benchmarked this for my current project with dataset of 2.5M records in one table.
I was reading information and counting records, for example, I needed to find IDs of records, which field "name" was updated more than once in certain timeframe. Django benchmark was using ORM, to retrieve all records and then to iterate through them. Data was saved in list for future processing. No any debug output, except result print in the end.
On the other end, I was using MySQLdb which was executing same queries (got from Django) and building same structure, using classes for storing data and saving instances in list for future processing. No any debug output, except result print in the end.
I found that:
without Django with Django
execution time x 10x
memory consumption y 25y
And I was only reading and counting, without performing update/insert queries.
Try to investigate this question for yourself, benchmark isn't hard to write and execute.