I need to update 10,000 records and am wondering what is the quickest way, this is what I am doing currently
I get all the records I need to update, a big list of sku's. I then loop through that list and update them. In the below
in_both is my list of skus
I loop through, get the reference in ShopReconciliation and update
for line_item, sku enumerate(in_both):
got_products = ShopReconciliation.objects.filter(sku=sku, run_id=self.reconciliation_id)
for got_product in got_products:
got_product.in_es = True
got_product.save()
This works, but strike me it isn't very quick. I could have 60k records
Is there a better way with so many records
Thanks
cursor.executemany is supported by most libraries like cx_Oracle or pyodbc. May check the database and data access library for availability of the function.
In the end I used transaction.atomic (with the With)
That seemed to work, I just need to be careful if an error occurs before the commit
Question
What is the Best way to update a column of a table of tens of millions of rows?
1)
I saw creating a new table and rename the old one when finish
2)
I saw update in batches using a temp table
3)
I saw single transaction (don't like this one though)
4)
never listen to cursor solution for a problema like this and I think it's not worthy to try
5) I read about loading data from file (Using BCP), but have not read if the performance is better or not. was not clear if it is just to copy or if it would allow join a big table with something and then bull copy.
really would like have some advice here.
Priority is performance
At the momment I'm testing solution 2) and Exploring solution 5)
Additional Information (UPDATE)
thank you for the critical thinking in here.
The operation be done in downtime.
UPDATE Will not cause row forwarding
All the tables go indexes, average 5 indexes, although few tables got
like 13 indexes.
the probability of target column is present in one of the table
indexes something like 50%.
Some tables can be rebuilt and replace, others don't because they
make part of a software solution, and we might lose support to those.
from those tables some got triggers.
I'll need to do this for more than 600 tables where ~150 range from
0.8 Million to 35 Million rows
The update is always in the same column in the various fields
References
BCP for data transfer
Actually it depends:
on the number of indexes the table contains
the size of the row before and after the UPDATE operation
type of UPDATE - would it be in place? does it need to modify the row length
does the operation cause row forwarding?
how big is the table?
how big would the transaction log of the UPDATE command be?
does the table contain triggers?
can the operation be done in downtime?
will the table be modified during the operation?
are minimal logging operations allowed?
would the whole UPDATE transaction fit in the transaction log?
can the table be rebuilt & replaced with a new one?
what was the timing of the operation on the test environment?
what about free space in the database - is there enough space for a copy of the table?
what kind of UPDATE operation is to be performed? does additional SELECT commands have to be done to calculate the new value of every row? or is it a static change?
Depending on the answers and the results of the operation in the test environment we could consider the fastest operations to be:
minimal logging copy of the table
an in place UPDATE operation preferably in batches
I have a fairly large production database system, based on a large hierarchy of nodes each with a 10+ associated models. If someone deletes a node fairly high in the tree, there can be thousands of models deleted and if that deletion was a mistake, restoring them can be very difficult. I'm looking for a way to give me an easy 'undo' option.
I've tried using Django-reversion, but it seems like in order to get the functionality I want (easily reverting a large cascade delete) it needs to store a bunch of information with each revision. When I created initial revisions, the process is less than 10% done and it's already using 8GB in my database, which is not going to work for me.
So, is there a standard solution for this problem? Or a way to customize Django-reversions to fit my use case?
What you're looking for is called a soft delete. Add a column named deleted with a value of false to the table. Now when you want to do a "delete" instead change the column deleted to true. Update all the code not to show the rows marked as deleted (or move the database table and replace it with a view that doesn't show them). Change all the unique constraints to have a filter WHERE deleted = false so you won't have a problem with not being able to add something similar to what user can't see in the system.
As for the cascades you have two options. Either do an ON UPDATE trigger that will update the child rows or add the deleted column to the FK and define it as ON UPDATE CASCADE.
You'll get the whole reverse functionality at a cost of one extra row (and not being able to delete stuff to save space unless you do it manually).
I have a Qt QAbstractItemModel, and the underlying information is inside an sqlite database. I want to incrementally add rows from a database query to the model as they are needed. The database fetch may be slow, however. My problem is that beginInsertRows() needs to be called before the model is modified, but it needs to know how many rows will be added. I won't know that until after I do the query. This means I seem to have the following alternatives, all of which are unattractive.
Do two database queries: "SELECT COUNT(*) …" to get the number of result rows, then call beginInsertRows(), and do the real query. The downside, of course, is that a potentially expensive query has to be done twice.
Do my entire query, buffer the results, count the rows, then call beginInsertRows() and insert them into the model. The downside is all the extra buffering.
Call beginInsertRows()/endInsertRows() once for each row in the result set. This is going to cause a whole bunch of unnecessary view updates.
This seems like a general problem to me. Is there a general solution? For instance, is there a way to tell beginInsertRows() one thing and then change your mind?
Thanks.
My Django application retrieves an RSS feed every day. I would like to persist the time the feed was last updated somewhere in the app. I'm only retrieving one feed, it will never grow to be multiple feeds. How can I persist the last updated time?
My ideas so far
Create a model and add a datetime field to it. This seems like overkill as it adds another table to the database, in which there will only ever be one row. Other than that, it's the most obvious and straight-forward solution.
Create a settings object which just stores key/value mappings. The last updated date would just be row in this database. This is essentially a generic version of the previous solution.
Use dbsettings/django-values, which allows you to store settings in the database. The last updated date would just be a 'setting'.
Any other ideas that I'm missing?
In spite of the fact databases regularly store many rows in any given table, having a table with only one row is not especially costly, so long as you don't have (m)any indexes, which would waste space. In fact most databases create many single row tables to implement some features, like monotonic sequences used for generating primary keys. I encourage you to create a regular model for this.
RAM is volatile, thus not persistent: memcached is not what you asked for.
XML it is not the right technology to store a single value.
RDMS is not the right technology to store a single value.
Django cache framework will answer your question if CACHE_BACKEND is set to anything else than file://...
The filesystem is the right technology to "persist a single value".
In settings.py:
RSS_FETCH_DATETIME_PATH=os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'rss_fetch_datetime'
)
In your rss fetch script:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'w+')
handler.write(int(time.time()))
handler.close()
Wherever you need to read it:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'r+')
timestamp = int(handler.read())
handler.close()
But cron is the right tool if you want to "run a command every day", for example at 5AM:
0 5 * * * /path/to/manage.py runscript /path/to/retreive/script
Of course, you can still write the last update timestamp in a file at the end of the retreive script, and use it somewhere else, if that makes sense to you.
Concluding by quoting Ken Thompson:
One of my most productive days was
throwing away 1000 lines of code.
One solution I've used in the past is to use Django's cache feature. You set a value to True with an expiration time of one day (in your case.) If the value is not set, you fetch the feed, otherwise you don't do anything.
You can see my solution here: Importing your Flickr photos with Django
If you need it only for caching purposes, why not store it in the memcached?
On the other hand, if you use this data for other purposes (e.g. display it on the page, or to make some calculation, etc.), then I would store it in a new model - in Django, all persistence is built on top of the database, via models, and I would not try to use other "clever" solutions.
One thing I used to do when I was deving with PHP, was to store the xml somewhere, but with a new tag inserted to hold the timestamp of the latest retrieval. It wasn't great, but it was quick and simple.
Keeping it simple would lead to the idea of just storing it in the file system ... why can't you do that? You could, for example, have a siteconfig module in one of your apps which held these sorts of data. This could load up data from a specific file, which could be text, JSON, ConfigParser, pickle or any suitable format. Just import siteconfig somewhere, and it can load the data and make it available to the other modules in your site. You could easily extend this to hold a dict-like object with a number of settings (e.g., if you ever have multiple feeds, but don't want to have a model just for 2-3 rows, you could easily hold the last-retrieved time for each feed in a dict keyed by feed URL).
Create a session key, which persists forever and update the feed timestamp every time you access it.