Optimization of bulk update/insert - django

I'm writing a web application that is going to show player statistics for an online game, using Django 1.6 and PostgreSQL 9.1. I've created a script using django-extensions "runscript" which fetches all players that are online and insert/updates into my table. This script is executed 4 times per hour using cron. I need to either insert or update since the player already could be in the table (and thus should be updated) or not be in the table.
To my problem: there is around 25,000 players online at peak hours and I'm not really sure how I should optimize this (minimize hdd i/o). This is how I've done so far:
#transaction.commit_manually
def run():
for fetched_player in FetchPlayers():
defaults = {
'level': fetched_player['level'],
'server': fetched_player['server'],
'last_seen': date.today(),
}
player, created = Player.objects.get_or_create(name=fetched_player['name'], defaults)
if not created:
player.level = fetched_player['level']
if player.server != fetched_player['server']:
# I save this info to another table
player.server = fetched_player['server']
player.last_seen = date.today()
player.save()
transaction.commit()
Would it (considerably) faster to bypass Django and access the database using psycopg2 or similar? Would Django be confused when 'someone else' is modifying the database? Note that Django only reads the database, all writes are done by this script.
What about to (using either Django or psycopg2) bulk fetch players from the database, update those that was found, and then insert those players that were not found? If this is even possible? The query would get huge: 'SELECT * FROM player WHERE name = name[0] OR name = name[1] OR ... OR name[25000]'. :)

If you want to reduce the number of queries, here is what I suggest:
Call update() directly for each player, which returns the number of rows updated, if the count is 0 (meaning the player is new), put the player data in a temporary list. When you are done with all fetched players, use a bulk_create() to insert all new players with one SQL statement.
Assume you have M+N players (M new, N updated), the number of queries:
Before: (M+N) selects + M inserts + N updates
After: (M+N) updates + 1 bulk insert.

Related

Django. Moving a redis query outside of a view. Making the query results available to all views

My Django app has read-only "dashboard" views of data in a Pandas DataFrame. The DataFrame is built from a redis database query.
Code snippet below:
# Part 1. Get values from the redis database and load them into a DataFrame.
r = redis.StrictRedis(**redisconfig)
keys = r.keys(pattern="*")
keys.sort()
values = r.mget(keys)
values = [x for x in vals if x != None]
redisDataFrame = pd.DataFrame(map(json.loads, vals))
# Part 2. Manipulate the DataFrame for display
myViewData = redisDataFrame
#Manipulation of myViewData
#Exact steps vary based from one view to the next.
fig = myViewData.plot()
The code for part 1 (query redis) is inside every single view that displays that data. And the views have an update interval of 1 second. If I have 20 users viewing dashboards, the redis database is getting queried 20 times a second.
Because the query sometimes takes several seconds, Django spawns multiple threads, many of which hang and slow down the whole system.
I want to put part 1 (querying the redis database) into its own codeblock. Django will query redis (and make the redisDataFrame object) once per second. Each view will copy redisDataFrame into its own object, but it won't query the redis database over and over again. I think this will help performance.
I see some options for this, but I'm not sure what's the best option. Can you point me in the right direction?
-Custom context processor. I could put the 'Part 1' code into a custom context processor, using sched to execute once per second.
import time, sched
schedule = sched.scheduler(time.time, time.sleep)
r = redis.StrictRedis(**redisconfig)
def query_redis:
keys = r.keys(pattern="*")
keys.sort()
values = r.mget(keys)
values = [x for x in vals if x != None]
retuen redisDataFrame = pd.DataFrame(map(json.loads, vals))
scheduler.enter(1, 1, redis_data_frame = query_redis())
return redisDataFrame
from mysite.context_processors import redisDataFrame
...
myViewData = redisDataFrame
-Celery. I'm not familiar with this, but it's often recommended. That said, Celery uses redis as a "broker" between Python apps. If Celery writes to a redis database, that doesn't help my issue of improving access to redis.
I feel like this issue (multiple users accessing read-only DataFrames) is a common task that's easily solved. I just don't know how to solve it. Can you help?

Django bulk_create with ignore rows that cause IntegrityError?

I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?
This is now possible on Django 2.2
Django 2.2 adds a new ignore_conflicts option to the bulk_create method, from the documentation:
On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).
Example:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
], ignore_conflicts=True)
One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.
objs = [(Event), (Event), (Event)...]
try:
Event.objects.bulk_create(objs)
except IntegrityError:
for obj in objs:
try:
obj.save()
except IntegrityError:
continue
If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT each row in a separate transaction and ignore errors (very slow);
Create a SAVEPOINT before each insert (can have scaling problems);
Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
Bulk-insert or (better) COPY the data into a TEMPORARY table, then merge that into the main table server-side.
The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED or TEMPORARY, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATE and DELETE, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable into realtable, retrying each block if it failed.
Or 5. Divide and conquer
I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.
def psql_copy(records):
count = len(records)
if count < 1:
return True
try:
pg.copy_bin_values(records)
return True
except IntegrityError:
if count == 1:
# found culprit!
msg = "Integrity error copying record:\n%r"
logger.error(msg % records[0], exc_info=True)
return False
finally:
connection.commit()
# There was an integrity error but we had more than one record.
# Divide and conquer.
mid = count / 2
return psql_copy(records[:mid]) and psql_copy(records[mid:])
# or just return False
Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsert method
from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)]
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])
Note: The library only support postgreSQL
Here is a gist that I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.
Late answer for pre Django 2.2 projects :
I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.
In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.
So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are owner and brand, and in this example owner is an user object instance and brand is a string instance:
create_list = []
create_list_check = []
for brand in brands:
if (owner.id, brand) not in create_list_check:
create_list_check.append((owner.id, brand))
create_list.append(ProductBrand(owner=owner, name=brand))
if create_list:
ProductBrand.objects.bulk_create(create_list)
it's work for me
i am use this this funtion in thread.
my csv file contains 120907 no of rows.
def products_create():
full_path = os.path.join(settings.MEDIA_ROOT,'productcsv')
filename = os.listdir(full_path)[0]
logger.debug(filename)
logger.debug(len(Product.objects.all()))
if len(Product.objects.all()) > 0:
logger.debug("Products Data Erasing")
Product.objects.all().delete()
logger.debug("Products Erasing Done")
csvfile = os.path.join(full_path,filename)
csv_df = pd.read_csv(csvfile,sep=',')
csv_df['HSN Code'] = csv_df['HSN Code'].fillna(0)
row_iter = csv_df.iterrows()
logger.debug(row_iter)
logger.debug("New Products Creating")
for index, row in row_iter:
Product.objects.create(part_number = row[0],
part_description = row[1],
mrp = row[2],
hsn_code = row[3],
gst = row[4],
)
# products_list = [
# Product(
# part_number = row[0] ,
# part_description = row[1],
# mrp = row[2],
# hsn_code = row[3],
# gst = row[4],
# )
# for index, row in row_iter
# ]
# logger.debug(products_list)
# Product.objects.bulk_create(products_list)
logger.debug("Products uploading done")```

How many rows were deleted?

Is it possible to check how many rows were deleted by a query?
queryset = MyModel.object.filter(foo=bar)
queryset.delete()
deleted = ...
Or should I use transactions for that?
#transaction.commit_on_success
def delete_some_rows():
queryset = MyModel.object.filter(foo=bar)
deleted = queryset.count()
queryset.delete()
PHP + MySQL example:
mysql_query('DELETE FROM mytable WHERE id < 10');
printf("Records deleted: %d\n", mysql_affected_rows());
There are many situations where you want to know how many rows were deleted, for example if you do something based on how many rows were deleted. Checking it by performing a COUNT creates extra database load and is not atomic.
The queryset.delete() method immediately deletes the object and returns the number of objects deleted and a dictionary with the number of deletions per object type.
Check the docs for more details: https://docs.djangoproject.com/en/stable/topics/db/queries/#deleting-objects
Actual rows affected you could view with SELECT row_count().
First of all qs.count() and cursor.rowcount is not same things!
In MySQL with InnoDB with REPEATABLE READ (default mode) READ queries and WRITE queries view !!DIFFERENT!!! querysets!
READ queries read from old snapshot, while WRITE queries view actual committed data, like they works in READ COMMITTED mode.

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()

Lots of queries from django foreignkey fields

I've been drooling over Django all day while coding up an internal website in record time, but now I'm noticing that something is very inefficient with my ForeignKeys in the model.
I have a model which has 6 ForeignKeys, which are basically lookup tables. When I query all objects and display them in a template, it's running about 10 queries per item. Here's some code, which ought to explain it better:
class Website(models.Model):
domain_name = models.CharField(max_length=100)
registrant = models.ForeignKey('Registrant')
account = models.ForeignKey('Account')
registrar = models.ForeignKey('Registrar')
server = models.ForeignKey('Server', related_name='server')
host = models.ForeignKey('Host')
target_server = models.ForeignKey('Server', related_name='target')
class Registrant(models.Model):
name = models.CharField(max_length=100)
...and 5 more simple tables. There are 155 Website records, and in the view I'm using:
Website.objects.all()
It ends up executing 1544 queries. In the template, I'm using all of the foreign fields, as in:
<span class="value">Registrant:</span> {{ website.registrant.name }}<br />
So I know it's going to run a lot of queries...but it seems like this is excessive. Is this normal? Should I not be doing it this way?
I'm pretty new to Django, so hopefully I'm just doing something stupid. It's definitely a pretty amazing framework.
You should use the select_related function, e.g.
Website.objects.select_related()
so that it will automatically do a join and follow all of those foreign keys when the query is performed instead of loading them on demand as they are used. Django loads data from the database lazily, so by default you get the following behavior
# one database query
website = Website.objects.get(id=123)
# first time account is referenced, so another query
print website.account.username
# account has already been loaded, so no new query
print website.account.email_address
# first time registrar is referenced, so another query
print website.registrar.name
and so on. If you use selected related, then a join is performed behind the scenes and all of these foreign keys are automatically followed and loaded on the first query, so only one database query is performed. So in the above example, you'd get
# one database query with a join and all foreign keys followed
website = Website.objects.select_related().get(id=123)
# no additional query is needed because the data is already loaded
print website.account.username
print website.account.email_address
print website.registrar.name