Django Mass Update/Insert Performance - django

I'm receiving financial data for approximately 5000 instruments every 5 seconds, and need to update the respective entries in the database. The model looks as follows:
class Market(models.Model):
market = models.CharField(max_length=200)
exchange = models.ForeignKey(Exchange,on_delete=models.CASCADE)
ask = models.FloatField()
bid = models.FloatField()
lastUpdate = models.DateTimeField(default = timezone.now)
What needs to happen is the following:
After new financial data is received, check if an entry exists in the
database.
If the entry exists, update the ask, bid and lastUpdate fields
If the entry does not exist, create a new entry
My code looks as follows:
bi_markets = []
for item in dbMarkets:
eItem = Market.objects.filter(exchange=item.exchange,market=item.market)
if len(eItem) > 0:
eItem.update(ask=item.ask,bid=item.bid)
else:
bi_markets.append(item)
#Bulk insert items that does not exist
Market.objects.bulk_create(bi_markets)
However executing this takes way too long. Approximately 30 seconds. I need to reduce the time down to 1 second. I know this can be done as I do the same wth custom SQL code in .NET in under 100ms. Any idea how to improve the performance in Django?

If it’s this kind of performance you’re going for, I don’t see why you wouldn’t just break out into raw SQL. Bulk creating things that don’t exist yet sounds like the advanced SQL querying that Django isn’t really made for.
https://docs.djangoproject.com/en/2.0/topics/db/sql/
You can also do (sorry on mobile):
bi_markets = []
for item in dbMarkets:
rows = Market.objects.filter(exchange=item.exchange, market=item.market).update(ask=item.ask, bid=item.bid)
if rows == 0:
bi_markets.append(item)
Market.objects.bulk_create(bi_markets)
Maybe that combination will generate some better SQL and it sidesteps the exists() call as well (update returns how many rows it changed).

I've decided to split the update and create functionality. The create only happens when the app starts, from there on I do updates using custom SQL script. See below. Working great.
updateQ = []
updateQ.append("BEGIN TRANSACTION;")
for dbItem in dbMarkets:
eItem = tickers[dbItem.market]
qStr = "UPDATE app_market SET ask = " + str(eItem['ask']) + ",bid = " + str(eItem['bid']) + " WHERE exchange_id = " + str(e.dbExchange.pk) + " AND market = " + '"' + dbItem.market + '";'
updateQ.append(qStr)
updateQ.append("COMMIT;")
updateQFinal = ''.join(map(str, updateQ))
with connection.cursor() as cursor:
cursor.executescript(updateQFinal)

Related

How to add weeks to a datetime column, depending on a django model/dictionary?

Context
There is a dataframe of customer invoices and their due dates.(Identified by customer code)
Week(s) need to be added depending on customer code
Model is created to persist the list of customers and week(s) to be added
What is done so far:
Models.py
class BpShift(models.Model):
bp_name = models.CharField(max_length=50, default='')
bp_code = models.CharField(max_length=15, primary_key=True, default='')
weeks = models.IntegerField(default=0)
helper.py
from .models import BpShift
# used in views later
def week_shift(self, df):
df['DueDateRange'] = df['DueDate'] + datetime.timedelta(
weeks=BpShift.objects.get(pk=df['BpCode']).weeks)
I realised my understanding of Dataframes is seriously flawed.
df['A'] and df['B'] would return Series. Of course, timedelta wouldn't work like this(weeks=BpShift.objects.get(pk=df['BpCode']).weeks).
Dataframe
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d)
Customer List csv
BP Name,BP Code,Week(s)
Customer1,CA0023MY,1
Customer2,CA0064SG,1
Error
BpShift matching query does not exist.
Commentary
I used these methods in hope that I would be able to change the dataframe at once, instead of
using df.iterrows(). I have recently been avoiding for loops like a plague and wondering if this
is the "correct" mentality. Is there any recommended way of doing this? Thanks in advance for any guidance!
This question Python & Pandas: series to timedelta will help to take you from Series to timedelta. And although
pandas.Series(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('weeks', flat=True)
)
will give you a Series of integers, I doubt the order is the same as in df['BpCode']. Because it depends on the django Model and database backend.
So you might be better off to explicitly create not a Series, but a DataFrame with pk and weeks columns so you can use df.join. Something like this
pandas.DataFrame(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('pk', 'weeks'),
columns=['BpCode', 'weeks'],
)
should give you a DataFrame that you can join with.
So combined this should be the gist of your code:
django_response = [('customer1', 1), ('customer2', '2')]
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d).set_index('BpCode').join(
pd.DataFrame(django_response, columns=['BpCode', 'weeks']).set_index('BpCode')
)
df['DueDate'] = pd.to_datetime(df['DueDate'])
df['weeks'] = pd.to_numeric(df['weeks'])
df['new_duedate'] = df['DueDate'] + df['weeks'] * pd.Timedelta('1W')
print(df)
DueDate weeks new_duedate
BpCode
customer1 2020-05-30 1 2020-06-06
customer2 2020-04-30 2 2020-05-14
You were right to want to avoid looping. This approach gets all the data in one SQL query from your Django model, by using filter. Then does a left join with the DataFrame you already have. Casts the dates and weeks to the right types and then computes a new due date using the whole columns instead of loops over them.
NB the left join will give NaN and NaT for customers that don't exist in your Django database. You can either avoid those rows by passing how='inner' to df.join or handle them whatever way you like.

Django - export excel make it faster with openpyxl

I am using Django restframework and I am trying to export excel. My issue is the process is take a lot of time till it generates the excel file.
The final file have about 1MB with 20k lines and the generation time is about 8 minutes and this does not seem right.
here is the view:
class GenerateExcelView(APIView):
filename = 'AllHours.xlsx'
wb = Workbook()
ws = wb.active
ws.title = "Workbook"
data = Report.objects.all()
row_couter = 2
for line in data:
first_name = line.employee_id
second_name = line.employee_name
age = line.description
...
ws['A{}'.format(row_counter)] = first_name
ws['B{}'.format(row_counter)] = second_name
ws['C{}'.format(row_counter)] = age
...
row_counter +=1
response = HttpResponse(save_virtual_workbook(wb), content_type='application/ms-excel')
response["Content-Disposition"] = 'attachment; filename="' + filename + '"'
return response
There are few more columns... Is it possible to change the process so it is a bit faster?
EDIT: I had wrong indentation of the loop.
It tends to help a lot with performance to use prefetch_related on the queryset.
Given a Table with a 100 rows each row having a foreign key to another table in you example the employee. Your loop would fetch the report then for each of the 100 rows the used relations. This is due to the lazy nature of the django ORM. As you can see we are already on at least 100 Queries... not so great.
If you would use:
data = Report.objects.all().prefetch_related('employee')
It would use one db query in stead of a hundred.
That should improve the speed of your solution by quite a bit already.
see more: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#prefetch-related
I have been wrestling with the same problem, and even after refactoring into raw SQL there is little improvement. The issue is the speed of openpyxl.
Their documentation suggests that using write-only mode helps, but I found it to be a small improvement at best: my benchmark on a report with 2 tabs, 18k rows on the second tab, showed a 50% reduction after the query refactor to SQL plus an openpyxl refactor to use write-only mode (which is a pain if you are doing cell formatting or special rows like headers and totals).
You can check their performance page here: https://openpyxl.readthedocs.io/en/stable/performance.html
... but I wouldn't get your hopes up.

Django compare values of two objects

I have a Django model that looks something like this:
class Response(models.Model):
transcript = models.TextField(null=True)
class Coding(models.Model):
qid = models.CharField(max_length = 30)
value = models.CharField(max_length = 200)
response = models.ForeignKey(Response)
coder = models.ForeignKey(User)
For each Response object, there are two coding objects with qid = "risk", one for coder 3 and one for coder 4. What I would like to be able to do is get a list of all Response objects for which the difference in value between coder 3 and coder 4 is greater than 1. The value field stores numbers 1-7.
I realize in hindsight that setting up value as a CharField may have been a mistake, but hopefully I can get around that.
I believe something like the following SQL would do what I'm looking for, but I'd rather do this with the ORM
SELECT UNIQUE c1.response_id FROM coding c1, coding c2
WHERE c1.coder_id = 3 AND
c2.coder_id = 4 AND
c1.qid = "risk" AND
c2.qid = "risk" AND
c1.response_id = c2.response_id AND
c1.value - c2.value > 1
from django.db.models import F
qset = Coding.objects.filter(response__coding__value__gt=F('value') + 1,
qid='risk', coder=4
).extra(where=['T3.qid = %s', 'T3.coder_id = %s'],
params=['risk', 3])
responses = [c.response for c in qset.select_related('response')]
When you join to a table already in the query, the ORM will assign the second one an alias, in this case T3, which you can using in parameters to extra(). To find out what the alias is you can drop into the shell and print qset.query.
See Django documentation on F objects and extra
Update: It seems you actually don't have to use extra(), or figure out what alias django uses, because every time you refer to response__coding in your lookups, django will use the alias created initially. Here's one way to look for differences in either direction:
from django.db.models import Q, F
gt = Q(response__coding__value__gt=F('value') + 1)
lt = Q(response__coding__value__lt=F('value') - 1)
match = Q(response__coding__qid='risk', response__coding__coder=4)
qset = Coding.objects.filter(match & (gt | lt), qid='risk', coder=3)
responses = [c.response for c in qset.select_related('response')]
See Django documentation on Q objects
BTW, If you are going to want both Coding instances, you have an N + 1 queries problem here, because django's select_related() won't get reverse FK relationships. But since you have the data in the query already, you could retrieve the required information using the T3 alias as described above and extra(select={'other_value':'T3.value'}). The value data from the corresponding Coding record would be accessible as an attribute on the retrieved Coding instance, i.e. as c.other_value.
Incidentally, your question is general enough, but it looks like you have an entity-attribute-value schema, which in an RDB scenario is generally considered an anti-pattern. You might be better off long-term (and this query would be simpler) with a risk field:
class Coding(models.Model):
response = models.ForeignKey(Response)
coder = models.ForeignKey(User)
risk = models.IntegerField()
# other fields for other qid 'attribute' names...

Improving Django performance with 350000+ regs and complex query

I have a model like this:
class Stock(models.Model):
product = models.ForeignKey(Product)
place = models.ForeignKey(Place)
date = models.DateField()
quantity = models.IntegerField()
I need to get the latest (by date) quantity for every product for every place,
with almost 500 products, 100 places and 350000 stock records on the database.
My current code is like this, it worked on testing but it takes so long with the real data that it's useless
stocks = Stock.objects.filter(product__in=self.products,
place__in=self.places, date__lt=date_at)
stock_values = {}
for prod in self.products:
for place in self.places:
key = u'%s%s' % (prod.id, place.id)
stock = stocks.filter(product=prod, place=place, date=date_at)
if len(stock) > 0:
stock_values[key] = stock[0].quantity
else:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
except IndexError:
stock_values[key] = 0
else:
stock_values[key] = stock.quantity
return stock_values
How would you make it faster?
Edit:
Rewrote the code as this:
stock_values = {}
for product in self.products:
for place in self.places:
try:
stock_value = Stock.objects.filter(product=product, place=place, date__lte=date_at)\
.order_by('-date').values('cant')[0]['cant']
except IndexError:
stock_value = 0
stock_values[u'%s%s' % (product.id, place.id)] = stock_value
return stock_values
It works better (from 256 secs to 64) but still need to improve it. Maybe some custom SQL, I don't know...
Arthur's right, the len(stock) isn't the most efficient way to do that. You could go further along the "easier to ask for forgiveness than permission" route with something like this inside the inner loop:
key = u'%s%s' % (prod.id, place.id)
try:
stock = stocks.filter(product=prod, place=place, date=date_at)[0]
quantity = stock.quantity
except IndexError:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
quantity = stock.quantity
except IndexError:
quantity = 0
stock_values[key] = quantity
I'm not sure how much that would improve it compared to just changing the length check, though I think this should at least restrict it to two queries with LIMIT 1 on them (see Limiting QuerySets).
Mind you, this is still performing a lot of database hits since you could run through that loop almost 50000 times. Optimize how you're looping and you're in a better position still.
maybe the trick is in that len() method!
follow docs from:
Note: Don't use len() on QuerySets if all you want to do is determine
the number of records in the set. It's much more efficient to handle a
count at the database level, using SQL's SELECT COUNT(*), and Django
provides a count() method for precisely this reason. See count()
below.
So try changing the len to count(), and see if it makes faster!

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()