Django - export excel make it faster with openpyxl - django

I am using Django restframework and I am trying to export excel. My issue is the process is take a lot of time till it generates the excel file.
The final file have about 1MB with 20k lines and the generation time is about 8 minutes and this does not seem right.
here is the view:
class GenerateExcelView(APIView):
filename = 'AllHours.xlsx'
wb = Workbook()
ws = wb.active
ws.title = "Workbook"
data = Report.objects.all()
row_couter = 2
for line in data:
first_name = line.employee_id
second_name = line.employee_name
age = line.description
...
ws['A{}'.format(row_counter)] = first_name
ws['B{}'.format(row_counter)] = second_name
ws['C{}'.format(row_counter)] = age
...
row_counter +=1
response = HttpResponse(save_virtual_workbook(wb), content_type='application/ms-excel')
response["Content-Disposition"] = 'attachment; filename="' + filename + '"'
return response
There are few more columns... Is it possible to change the process so it is a bit faster?
EDIT: I had wrong indentation of the loop.

It tends to help a lot with performance to use prefetch_related on the queryset.
Given a Table with a 100 rows each row having a foreign key to another table in you example the employee. Your loop would fetch the report then for each of the 100 rows the used relations. This is due to the lazy nature of the django ORM. As you can see we are already on at least 100 Queries... not so great.
If you would use:
data = Report.objects.all().prefetch_related('employee')
It would use one db query in stead of a hundred.
That should improve the speed of your solution by quite a bit already.
see more: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#prefetch-related

I have been wrestling with the same problem, and even after refactoring into raw SQL there is little improvement. The issue is the speed of openpyxl.
Their documentation suggests that using write-only mode helps, but I found it to be a small improvement at best: my benchmark on a report with 2 tabs, 18k rows on the second tab, showed a 50% reduction after the query refactor to SQL plus an openpyxl refactor to use write-only mode (which is a pain if you are doing cell formatting or special rows like headers and totals).
You can check their performance page here: https://openpyxl.readthedocs.io/en/stable/performance.html
... but I wouldn't get your hopes up.

Related

is it possible to add db_index = True to a field that is not unique (django)

I have a model that has some fields like:
current_datetime = models.TimeField(auto_now_add=True)
new_datetime = models.DateTimeField(null=True, db_index=True)
and data would be like :
currun_date_time = 2023-01-22T09:42:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T09:52:00+0330 new_datetime =2023-01-22T09:00:00+0330
currun_date_time = 2023-01-22T10:02:00+0330 new_datetime =2023-01-22T10:00:00+0330
is it possible new_datetime to have db_index = True ?
the reason i want this index is there are many rows (more than a 200,000 and keep adding every day) and there is a place that user can choose datetime range and see the results(it's a statistical website). i want to send a query with that filtered datetime range so it should be done fast. by the way i am using postgresql
also if you have tips for handling data or sth. like that for such websites i would be glad too hear
thanks.
Yes, It is possible to have datetime field to be true. This could upgrade the performance of queries that sort or screen by the given field.
Other better ways to have an index in datetime field is:
To evaluate the query plan and detect any sluggish processes or
missing indexes, take advantage of the "explain" command of your
database.
Employ the "limit" and "offset" parameters within your queries to
get only the necessary data.
For retrieving associated data in a single query, rather than
numerous queries, incorporate the "select_related" and
"prefetch_related" methods in your Django queries.
To store the outcomes of elaborate queries and dodge running the
same query multiple times, make use of caching systems such as
Redis or Memcached.
Moreover, if there are too many rows and the data is not required
for a long period of time, you can contemplate filing the
information in another table or database.

how to get latest foreign key value in models.py

I have a little problem with getting latest foreign key value in my django app. Here are my two models:
class Stock(models.Model):
...
class Dividend(models.Model):
date = models.DateField('pay date')
stock = models.ForeignKey(Stock, related_name="dividends")
class Meta:
ordering = ["date"]
I would like to get latest dividend from stock object. So basically this - stock.dividends.latest('date'). However, everytime I call stock.dividends.latest('date'), it fires up sql query to get latest dividend. I have latest() method in for cycle for every stock I have. I would like to avoid these sql queries. May I somehow define new method in class Stock that would get latest dividend within sql query for stock object?
I cannot change default ordering from "date" to "-date".
Using select_related('dividends') loads dividends objects with stock, but latest probably uses order_by and it requires sql query anyway. :(
EDIT1: To make more clear what I want, here is an example. Let's say I have 100 symbols in shares.keys():
for stock in Stock.objects.filter(symbol__in=shares.keys()): # 1 sql query
latest_dividend = stock.dividends.latest('date') # 100 sql queries
... #do something with latest dividend
Well and in some cases I might have 500 symbols in shares.keys(). That is why I need to avoid making sql queries on getting latest dividend for stock.
I have the same problem with you, so I tested many Django queries. Finally, I found out that we can use this:
Stock.objects.all().annotate(latest_date=Max('dividends__date')).filter(dividends__date=F('latest_date')).values('dividends')
I'm not sure my solution is the best, but here it is (works only with PostgreSQL):
stocks = list(Stock.objects.filter(**something))
dividends = Dividend.objects.filter(
stock__in=stocks,
).order_by(
'stock_id',
'-date'
).distinct(
'stock_id',
)
dividends_dict = {d.stock_id: d for d in dividends}
for stock in stocks:
stock.latest_dividend = dividends_dict.get(stock.id)
I'm a little confused by your question, I'm assuming you are trying to access the dividends from your stock object in order to limit your queries to the database. I believe that is the least number queries of possible.
stock_options = stock.objects.get(pk=your_query)
order_options = stock.dividend_set.order_by('-date')[:5]
likeon: Thanks for your answer. But I think I can avoid initializing that large dictionary (I have 5000 stocks and 280 000 dividends). But your list gave me an idea. Your code requires 2 sql queries. Here is my example (EDIT1).
for stock in Stock.objects.filter(symbol__in=shares.keys())\
.prefetch_related('dividends'): # 2 sql queries
latest_dividend = list(stock.dividends.all())[-1] # 0 sql queries
... #do something with latest_dividend
My code also requires 2 sql queries, but I do not have to reorder it and create list from stocks and all 280 000 dividends (I only create dict from current stock dividends every cycle). May be creating one dict is quicker than creating len(shares.keys()) dicts, not sure.
I thought there would be easier solution (avoid creating list/dictionary from dividends), but this is good enough for now. Thanks for answers!
As long as I understood you can do it this way:
stock.dividends.last()
as implementation in Django is like this:
def first(self):
"""Return the first object of a query or None if no match is found."""
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
return obj
Also, you can use .latest(*fields, field_name=None) too.

Django queryset aggregate by time interval

Hi I am writing a Django view which ouputs data for graphing on the client side (High Charts). The data is climate data with a given parameter recorded once per day.
My query is this:
format = '%Y-%m-%d'
sd = datetime.datetime.strptime(startdate, format)
ed = datetime.datetime.strptime(enddate, format)
data = Climate.objects.filter(recorded_on__range = (sd, ed)).order_by('recorded_on')
Now, as the range is increased the dataset obviously gets larger and this does not present well on the graph (aside from slowing things down considerably).
Is there an way to group my data as averages in time periods - specifically average for each month or average for each year?
I realize this could be done in SQL as mentioned here: django aggregation to lower resolution using grouping by a date range
But I would like to know if there is a handy way in Django itself.
Or is it perhaps better to modify the db directly and use a script to populate month and year fields from the timestamp?
Any help much appreciated.
Have you tried using django-qsstats-magic (https://github.com/kmike/django-qsstats-magic)?
It makes things very easy for charting, here is a timeseries example from their docs:
from django.contrib.auth.models import User
import datetime, qsstats
qs = User.objects.all()
qss = qsstats.QuerySetStats(qs, 'date_joined')
today = datetime.date.today()
seven_days_ago = today - datetime.timedelta(days=7)
time_series = qss.time_series(seven_days_ago, today)
print 'New users in the last 7 days: %s' % [t[1] for t in time_series]

Bulk updating a table

I want to update a customer table with a spreadsheet from our accounting system. Unfortunately I can't just clear out the data and reload all of it, because there are a few records in the table that are not in the imported data (don't ask).
For 2000 records this is taking about 5 minutes, and I wondered if there was a better way of doing it.
for row in data:
try:
try:
customer = models.Retailer.objects.get(shared_id=row['Customer'])
except models.Retailer.DoesNotExist:
customer = models.Retailer()
customer.shared_id = row['Customer']
customer.name = row['Name 1']
customer.address01 = row['Street']
customer.address02 = row['Street 2']
customer.postcode = row['Postl Code']
customer.city = row['City']
customer.save()
except:
print formatExceptionInfo("Error with Customer ID: " + str(row['Customer']))
Look at my answer here: Django: form that updates X amount of models
The QuerySet has update() method - rest is explained in above link.
I've had some success using this bulk update snippet:
http://djangosnippets.org/snippets/446/
It's a bit outdated, but it worked on django 1.1, so I suppose you can still make it work. If you are looking for a quick way to do a one time bulk insert, this is the quickest (I'm not sure I'd trust it for regular use without seriously testing performance).
I've made a terribly crude attempt on a solution for this problem, but it's not finished yet and it doesn`t support working with django orm objects directly - yet.
http://pypi.python.org/pypi/dse/0.1.0
It`s not been properly testet and let me know if you have any suggestions on how to improve it. Using the django orm to do stuff like this is terrible.
Thomas

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?
The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.
If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.
It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.
Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.
django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()
Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)
Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)
Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.
The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()