Flask: streaming file with stream_with_context is very slow - flask

The following code streams a postgres BYTEA column to a browser
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
return Response(stream_with_context(file.data), mimetype=file.mime_type)
it is extreemely slow (aprox 6 minutes for 5 mb).
I am downloading with curl from the same host, so network is not the issue,
also I can extract the file from the psql console in less than a second,
so it seems the database side is also not to blame :
COPY (select f.data from z_file f where f.id = '4ec3rf') TO 'zazX.pdf' (FORMAT binary)
Update:
I have further evidence that the "fetch from the DB" step is not slow, If I write file.data to a file using
with open("/vagrant/zoz.pdf", 'wb') as output:
output.write(file.data)
it also takes a fraction of a second. So the slowness is caused by the way Flask does the streaming.

I had this issue while using Flask to proxy streaming from another url using python-requests.
In this use case, the trick is setting the chunk_size parameter in iter_content:
def flask_view():
...
req = requests.get(url, stream=True, params=args)
return Response(
stream_with_context(req.iter_content(chunk_size=1024)),
content_type=req.headers['content-type']
otherwise it will use chunk_size=1, which can slow things down quite a bit. In my case the streaming went from a couple of kb/s to several mb/s after the increase in chunk_size.

Flask can be given a generator that returns the whole array in a single yield and will "know" how to deal with it, this returns in milliseconds :
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
def single_chunk_generator():
yield file.data
return Response(stream_with_context(single_chunk_generator()), mimetype=file.mime_type)
stream_with_context, when given an array will create a generator that iterates through it and do various checks on every element, which causes a huge performance hit.

Related

Django: Script that executes many queries runs massively slower when executed from Admin view than when executed from shell

I have a script that loops through the rows of an external csv file (about 12,000 rows) and executes a single Model.objects.get() query to retrieve each item from the database (final product will be much more complicated but right now it's stripped down to the barest functionality possible to try to figure this out).
For right now the path to the local csv file is hardcoded into the script. When I run the script through the shell using py manage.py runscript update_products_from_csv it runs in about 6 seconds.
The ultimate goal is to be able to upload the csv through the admin and then have the script run from there. I've already been able to accomplish that, but the runtime when I do it that way takes more like 160 seconds. The view for that in the admin looks like...
from .scripts import update_products_from_csv
class CsvUploadForm(forms.Form):
csv_file = forms.FileField(label='Upload CSV')
#admin.register(Product)
class ProductAdmin(admin.ModelAdmin):
# list_display, list_filter, fieldsets, etc
def changelist_view(self, request, extra_context=None):
extra_context = extra_context or {}
extra_context['csv_upload_form'] = CsvUploadForm()
return super().changelist_view(request, extra_context=extra_context)
def get_urls(self):
urls = super().get_urls()
new_urls = [path('upload-csv/', self.upload_csv),]
return new_urls + urls
def upload_csv(self, request):
if request.method == 'POST':
# csv_file = request.FILES['csv_file'].file
# result_string = update_products_from_csv.run(csv_file)
# I commented out the above two lines and added the below line to rule out
# the possibility that the csv upload itself was the problem. Whether I execute
# the script using the uploaded file or let it use the hardcoded local path,
# the results are the same. It works, but takes more than 20 times longer
# than executing the same script from the shell.
result_string = update_products_from_csv.run()
print(result_string)
messages.success(request, result_string)
return HttpResponseRedirect(reverse('admin:products_product_changelist'))
Right now the actual running parts of the script are about as simple as this...
import csv
from time import time
from apps.products.models import Product
CSV_PATH = 'path/to/local/csv_file.csv'
def run():
csv_data = get_csv_data()
update_data = build_update_data(csv_data)
update_handler(update_data)
return 'Finished'
def get_csv_data():
with open(CSV_PATH, 'r') as f:
return [d for d in csv.DictReader(f)]
def build_update_data(csv_data):
update_data = []
# Code that loops through csv data, applies some custom logic, and builds a list of
# dicts with the data cleaned and formatted as needed
return update_data
def update_handler(update_data):
query_times = []
for upd in update_data:
iter_start = time()
product_obj = Product.objects.get(external_id=upd['external_id'])
# external_id is not the primary key but is an indexed field in the Product model
query_times.append(time() - iter_start)
# Code to export query_times to an external file for analysis
update_handler() has a bunch of other code checking field values to see if anything needs to be changed, and building the objects when a match does not exist, but that's all commented out right now. As you can see, I'm also timing each query and logging those values. (I've been dropping time() calls in various places all day and have determined that the query is the only part that's noticeably different.)
When I run it from the shell, the average query time is 0.0005 seconds and the total of all query times comes out to about 6.8 seconds every single time.
When I run it through the admin view and then check the queries in Django Debug Toolbar it shows the 12,000+ queries as expected, and shows a total query time of only about 3900ms. But when I look at the log of query times gathered by the time() calls, the average query time is 0.013 seconds (26 times longer than when I run it through the shell), and the total of all query times always comes out at 156-157 seconds.
The queries in Django Debug Toolbar when I run it through the admin all look like SELECT ••• FROM "products_product" WHERE "products_product"."external_id" = 10 LIMIT 21, and according to the toolbar they are mostly all 0-1ms. I'm not sure how I would check what the queries look like when running it from the shell, but I can't imagine they'd be different? I couldn't find anything in django-extensions runscript docs about it doing query optimizations or anything like that.
One additional interesting facet is that when running it from the admin, from the time I see result_string print in the terminal, it's another solid 1-3 minutes before the success message appears in the browser window.
I don't know what else to check. I'm obviously missing something fundamental, but I don't know what.
Somebody on Reddit suggested that running the script from the shell might be automatically spinning up a new thread where the logic can run unencumbered by the other Django server processes, and this seems to be the answer. If I run the script in a new thread from the admin view, it runs just as fast as it does when I run it from the shell.

Flask - Generated PDF can be viewed but cannot be downloaded

I recently started learning flask and created a simple webapp which randomly generates kids' math work sheets in PDF based on user input.
The PDF opens automatically in a browser and can be viewed. But when I try downloading it both on a PC and in Chrome iOS, I get error messages (Chrome PC: Failed - Network error / Chrome iOS:the file could not be downloaded at this time).
You can try it out here: kidsmathsheets.com
I suspect it has something to do with the way I'm generating and returning the PDF file. FYI I'm using ReportLab to generate the PDF. My code below (hosted on pythonanywhere):
from reportlab.lib.pagesizes import A4, letter
from reportlab.pdfgen import canvas
from reportlab.platypus import Table
from flask import Flask, render_template, request, Response
import io
from werkzeug import FileWrapper
# Other code to take in input and generate data
filename=io.BytesIO()
if letter_size:
c = canvas.Canvas(filename, pagesize=letter)
else:
c = canvas.Canvas(filename, pagesize=A4)
pdf_all(c, p_set, answer=answers, letter=letter_size)
c.save()
filename.seek(0)
wrapped_file = FileWrapper(filename)
return Response(wrapped_file, mimetype="application/pdf", direct_passthrough=True)
else:
return render_template('index.html')
Any idea what's causing the issue? Help is much appreciated!
Please check whether you are using an ajax POST request for invoking the endpoint to generate your data and display the PDF respectively. If this is the case - quite probably this causes the behaviour our observe. You might want to try invoking the endpoint with a GET request to /my-endpoint/some-hashed-non-reusable-id-of-my-document where some-hashed-non-reusable-id-of-my-documentwill tell the endpoint which document to serve without allowing users to play around with guesstimates about what other documents you might have. You might try it first like:
#app.route('/display-document/<document_id>'):
def display_document(document_id):
document = get_my_document_from_wherever_it_is(document_id)
binary = get_binary_data_from_document(document)
.........
Prepare response here
.......
return send_file(binary, mimetype="application/pdf")
Kind note: a right click and 'print to pdf' will work but this is not the solution we want

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

how to make flask pass a generator to task such as celery

I have a bunch of code that I have working in flask correctly, but these requests can take over 30 minutes to finish. I am using chained generators to use my existing code with yields to return to the browser.
Since these tasks take 30 minutes or more to complete, I want to offload these tasks but at am a loss. I have not succesfully gotten celery/rabbitmq/redis or any other combination to work correctly and am looking for how I can accomplish this so my page returns right away and I can check if the task is complete in the background.
Here is example code that works for now but takes 4 seconds of processing for the page to return.
I am looking for advice on how to get around this problem, can celery/redis or rabbitmq deal with generators like this? should I be looking at a different solution?
Thanks!
import time
import flask
from itertools import chain
class TestClass(object):
def __init__(self):
self.a=4
def first_generator(self):
b = self.a + 2
yield str(self.a) + '\n'
time.sleep(1)
yield str(b) + '\n'
def second_generator(self):
time.sleep(1)
yield '5\n'
def third_generator(self):
time.sleep(1)
yield '6\n'
def application(self):
return chain(tc.first_generator(),
tc.second_generator(),
tc.third_generator())
tc = TestClass()
app = flask.Flask(__name__)
#app.route('/')
def process():
return flask.Response(tc.application(), mimetype='text/plain')
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=True)
Firstly, it's not clear what it would even mean to "pass a generator to Celery". The whole point of Celery is that is not directly linked to your app: it's a completely separate thing, maybe even running on a separate machine, to which you would pass some fixed data. You can of course pass the initial parameters and get Celery itself to call the functions that create the generators for processing, but you can't drip-feed data to Celery.
Secondly, this is not at all an appropriate use for Celery in any case. Celery is for offline processing. You can't get it to return stuff to a waiting request. The only thing you could do would be to get it to save the results somewhere accessible by Flask, and then get your template to fire an Ajax request to get those results when they are available.

Django caching a large list

My django application deals with 25MB binary files. Each of them has about 100,000 "records" of 256 bytes each.
It takes me about 7 seconds to read the binary file from disk and decode it using python's struct module. I turn the data into a list of about 100,000 items, where each item is a dictionary with values of various types (float, string, etc.).
My django views need to search through this list. Clearly 7 seconds is too long.
I've tried using django's low-level caching API to cache the whole list, but that won't work because there's a maximum size limit of 1MB for any single cached item. I've tried caching the 100,000 list items individually, but that takes a lot more than 7 seconds - most of the time is spent unpickling the items.
Is there a convenient way to store a large list in memory between requests? Can you think of another way to cache the object for use by my django app?
edit the item size limit to be 10m (larger than 1m), add
-I 10m
to /etc/memcached.conf and restart memcached
also edit this class in memcached.py located in /usr/lib/python2.7/dist-packages/django/core/cache/backends to look like this:
class MemcachedCache(BaseMemcachedCache):
"An implementation of a cache binding using python-memcached"
def __init__(self, server, params):
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = 1024*1024*10 #added limit to accept 10mb
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)
I'm not able to add comments yet, but I wanted to share my quick fix around this problem, since I had the same problem with python-memcached behaving strangely when you change the SERVER_MAX_VALUE_LENGTH at import time.
Well, besides the __init__ edit that FizxMike suggests you can also edit the _cache property in the same class. Doing so you can instantiate the python-memcached Client passing the server_max_value_length explicitly, like this:
from django.core.cache.backends.memcached import BaseMemcachedCache
DEFAULT_MAX_VALUE_LENGTH = 1024 * 1024
class MemcachedCache(BaseMemcachedCache):
def __init__(self, server, params):
#options from the settings['CACHE'][connection]
self._options = params.get("OPTIONS", {})
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = self._options.get('SERVER_MAX_VALUE_LENGTH', DEFAULT_MAX_VALUE_LENGTH)
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)
#property
def _cache(self):
if getattr(self, '_client', None) is None:
server_max_value_length = self._options.get("SERVER_MAX_VALUE_LENGTH", DEFAULT_MAX_VALUE_LENGTH)
#one could optionally send more parameters here through the options settings,
#I simplified here for brevity
self._client = self._lib.Client(self._servers,
server_max_value_length=server_max_value_length)
return self._client
I also prefer to create another backend that inherits from BaseMemcachedCache and use it instead of editing django code.
here's the django memcached backend module for reference:
https://github.com/django/django/blob/master/django/core/cache/backends/memcached.py
Thanks for all the help on this thread!