Large file upload to Django Rest Framework - django

I try to upload a big file (4GB) with a PUT on a DRF viewset.
During the upload my memory is stable. At 100%, the python runserver process takes more and more RAM and is killed by the kernel. I have a logging line in the put method of this APIView but the process is killed before this method call.
I use this setting to force file usage FILE_UPLOAD_HANDLERS = ["django.core.files.uploadhandler.TemporaryFileUploadHandler"]
Where does this memory peak comes from? I guess it try to load the file content in memory but why (and where)?
More information:
I tried DEBUG true and false
The runserver is in a docker behind a traefik but there is no limitation in traefik AFAIK and the upload reaches 100%
I do not know yet if I would get the same behavior with daphne instead of runserver
EDIT: front use a Content-Type multipart/form-data
EDIT: I have tried FileUploadParser and (FormParser, MultiPartParser) for parser_classes in my APIView

TL;DR:
Neither a DRF nor a Django issue, it's a 2.5 years known Daphne issue. The solution is to use uvicorn, hypercorn, or something else for the time being.
Explanations
What you're seeing here is not coming from Django Rest Framework as:
The FileUploadParser is meant to handle large file uploads, as it reads the file chunk by chunk;
Your view not being executed rules out the parsers which aren't executed until you access the request.FILES property
The fact that you're mentioning Daphne reminds me of this SO answer which mentions a similar problem and points to a code that Daphne doesn't handle large file uploads as it loads the whole body in RAM before passing it to the view. (The code is still present in their master branch at the time of writing)
You're seeing the same behavior with runserver because when installed, Daphne replaces the initial runserver command with itself to provide WebSockets support for dev purposes.
To make sure that it's the real culprit, try to disable Channels/run the default Django runserver and see for yourself if your app is killed by the OOM Killer.

I don't know if it works with django rest, but you can try to chunk de file.
[...]
anexo_files = request.FILES.getlist('anexo_file_'+str(k))
index = 0
for file in anexo_files:
index = index + 1
extension = os.path.splitext(str(file))[1]
nome_arquivo_anexo = 'media/uploads/' + os.path.splitext(str(file))[0] + "_" + str(index) + datetime.datetime.now().strftime("%m%d%Y%H%M%S") + extension
handle_uploaded_file(file, nome_arquivo_anexo)
AnexoProjeto.objects.create(
projeto=projeto,
arquivo_anexo = nome_arquivo_anexo
)
[...]
Where handle_uploaded_file is
def handle_uploaded_file(f, nome_arquivo):
with open(nome_arquivo, 'wb+') as destination:
for chunk in f.chunks():
destination.write(chunk)

Related

GCP Composer - Airflow webserver shutdown constantly

I'm using GCP Composer with newest image version composer-1.16.1-airflow-1.10.15.
Mine webservers are dying from time to time because of some missing cache files
{cli.py:1050} ERROR - [Errno 2] No such file or directory
Does anybody know how to solve it?
Additional info:
Workers:
Node count 3 Disk size (GB) 20 Machine type n1-standard-1
Web server configuration:
Machine type composer-n1-webserver-8 (8 vCPU, 7.6 GB memory)
Configuration overrides:
UPDATE 27.04.2021
I've managed to find the place responsible for killing the web-server
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1032-L1138
GCP Composer is using Celery Executor underneath - soo during the check it tries to read some cache files that are already removed by workers?
I've found it! Aaand I'll report the bug to GCP Composer team
So if the config webserver.reload_on_plugin_change=True then cli is going into that section:
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/bin/cli.py#L1118-L1138
# if we should check the directory with the plugin,
if self.reload_on_plugin_change:
# compare the previous and current contents of the directory
new_state = self._generate_plugin_state()
# If changed, wait until its content is fully saved.
if new_state != self._last_plugin_state:
self.log.debug(
'[%d / %d] Plugins folder changed. The gunicorn will be restarted the next time the '
'plugin directory is checked, if there is no change in it.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = True
self._last_plugin_state = new_state
elif self._restart_on_next_plugin_check:
self.log.debug(
'[%d / %d] Starts reloading the gunicorn configuration.',
num_ready_workers_running, num_workers_running
)
self._restart_on_next_plugin_check = False
self._last_refresh_time = time.time()
self._reload_gunicorn()
def _generate_plugin_state(self):
"""
Generate dict of filenames and last modification time of all files in settings.PLUGINS_FOLDER
directory.
"""
if not settings.PLUGINS_FOLDER:
return {}
all_filenames = []
for (root, _, filenames) in os.walk(settings.PLUGINS_FOLDER):
all_filenames.extend(os.path.join(root, f) for f in filenames)
plugin_state = {f: self._get_file_hash(f) for f in sorted(all_filenames)}
return plugin_state
It is generating files to check by calling os.walk(settings.PLUGINS_FOLDER) function.
In the same time gcsfuse is deciding to delete part of these files
And an error happens - file is not found.
So disabling webserver.reload_on_plugin_change is making the work - but this option is really convenient so I'll create the bug ticket for google

Process Several Pcap Files Simultaneously - Django

In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
print(strfilename)
pcapuuid = individualpcap.uuid
print(pcapuuid)
packets = rdpcap(strfilename)
print("hokay")
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view

Large File PUT requests on Heroku using Django

I'm creating an API for a Django application and one of the PUT requests allows for a (very) large file upload. I'm trying to host the application on Heroku, but I'm running into some issues with the file upload and 30 second limit and ephemeral filesystem. I'm now trying to pass the file off onto an s3 bucket (via boto and abusing the multipart_upload function), but the process is dying before the upload can complete. I found this extremely helpful answer early on, which was excellent advice for the API in general, but doesn't solve the Heroku issue.
In my view
for i, handler in enumerate(upload_handlers):
chunk = request.read(handler.chunk_size)
while chunk:
handler.write_multipart_upload(chunk, counters[i], filename=file_name)
counters[i] += len(chunk)
chunk = request.read(handler.chunk_size)
for i, handler in enumerate(upload_handlers):
file_obj = handler.file_complete(counters[i])
In my handler
def write_multipart_upload(self, raw_data, start, filename=None):
if not self.mp:
self._validate_file(filename=filename)
self.mp = self.bucket.initiate_multipart_upload('feeds/' + self.file_name)
self.buffer.write(raw_data)
if self.buffer.len:
self.buffer.seek(0)
self.mp.upload_part_from_file(self.buffer, self.total_chunks)
self.buffer.close()
self.buffer = StringIO()
def file_complete(self, file_size):
self.mp.complete_upload()
self.file
Maybe this has become too complicated. At this point, I'm looking for something simple and maintainable. If anyone has a solidly-working example of a PUT-based API for large file uploads, I'll scrap Heroku and be happy with that. Thanks in advance.

Django: Gracefully restart nginx + fastcgi sites to reflect code changes?

Common situation: I have a client on my server who may update some of the code in his python project. He can ssh into his shell and pull from his repository and all is fine -- but the code is stored in memory (as far as I know) so I need to actually kill the fastcgi process and restart it to have the code change.
I know I can gracefully restart fcgi but I don't want to have to manually do this. I want my client to update the code, and within 5 minutes or whatever, to have the new code running under the fcgi process.
Thanks
First off, if uptime is important to you, I'd suggest making the client do it. It can be as simple as giving him a command called deploy-code. Using your method, if there is an error in their code, your method requires a 10 minute turnaround (read: downtime) for fixing it, assuming he gets it correct.
That said, if you actually want to do this, you should create a daemon which will look for files modified within the last 5 minutes. If it detects one, it will execute the reboot command.
Code might look something like:
import os, time
CODE_DIR = '/tmp/foo'
while True:
if restarted = True:
restarted = False
time.sleep(5*60)
for root, dirs, files in os.walk(CODE_DIR):
if restarted=True:
break
for filename in files:
if restared=True:
break
updated_on = os.path.getmtime(os.path.join(root, filename))
current_time = time.time()
if current_time - updated_on <= 6 * 60: # 6 min
# 6 min could offer false negatives, but that's better
# than false positives
restarted = True
print "We should execute the restart command here."

Apache lags when responding to gzipped requests

For an application I'm developing, the user submits a gzipped HTTP POST request (content-encoding: GZIP) with multipart form data (content-type: multipart/form-data). I use mod_deflate as an input filter to decompress and the web request is processed in Django via mod_wsgi.
Generally, everything is fine. But for certain requests (deterministic), there is almost a minute lag from request to response. Investigation shows that the processing in django is done immediately, but the response from the server stalls. If the request is not GZIPed, all works well.
Note that to deal with a glitch in mod_wsgi, I set content-length to the uncompressed mesage size.
Has anyone run into this problem? Is there a way to easily debug apache as it processes responses?
What glitch do you believe exists in mod_wsgi?
The simple fact of the matter is that WSGI 1.0 doesn't support mutating input filters which change the content length of the request content. Thus technically you can't use mod_deflate in Apache for request content when using WSGI 1.0. Your setting the content length to be a value other than the actual size is most likely going to stuff up operation of mod_deflate.
If you want to be able to handle compressed request content you need to step outside of WSGI 1.0 specification and use non standard code.
I suggest you have a read of:
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
This explains this problem and the suggestions about it.
I'd very much suggest you take this issue over to the official mod_wsgi mailing list for discussion about how you need to write your code. If though you are using one of the Python frameworks however, you are probably going to be restricted in what you can do as they will implement WSGI 1.0 where you can't do this.
UPDATE 1
From discussion on mod_wsgi list, the original WSGI application should be wrapped in following WSGI middleware. This will only work on WSGI adapters that actually provide an empty string as end sentinel for input, something which WSGI 1.0 doesn't require. This should possibly only be used for small uploads as everything is read into memory. If need large compressed uploads, then data when accumulated should be written out to a file instead.
class Wrapper(object):
def __init__(self, application):
self.__application = application
def __call__(self, environ, start_response):
if environ.get('HTTP_CONTENT_ENCODING', '') == 'gzip':
buffer = cStringIO.StringIO()
input = environ['wsgi.input']
blksize = 8192
length = 0
data = input.read(blksize)
buffer.write(data)
length += len(data)
while data:
data = input.read(blksize)
buffer.write(data)
length += len(data)
buffer = cStringIO.StringIO(buffer.getvalue())
environ['wsgi.input'] = buffer
environ['CONTENT_LENGTH'] = length
return self.__application(environ, start_response)
application = Wrapper(original_wsgi_application_callable)