Apache lags when responding to gzipped requests - django

For an application I'm developing, the user submits a gzipped HTTP POST request (content-encoding: GZIP) with multipart form data (content-type: multipart/form-data). I use mod_deflate as an input filter to decompress and the web request is processed in Django via mod_wsgi.
Generally, everything is fine. But for certain requests (deterministic), there is almost a minute lag from request to response. Investigation shows that the processing in django is done immediately, but the response from the server stalls. If the request is not GZIPed, all works well.
Note that to deal with a glitch in mod_wsgi, I set content-length to the uncompressed mesage size.
Has anyone run into this problem? Is there a way to easily debug apache as it processes responses?

What glitch do you believe exists in mod_wsgi?
The simple fact of the matter is that WSGI 1.0 doesn't support mutating input filters which change the content length of the request content. Thus technically you can't use mod_deflate in Apache for request content when using WSGI 1.0. Your setting the content length to be a value other than the actual size is most likely going to stuff up operation of mod_deflate.
If you want to be able to handle compressed request content you need to step outside of WSGI 1.0 specification and use non standard code.
I suggest you have a read of:
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
This explains this problem and the suggestions about it.
I'd very much suggest you take this issue over to the official mod_wsgi mailing list for discussion about how you need to write your code. If though you are using one of the Python frameworks however, you are probably going to be restricted in what you can do as they will implement WSGI 1.0 where you can't do this.
UPDATE 1
From discussion on mod_wsgi list, the original WSGI application should be wrapped in following WSGI middleware. This will only work on WSGI adapters that actually provide an empty string as end sentinel for input, something which WSGI 1.0 doesn't require. This should possibly only be used for small uploads as everything is read into memory. If need large compressed uploads, then data when accumulated should be written out to a file instead.
class Wrapper(object):
def __init__(self, application):
self.__application = application
def __call__(self, environ, start_response):
if environ.get('HTTP_CONTENT_ENCODING', '') == 'gzip':
buffer = cStringIO.StringIO()
input = environ['wsgi.input']
blksize = 8192
length = 0
data = input.read(blksize)
buffer.write(data)
length += len(data)
while data:
data = input.read(blksize)
buffer.write(data)
length += len(data)
buffer = cStringIO.StringIO(buffer.getvalue())
environ['wsgi.input'] = buffer
environ['CONTENT_LENGTH'] = length
return self.__application(environ, start_response)
application = Wrapper(original_wsgi_application_callable)

Related

Large file upload to Django Rest Framework

I try to upload a big file (4GB) with a PUT on a DRF viewset.
During the upload my memory is stable. At 100%, the python runserver process takes more and more RAM and is killed by the kernel. I have a logging line in the put method of this APIView but the process is killed before this method call.
I use this setting to force file usage FILE_UPLOAD_HANDLERS = ["django.core.files.uploadhandler.TemporaryFileUploadHandler"]
Where does this memory peak comes from? I guess it try to load the file content in memory but why (and where)?
More information:
I tried DEBUG true and false
The runserver is in a docker behind a traefik but there is no limitation in traefik AFAIK and the upload reaches 100%
I do not know yet if I would get the same behavior with daphne instead of runserver
EDIT: front use a Content-Type multipart/form-data
EDIT: I have tried FileUploadParser and (FormParser, MultiPartParser) for parser_classes in my APIView
TL;DR:
Neither a DRF nor a Django issue, it's a 2.5 years known Daphne issue. The solution is to use uvicorn, hypercorn, or something else for the time being.
Explanations
What you're seeing here is not coming from Django Rest Framework as:
The FileUploadParser is meant to handle large file uploads, as it reads the file chunk by chunk;
Your view not being executed rules out the parsers which aren't executed until you access the request.FILES property
The fact that you're mentioning Daphne reminds me of this SO answer which mentions a similar problem and points to a code that Daphne doesn't handle large file uploads as it loads the whole body in RAM before passing it to the view. (The code is still present in their master branch at the time of writing)
You're seeing the same behavior with runserver because when installed, Daphne replaces the initial runserver command with itself to provide WebSockets support for dev purposes.
To make sure that it's the real culprit, try to disable Channels/run the default Django runserver and see for yourself if your app is killed by the OOM Killer.
I don't know if it works with django rest, but you can try to chunk de file.
[...]
anexo_files = request.FILES.getlist('anexo_file_'+str(k))
index = 0
for file in anexo_files:
index = index + 1
extension = os.path.splitext(str(file))[1]
nome_arquivo_anexo = 'media/uploads/' + os.path.splitext(str(file))[0] + "_" + str(index) + datetime.datetime.now().strftime("%m%d%Y%H%M%S") + extension
handle_uploaded_file(file, nome_arquivo_anexo)
AnexoProjeto.objects.create(
projeto=projeto,
arquivo_anexo = nome_arquivo_anexo
)
[...]
Where handle_uploaded_file is
def handle_uploaded_file(f, nome_arquivo):
with open(nome_arquivo, 'wb+') as destination:
for chunk in f.chunks():
destination.write(chunk)

Process Several Pcap Files Simultaneously - Django

In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
print(strfilename)
pcapuuid = individualpcap.uuid
print(pcapuuid)
packets = rdpcap(strfilename)
print("hokay")
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view

Python 2.7 - Having trouble downloading large files

I'm trying to download some decently large files in python 2.7 (between 300 and 700 MB each), and I'm running into the problem of the connection getting reset in the middle of retrieving the files. Specifically, I was using urllib.urlretrieve(url, file_name), and every so often I get socket.error: [Errno 104] Connection reset by peer.
Now, I'm very unfamiliar with how sockets and web protocol works, so I tried the following, not really knowing if it would help:
response = urllib.urlopen(url)
CHUNK_SIZE = 16 * 1024
with open(file_name, 'wb') as f:
for chunk in iter(lambda: response.read(CHUNK_SIZE), ''):
f.write(chunk)
Edit: Guess I should credit the author of this code: https://stackoverflow.com/a/1517728/3002473
It sounds reasonable that we're only downloading a little bit at a time, so it should be "less susceptible" to this Errno 104, but again I know basically nothing about how all of this works so I don't know if this actually makes a difference.
After testing a bit it seems like it works slightly better? But that might just be coincidence. Generally, I'm able to download one, maybe two files before this error gets thrown.
Why am I getting Errno 104, and how can I go about preventing this? Out of curiosity, should I be using urllib2 instead of urllib?

Simple libtorrent Python client

I tried creating a simple libtorrent python client (for magnet uri), and I failed, the program never continues past the "downloading metadata".
If you may help me write a simple client it would be amazing.
P.S. When I choose a save path, is the save path the folder which I want my data to be saved in? or the path for the data itself.
(I used a code someone posted here)
import libtorrent as lt
import time
ses = lt.session()
ses.listen_on(6881, 6891)
params = {
'save_path': '/home/downloads/',
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
ses.start_dht()
print 'downloading metadata...'
while (not handle.has_metadata()):
time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', \
'downloading', 'finished', 'seeding', 'allocating']
print '%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s %.3' % \
(s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, \
s.num_peers, state_str[s.state], s.total_download/1000000)
time.sleep(5)
What happens it is that the first while loop becomes infinite because the state does not change.
You have to add a s = handle.status (); for having the metadata the status changes and the loop stops. Alternatively add the first while inside the other while so that the same will happen.
Yes, the save path you specify is the one that the torrents will be downloaded to.
As for the metadata downloading part, I would add the following extensions first:
ses.add_extension(lt.create_metadata_plugin)
ses.add_extension(lt.create_ut_metadata_plugin)
Second, I would add a DHT bootstrap node:
ses.add_dht_router("router.bittorrent.com", 6881)
Finally, I would begin debugging the application by seeing if my network interface is binding or if any other errors come up (my experience with BitTorrent download problems, in general, is that they are network related). To get an idea of what's happening I would use libtorrent-rasterbar's alert system:
ses.set_alert_mask(lt.alert.category_t.all_categories)
And make a thread (with the following code) to collect the alerts and display them:
while True:
ses.wait_for_alert(500)
alert = lt_session.pop_alert()
if not alert:
continue
print "[%s] %s" % (type(alert), alert.__str__())
Even with all this working correctly, make sure that torrent you are trying to download actually has peers. Even if there are a few peers, none may be configured correctly or support metadata exchange (exchanging metadata is not a standard BitTorrent feature). Try to load a torrent file (which doesn't require downloading metadata) and see if you can download successfully (to rule out some network issues).

Large File PUT requests on Heroku using Django

I'm creating an API for a Django application and one of the PUT requests allows for a (very) large file upload. I'm trying to host the application on Heroku, but I'm running into some issues with the file upload and 30 second limit and ephemeral filesystem. I'm now trying to pass the file off onto an s3 bucket (via boto and abusing the multipart_upload function), but the process is dying before the upload can complete. I found this extremely helpful answer early on, which was excellent advice for the API in general, but doesn't solve the Heroku issue.
In my view
for i, handler in enumerate(upload_handlers):
chunk = request.read(handler.chunk_size)
while chunk:
handler.write_multipart_upload(chunk, counters[i], filename=file_name)
counters[i] += len(chunk)
chunk = request.read(handler.chunk_size)
for i, handler in enumerate(upload_handlers):
file_obj = handler.file_complete(counters[i])
In my handler
def write_multipart_upload(self, raw_data, start, filename=None):
if not self.mp:
self._validate_file(filename=filename)
self.mp = self.bucket.initiate_multipart_upload('feeds/' + self.file_name)
self.buffer.write(raw_data)
if self.buffer.len:
self.buffer.seek(0)
self.mp.upload_part_from_file(self.buffer, self.total_chunks)
self.buffer.close()
self.buffer = StringIO()
def file_complete(self, file_size):
self.mp.complete_upload()
self.file
Maybe this has become too complicated. At this point, I'm looking for something simple and maintainable. If anyone has a solidly-working example of a PUT-based API for large file uploads, I'll scrap Heroku and be happy with that. Thanks in advance.