I am writing a webservice in Django to handle image/video streams, but it's mostly done in an external program. For instance:
client requests for /1.jpg?size=300x200
python code parse 300x200 in django (or other WSGI app)
python calls convert (part of Imagemagick) using subprocess module, with parameter 300x200
convert reads 1.jpg from local disk, convert to size accordingly
Writing to a temp file
Django builds HttpResponse() and read the whole temp file content as body
As you can see, the whole temp file read-then-write process is inefficient. I need a generic way to handle similar external programs like this, not only convert, but others as well like cjpeg, ffmepg, etc. or even proprietary binaries.
I want to implement it in this way:
python gets the stdout fd of the convert child process
chain it to WSGI socket fd for output
I've done my homework, Google says this kind of zero-copy could be done with system call splice(). but it's not available in Python. So how to maximize performance in Python for these kind of scenario?
Call splice() using ctypes?
hack memoryview() or buffer() ?
subprocess has stdout which has readinto(), could this be utilized somehow?
How could we get fd number for any WSGI app?
I am kinda newbie to these, any suggestion is appreciated, thanks!
If the goal is to increase performance, you ought to examine the bottlenecks on a case-by-case basis, rather than taking a "one solution fits all" approach.
For the convert case, assuming the images aren't insanely large, the bottleneck there will most likely be spawning a subprocess for each request.
I'd suggest avoiding creating a subprocess and a temporary file, and do the whole thing in the Django process using PIL with something like this...
import os
from PIL import Image
from django.http import HttpResponse
IMAGE_ROOT = '/path/to/images'
# A Django view which returns a resized image
# Example parameters: image_filename='1.jpg', width=300, height=200
def resized_image_view(request, image_filename, width, height):
full_path = os.path.join(IMAGE_ROOT, image_filename)
source_image = Image.open(full_path)
resized_image = source_image.resize((width, height))
response = HttpResponse(content_type='image/jpeg')
resized_image.save(response, 'JPEG')
return response
You should be able to get results identical to ImageMagick by using the correct scaling algorithm, which, in general is ANTIALIAS for cases where the rescaled image is less than 50% of the size of the original, and BICUBIC in all other cases.
For the case of videos, if you're returning a transcoded video stream, the bottleneck will likely be either CPU-time, or network bandwidth.
I find that WSGI could actually handle an fd as an interator response
Example WSGI app:
def image_app(environ, start_response):
start_response('200 OK', [('Content-Type', 'image/jpeg'), ('Connection', 'Close')])
proc = subprocess.Popen([
'convert',
'1.jpg',
'-thumbnail', '200x150',
'-', //to stdout
], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return proc.stdout
It wrapps the stdout as http response via a pipe
Related
User records audio, audio gets saved into audio Blob and sent to backend. I want to get the audio file and send it to openai whisper API.
files = request.FILES.get('audio')
audio = whisper.load_audio(files)
I've tried different ways to send the audio file but none of it seemed to work and I don't understand how it should be sent. I would prefer not to save the file. I want user recorded audio sent to whisper API from backend.
Edit*
The answer by AKX seems to work but now there is another error
Edit 2*
He has edited his answer and everything works perfectly now. Thanks a lot to #AKX!
load_audio() requires a file on disk, so you'll need to cater to it – but you can use a temporary file that's automagically deleted outside the with block. (On Windows, you may need to use delete=False because of sharing permission reasons.)
import os
import tempfile
file = request.FILES.get('audio')
with tempfile.NamedTemporaryFile(suffix=os.path.splitext(file.name)[1], delete=False) as f:
for chunk in file.chunks():
f.write(chunk)
f.seek(0)
try:
audio = whisper.load_audio(f.name)
finally:
os.unlink(f.name)
In my Django application I have to deal with huge files. Instead of uploading them via the web app, the users may place them into a folder (called .dump) on a Samba share and then can choose the file in the Django app to create a new model instance from it. The view looks roughly like this:
class AddDumpedMeasurement(View):
def get(self, request, *args, **kwargs):
filename = request.GET.get('filename', None)
dump_dir = os.path.join(settings.MEDIA_ROOT, settings.MEASUREMENT_DATA_DUMP_PATH)
in_file = os.path.join(dump_dir, filename)
if isfile(in_file):
try:
with open(in_file, 'rb') as f:
object = NCFile.objects.create(sample=sample, created_by=request.user, file=File(f))
return JsonResponse(data={'redirect': object.get_absolute_url()})
except:
return JsonResponse(data={'error': 'Couldn\'t read file'}, status=400)
else:
return JsonResponse(data={'error': 'File not found'}, status=400)
As MEDIA_ROOT and .dump are on the same Samba share (which is mounted by the web server), why is moving the file to its new location so slow? I would have expected it to be almost instantaneous. Is it because I open() it and stream the bytes to the file object? If so, is there a better way to move the file to its correct destination and create the model instance?
Using a temporary file and replacing it with the original one allows one to use os.rename which is fast.
tmp_file = NamedTemporaryFile()
object = NCFile.objects.create(..., file=File(tmp_file))
tmp_file.close()
if isfile(object.file.path):
os.remove(object.file.path)
new_relative_path = os.path.join(os.path.dirname(object.file.name), filename)
new_relative_path = object.file.storage.get_available_name(new_relative_path)
os.rename(in_file, os.path.join(settings.MEDIA_ROOT, new_relative_path))
object.file.name = new_relative_path
object.save()
Is it because I open() it and stream the bytes to the file object?
I would argue that it is so. A simple move operation on a file system object means just updating a record on the file systems internal database. That would indeed be instantaneous
opening a local file, reading it line by line is like a copy operation which could be slow depending on the file size. Additionally you are doing this at a very high level while an OS copy operation happens at a much lower level.
But that's not the real cause of the problem. You have said the files are on a samba share. Which I presume means that you have mounted a remote folder locally. Thus when you read the file in question you are actually fetching it over the network. That will be slower than a disk read. Then when you write the destination file, you are writing data over the network, again an operation that's slower than a disk write.
I know about this question. But you can’t write to filesystem in app engine (shutil or zipfile require creating files).
So basically I need to archive something like/base/naclusing zip or tar, and write the output to the web browser asking the page (the output will never exceed 32 Mb).
It just happened that I had to solve the exact same problem tonight :) This worked for me:
import StringIO
import tarfile
fd = StringIO.StringIO()
with tarfile.open(mode="w:gz", fileobj=fd) as tgz:
tgz.add('dir_to_download')
self.response.headers['Content-Type'] ='application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="archive.tgz"'
self.response.write(fd.getvalue())
Key points:
used StringIO to fake a file in memory
used fileobj to pass directly the fake file's object to tarfile.open() (also supported by gzip.GzipFile() if you prefer gzip instead of tarfile)
set headers to present the response as a downloadable file
I have an application that stores uploaded CSV files using the Paperclip gem.
Once uploaded, I would like to be able to stream the data from the uploaded file into code that reads it line-by-line and loads it into a data-staging table in Postgres.
I've gotten this far in my efforts, where data_file.upload is a Paperclip CSV Attachment
io = StringIO.new(Paperclip.io_adapters.for(data_file.upload).read, 'r')
Even though ^^ works, the problem is that - as you can see - it loads the entire file into memory as a honkin' Ruby String, and Ruby String garbage is notoriously bad for app performance.
Instead, I want a Ruby IO object that supports use of e.g., io.gets so that the IO object handles buffering and cleanup, and the whole file doesn't sit as one huge string in memory.
Thanks in advance for any suggestions!
With some help (from StackOverflow, of course), I was able to suss this myself.
In my PaperClip AR model object, I now have the following:
# Done this way so we get auto-closing of the File object
def yielding_upload_as_readable_file
# It's quite annoying that there's not 1 method that works for both filesystem and S3 storage
open(filesystem_storage? ? upload.path : upload.url) { |file| yield file }
end
def filesystem_storage?
Paperclip::Attachment.default_options[:storage] == :filesystem
end
... and, I consume it in another model like so:
data_file.yielding_upload_as_readable_file do |file|
while line = file.gets
next if line.strip.size == 0
... process line ...
end
end
I was assuming that sys.stdout would be referencing the same physical stream as iostreams::cout running in the same process, but this doesn't seem to be the case.
The following code, which makes a call to a C++ function with a python wrapper called "write", that writes to cout:
from cStringIO import StringIO
import sys
orig_stdout = sys.stdout
sys.stdout = stringout = StringIO()
write("cout") # wrapped C++ function that writes to cout
print "-" * 40
print "stdout"
sys.stdout = orig_stdout
print stringout.getvalue()
immediately writes "cout" to the console, then the separator "---...", and finally, as the return value of stringout.getvalue(), the string "stdout".
My intention was to capture in stringout also the string written to cout from C++.
Does anyone know what is going on, and if so, how I can capture what is written to cout in a python string?
Thanks in advance.
sys.stdout is a Python object that writes to standard output. It is not actually the standard output file handle; it wraps that file handle. Altering the object that sys.stdout points to in Python-land does not in any way affect the stdout handle or the std::cout stream object in C++.
With help from comp.lang.python and after some searching on this site:
As cdhowie pointed out, the standard output file handle has to be accessed at a lower level. In fact, its file descriptor can be obtained as sys.stdout.fileno() (which should be 1), and then os.dup and os.dup2 can be used.
I found this answer to a similar question very helpful.
What I really wanted was to capture the output in a string, not a file. The python StringIO class however doesn't have a file descriptor and cannot be used in place of an actual file, so I came up with the not fully satisfactory workaround in which a temporary file is written and subsequently read.
It cannot possibly be the same stream, as Python is written in
C, and not C++, and has no access to std::cout. Whether it
uses stdout or implements its own stream based on fd 1,
I don't know, but in any case, you'd be advised to flush between
writes using the two objects (Python and C++).