Downloading large files in Python - python-2.7

In python 2.7.3, I try to create a script to download a file over the Internet. I use the urllib2 module.
Here, what I have done :
import urllib2
HTTP_client = urllib2.build_opener()
#### Here I can modify HTTP_client headers
URL = 'http://www.google.com'
data = HTTP_client.open(URL)
with open ('file.txt','wb') as f:
f.write(data.read())
OK. That's work perfectly.
The problem is when I want to save big files (hundreds of MB). I think that when I call the 'open' method, it downloads the file in memory. But, what about large files ? It will not save 1 GB of data in memory !! What happen if i lost connection, all the downloaded part is lost.
How to download large files in Python like wget does ? In wget, it downloads the file 'directly' in hard disk. We can see the file growning up in size.
I'm surprised there is no method 'retrieve' for doing stuff like
HTTP_client.retrieve(URL, 'filetosave.ext')

To resolve this, you can read chunks at a time and write them to file.
req = urllib2.urlopen(url)
CHUNK = 16 * 1024
with open(file, 'wb') as fp:
while True:
chunk = req.read(CHUNK)
if not chunk: break
fp.write(chunk)

Related

How to pass InMemoryUploadedFile as a file?

User records audio, audio gets saved into audio Blob and sent to backend. I want to get the audio file and send it to openai whisper API.
files = request.FILES.get('audio')
audio = whisper.load_audio(files)
I've tried different ways to send the audio file but none of it seemed to work and I don't understand how it should be sent. I would prefer not to save the file. I want user recorded audio sent to whisper API from backend.
Edit*
The answer by AKX seems to work but now there is another error
Edit 2*
He has edited his answer and everything works perfectly now. Thanks a lot to #AKX!
load_audio() requires a file on disk, so you'll need to cater to it – but you can use a temporary file that's automagically deleted outside the with block. (On Windows, you may need to use delete=False because of sharing permission reasons.)
import os
import tempfile
file = request.FILES.get('audio')
with tempfile.NamedTemporaryFile(suffix=os.path.splitext(file.name)[1], delete=False) as f:
for chunk in file.chunks():
f.write(chunk)
f.seek(0)
try:
audio = whisper.load_audio(f.name)
finally:
os.unlink(f.name)

How do I make excel spreadsheets downloadable in Django?

I'm writing a web application that generates reports from a local database. I want to generate an excel spreadsheet and immediately cause the user to download it. However, when I try to return the file via HttpResponse, I can not open the file. However, if I try to open the file in storage, the file opens perfectly fine.
This is using Django 2.1 (for database reasons, I'm not using 2.2) and I'm generating the file with xlrd. There is another excel spreadsheet that will need to be generated and downloaded that uses the openpyxl library (both libraries serve very distinct purposes IMO).
This spreadsheet is not very large (5x6 column s xrows).
I've looked at other similar stack overflow questions and followed their instructions. Specifically, I am talking about this answer:
https://stackoverflow.com/a/36394206/6411417
As you can see in my code, the logic is nearly the same and yet I can not open the downloaded excel spreadsheets. The only difference is that my file name is generated when the file is generated and returned into the file_name variable.
def make_lrm_summary_file(request):
file_path = make_lrm_summary()
if os.path.exists(file_path):
with open(file_path, 'rb') as fh:
response = HttpResponse(fh.read(), content_type="application/vnd.ms-excel")
response['Content-Disposition'] = f'inline; filename="{ os.path.basename(file_path) }"'
return response
raise Http404
Again, the file is properly generated and stored on my server but the download itself is providing an excel file that can not be opened. Specifically, I get the error message:
EXCEL.EXE - Application Error | The application was unable to start correctly (0x0000005). Click OK to close the application.

How do I read a zip(which in fact is in bytes form) without creating a temporary copy?

I am uploading a zip(which further contains pdf files to be read) as multipart/form-data .
I am handling the upload as below:
file = request.FILES["zipfile"].read() #gives a byte object
bytes_io = io.BytesIO(file) # gives a IO stream object
What I intend to do is to read the pdf files inside the zip, but I am stuck as to how to proceed from here. I am confused, what do I do with either the bytes object from the request or the IO object after conversion.
Found the answer just after asking the question.
Simply use the zipfile package as below:
from zipfile import ZipFile
file = request.FILES["zipfile"].read()
bytes_io = io.BytesIO(file)
zipfile = ZipFile(bytes_io, 'r')
And then refer the docs for further operations on the zip file.
Hope it helps!

How to zip or tar a static folder without writing anything to the filesystem in python?

I know about this question. But you can’t write to filesystem in app engine (shutil or zipfile require creating files).
So basically I need to archive something like/base/naclusing zip or tar, and write the output to the web browser asking the page (the output will never exceed 32 Mb).
It just happened that I had to solve the exact same problem tonight :) This worked for me:
import StringIO
import tarfile
fd = StringIO.StringIO()
with tarfile.open(mode="w:gz", fileobj=fd) as tgz:
tgz.add('dir_to_download')
self.response.headers['Content-Type'] ='application/octet-stream'
self.response.headers['Content-Disposition'] = 'attachment; filename="archive.tgz"'
self.response.write(fd.getvalue())
Key points:
used StringIO to fake a file in memory
used fileobj to pass directly the fake file's object to tarfile.open() (also supported by gzip.GzipFile() if you prefer gzip instead of tarfile)
set headers to present the response as a downloadable file

How do I get a Ruby IO stream for a Paperclip Attachment?

I have an application that stores uploaded CSV files using the Paperclip gem.
Once uploaded, I would like to be able to stream the data from the uploaded file into code that reads it line-by-line and loads it into a data-staging table in Postgres.
I've gotten this far in my efforts, where data_file.upload is a Paperclip CSV Attachment
io = StringIO.new(Paperclip.io_adapters.for(data_file.upload).read, 'r')
Even though ^^ works, the problem is that - as you can see - it loads the entire file into memory as a honkin' Ruby String, and Ruby String garbage is notoriously bad for app performance.
Instead, I want a Ruby IO object that supports use of e.g., io.gets so that the IO object handles buffering and cleanup, and the whole file doesn't sit as one huge string in memory.
Thanks in advance for any suggestions!
With some help (from StackOverflow, of course), I was able to suss this myself.
In my PaperClip AR model object, I now have the following:
# Done this way so we get auto-closing of the File object
def yielding_upload_as_readable_file
# It's quite annoying that there's not 1 method that works for both filesystem and S3 storage
open(filesystem_storage? ? upload.path : upload.url) { |file| yield file }
end
def filesystem_storage?
Paperclip::Attachment.default_options[:storage] == :filesystem
end
... and, I consume it in another model like so:
data_file.yielding_upload_as_readable_file do |file|
while line = file.gets
next if line.strip.size == 0
... process line ...
end
end