Processing large files with django celery tasks - django

My goal it to process a large CSV file using Celery that is uploaded through a Django form. When the file's size is less than SETTINGS.FILE_UPLOAD_MAX_MEMORY_SIZE, I can pass the form's cleaned_data variable to a celery task and read the file with:
#task
def taskFunction(cleaned_data):
for line in csv.reader(cleaned_data['upload_file']):
MyModel.objects.create(field=line[0])
However, when the file's size is greater than the above setting, I get the following error:
expected string or Unicode object, NoneType found
Where the stack trace shows the error occurring during pickle:
return dumper(obj, protocol=pickle_protocol)
It appears that when the uploaded file is read from a temporary file, pickle fails.
The simple solution to this problem is to increase the FILE_UPLOAD_MAX_MEMORY_SIZE. However, I am curious if there is a better way to manage this issue?

Save it to a temp file and pass the file name to celery instead. Delete after processing.

Related

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.
As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

"Resource temporarily unavailable" error on reading CSV file in web2py app on PythonAnywhere

I have a python web2py app uploaded at PythonAnywhere. App is working fine. I want to read a csv file placed in a folder along with my app and import it into mysql table. When I try to read that CSV file, I get the error saying "[Errno 11] Resource temporarily unavailable".
I am new to python as well as PythonAnywhere and I couldn't understand this issue and can't figure it out how can I overcome this error and read a csv file successfully at server?
Note: I can run this code successfully on my local machine.
What I am doing is this:
path = '/home/user123/web2py/files/'
file_ = path+filename
print file_
with open(file_, "r") as f_obj:
reader = csv.reader(f_obj)
fields = reader.next()
print fields
self.create_new_table(tablename, fields)
Will appreciate any help in this regard.
Thanx in advance.
I opened server.log file in Web tab and found out that the print statement "print fields" was causing the error .... It tried to print all the column names and at the mid of those column names, it produced this error and stopped execution. I removed such print statements which were trying to print long statements and the error was gone!
It seems to be limit in print or something else similar to this, dont know exactly!

sorl-thumbnail with graphicsmagick backend pdf/image conversion error on Windows7

I'm trying to set up the sorl-thumbnail django app to provide thumbnails of pdf-files for a web site - running on Windows Server 2008 R2 with Appache web server.
I've had sorl-thumbnail functional with the PIL backend for thumbnail generation of jpeg images - which was working fine.
Since PIL cannot read pdf-files I wanted to switch to the graphicsmagick backend.
I've installed and tested the graphicsmagick/ghostscript combination. From the command line
gm convert foo.pdf -resize 400x400 bar.jpg
generates the expected jpg thumbnail. It also works for jpg to jpg thumbnail generation.
However, when called from sorl-thumbnail, ghostscript crashes.
From django python shell (python manage.py shell) I use the low-level command described in the sorl docs and pass in a FieldFile instance (ff) pointing to foo.pdf and get the following error:
In [8]: im = get_thumbnail(ff, '400x400', quality=95)
**** Warning: stream operator isn't terminated by valid EOL.
**** Warning: stream Length incorrect.
**** Warning: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** Error: Trailer is not found.
GPL Ghostscript 9.07: Unrecoverable error, exit code 1
Note that ff is pointing to the same file that converts fine when using gm convert from command line.
I've tried also passing an ImageFieldFile instance (iff) and get the following error:
In [5]: im = get_thumbnail(iff, '400x400', quality=95)
identify.exe: Corrupt JPEG data: 1 extraneous bytes before marker 0xdb `c:\users\thin\appdata\local\temp\tmpxs7m5p' # warning/jpeg.c/JPEGWarningHandler/348.
identify.exe: Corrupt JPEG data: 1 extraneous bytes before marker 0xc4 `c:\users\thin\appdata\local\temp\tmpxs7m5p' # warning/jpeg.c/JPEGWarningHandler/348.
identify.exe: Corrupt JPEG data: 1 extraneous bytes before marker 0xda `c:\users\thin\appdata\local\temp\tmpxs7m5p' # warning/jpeg.c/JPEGWarningHandler/348.
Invalid Parameter - -auto-orient
Changing back sorl settings to use the default PIL backend and repeating the command for jpg to jpg conversion, the thumbnail image is generated without errors/warnings and available through the cache.
It seems that sorl is copying the source file to a temporary file before passing it to gm - and that the problem originates in this copy operation.
I've found what I believe to be the copy operation in the sources of sorl_thumbnail-11.12-py2.7.egg\sorl\thumbnail\engines\convert_engine.py lines 47-55:
class Engine(EngineBase):
...
def get_image(self, source):
"""
Returns the backend image objects from a ImageFile instance
"""
handle, tmp = mkstemp()
with open(tmp, 'w') as fp:
fp.write(source.read())
os.close(handle)
return {'source': tmp, 'options': SortedDict(), 'size': None}
Could the problem be here - I don't see it!
Any suggestions of how to overcome this problem would be greatly appreciated!
I'm using django 1.4, sorl-thumbnail 11.12 with memcached and ghostscript 9.07.
After some trial and error, I found that the problem could be solved by changing the write mode from 'w' to 'wb', so that the sources of sorl_thumbnail-11.12-py2.7.egg\sorl\thumbnail\engines\convert_engine.py lines 47-55 now read:
class Engine(EngineBase):
...
def get_image(self, source):
"""
Returns the backend image objects from a ImageFile instance
"""
handle, tmp = mkstemp()
with open(tmp, 'wb') as fp:
fp.write(source.read())
os.close(handle)
return {'source': tmp, 'options': SortedDict(), 'size': None}
There are I believe two other locations in the convert_engine.py file, where the same change should be made.
After that, the gm convert command was able to process the file.
However, since my pdf's are fairly large multipage pdf's I then ran into other problems, the most important being that the get_image method makes a full copy of the file before the thumbnail is generated. With filesizes around 50 Mb it therefore turns out to be a very slow process, and finally I've opted for bypassing sorl and calling gm directly. The thumbnail is then stored in a standard ImageField. Not so elegant, but much faster.

ZMQ IOLoop instance write/read workflow

I am having a weird system behavior when using PyZMQ's IOLoop instance:
def main():
context = zmq.Context()
s = context.socket(zmq.REP)
s.bind('tcp://*:12345')
stream = zmqstream.ZMQStream(s)
stream.on_recv(on_message)
io_loop = ioloop.IOLoop.instance()
io_loop.add_handler(some_file.fileno(), on_file_data_ready_read_and_then_write, io_loop.READ)
io_loop.add_timeout(time.time() + 10, another_handler)
io_loop.start()
def on_file_data_ready_read_and_then_write(fd, events):
# Read content of the file and then write back
some_file.read()
print "Read content"
some_file.write("blah")
print "Wrote content"
def on_message(msg):
# Do something...
pass
if __name__=='__main__':
main()
Basically the event loop listens to zmq port of 12345 for JSON requests, and reads content from a file when available (and when it does, manipulate it and wrties to it back. Basically the file is a special /proc/ kernel module that was built for that).
Everything works well BUT, for some reason when looking at the strace I see the following:
...
1. read(\23424) <--- Content read from file
2. write("read content")
3. write("Wrote content")
4. POLLING
5. write(\324324) # <---- THIS is the content that was sent using some_file.write()
...
So it seems like the write to file was not done in the order of the python script, but the system call of write to that file was done AFTER the polling, even though it should have been done between lines 2 and 3.
Any ideas?
Looks like you're running into a caching problem. If some_file is a file like object, you can try explicitly calling .flush() on it, same goes for ZMQ Socket which can hold messages for efficiency reasons as well.
As it stands, the file's contents are being flushed when the some_file reference is garbage collected.
Additional:
use the context manager logic that newer versions of Python provide with open()
with open("my_file") as some_file:
some_file.write("blah")
As soon as it finishes this context, some_file will automatically be flushed and closed.

Django1.1 file based session backend multi-threaded solution

I read django.contrib.sessions.backend.file today, in the save method of SessionStore there is something as the following that's used to achieve multi-threaded saving integrity:
output_file_fd, output_file_name = tempfile.mkstemp(dir=dir,
prefix=prefix + '_out_')
renamed = False
try:
try:
os.write(output_file_fd, self.encode(session_data))
finally:
os.close(output_file_fd)
os.rename(output_file_name, session_file_name)
renamed = True
finally:
if not renamed:
os.unlink(output_file_name)
I don't quite understand how this solve the integrity problem.
Technically this doesn't solve the integrity problem completely. #9084 Addresses this issue.
Essentially this works by using tempfile.mkstemp which is guaranteed to be atomic, and writing the data to that file. It then calls os.rename() which will rename the temp file to the new file. In unix this will remove the old file before renaming, in windows this will raise an error. This should be fixed for django 1.1
If you looked in the revision history you'll see that they previously had locks, but changed them to this method for various reasons.