Timing issue with image upload on Heroku + Django Rest Framework + s3

Timing issue with image upload on Heroku + Django Rest Framework + s3 - django

ok this is a mix of architecture design and code question.. I've had this problem on a couple of applications and I'm not sure how to handle this.
The problem is fairly basic, an app which allows users to attach files/images to an object.
Let's say I have the following Django models: Document and Attachment.
class Document(models.Model):
body = models.TextField(blank=True, help_text='Plain text.')
class Attachment(models.Model):
document = models.ForeignKey(Document, on_delete=models.CASCADE, related_name='attachments')
attachment = models.FileField(upload_to=get_attachment_path, max_length=255, null=False)
filesize = models.IntegerField(default=0)
filetype = models.CharField(blank=True, max_length=100)
orientation = models.IntegerField(default=0)
The get_attachment_path callable simply builds a path based on the document pk and the user.
When a user creates a document, he can attach files to it. As it's a modern world, you want to be able to upload the files before creating the document, so I have a TempUpload object and a direct connection from the web application to S3 (using pre-signed URL).
When the user clicks save on the document form, I get an array of all the TempUpload objects that I need to attach to the new document.
Now here's the problem... within the heroku 30s timeout constraint, I need to:
create the document (quite fast)
iterate the array of TempUpload objects, and create the Attachment
copy the file to its final destination (using boto3 copy_object)
get the filetype (using the magic library, I only query the first 128 bytes, but still.. one roundtrip to S3)
get the file size (another roundtrip to S3)
get the orientation (only if the attachment is an image, also based on the EXIF data and using streaming)
I've already moved the thumbnail generation to an external service. But with bigger files (standard camera pictures can easily be 6-10Mb big), I can have processing delay of almost 1 sec per file, meaning that if the user uploads more than 20 images, it's getting very very close to the heroku timeout...
I'm currently using celery and redis to move most of the processing outside the response lifecycle, but it's not really nice... it can happen that the document is requested before the async task complete. in which case some info are not available yet (size/orientation).
I think that should actually be a fairly standard feature, I'd be interested to know how do you implement this ?
Edit:
Just to add some strategies I'm thinking of:
get the info at the initial upload instead of when saving the document. This would work but the problem is that I don't know when the upload is completed (this happens directly between the web app and S3)
use lambda fonction ?
use S3 tags instead of moving the file to easily differentiate the "saved" files and the ones that should be deleted ?
For testing and dev I'd like to keep as much logic as possible within the django app.. but hey, maybe that's not possible...

Related

AWS S3 filename

I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?

I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.

Django: Insert image to PostgreSQL (PgAdmin4)

Im currently working an e-shop. So my idea is to store images with Django models in PgAdmin4. As i saw in older posts methods like bytea('D:\image.jpg') and so on just converts the string constant to its binary representation.
So my question is if there is a newer method to store the actual image, or if it is possible to grab the image via a path?
models.py
image = models.ImageField(null=True, blank=True)
PgAdmin4
INSERT INTO product_images(
id, image)
VALUES (SERIAL, ?);// how to insert image?

There are several options for keeping images. The first is to use a storage service like S3, which I recommend. You can read this article for more detailed information. I can also recommend that I have used a third party package ready to use S3 with Django. If you use this option, imagefield will keep the path in S3.
Another option is if you are using only one server, you can keep the pictures in that server's local. Again imagefield will keep the path.
If you say I want to keep it directly in the database, you can follow this link. Currently, there is no newer method for it.
But I have to say that I think using a storage service like S3 is the best way under all circumstances.

Connection issues implementing Django Custom Storage backend with Box Cloud Storage API

I am working on a legacy Django application that has a lot of third party dependencies, one of which is the storage backend for file uploads. I was recently tasked with replacing our legacy third party cloud storage vendor with a newer cloud storage vendor (Box).
The cloud storage is implemented as a custom storage backend and used as the "storage" parameter in FileFields in models throughout the app. I'm basically trying to figure out what exactly happens in storage if you have a FileField in a model and you create a ModelForm based on that model, then you call "save" on the form.
It seems that a lot of stuff is going on and some of it is causing connection problems with the cloud storage API.
I tried reading the Django source to follow it and got all the way down to where the model is deciding if it should do an update (by doing an "exists" check in storage) or an insert.
Once it decides to do an insert, I noticed a call to my cloud storage backend occurs to upload the file (presumably non blocking?) as the insert sql is being generated.
Somewhere in here, connections to the cloud storage begin to hang and become unresponsive. At least, all I see in logs is
INFO requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): upload.box.com"
and no further info or response.
Unlike the previous cloud storage, the new one issues JWT's per session instead of having a static auth token that you simply pass every time. If I do not use Django's ModelForm with its magical "save" method, but call methods directly on the models with the FileFields, I do not encounter the connection problem - I get responses just fine from the cloud storage API.
So, I'm thinking there must be some kind of concurrency issue when calling "save" on a form that affects a model with a FileField?? I'm a little stumped. The code is involved, so it is hard to copy here, but basically it comes down to:
class CustomStorage:
def __init__(self):
set up storage API client,
authenticate client instance, etc
def _save(self, name):
call storage API methods to upload file
** includes a retry loop with file renaming
algorithm to avoid name conflicts, as
cloud API does not allow duplicate file
names
def exists(self, name):
call storage API methods to determine if file name conflict exists
def _open(self, name):
call storage API methods to download file
custom_storage = CustomStorage()
class ExampleModel(models.Model):
name = models.CharField(maxlength=255)
file_ref = FPFileField(upload_to="uploads", storage=custom_storage) #because we also have a dependency on FilePicker, now called FileStack
class ExampleModelForm(ModelForm):
file_ref = CustomFilePickerField()
class Meta:
model = ExampleModel
fields = ('name', 'file_obj')
form = ExampleModelForm()
model = form.save() # --> connection problem with
# cloud storage API starts here
# if I were to call ExampleModel.objects.create(...),
# the storage upload process would work fine
Is there some gotcha I'm not aware of that Django experts would know about implementing custom storage backends for Django based on cloud storage APIs?

Turns out that the problem was our usage of a deprecated version of a FilePicker field for the uploaded document model field, combined with an old bug in a stale version of Requests. Occasionally, this field would return a Django file instance wrapped around a cStringIO.StringIO object instead of a vanilla file object. This ended up running into a bug in the Requests library which caused stalled responses when chunking for a multi-part POST when the payload is a StringIO instance. Because upgrading is not an option, I solved the issue by detecting if the the underlying Django FileField File is not a file object and re-wrapping it in a ContentFile instance and seek-ing to 0, in order to play nice with the older version of Requests. If anyone knows of a better alternative, by all means, please let me know.

How to use Django ImageField, and why use it at all?

Up until now, I've been storing my image filenames in a CharField and saving the actual file directly to S3. This was a fine solution for my own usage. I'd like to reconsider using an ImageField, since now there will be other users and file input validation would be appropriate.
I have a couple of questions that weren't exactly answered after reading the docs and the source code for FileField (which appears to be essentially ImageField minus the Pillow check and dimension field updating functionality).
1) Why use an ImageField at all? Or rather, why use a FileField? Sure, it's convenient for quick-and-easy forms and convenient for inserting to Django templates. But are there any substantial reasons, eg. Is it evidently secured against exploits and malicious uploads?
2) How to write to the field file? If it is correct that the file can be read by instance.imagefield (or is it instance.imagefield.file?), if I want to write to it can I simply do the following?
#receiver(pre_save, sender=Image)
def pre_save_image(sender, instance, *args, **kwargs):
instance.imagefield = process_image(instance.imagefield)
3) How to try saving with a specific filename, then try again with a new filename if that randomly generated filename already exists? For example with my code right now I do this, how can it be done with ImageField? I want to do it at the model layer, because if I do repeated tries at the view layer then the pre_save processing would run again which is ghetto (even though it's unlikely that it'll have a second try ever in the lifetime of the service).
for i in range(tries):
try:
name = generate_random_name()
media_storage.save(name + '.jpg', ContentFile(final_bytes))
break
except:
pass
4) In the models.py pre_save and post_save signals and in the actual model's save(), how can I tell if a file came in with the request? i.e. I want to know if a new image is incoming to be saved, or if there is no image (some other field in the object is being updated and the image itself remains unchanged).

I don't see any advantage of FileField or ImageField over what you are doing today. In fact, as I see it, the proper/modern/scalable way to deal with uploads is to have the client (browser) upload files directly to S3.
If done correctly (from a security stand point), this scheme allows you to scale in an incredible way without the need to add more computer power on your side. As an example, consider 100 people uploading a picture at the same time. Your server will need to receive all these data, only to upload it again to S3. On the other side, you can have a 1000 people upload at the same time, and I can assure you AWS can handle it. Your server only needs to handle the signing of the URL, which is a lot less work.
Take a look at fine-uploader, as a good technology to use to handle the efficient upload to s3 (loading in chunks, error checking, etc): http://docs.fineuploader.com/endpoint_handlers/amazon-s3.html. Google "django fineuploader" to find a sample application for Django.
In my case, I use a Model with a couple CharFields (bucket, key) plus a few other things specific to my application. My data flow is as follows:
Django services a page with the fine-uploader widget, configured based on my settings.
Fineuploader requests a signed URL from the django server (endpoint), and uses that to upload to S3 directly.
When the upload is complete, fineUploader makes another request to my server to register the completion of the upload, at which time, I create my object on the database. In this case, if the upload fails, I never create an object on the database.
On the AWS side, S3 triggers a Lambda function, which I use to create a thumbnail, and store it back to S3. So, I don't even use my own CPU (e.g. Celery) for resizing. So you see, not only can I have thousands of users uploading at the same time, but I can resize those thousand pictures in parallel, and for less than what an EC2 worker will cost me.
My Django Model is also used as a wrapper to manage the business logic (e.g. functions like get_original_url() and get_thumbnail_url()), so after the uploads, it is easy for my templates to get the signed read-onlly URLs.
In short, you can implement your own version of Fineuploader if you want, or use many of the alternative, but assuming you follow the recommended security best practices on the AWS side (e.g. create a special IAM with only write permission for the client, even if you are using signed URLs), this, IMO, is the best practice for dealing with uploads, especially if you are using S3 or similar to store these files.
Sorry if I am only really answering question 1, but questions 2 and 3 don't apply if you accept my answer for 1.

1) Why use an ImageField at all? Or rather, why use a FileField?
It's convenient for quick-and-easy forms and convenient for inserting
to Django templates.
But are there any substantial reasons, eg. Is it evidently secured against exploits and malicious uploads?
Yes. I daresay your own code probably does it too, but for a newby using the FileField will probably ensure that your important system files are not getting overwritten by a malicious upload.
2) How to write to the field file?
In your situation you would need to use a special storage backend that makes it possible to write directly to the Amazon S3. As you know, the storage backend for FileFile and ImageField are plugable. Here is one example plugin: `http://django-storages.readthedocs.io/en/latest/backends/amazon-S3.html
There is sample code which demonstrates how it can be written to. So I wll not go into that.`
3) How to try saving with a specific filename, then try again with a new filename if that randomly generated filename already exists?
ImageField and FileField takes care of this for you automatically. It will create a new filename if the old one exists. The code in my answer here did that automatically when I called it over and over again. here are some sample filenames produces (input being bada.png)
"4", "media/bada.png"
"5", "media/bada_aH0gV7t.png"
"7", "media/bada_XkzthgK.png"
"8", "media/bada_YzZuwDi.png"
"9", "media/bada_wpkasI3.png"
4) In the models.py pre_save and post_save signals and in the actual model's save(), how can I tell if a file came in with the request?
Your instance.pk will be None
If this is a modification to an existing file the PK will be set.
If this is a new image upload in the pre_save

Took me forever to learn how to save an image using ImageField. Turns out it's crazy easy -- once you know how to do it, it is, at least. I mean, it all comes together sensibly after you see it.
So basically, you're working with a FileField. I already looked into the differences between ImageField and FileField:
ImageField takes everything FileField takes in terms of attributes,
but ImageField also takes a width and height attribute if indicated.
ImageField, unlike FileField, validates an upload, making sure it's
an image.
Using ImageField comes down to most of the same constructs as FileField does. The biggest things to remember:
request.FILES['name_of_model']
So a form is generated from something in forms.py (or wherever your forms are) like this:
imgfile = forms.ImageField(label = 'Choose your image',
help_text = 'The image should be cool.')
In the model, you might have this in correspondence:
imgfile = models.ImageField(upload_to='images/%m/%d')
So there will be a POST request from the user (when the user completes the form). That request will contain basically a dictionary of data. The dictionary holds the submitted files. To focus the request on the file from the field (in our case, an ImageField), you would use:
request.FILES['imgfield']
You would use that when you construct the model object (instantiating your model class):
newPic = ImageModel(imgfile = request.FILES['imgfile'])
To save that the simple way, you'd just use the save() method bestowed upon your object (because Django is that awesome):
if form.is_valid():
newPic = Pic(imgfile = request.FILES['imgfile'])
newPic.save()
Your image will be stored, by default, to the directory you indicate for MEDIA_ROOT in settings.py.
The tough part, which isn't really so tough when you catch on, is accessing the image.
In your template, you could have something like this:
<img src="{{ MEDIA_URL }}{{ image.imgfile.name }}"></img>
Where {{ MEDIA_URL }} is something like /media/, as indicated in settings.py and {{ image.imgfile.name }} is the name of the file and the subdirectory you indicated in the model. "image" in this case is just the current image in a loop of images you might create to access each image in the database:
{% for image in images %}
{% endfor %}
Make SURE you configure your urls properly to handle the image or the image won't work. Add this to your urls:
urlpatterns += patterns('',
url(r'^media/(?P<path>.*)$', 'django.views.static.serve', {
'document_root': settings.MEDIA_ROOT,
}),
)

Django model for storing multiple files in a single user profile [duplicate]

This question already has answers here:
How can javascript upload a blob?
(6 answers)
Closed 8 years ago.
When a user logs into the database, he generates a WAV audio file (in the browser) which has to be stored in the database. The problem is that, everything is fine when I store the file for the first time. But, when I store after that, I get IntegrityError. Does anyone have the solution to this?
My current models.py is the following:
class InputFile(models.Model):
audio_file = models.FileField(upload_to='audio_files')
input_user = models.ForeignKey(User)
rec_date = models.DateTimeField('date recorded', auto_now_add=True)

I would still recommend storing the user media as files. This way you can avoid storing large files in the database which is not only inefficient but also slow.
Furthermore, I would not recommend performing audio processing within the Django request-response. Try to use an asynchronous solution like Celery.
Audio processing is quite similar to Video processing. You can get a lot of ideas from how video sharing sites like YouTube work. Once you upload a file, they notify you when the uploading and encoding is completed. This happens in a separate task queue.
The intermediate and final media files are again stored as files and served directly from the filesystem. By avoiding the database layer you can get a much more responsive application.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js