Django File object and S3 - django

So I have added s3 support to one of my Django projects. (storages and boto3)
I have a model that has a file field with zip-archive with images in it.
At some point I need to access this zip-archive and parse it to create instances of another model with those images from archive. It looks something like this:
I access archive data with zipfile
Get image from it
Put this image to django File object
Add this file object to model field
Save model
I works perfectly fine without s3, however with it I get UnsupportedOperation: seek error.
My guess is that boto3/storages does not support uploading files to s3 from memory files. Is it the case? If so, how to fix id/ avoid this in this kind of situation?

Related

S3 trigger to perform a file conversion for a multi-part file type

I am working on converting shapefiles to geojson. Shapefiles are composed of at least 3 required files and as many as 8 separate files all residing in a folder. To convert to geojson you need all the constituent parts. Right now I have a batch conversion process that goes through all the shapefiles stored in an s3 bucket, downloads all the separate file parts and performs the conversion. What I'm trying to figure out now is how to run the file conversion process based on the upload of a single shapefile folder, hopefully using an s3 bucket trigger.
I have reviewed this answer (AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function) but in my case there is no frontend client (the answer presented in that question appears to be to signal a final event, but that is done from the client interface). Maybe I need to build one, but I was trying to handle this only in the backend (there is no frontend and no plans to have one). The 'user' would be dropping the files right into s3 directly without a file upload interface.
Of course when someone uploads a folder with all the shapefile parts in it, it triggers the s3 trigger for each part but each part cannot produce a shapefile alone.
A few solutions I thought of but with their own problems:
I am converting the shapefiles to geojson and storing the geojson in a separate s3 bucket using a naming convention for the geojson based on the s3 file name. In theory you could always check if the geojson exists in the destination s3 bucket already and if not, run the conversion. But this still doesn't take care of the timing aspect of all the multiple parts of the file being uploaded. I could check the name but it would be triggered multiple times, fail on some and then ultimately (probably) succeed after all the parts are in place.
1a. Maybe some type of try/except error checking on the conversion mentioned above? meaning, for each file part uploaded, go ahead and try to download and convert. This seems fragile and potentially error prone. Also, I believe that a certain subset of all the files will likely produce a geojson without error but without all the metadata or complete set of data so a successful conversion may not actually be a success.
Using a database to track which files have been converted, which would basically be the same solution as 1 above.
Partly a question as a solution: on the s3 web console there is 'file' upload and 'folder' upload. To upload the shapefile folder containing all the component parts, you'd have to use the 'folder' option. The question then is, is there any way to know, from the event trigger perspective, that the operation was a folder upload, not just a file upload and to therefore wait until all the parts of the folder are uploaded OR if there is any event data in AWS that, when a FOLDER is uploaded it counts the underlying file parts (1 of 6, 2 of 6 etc) and could send an event after all the parts of the folder have been uploaded(?)
I also am aware of the 'multipart' upload which would, I think, do what I proposed in #3 above but that multipart 'tag' is only if you upload via sdk or cli. Unless the s3 console folder upload is underneath a multi-part upload?

Correct way to fetch data from an aws server into a flutter app?

I have a general understanding question. I am building a flutter app that relies on a content library containing text files, latex equations, images, pdfs, videos etc.
The content lies on an aws amplify backend. Depending on the navigation of the user in the app, the corresponding data is fetched and displayed.
I am not sure about the correct way of fetching the data. The current method (which works) is that the data is stored in an S3 bucket. When data is requested, the data is downloaded to a temporary directory and then opened and processed in the app. This is actually not slow, but I feel that it is not the way it should be done.
When data is downloaded a file transfer notification pops up, which bothers me because it is shown all the time. Also I would like to read the data directly with something like a get request, without downloading the file first (specially for text files, which I would like to read directly into a String). But here I don't know how it works, because I don't see that you can save data in a file system with the other amplify services like data store or the rest api. Also, the S3 bucket is an intuitive way of storing data that is easy to use for the content creators of my company, for me it seems that the S3 bucket is the way to go. However with S3 I have only figured out the download method to fetch data.
Could someone give me a hint on what is the correct approach for this use case? Thank you very much!

Django how to upload file directly to 3rd-part storage server, like Cloudinary, S3

Now, I have realized the uploading process is like that:
1. Generate the HTTP request object, and set the value to request.FILE by using uploadhandler.
2. In the views.py, the instance of FieldFile which is the mirror of FileField will call the storage.save() to upload file.
So, as you see, django always use the cache or disk to pass the data, if your file is too large, it will cost too much time.
And the design I want to figure this problem is to custom an uploadhandler which will call storage.save() by using input raw data. The only question is how can I modify the actions of FileField?
Thanks for any help.
you can use this package
Add direct uploads to AWS S3 functionality with a progress bar to file input fields.
https://github.com/bradleyg/django-s3direct
You can use one of the following packages
https://github.com/cloudinary/pycloudinary
http://django-storages.readthedocs.io/en/latest/backends/amazon-S3.html

How to mix Django, Uploadify, and S3Boto Storage Backend?

Background
I'm doing fairly big file uploads on Django. File size is generally 10MB-100MB.
I'm on Heroku and I've been hitting the request timeout of 30 seconds.
The Beginning
In order to get around the limit, Heroku's recommendation is to upload from the browser DIRECTLY to S3.
Amazon documents this by showing you how to write an HTML form to perform the upload.
Since I'm on Django, rather than write the HTML by hand, I'm using django-uploadify-s3 (example). This provides me with an SWF object, wrapped in JS, that performs the actual upload.
This part is working fine! Hooray!
The Problem
The problem is in tying that data back to my Django model in a sane way.
Right now the data comes back as a simple URL string, pointing to the file's location.
However, I was previously using S3 Boto from django-storages to manage all of my files as FileFields, backed by the delightful S3BotoStorageFile.
To reiterate, S3 Boto is working great in isolation, Uploadify is working great in isolation, the problem is in putting the two together.
My understanding is that the only way to populate the FileField is by providing both the filename AND the file content. When you're uploading files from the browser to Django, this is no problem, as Django has the file content in a buffer and can do whatever it likes with it. However, when doing direct-to-S3 uploads like me, Django only receives the file name and URL, not the binary data, so I can't properly populate the FieldFile.
Cry For Help
Anyone know a graceful way to use S3Boto's FileField in conjunction with direct-to-S3 uploading?
Else, what's the best way to manage an S3 file just based on its URL? Including setting expiration, key id, etc.
Many thanks!
Use a URLField.
I had a similar issue where i want to store file to s3 either directly using FileField or i have an option for the user to input the url directly. So to circumvent that, i used 2 fields in my model, one for FileField and one for URLField. And in the template i could use 'or' to see which one exists and to use that like {{ instance.filefield or instance.url }}.
This is untested, but you should be able to use:
from django.core.files.storage import default_storage
f = default_storage.open('name_you_expect_in_s3', 'r')
#f is an instance of S3BotoStorageFile, and can be assigned to a field
obj, created = YourObject.objects.get_or_create(**stuff_you_know)
obj.s3file_field = f
obj.save()
I think this should set up the local pointer to s3 and save it, without over writing the content.
ETA: You should do this only after the upload completes on S3 and you know the key in s3.
Checkout django-filetransfers. Looks like it plays nice with django-storages.
I've never used django, so ymmv :) but why not just write a single byte to populate the content? That way, you can still use FieldFile.
I'm thinking that writing actual SQL may be the easiest solution here. Alternatively you could subclass S3BotoStorage, override the _save method and allow for an optional kwarg of filepath which sidesteps all the other saving stuff and just returns the cleaned_name.

Is there a way to modify the storage backend used by a Django FileField?

I am saving some uploaded files with a Django FileField set to use DefaultStorage backend. At some point after the file has been uploaded I'd like to move them to a different storage backend i.e. change the FileField's storage attribute (obv. after saving the contents of the source file to the new storage location). Simply changing the FileField instances storage doesn't seem to work.
Is this possible without the use of a second FileField model attr which has been told to use a different storage backend? Ideally I'd like to not have to double up on the fields and put switches in all the templates that reference the files.
Thanks!
It seems that the storage associated with a FileField is not written to the database, so setting it on a particular field instance wouldn't persist. Instead it is read from the FileField instance associated with the model (so if you have file = models.FileField(..., storage=some_storage) it is reset to some_storage every time the models are setup by Django).