Kubernetes share temporary storage to upload file async - django

Following this post and this, here's my situation:
Users upload images to my backend, setup like so: LB -> Nginx Ingress Controller -> Django (Uwsgi). The image eventually will be uploaded to Object Storage. Therefore, Django will temporarily write the image to the disk, then delegate the upload task to a async service (DjangoQ), since the upload to Object Storage can be time consuming. Here's the catch: since my Django replicas and DjangoQ replicas are all separate pods, the file is not available in the DjangoQ pod. Like usual, the task queue is managed by a redis broker and any random DjangoQ pod may consume that task.
I need a way to share the disk file created by Django with DjangoQ.
The above mentioned posts basically mention two solutions:
-solution 1: NFS to mount the disk on all pods. It kind of seems like an overkill since the shared volume only stores the file for a few seconds until upload to Object Storage is completed.
-solution 2: the Django service should make the file available via an API, which DjangoQ would use to access the file from another pod. This seems nice but I have no idea how to proceed... should I create a second Django/uwsgi app as a side container which would listen to another port and send an HTTPResponse with the file? Can the file be streamed?

Third option: don't move the file data through your app at all. Have the user upload it directly to object storage. This usually means making an API which returns a pre-signed upload URL that's valid for a few minutes, user uploads the file, then makes another call to let you know the upload is finished. Then your async task can download it and do whatever.
Otherwise you have the two options correctly. For option 2, and internal Minio server is pretty common since again, Django is very slow for serving large file blobs.

Related

Reliable File transfer from Django to external FTP

I have a Django application that needs to upload files generated upon the update of a model objects. More concretely, once a database entry is modified, a function has to be fired to generate a tiny CSV file and upload it to an (S)FTP server. My question is on how to make the uploads reliable and performant (possibly couple of thousands transactions a day). That is, make sure the files are always uploaded with possibly few duplicates without overloading the Django instance.
One option I explored was to send these tiny CSV files to a cloud queue (e.g. AWS SNS/SQS) and have another CRON job (e.g. AWS Lambda) to fetch the queue and upload the files.
Any ideias for the architecture?

Heroku doesn't update github file system when an image is uploaded from website

I ran into the problem where Heroku doesn't update my GitHub repository (or say static filesystem) when a blog post (including pictures) is created from the website.
Other images survive, whilst the ones saved in my filesystem with the server running on heroku, disapper.
I found this on their documentation.
The Heroku filesystem is ephemeral - that means that any changes to the filesystem whilst the dyno is running only last until that dyno is shut down or restarted.
I'm still confused why not all the pictures disappear and only those added later do.
Is AWS S3 a solution for this? If it is, how can I represent my filesystem using buckets?
Say, for the Blog Post 1 I have 2 picture resolutions, which means storing the files in different folders corresponding to those resolutions.
---1920x1920
-----picture.jpg
---800x800
-----picture.jpg
Does that mean I have to create 2 buckets named 1920x1920 and 800x800 or is there a better way of handling them?
Is AWS S3 a solution for this?
S3 is the recommended solution for this, and the configuration is documented in Heroku DevCentre with specfic instructions for uploading from Python.
Note these Python instructions use the Direct Upload approch: Have the flask app generate a pre-signed URL, which is then passed back to the client Javascript code, so that the user's browser can make the upload to S3 directly. The resulting S3 URL of the image, is then put into a hidden element in the form, which is then received by your app on form submit.
The fact that you have separate image sizes suggests your app does some processing (maybe with PIL) to get these thumbnails. In which case it may be easier to use the Pass-Through approach where your app implements its own upload mechanism, does the processing and then uploads the thumbnails to S3 (The upload to S3 part is well document, such as in this SO thread).
The Pass-Through method carries the warning that this may cause blocking of a single threaded worker. If your site gets a volume of requests that causes this to be an issue, you may need to increase the number of gunicorn workers, or change to a worker type that supports concurrency (This github post has some useful commands/info on concurrent worker types).
The best way to implement this whole thing (although the requirement for a redisgo dyno and worker dyno may push you into the paid teir) may be with Background Tasks using rq. You use the Direct-Upload approach above to upload the original image, then have a background job download that, do the resizing, and put the resulting thumbnails back onto S3.
Does that mean I have to create 2 buckets named 1920x1920 and 800x800 or is there a better way of handling them?
Have one Bucket for the entire app, and just include forward slashes in the object's key to mimic a subdirectory structure.

Update wowza StreamPublisher schedule via REST API (or alternative)

Just getting started with Wowza Streaming Engine.
Objective:
Set up a streaming server which live streams existing video (from S3) at a pre-defined schedule (think of a tv channel that linearly streams - you're unable to seek through).
Create a separate admin app that manages that schedule and updates the streaming app accordingly.
Accomplish this with as a little custom Java as possible.
Questions:
Is it possible to fetch / update streamingschedule.smil with the Wowza Streaming Engine REST API?
There are methods to retrieve and update specific SMIL files via the REST API, but they only seem to be applicable to those created through the manager. After all, streamingschedule.smil needs to be created manually by hand
Alternatively, is it possible to reference a streamingschedule.smil that exists on an S3 bucket? (In a similar way footage can be linked from S3 buckets with the use of the MediaCache module)
A comment here (search for '3a') seems to indicate it's possible, but there's a lot of noise in that thread.
What I've done:
Set up Wowza Streaming Engine 4.4.1 on EC2
Enabled REST API documentation
Created a separate S3 bucket and filled it with pre-recorded footage
Enabled MediaCache on the server which points to the above S3 bucket
Created a customised VOD edge application, with AppType set to Live and StreamType set to live in order to be able to point to the above (as suggested here)
Created a StreamPublisher module with a streamingschedule.smil file
The above all works and I have a working schedule with linearly streaming content pulled from an S3 bucket. Just need to be able to easily manipulate that schedule without having to manually edit the file via SSH.
So close! TIA
To answer your questions:
No. However, you can update it by creating an http provider and having it handle the modifications to that schedule. Should you want more flexibility here you can even extend the scheduler module to not require that file at all.
Yes. You would have to modify the ServerListenerStreamPublisher solution to accomplish it. Currently it solely looks a the local filesystem to read teh streamingschedule.smil file.
Thanks,
Matt

Bypassing SNS notifications in development for Elastic Transcoder

I have an Elastic Transcoder job that sends out SNS Notifications to an endpoint in my app when the job has finished. Obviously, it is non-trivial to be getting the SNS to be hitting endpoints on my local machine when in the development environment. I can't quite work out how to overcome this.
When a video is uploaded it create a database row with the mezzanine url and some empty columns for transcoded urls. It then creates an Elastic Transcoder job.
When the notification arrives at the endpoint from the job it adds the URLs of the transcoded files to the row of the database in which the url of the mezzanine file is located. I know what these URLs will be but don't want to add them to the database until the job is complete (so I can serve placeholders if the database column is NULL, or if indeed it fails).
It feels wrong to check for the ENV in the controller and just add the URLs to the database immediately if we're in development, so I wondered if there was a better way?
Typically, your development environment should be completely separate from production's moving parts -- including SNS topics. If you duplicate your whole stack into a development tier ("dev db", "dev sqs", "dev rails") and regularly copy data over to it, you can isolate your changes' impact on production data while making assurances about behavior.
You could also then use a simple K/V pair in your Rails config (environments.rb) to tell it which environment to live in.

Heroku ephemeral storage, Sendgrid, and attachments

On occasion I need to send emails with attachments to users of my site. I am using SendGrid and python-sendgrid 0.1.4 to do the send. Email sending is queued through Redis.
Here's the issue -- where do I put the attachment, which is currently generated as part of the web process? I tried putting it /tmp, which didn't work -- presumably because the file was deleted when the web process shut down and was no longer available when the worker process came by? I tried /app/media, which also didn't work -- I think because /app/media is read-only (though, oddly, I did not get any errors attempting to write to this directory)?
I think the answer may be that I have to refactor my code to generate the attachment in the same process as the email is sent, but as that is a pretty significant refactor, I thought I'd ask the community first. Thanks!
Heroku's /tmp directories are unique to each dyno. So your Web Dyno saves a file in its /tmp directory, then your worker looks in its /tmp directory and cannot find it.
The best option is likely refactoring your code (that way you aren't clogging up your Web Dyno's resources creating and writing files to disk). However, if you really want to avoid it, you could store your files temporarily on S3 [tutorial] or some other external storage mechanism.
You always need to use an external storage like for example S3, to store files that need to be available to every server instance/dyno.
Interesting to know is, if you don't want to store those attachements forever. You can attach a lifecycle event to your S3 bucket that will automatically delete a file if it's older then x days.