How to correctly handle images with Django and CloudFiles? - django

In my particular case I'm using Rackspace CloudFiles with sorl-thumbnails. It seems to download images from CloudFiles slowly. I have 1 worker for handling requests and another one for celery tasks.
Looked for existing solutions and it seems there is no such one at the moment.
Maybe I missed something? How should it be done the right way?

This isn't going to solve your problem, but there are some things to note/think about:
Remote object storage (Amazon's S3, Rackspace's CloudFiles) is going to be slower than local filesystem access. This depends on what you're doing of course and who's fetching the thumbnail. For users, downloading from a CDN is going to be faster than from a server. It may serve you well to do the thumbnail creation locally on an SSD backed server then upload to CloudFiles, distributing it over the CDN. Rackspace now has beefier SSD based instances with much greater IOPS.
The sorlery module takes great care to queue thumbnail creation with Celery (for use with remote object storage) and avoid filesystem access.
On another note, sorl-thumbnail hasn't seen development in over a year with LOTS of pull requests and issues sitting out on GitHub. Have you thought about using easy-thumbnails with django-cumulus?

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Migrating Django application from heroku (celery/redis) to aws fargate/lambda

Apologies in advance for my little knowledge of AWS
I'm trying to draw parallels between my current setup on Heroku to a move to AWS. I've run into some memory issues on Heroku because of some machine learning models I'm running and Heroku seems too expensive for my needs.
I was recommenced to move to aws using fargate which would be a better fit for my app. Below is my whole architecture, I'm hoping for some guidance on my direction of what I have and where I plan to go.
A django application running on heroku.
The base of functionality is the user uploads a video from their mobile device and uploads it to s3. A message from SNS is sent to my Heroku server that the upload is completed. The server kicks off a celery task that downloads the video from s3 and uses a machine learning model to do some natural language processing, then saves the results to my postresql database. Obviously this is very compute intensive, so I've run into some memory issues and can for-see scaling issues to come.
After lots of tweaking and attempts to no avail, I've decided to move over to AWS and leverage some of the cost benefits that I've seen in comparison to heroku of running more memory intensive tasks.
I should also mention there is a web interface involved with this django project and it isn't just a REST Api.
As far as AWS goes, I'm looking for a bit of direction. Possibly just a rough outline of the architecture I should look deeper into.
My first plan is to dockerize my application and go from there...but I'm a bit stuck on how my application fits (website, rest api, worker threads) into the AWS ecosystem.
AWS is a great fit for the application you describe. AWS Fargate / RDS will host your Django application. You have the option of using AWS Batch to handle your processing. One huge advantage is the ability to scale according to the needs of your application.
This image is one possible way to structure your application. It's a lot of work to get to this point, but AWS offers a lot of power and flexibility for reasonable costs IMO.

Sync data between EC2 instances

While I'm looking to move our servers to AWS, I'm trying to figure out how to sync data between our web nodes.
I would like to mount a disk on every web node and have a local cache of the entire share.
Are there any preferred ways to do this?
Sounds like you should consider storing your files on s3 originally and if performance is key, have a sync job that pulls copies of the files locally to your ec2 instance. S3 is fast, durable and cheap - maybe even fast enough without keeping a local cache - but if you do indeed need a local copy, there are tools such as the aws cli and other 3rd party tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Depending on what you are trying to sync - take a look at
http://aws.amazon.com/elasticache/
It is an extremely fast and efficient method for sharing data.
One absolute easy solution is to install Dropbox sync client on both the machines and keep your files in Dropbox. This is by far the easiest !
In this approach, you can "load" data to the machines, using externally adding files to your dropbox account (not even go to AWS service to load) - From another machine or even from browser interface of Dropbox.

What factors should be considered to move to Amazon Storage?

We have a Django application running on Webfaction. In this application, user uploads lot of images. So, far we have not had any issues. Soon, we expect about 10,000 users. But, I was wondering, should we decide to move to cloud solution like S3? How will the move help us?
thanks
Some of the advantages of moving to a remote storage such as S3 are:
Central storage location: You don't need to worry about managing a shared NFS mount as you bring up new webservers to handle additional load.
Offloading requests: Your servers will not take on the load of serving the media.
Some disadvantages are:
Additional cost: You pay for the storage and the bandwidth.
More moving parts: A file system is fairly easy to understand, manage and test. Remote APIs aren't perfect and some of the problems are out of your control.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).