What are the downsides of using filesystem for storing uploaded files in django? - django

I know about s3 storage but I am trying to see if things can work out by only using filesystem

The main reason to use a service like S3 is scalability. Imagine that you use a the file system of a simple server to store files. Then it means that everyone that visits your site and wants to access a file, has to visit the same server. If there are enough visitors, then this will eventually render the system unresponsive.
Scalable storage services will store the same data on multiple servers to allow serving the content when the number of requests increases. Furthermore one normally hits a server that is close to the location of that user which minimizes the delay to fetch a file.
Finally such storage services are more reliable. If you use a single disk to store all the files, it is possible that eventually the disk fails losing all the data. By storing the data on multiple locations, it is less likely that the files are completely lost.

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

Uploading large files to server

The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?
You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.

Share data across Amazon Elastic Beanstalk nodes

I have a spring boot application which downloads around 300 MB of data at start up and saves it to a path /app/local/mydata. Currently, I have just one dev environment with a single node and it is not a problem. However, once I create a prod instance with (say) 10 nodes, it would be a waste of data bandwidth for each node to individually download the same 300 MB data. It will put a lot of stress on the service it is downloading the data from. And there is cost associated with data flowing in/out of EC2.
I can build a logic using a touchfile to make sure that only one box downloads the data and others just wait until the download is complete. However, I don't know where to download these data such that the other nodes can read it too.
Any suggestions?
Download it to S3 if you want to keep it in a file, but it sounds like you might need to put the data in a database (RDS) or maybe cache it in Redis (ElastiCache).
I'm not sure what a "touchfile" is but I assume you mean some sort of file lock mechanism. I don't see that as the best option for coordinating this across multiple servers. I would probably use a DynamoDB table with consistent reads and conditional writes as a distributed locking mechanism.
How often does the data you are downloading change? Perhaps you could just schedule a Lambda function to refresh the data periodically and update a database or something?
In general, you need to stop thinking about using the web server's local file system for this sort of thing.

Where to store user file uploads?

In my compojure app, where should I store user upload files? Do I just make a user-upload dir in my project root and stick everything in there? Is there anything special I should do (classpath, permissions, etc)?
To properly answer your question, you need to think of the lifecycle of the uploaded files. I would start answering questions such as:
how big are the files going to be?
what storage options will hold enough data to store all the uploads?
how about SLAs, redundancy and disaster avoidance?
how and who to monitor the free space and health of the storage?
In general, the file system location is much less relevant than the block device sitting behind it: as long as your data is stored safely enough for your application, user-upload can be anywhere and be anything from a regular disk to an S3 bucket e.g. via s3fs-fuse.
Putting such folder in your classpath sounds odd to me. It gives no essential benefit, as you will always need to go through a configuration entry to state where to store and read files from.
Permission wise, your application will require at least write access to the upload storage (most likely read access as well). Granting such permissions depends on the physical device you choose: if you opt for the local file system as you suggest in your question, you need to make sure the Clojure app is run by a user with chmod +rw, but in case of S3, you will need to configure API keys.
For anything other than a practice problem, I would suggest using a database such as Postgres or Datomic. This way, you get the reliability of a DB with real transactions, along with the ability to access the files across a network from any location.