Optimal way to use AWS S3 for a backend application - amazon-web-services

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!

I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

Related

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!
for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

Amazon EC2 scaling and upload temporary folder

I have an application based on php in one amazon instance for uploading and transcoding audio files. This application first uploads the file and after that transcodes that and finally put it in one s3 bucket. At the moment application shows the progress of file uploading and transcoding based on repeatedly ajax requests by monitoring file size in a temporary folder.
I was wondering all the time if tomorrow users rush to my service and I need to scale my service with any possible way in AWS.
A: What will happen for my upload and transcoding technique?
B: If I add more instances does it mean I have different files on different temporary conversion folders in different physical places?
C: If I want to get the file size by ajax from http://www.example.com/filesize up to the finishing process do I need to have the real address of each ec2 instance (i mean ip,dns) or all of the instances folders (or folder)?
D: When we scale what will happen for temporary folder is it correct that all of instances except their lamp stack locate to one root folder of main instance?
I have some basic information about scaling in the other hosting techniques but in amazon these questions are in my mind.
Thanks for advice.
It is difficult to answer your questions without knowing considerably more about your application architecture, but given that you're using temporary files, here's a guess:
Your ability to scale depends entirely on your architecture, and of course having a wallet deep enough to pay.
Yes. If you're generating temporary files on individual machines, they won't be stored in a shared place the way you currently describe it.
Yes. You need some way to know where the files are stored. You might be able to get around this with an ELB stickiness policy (i.e. traffic through the ELB gets routed to the same instances), but they are kind of a pain and won't necessarily solve your problem.
Not quite sure what the question is here.
As it sounds like you're in the early days of your application, give this tutorial and this tutorial a peek. The first one describes a thumbnailing service built on Amazon SQS, the second a video processing one. They'll help you design with best AWS practices in mind, and help you avoid many of the issues you're worried about now.
One way you could get around scaling and session stickiness is to have the transcoding update a database with the current progress. Any user returning checks the database to see the progress of their upload. No need to keep track of where the transcoding is taking place since the progress gets stored in a single place.
However, like Christopher said, we don't really know anything about you're application, any advice we give is really looking from the outside in and we don't have a good idea about what would be the easiest thing for you to do. This seems like a pretty simple solution but I could be missing something because I don't know anything about your application or architecture.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).