Best Practice in delivering processed image from server to client - amazon-web-services

There is a service which generate an image (1MB ~ 10MB sized) based on user request. It requires some calculation process and I would like to deliver generated image as quickly as possible. (hopefully within seconds)
To achieve the goal, what would be the best option I could consider?
conditions
Image Generate Service is scalable and image generation jobs are managed by queue. so there can be multiple services are running and the client cannot have 1 single destination to connect directly.
Generated images are not reusable. whenever user requests with some input, result images are different.
Those images cannot be pre-generated so we cannot store it first on S3 to serve it through CloudFront.
server location: us-west-1, client location: South Korea
some trials
I tried with some scenarios as below but I still expect there is some better way to achieve this goal.
Upload result file to S3 (public bucket) and provide for client with the key so that client can download it from S3 right after the file is uploaded.
tested with / without S3 Transfer acceleration
without acceleration, it is a bit slower than direct socket transferring. but interestingly, when I use acceleration, it is much faster than socket transferring although the file was not hit on CloudFront edge server.
Run separate WebSocket server so it can emit result image directly to clients.
concerns:
to make it scalable, not only image generate services but these WebSocket servers should be scalable. which requires client to know exact destination to receive expected result.
network bandwidth limitation to individual EC2 instance

Based on my experience, and as I see in your case scenario, the best would be to go with something like GlusterFS in EC2. I'm using it to deliver some moodle data to final users, but the speed is definitely faster than using S3. You should give it a try.

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

Uploaded images with malicious code in Amazon S3

I have a custom php web application where users will be able to upload images.
I know that there is a security concern with image files as a hacker can add malicious code to them and trigger them through the url of the image file.
So I'm no longer storing images in the web server and uploading them directly into Amazon S3. I was wondering if it is still possible for a hacker to achieve the same results with a malicious image even if the image files are stored in a completely separate place like Amazon S3.
If you upload files into S3 then there would be no need to worry about the server side exploits like RCE as it is an object storage which won't be executed, but you need to take care of client side vulnerabilities like XSS...
i.e., even in your case of image upload, attacker cannot harm the server side setup directly by exploiting Unrestricted file upload but he can embed client side script into the image and exploit... as #dy10 mentioned, setting the proper content-type would help...

Uploading various sized Images to AWS Cloudfront versus post processing

We are using AWS cloudfront to render static contents on our site with origin as S3 BUCKET. Now as next steps, the user can dynamically upload images which we want to push to CDN. But we would require different sizes of it so that we can use it later in in the site. One option is to actually do preprocessing of images before pushing to S3 BUCKET . This ends up creating multiple images based on sizes. Can we do post processing something like http://imageprocessor.org/imageprocessor-web/ does but still use cloudfront. Any feedback would be helpful.
Regards
Raghav
Well, yes, it is possible to do post-processing and use CloudFront but you need an intermediate layer between CloudFront and S3. I designed a system using the following high-level implementation:
Request arrives at CloudFront, which serves the image from cache if available; otherwise CloudFront sends the request to the origin server.
The origin server is not S3. The origin server is Varnish, on EC2.
Varnish sends the request to S3, where all the resized image results are stored. If S3 returns 200 OK, the image is returned to CloudFront and to the requesting browser and the process is complete. Since the Varnish machine runs in the same AWS region as the S3 bucket, the performance is essentially indistinguishble between CloudFront >> S3 and CloudFront >> Varnish >> S3.
Otherwise, Varnish is configured to retry the failed request by sending it to the resizer platform, which also runs in EC2.
The resizer examines the request to determine what image is being requested, and what size. In my application, the desired size is in the last few characters of the filename, so xxxxx_300_300_.jpg means 300 x 300. The resizer fetches the source image... resizes it... stores the result in S3... and returns the new image to Varnish, which returns it to CloudFront and to the requester. The resizer itself is Imagemagick wrapped in Mojolicious and uses a MySQL database to identify the source URI where the original image can be fetched.
Storing the results in a backing store, like S3, and checking there, first, on each request, is a critical part of this process, because CloudFront does not work like many people seem to assume. Check your assumptions against the following assertions:
CloudFront has 50+ edge locations. Requests are routed to the edge that optimal for (usually, geographically close to) the viewer. The edge caches are all independent. If I request an object through CloudFront, and you request the same object, and our requests arrive at different edge locations, then neither of us will be served from cache. If you are generating content on demand, you want to save your results to S3 so that you do not have to repeat the processing effort.
CloudFront honors your Cache-Control: header (or overridden values in configuration) for expiration purposes, but does not guarantee to retain objects in cache until they expire. Caches are volatile and CloudFront is no exception. For this reason, too, your results need to be stored in S3 to avoid duplicate processing.
This is a much more complex solution than pre-processing.
I have a pool of millions of images, a large percentage of which would have a very low probability of being viewed, and this is an appropriate solution, here. It was originally designed as a parallel solution to make up for deficiencies in a poorly-architected preprocessor that sometimes "forgot" to process everything correctly, but it worked so well that it is now the only service providing images.
However, if your motivation revolves around avoiding the storage cost of the preprocessed results, this solution won't entirely solve that.