How to add custom authentication to aws s3 download of large files - amazon-web-services

I'm trying to figure out how to implement these requirements for S3 downloads:
Signed URL (links should become invalid after some amount of time).
Download only 1 time - any other requests to the same URL should fail.
Need to restrict downloads to the user/browser who made the request to generate the signed URL - no other user should be able to download.
Be able to deal with large files (ideally, streaming, just like when someone downloads directly from a standard S3 access point).
Things that I've tried:
S3 Object Lambda + Access Point
Generate pre-signed URL to lambda access point, this works well.
Make use of S3 object metadata to store download state / restrict downloads to just 1 time. This works well.
No way to access user-agent or requestor's IP.
Large files are a problem. Timeout has been configured to 15 minutes (the max), but request still times out much earlier. This was done with NodeJS.
Lambda + Lambda URL
Pre-signed URL is generated and passed to lambda URL as encoded param - the lambda makes the request if auth/validation passes. This approach seems to work fine.
Can use same approach of leveraging S3 object metadata to limit downloads to just 1 time.
User-agent and requestor IP is available, this is great.
Large files are a problem. I've tried NodeJS and it behaves the same as the S3 Object Lambda (eventually times out, even earlier than the configured time), Also implemented the Java streaming handler but it dies with an "out of memory" error, even when I bump the memory up to 3GB (the file is only 1GB and I thought streaming would get around the memory problem anyway). I've tried several ways to stream (Java 11), but it really seems like the streaming handler is not really streaming, but buffering somewhere outside of the lambda.
I'm now unsure if AWS lambda will be able to handle all of these requirements, but I would really like to know if others might have ideas, or if I'm missing something.

Related

Best way to stream or load audio files into S3 bucket (contact centre recordings)

What is the best way to with reliability get our client to send audio files to our S3 bucket that will process the audio files (ML processes that will do speech-to-text-insights)?
The files could be in .wav / mp3 other such audio formats. Also, some files may be larger in size.
Love to get best ideas? (e.g. API Gateway / Lambda / S3 ?) Would love to hear from anyone who may have done this before.
Some questions and answers to give context:
How do users interface with your system? We are looking for API based approach vs. a browser based approach. We can get browser based approach to work but not sure if that is the right technical/architectural / scalable approach
Do you require a bulk upload method? Yes. We would need bulk upload functionality and some individual files may be larger as well
Will it be controlled by a human, or do you want it to upload automatically somehow? Certainly want it automatically
ultimately, we are building a SaaS solution that will take the audio files and meta data and perform analytics on it and deliver results of our analysis through an API back to the App. So the approach we are looking for is something that will work within this context
I have a similar scenario.
If you intend to use Api Gateway/Lambda/s3 then you should know that there is a limit on the payload size that Gateway & Lambda can accept. Specifically, Api Gateway accepts payloads till 10 MB & Lambda till 6MB.
There is a workaround for this issue though. You can upload your files directly on an s3 bucket and attach a lambda trigger on object creation.
I'll leave some articles that may point you to the right direction :
Uploading a file using presigned URLs :
https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html
Lambda trigger on s3 object creation: https://medium.com/analytics-vidhya/trigger-aws-lambda-function-to-store-audio-from-api-in-s3-bucket-b2bc191f23ec
A holistic view of the same issue: https://sookocheff.com/post/api/uploading-large-payloads-through-api-gateway/
Related GitHub issue :
https://github.com/serverless/examples/issues/106
So from my pov, regarding uploading files, the best way would be to return a pre-signed URL, then have the client upload the file directly to S3. Otherwise, you'll have to implement uploading the file in chunks.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

How to upload large file using AWS Gateway and S3 proxy

I have AWS Gateway API configured as proxy for S3 to upload a file to S3 bucket. I have configured binary media to support multipart/form-data
I am able to upload a file of size 10MB or less without any issue. However when the file size is more than 10MB i get 413 Request Entity Too Large issue.
I know that AAG has hard limit of 10 MB on payload.
Questions
1>Isn't adding multipart/form-data should solve the 10 MB limit issue? Do i need to configure anything else?
2>Another approach recommended is to create pre-signed url. I am assuming for this approach to work client has to make call to get pre-signed url and then use that url to upload a file. Is this the only approach to upload a large file?
Note that I have gone through several SO post regarding the same issue, but most of them are old and i am curious to see if there are any new recommendations.
The 10 MB payload limit is hard and cannot be increased [1].
It seems to be possible to split the file into chunks on the client and then put it together on the server again [2] to circumvent the 10 MB limit, but I do not think this is a reasonable approach. The pre-signed URL approach seems better to me, if you do not use a client SDK which provides functionality for chunking.
Please note that if you ever decide to move away from S3, you can still implement the very same interface on any server yourself. In my opinion it is therefore the way to go.

Can you request an object from S3 without knowing its extension?

Say I have a bucket called uploads with two directories, both of which contain images.
The first directory, called catalog, has images with various extensions (.jpg, .png, etc.)
The second directory, called brands, has images with no extensions.
I can request uploads/catalog/some-image.jpg and uploads/brands/extensionless-image, and they both return an image as I expect.
We're already using a third-party service, imgix, which is just an image-processing CDN that links to the S3 bucket so that we can request, say, a smaller or cropped version of the image in the bucket.
Ideally, I'd like to keep the images and objects in their current formats in the bucket, but I would like the client-side to be agnostic about which file it is requesting. In other words, I'd like to request some-image, and even though it may or may not actually have an extension in the bucket, I'd still like to somehow "intelligently guess" the image I'm requesting. We'll also assume that there are no collisions, i.e., there will never be an image some-image.jpg and some-image with both the same name (our objects are named with a collision-less algorithm).
This is what I've tried:
Simply request images in one directory by their extension, and the images in the other bucket without their extension (however, even though the policy is the same of requesting an image, the mechanism has to be implemented in two different ways. I would like a singular mechanism)
Another solution is to programmatically remove the extensions from all the images in catalog and re-sync the bucket
Anyone run into something similar before? Thoughts?
I suspect your best bet is going to be renaming the images. Not that there aren't other solutions, but because that is probably going to be the simplest and most straightforward approach.
First, S3 will not guess. The key on an S3 object is an opaque string from S3's perspective. The extension has no meaning, and even the slashes delimiting "directories" have no intrinsic meaning to S3. (Deleting a "directory" in S3 means sending a delete request for every individual object in the directory. The console creates a convenient illusion by doing this for you.)
S3 has redirect rules, but they only match and manipulate path prefixes, not suffixes, so no help there.
It would be possible, using a reverse proxy in front of S3, to inspect requests and for any 404 or 403, the proxy could retry the request with alternate extensions, until it found one that worked, and it could potentially "learn" the right extension for use on subsequent requests, but then you'd have the added turn-around time and additional cost for multiple requests.
I have developed systems whose job it is to "find" things requested over HTTP by trying multiple back-end URLs, without the requester being aware of the "hunting" going on in the background, and it can be very useful... but that is a much more complicated solution than you would probably want to consider, particularly in light of the fact that every millisecond counts when it comes to image loading.
There is no native solution for magic guessing with S3. You pretty much have to ask it for exactly what you want. Storage in S3 is cheap enough, of course, that you could probably duplicate your content, with and without extensions, without giving too much thought to the cost. If you used a Lambda event on the bucket, you could even automate the process of copying "kitten.jpg" to "kitten" each time "kitten.jpg" was modified.
If the content-type is set correctly in your object metadata, you should be fine regardless of extensions. If content-type header is not set, you can set it, for example using ImageMagick Identify to discover the image type and AWS CLI to set it.

Log delay in Amazon S3

I have recently hosted in Amazon S3, and I need the log files to calculate the statistics for the "get", "put", "list" operations in the objects.
And I've observed that the log files are organized weirdly. I don't know when the log will appear(not immediatly, at least 20 minutes after the operation) and how many lines of logs will be contained in one log file.
After that, I need to download these log files and analyse them. But I can't figure out how often I will do this.
Can somebody help? Thanks.
What you describe (log files being made available with delays and being in unpredictable order) is exactly what is declared by AWS as behaviour to expect. This is by nature of distributed system, AWS S3 is using to provide S3 service, the same request may be served each time from different server - I have seen 5 different IP addresses being provided for publishing.
So the only solution is: accept the delay, see the delay you experience and add some extra time and learn living with this total delay (I would expect something like 30 to 60 minutes, but statistics could tell more).
If you need log records ordered, you have either sort them yourself, or search for some log processing solutions - I have seen some applications being offered exactly for this purpose.
In case, you really need to get your log file with very short delay, you have to make the logs yourself and this means, you have to write and run some frontend, which gives access to your files on S3 and at the same time keeps logging as needed.
I run such a solution, users get user name and password and url of my frontend. As they send the request, I evaluate, if they provide proper credentials and if they are allowed to see given resource, and if so, I create few minutes valid temporary url for that resource and redirect the request to that.
But such a fronted costs money (you have to run your frontend somewhere) and is less robust, then accessing directly the AWS S3.
Good luck, Lulu.
A lot has changed since the time that the question was originally posted. The delay is still there, but one of OP concerns was when to download the logs to analyze them.
One option right now would be to leverage Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/setup-event-notification-destination.html
This way, whenever an object is created in the access logs bucket, you can trigger a notification either to SNS, SQS or Lamba, and based on that download and analyze the log files.