I'd like to be able to detect when all parts of an S3 multipart upload have been uploaded.
Context
I'm working on a Backend application that sits between a Frontend and an S3 bucket.
When the Frontend needs to upload a large file it makes a call to the Backend (step 1). The latter initiates a multipart upload in S3, generates a bunch of presinged URLs, and hands them to the Frontend (steps 2 - 5). The Frontend uploads segment data directly to S3 (steps 6, 10).
S3 multipart uploads need to be explicitly completed. One obvious way to perform it would be to make another call from the Frontend to the Backend to notify about the fact that all parts have been uploaded. But if possible I'd like to avoid that extra call.
A possible solution: S3 Event Notifications
I have S3 Event Notifications enabled on the S3 bucket so whenever something happens, it notifies an SNS topic which in turn calls the Backend.
If the bucket sent S3 notifications after each part is done uploading, I could use those in the Backend to see if it's time to complete the upload (steps 7 - 9, 11 - 14).
But although some folks claim (one, two) that it's the case, I wasn't able to reproduce it.
For proof of concept, I used this guide from Amazon to upload a file using aws s3api create-multipart-upload, several aws s3api upload-part, and aws s3api complete-multipart-upload. I would expect to get a notification after each upload-part, but I only got a single "s3:ObjectCreated:CompleteMultipartUpload" after, well, complete-multipart-upload.
My bucket is configured to send notification for all object creation events: "s3:ObjectCreated:*".
Questions
Is it possible to somehow instruct S3 to send notifications upon upload of each part?
Are there any other mechanisms to find out in the Backend that all parts have been uploaded?
Maybe what I want is complete nonsense and even if there was a way to implement it, it would bring significant drawbacks?
Related
I am trying to build a file upload and download app using AWS APi gateway,AWS lambda and S3 for storage.
AWS lambda puts a cap of 6 mb on the file size and API gateway a limit of 10 mb.
Therefore we decided to use pre sign url for uploading n downloading files.
Step 1- Client sends the list of filename(let's say 5 files) to lambda.
Step 2- Lamda creates and returns the list of pre sign url(PUT) for those files(5 urls).
Step 3- Client uploads the file to S3 using the urls which it received.
Note - The filename are S3 bucket keys.
Similar approach with downloading file .
Now the issue is with the latency, it takes quite a long time and performance is the collateral damage.
The question is, the above approach the only way to do file upload n download in lambda.
It looks like the case of S3 Transfer Acceleration. You'll still create pre-signed URLs but enable this special setting in S3 which will reduce latency.
https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
Alternatively you can use CloudFront with S3 origin to upload / download file. Might have to re-architect your solution but with CloudFront and AWS networking backbone, latency can be reduced a lot.
I can't find some information about Amazon S3, hope you will help me. When is a file available for user to download, after the POST upload? I mean some small JSON file that doesn't require much processing. Is it available to download immediately after uploading? Or maybe amazon s3 works in some sessions and it always takes a few hours?
According to the doc,
Amazon S3 provides strong read-after-write consistency for PUTs and DELETEs of objects in your Amazon S3 bucket in all AWS Regions.
This means that your objects are available to download immediately after it's uploaded.
An object that is uploaded to an Amazon S3 bucket is available right away. There is no time period that you have to wait. That means if you are writing a client app that uses these objects, you can access them as soon as they are uploaded.
In case anyone is wondering how to programmatically interact with objects located in an Amazon S3 bucket through code, here is an example of uploading and reading objects in an Amazon S3 bucket from a client web app....
Creating an example AWS photo analyzer application using the AWS SDK for Java
We provide REST api to upload user files, on backend we use S3 Bucket to store the user uploads.
As our REST api has timeout of 30sec, user may get timeouts based on filesize and his network.
So we thought of providing him a S3 Pre-Sign URLs for uploads, through which user can upload his large files using AJAX calls from frontend or using backend scheduled script.
Everything looks OK but we don't have any clue about provided s3 Pre-Signs:
- whether he attempted to upload or not:
if attempted whether the upload is success or not
if failed, what was the error (url expired or something other..)
We can know about success case by searching the object key in our bucket. But in case of failures we don't have any clue.
Please let me know if there is anyway to track the S3 pre-sign access/uploads.
You will not know when a pre-signed URL is used, but another couple of options are:
You can configure an Amazon S3 Event to trigger when a new file is uploaded to a bucket. This could trigger an AWS Lambda function that can process the file or, at least, make a log that the file was uploaded.
You can use Amazon S3 Server Access Logging to track access to the bucket.
I want to expose an API (preferably using AWS API gateway/ Lambda/Go) to the users.
Using this API, the users can download a binary file from S3 bucket.
I want to capture the metrics like, which user has started download of the file, the time at which the file download had started and finished.
I want to record these timestamps in DynamoDB.
S3 has support for Events for creating/modifying/deleting files, so I can write a lambda function for these events.
But S3 doesn't seems to have support for read actions ( e.g. download a file)
I am thinking to write a Lambda function, which will be invoked when the user calls the API to download the file. In the lambda, I want to record the timestamp, read the file into a buffer, encode it and then send it as as base64 encoded response to the client.
Let me know if there is any better alternative approach.
use Amazon S3 Server Access Logging
don't use DynamoDB, if you need to query the logs in the target bucket setup Spectrum to query the logs which are also in S3
Maybe you can use S3 Access Logs?
And configure event based on new records in log bucket. However, this logs will not tell you if user has finished download or not.
We are currently publishing data to an S3 bucket. We now have multiple clients to consume this data that we stored in our bucket. Each client wants to have their own bucket. The ask is to publish data to each bucket.
Option 1: Have our publisher publish to each S3 bucket.
cons: More logic on our publishing application. Handle failures/retries based on clients.
Option 2: Use S3's Cross-region replication
reason against it: Even though we can transfer objects to other accounts, Only one destination can be specified. If source bucket has server side encryption we cannot replicate.
Option 3: AWS Lamba. Have S3 invoke Lamba and lamba publish to multiple buckets.
confused: Not sure how different this is from option 1.
Option 4: Restrict access to our S3 bucket with read only. Have clients read from it. But wondering how clients can know if an object is already read! I do not prefer time based folders, we have multiple publishers to this S3 bucket and clients cant know for sure if the folder is indeed complete.
Is there any good option to solve the above problem?
I would go with option 3, Lambda. Your Lambda function could be triggered by S3 events so you wouldn't have to add any manual steps or change your current publishing process at all.