ObjectCreated:Post on S3 Console upload?

ObjectCreated:Post on S3 Console upload? - amazon-web-services

My S3 Lambda Event listener is only seeing ObjectCreated:Put events when a file is uploaded via the S3 console. This is both for new files and overwriting existing files. Is this the expected behavior?
It seems like a new file upload should generate ObjectCreated:Post in keeping with the POST == Create, PUT == Update norm.

S3 has 4 APIs for object creation:
PUT is used for requests that send only the raw object bytes in the HTTP request body. It is the most common API used for creation of objects up to 5 GB in size.
POST uses specially-crafted HTML forms with attributes, authentication, and a file all as part of a multipart/form-data HTTP request body.
Copy is used where the source bytes come from an existing object in HTTP (which incidentally also uses HTTP PUT on the wire, but is its own event type). The Copy API is also used any time you edit the metadata of an existing object: once stored in S3, objects and their metadata are completely immutable. The console allows you to "edit" metadata, but it accomplishes this by copying the object on top of itself (which is a safe operation in S3, even when bucket versioning is not enabled, because the old object is untouched until the new object creation has succeeded) while supplying revised metadata. S3 does not support move or rename -- these are done with a copy followed by a delete. The maximum size of object that can be copied with the Copy API is 5 GB.
Multipart, which is mandatory for creating objects exceeding 5 GB and recommended for multi-megabyte objects. Multipart can be used for objects of any size, but each part (other than the last) must be at least 5 MiB in size, so it is not typically used for smaller uploads. This API also allows safe retrying of any parts that failed, uploading parts in parallel, and has multiple integrity checks to prevent any defects from appearing in the object that S3 reassembles. Multipart is also used to copy large objects
The console communicates with S3 using the standard public APIs, the same as the SDKs use, and uses either PUT or multipart, depending on the object size, and Copy for editing object metadata, as mentioned above.
For best results, always use the s3:ObjectCreated:* event, unless you have a specific reason not to.

Related

Set default meta-data for all future uploaded objects in GCP Bucket

I would like to know how could I set default meta-data for all future uploaded objects.
Am trying to set "Cache-Control:public,max-age=3600" as a header for each object in my bucket hosting a static website. For all the existing objects, I used the guide command to set meta data, although, can't find a way to set it by default for future uploaded objects.
P.S., Developers are using GCP console to upload the objects, and I recently realized that when they upload the updated HTML files (which replaces the one's on bucket), the meta-data resets.

According to the documentation, if an object does not have a Cache-Control entry, the default value when serving that object would be public,max-age=3600.
In the case that you still want to modify this meta-data, you could do that using the JSON API inside a Cloud Funtion that would be triggered every time a new object is created or an existing one is overwritten.

If an updated object in S3 serves as a lambda trigger, is there an inherent race condition?

If I update an object in an S3 Bucket, and trigger on that S3 PUT event as my Lambda trigger, is there a chance that the Lambda could operate on the older version of that object given S3’s eventual consistency model?
I’m having a devil of a time parsing out an authoritative answer either way...

Yes, there is a possibility that a blind GET of an object could fetch a former version.
There are at least two solutions that come to mind.
Weak: the notification event data contains the etag of the newly-uploaded object. If the object you fetch doesn't have this same etag in its response headers, then you know it isn't the intended object.
Strong: enable versioning on the bucket. The event data then contains the object versionId. When you download the object from S3, specify this exact version in the request. The consistency model is not as well documented when you overwrite an object and then download it with a specific version-id, so it is possible that this might result in an occasional 404 -- in which case, you almost certainly just spared yourself from fetching the old object -- but you can at least be confident that S3 will never give you a version other than the one explicitly specified.
If you weren't already using versioning on the bucket, you'll want to consider whether to keep old versions around, or whether to create a lifecycle policy to purge them... but one brilliantly-engineered feature about versioning is that the parts of your code that were written without awareness of versioning should still function correctly with versioning enabled -- if you send non-versioning-aware requests to S3, it still does exactly the right thing... for example, if you delete an object without specifying a version-id and later try to GET the object without specifying a version-id, S3 will correctly respond with a 404, even though the "deleted" version is actually still in the bucket.

How does the file get there in the first place? I'm asking, because if you could reverse the order, it'd solve your issue as you put your file in s3 via a lambda that before overwriting the file, can first get the existing version from the bucket and do whatever you need.

Upload files to S3 Bucket directly from a url

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.

No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.

This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.

if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch

Growing files on Amazon S3

Is it possible to have growing files on amazon s3?
That is, can i upload a file that i when the upload starts don't know the final size of. So that I can start writing more data to the file with at an specified offset.
for example write 1000 bytes in one go, and then in the next call continue to write to the file with offset 1001, so that the next bytes being written is the 1001 byte of the file.

Amazon S3 indeed allows you to do that by Uploading Objects Using Multipart Upload API:
Multipart upload allows you to upload a single object as a set of
parts. Each part is a contiguous portion of the object's data. You can
upload these object parts independently and in any order. If
transmission of any part fails, you can retransmit that part without
affecting other parts. After all parts of your object are uploaded,
Amazon S3 assembles these parts and creates the object. [...]
One of the listed advantages precisely addresses your use case, namely to Begin an upload before you know the final object size - You can upload an object as you are creating it.
This functionality is available by Using the REST API for Multipart Upload and all AWS SDKs as well as 3rd party libraries like boto (a Python package that provides interfaces to Amazon Web Services) do offer multipart upload support based on this API as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js