AWS PutObjectResponse.ETag Mismatch - amazon-web-services

I am attempting to verify an S3 Put Object using a local hash impl as follows:
MD5 md5 = MD5.Create();
byte[] inputBytes = Encoding.ASCII.GetBytes(str);
byte[] hash = md5.ComputeHash(inputBytes);
StringBuilder sb = new StringBuilder();
foreach (byte byt in hash)
{
sb.Append(byt.ToString("X2"));
}
return sb.ToString().ToLower();
From AWS Documentation this should match the PutObjectResponse.ETag property which is based on the content body (not Meta) of my put request. In this case I cam posting a JSON document which is the source of my hash.
All is fine except when I use AWS Managed KMS Server-side encryption at which point my hashes do not match. Is it not possible to verify the content body to that posted since it appears the ETAG hash is based on the encrypted content body, not the origional PUT content.

It is not possible to verify the integrity of the content/body using the ETag, because, as documented, the ETag no longer matches the expected value when using SSE with KMS keys.
However, you can prevent S3 from accepting an incomplete or corrupted upload, by sending a Content-MD5 header with the upload. The value you supply will be the binary (not hex) md5 of the payload, encoded in base64, and not url-escaped. If the payload doesn't match this hash, S3 won't accept the upload or store the object, and will return an error.
The SDK you are using might do this for you automatically, or you may he able to enable it; I don't know, since I use my own written-from-scratch libraries for all AWS service interactions.

Related

How to create a s3 object download link once file uploaded to bucket? [duplicate]

I'm using an AWS Lambda function to create a file and save it to my bucket on S3, it is working fine. After executing the putObject method, I get a data object, but it only contains an Etag of the recently added object.
s3.putObject(params, function(err, data) {
// data only contains Etag
});
I need to know the exact URL that I can use in a browser so the client can see the file. The folder has been already made public and I can see the file if I copy the Link from the S3 console.
I tried using getSignedUrl but the URL it returns is used for other purposes, I believe.
Thanks!
The SDKs do not generally contain a convenience method to create a URL for publicly-readable objects. However, when you called PutObject, you provided the bucket and the object's key and that's all you need. You can simply combine those to make the URL of the object, for example:
https://bucket.s3.amazonaws.com/key
So, for example, if your bucket is pablo and the object key is dogs/toto.png, use:
https://pablo.s3.amazonaws.com/dogs/toto.png
Note that S3 keys do not begin with a / prefix. A key is of the form dogs/toto.png, and not /dogs/toto.png.
For region-specific buckets, see Working with Amazon S3 Buckets and AWS S3 URL Styles. Replace s3 with s3.<region>.amazonaws.com or s3-<region>.amazonaws.com in the above URLs, for example:
https://seamus.s3.eu-west-1.amazonaws.com/dogs/setter.png (with dot)
https://seamus.s3-eu-west-1.amazonaws.com/dogs/setter.png (with dash)
If you are using IPv6, then the general URL form will be:
https://BUCKET.s3.dualstack.REGION.amazonaws.com
For some buckets, you may use the older path-style URLs. Path-style URLs are deprecated and only work with buckets created on or before September 30, 2020. They are used like this:
https://s3.amazonaws.com/bucket/key
https://s3.amazonaws.com/pablo/dogs/toto.png
https://s3.eu-west-1.amazonaws.com/seamus/dogs/setter.png
https://s3.dualstack.REGION.amazonaws.com/BUCKET
Currently there are TLS and SSL certificate issues that may require some buckets with dots (.) in their name to be accessed via path-style URLs. AWS plans to address this. See the AWS announcement.
Note: General guidance on object keys where certain characters need special handling. For example space is encoded to + (plus sign) and plus sign is encoded to %2B. Also here.
in case you got the s3bucket and filename objects and want to extract the url, here is an option:
function getUrlFromBucket(s3Bucket,fileName){
const {config :{params,region}} = s3Bucket;
const regionString = region.includes('us-east-1') ?'':('-' + region)
return `https://${params.Bucket}.s3${regionString}.amazonaws.com/${fileName}`
};
You can do an another call with this:
var params = {Bucket: 'bucket', Key: 'key'};
s3.getSignedUrl('putObject', params, function (err, url) {
console.log('The URL is', url);
});

AWS service to verify data integrity of file in S3 via checksum?

One method of ensuring a file in S3 is what it claims to be is to download it, get its checksum, and match the result against the checksum you were expecting.
Does AWS provide any service that allows this to happen without the user needing to first download the file? (i.e. ideally a simple request/url that provides the checksum of an S3 file, so that it can be verified before the file is downloaded)
What I've tried so far
I can think of a DIY solution along the lines of
Create an API endpoint that accepts a POST request with the S3 file url
Have the API run a lambda that generates the checksum of the file
Respond with the checksum value
This may work, but is already a little complicated and would have further considerations, e.g. large files may take a long time to generate a checksum (e.g. > 60 seconds)
I'm hoping AWS have some simple way of validating S3 files?
There is an ETag created against each object, which is an MD5 of the object contents.
However, there seems to be some exceptions.
From Common Response Headers - Amazon Simple Storage Service:
ETag: The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Also, the calculation of an ETag for a multi-part upload can be complex. See: s3cmd - What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB? - Stack Overflow

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide

S3 URL encoding when generating Signed URL

I have a system where after a file is uploaded to S3, a Lambda job raises a queue message and I use it to maintain a list of keys in a MySQL table.
I am trying to generate a pre-signed URL based on the records in my table.
i have two records currently
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist+-+Song.mp3
Generating pre-signed URL using :
var params = {
Bucket: bucket,
Expires: Settings.UrlGetTimeout,
Key: record
};
S3.getSignedUrl('getObject', params);
The URL with thumbnail.jpg works perfectly fine, but the one with +-+ fails. The original file name on local disk was "Artist - Song.mp3". S3 replaced spaces with '+'. Now when I am generating a URL using the exact same filename that S3 uses, it doesn't work; I get a "Specified Key doesn't exist" error from S3.
What must I do to generate URLs consistently for all filenames?
I solved this after a little experimentation.
Instead of directly storing key that S3 provides in their S3 event message, I am first replacing '+' character with space (as they are originally on the disk) and then URL decoding it.
return decodeURIComponent(str.replace(/\+/img, " "));
Now generating a S3 Pre-Signed URL works as expected.
Before MySQL has the following records:
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist+-+Song.mp3
Now:
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist - Song.mp3
I personally feel there is an inconsistency with S3's api/event messages.
Had i generated a Signed URL directly using the Key that S3 itself provided in SQS event message, It wouldn't have worked. One must do this string replacement step & URL decoding on the key in order to use it to get a proper working url.
Not sure if this is by design or a bug.
The second file's name is coming to you form-urlencoded. The + is actually a space, and if you had other characters (like parenthesis) they would be percent-escaped. You need to run your data through a URL decoder before working with it further.
Side-note: if the only thing your Lambda function does is create an SQS message, you can do that directly from S3 without writing your own function.

Data integrity check during upload to S3 with server side encryption

Data integrity check is something that the AWS Java SDK claims that it provides by default where either the client can calculate the object checksum on its own and add it as a header “Headers.CONTENT_MD5” in the S3 client or if we pass it as null or not set it, the S3 client internally computes an MD5 checksum on the client itself which it uses to compare to the Etag ((which is nothing but the MD5 of the created object) obtained from the object creation response to throw an error back to the client in case of a data integrity failure. Note that in this case though, the integrity check happens on the client side and not on the S3 server side which means that the object will still be created successfully and the client would need to clean it explicitly.
Hence, using the header is recommended(where the check happens at the S3 end itself and fails early) but as TransferManager uses part upload, it is not possible for the client to explicitly set the MD5 for a specific part. The Transfer Manager should take care of computing the MD5 of the part and setting the header but I don’t see that happening in the code.
As we want to use the Transfer Manager for multi-part uploads, we would need to depend on the client side checking which is enabled by default. However, there is a caveat to that too. When we enable SSE-KMS or SSE-C on the object in S3, then this data integrity check is skipped as it seems (as they mention in one of the comments in the code) that in that case an MD5 of the ciphertext is received from S3 which cant be verified with the MD5 which was computed at the client side.
What should I use to enable the data integrity check with SSE in S3?
Note: Please verify that the above understanding is correct.