I have an application running since many time that uploads files (images) on S3 storage.
Now I've been requested to update this application and upload file using SSE-C encryption (Server Side Encryption with Customer provided key). So I did it.
I'm also able to upload SSE-C encrypted files using aws cli.
What I need now, and here is my question, is to find a way to apply SSE-C encryption to earlier files already on S3 without SSE-C encryption.
Could someone explain me if and how this can be accomplished or point me to some doc or support page in order to find a solution?
One (maybe inefficient) way I found is doing the following for each file:
copy filename to filename.encrypted applying the SSE-C encryption
move filename.encrypted to filename
Is this the only way to do it or there is a better one?
NOTES:
Since I have many many files I obviously excluded the option to download the file and then upload again with SSE-C encryption because it'll be too slow and too expensive.
A solution that let apply the SSE-C without data transfert from and back to S3 is the one I'm looking for.
Thank you very much for any feedback on this.
You can apply encryption to already-existing objects by simply copying the object on top of itself:
aws s3 cp s3://bucket/foo.txt s3://bucket/foo.txt --sse-c --sse-c-key fileb://key.bin
This works as long as something (eg the encryption) is changing.
I got the --sse-c syntax from: How to supply a key on the command line that's not Base 64 encoded
Related
Occasionally, a client requests a large chunk of data to be transferred to them.
We host our data in AWS S3, and a solution we use is to generate presign URLs for the data they need.
My question:
When should data integrity checks actually be performed on data migration or is relying on TSL good enough...
From my understanding, most uploads/downloads used via AWS CLI will automatically perform data integrity checks.
One potential solution I have is to manually generate MD5SUMS for all files transferred, and for them to perform a local comparison.
I understand that the ETAG is a checksum of sorts, but because a lot of the files are multipart uploads, the ETAG becomes a very complicated mess to use as a comparison value.
You can activate "Additional checksums" in AWS S3.
The GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.
Check out this release blog: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/
One method of ensuring a file in S3 is what it claims to be is to download it, get its checksum, and match the result against the checksum you were expecting.
Does AWS provide any service that allows this to happen without the user needing to first download the file? (i.e. ideally a simple request/url that provides the checksum of an S3 file, so that it can be verified before the file is downloaded)
What I've tried so far
I can think of a DIY solution along the lines of
Create an API endpoint that accepts a POST request with the S3 file url
Have the API run a lambda that generates the checksum of the file
Respond with the checksum value
This may work, but is already a little complicated and would have further considerations, e.g. large files may take a long time to generate a checksum (e.g. > 60 seconds)
I'm hoping AWS have some simple way of validating S3 files?
There is an ETag created against each object, which is an MD5 of the object contents.
However, there seems to be some exceptions.
From Common Response Headers - Amazon Simple Storage Service:
ETag: The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Also, the calculation of an ETag for a multi-part upload can be complex. See: s3cmd - What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB? - Stack Overflow
I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare
I am writing an application using the AWS SDK for C++. I would like to enable integrity checking for S3 transfers, even transfers that require multiple requests due to the size of the file.
How can I do this? The documentation for the C++ version of the AWS SDK is scanty.
I scanned the source code to the SDK and found this in AmazonWebServiceRequest:
inline virtual bool ShouldComputeContentMd5() const { return false; }
but it's not clear to me how to get the S3 classes to use an overridden version of this class.
While we're on the subject, I'd rather use the relatively new SHA256 AWS feature instead of MD5, but there seem to be even fewer hooks for that hash algorithm in the C++ SDK.
Can anyone help? Thanks.
S3 has Etag feature. Once an object is uploaded either partially or fully, you can get the Etag from the S3 API Call and Read the Etag from its header.
Below links discusses more on the etags.
What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?
S3 Documentation on ETag Header:
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide