Data integrity check during upload to S3 with server side encryption - amazon-web-services

Data integrity check is something that the AWS Java SDK claims that it provides by default where either the client can calculate the object checksum on its own and add it as a header “Headers.CONTENT_MD5” in the S3 client or if we pass it as null or not set it, the S3 client internally computes an MD5 checksum on the client itself which it uses to compare to the Etag ((which is nothing but the MD5 of the created object) obtained from the object creation response to throw an error back to the client in case of a data integrity failure. Note that in this case though, the integrity check happens on the client side and not on the S3 server side which means that the object will still be created successfully and the client would need to clean it explicitly.
Hence, using the header is recommended(where the check happens at the S3 end itself and fails early) but as TransferManager uses part upload, it is not possible for the client to explicitly set the MD5 for a specific part. The Transfer Manager should take care of computing the MD5 of the part and setting the header but I don’t see that happening in the code.
As we want to use the Transfer Manager for multi-part uploads, we would need to depend on the client side checking which is enabled by default. However, there is a caveat to that too. When we enable SSE-KMS or SSE-C on the object in S3, then this data integrity check is skipped as it seems (as they mention in one of the comments in the code) that in that case an MD5 of the ciphertext is received from S3 which cant be verified with the MD5 which was computed at the client side.
What should I use to enable the data integrity check with SSE in S3?
Note: Please verify that the above understanding is correct.

Related

Checking data integrity of downloaded AWS S3 data when using presigned URLS

Occasionally, a client requests a large chunk of data to be transferred to them.
We host our data in AWS S3, and a solution we use is to generate presign URLs for the data they need.
My question:
When should data integrity checks actually be performed on data migration or is relying on TSL good enough...
From my understanding, most uploads/downloads used via AWS CLI will automatically perform data integrity checks.
One potential solution I have is to manually generate MD5SUMS for all files transferred, and for them to perform a local comparison.
I understand that the ETAG is a checksum of sorts, but because a lot of the files are multipart uploads, the ETAG becomes a very complicated mess to use as a comparison value.
You can activate "Additional checksums" in AWS S3.
The GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.
Check out this release blog: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

AWS service to verify data integrity of file in S3 via checksum?

One method of ensuring a file in S3 is what it claims to be is to download it, get its checksum, and match the result against the checksum you were expecting.
Does AWS provide any service that allows this to happen without the user needing to first download the file? (i.e. ideally a simple request/url that provides the checksum of an S3 file, so that it can be verified before the file is downloaded)
What I've tried so far
I can think of a DIY solution along the lines of
Create an API endpoint that accepts a POST request with the S3 file url
Have the API run a lambda that generates the checksum of the file
Respond with the checksum value
This may work, but is already a little complicated and would have further considerations, e.g. large files may take a long time to generate a checksum (e.g. > 60 seconds)
I'm hoping AWS have some simple way of validating S3 files?
There is an ETag created against each object, which is an MD5 of the object contents.
However, there seems to be some exceptions.
From Common Response Headers - Amazon Simple Storage Service:
ETag: The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Also, the calculation of an ETag for a multi-part upload can be complex. See: s3cmd - What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB? - Stack Overflow

AWS S3 object with data sensitive object names

We name the S3 object name with the birthday of the employees. It is stupid. We want to avoid creating object name with sensitive data. Is it safe to store the sensitive data using S3 user-defined metadata or Add an S3 bucket policy that denies the action S3:Getobject. Which will work?
As you mentioned; its not a good idea to create object name with sensitive data; but its ok... Not too bad also.. I will suggest to remove listAllObjects() permissions in the S3 policy. Policy should only allow getObject() which means anyone can get the object ONLY when they know object name; i.e. when calling api already knows DOB of the user.
With listAllObjects() permissions; caller can list all the objects in the bucket and get DOB of users.
Object keys and user metadata should not be used for sensitive data. The reasoning behind object keys is readily apparent, but metadata may be less obvious;
metadata is returned in the HTTP headers every time an object is fetched. This can't be disabled, but it can be worked around with CloudFront and Lambda#Edge response triggers, which can be used to redact the metadata when the object is downloaded through CloudFront; however,
metadata is not stored encrypted in S3, even if the object itself is encrypted.
Object tags are also not appropriate for sensitive data, because they are also not stored encrypted. Object tags are useful for flagging objects that contain sensitive data, because tags can be used in policies to control access permissions on the object, but this is only relevant when the object itself contains the sensitive data.
In the case where "sensitive" means "proprietary" rather than "personal," tags can be an acceptable place for data... this might be data that is considered sensitive from a business perspective but that does not need to be stored encrypted, such as the identification of a specific software version that created the object. (I use this strategy so that if a version of code is determined later to have a bug, I can identify which objects might have been impacted because they were generated by that version). You might want to keep this information proprietary but it would not be "sensitive" in this context.
If your s3 bucket is used to store private data and your allowing public access to the bucket this is always a bad idea - it's basically security by obscurity.
Instead of changing your existing s3 structure you could lock down the bucket to just your app then you serve the data via cloudfront signed urls?
Basically in your code where you currently inject the s3 url You can instead call the aws api to create a signed url from the s3 url and a policy and send this new url to the end user. This would mask the s3 url, and you can enforce other restrictions like how long the link is valid, enforce requiring a specific header or limit access to a specific ip etc. You also get cdn edge caching and reduced costs as side benefits.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide

Multiuser access to encrypted data

I'm building a server-side application which requires the data the be stored encrypted in the database. When a client accesses the data, it also has to be transferred encrypted. The clients each has a unique login.
My original idea to do this, is to store the data encrypted with a symmetric-algorithm like AES. So when a client wants to access the data the encrypted data is transferred to the client, while the key is encrypted with the public key from the client.
Is this a secure way to do store and transfer the data or is there a better solution to this problem?
Update: If following Søren's suggestion to keep a copy of the AES key encrypted using each client's public key, wouldn't that include the key to be stored somewhere in order to add additional clients or could that be generated in any way?
First you should start by defining some security properties you want to provide, for example:
Is it ok to give different users access to the same secret key? Aka if File1 is AES encrypted with key K, is it a problem if user Alice and user Bob both are given K.
How do I revoke users from the system? (It turns out Bob from scenario 1 is actually a Chinese spy working for our company, how do I securely kick him out of the system).
Does the encrypted data that is saved in the database need to be searched? (This problem is well researched and hard to solve!)
How much (if any) and what plaintext data will be placed into the database to help organize it? Databases expect data to have unique keys associated with them. You need to make sure these keys don't leak information, but are useful enough to retrieve the data later.
How often should secret keys be changed? If you are storing files and multiple users are allowed access to encrypted files, what happens when user X modifies a file? Does the secret key change? Should the new key be sent to all users?
What happens when 2 users modify the same data at the same time? Will the database be able to handle this without modification?
There are many others.
If the server is not trusted and must never see plaintext data, then here's a general overview of a possible solution.
Let the clients managed the crypto completely. Clients authenticate with the server and are allowed to store data into the database. It is the responsibility of the client to make sure the data is encrypted.
In this scenario, keys should be saved securely only on the clients computer. If they must be placed elsewhere, a "master key" could be created.
Secure from what? You need to define your goals more clearly.
The solution would protect the data during transfer, but from your description, the server would have full access to the data (since it'd need to store the AES key unencrypted). In other words, a hacker or burglar with access to the server would have full access to the data.
If secure transmission is what you want, use an SSL / TLS wrapper around the database connection. This is a standard solution from all major vendors.
To secure the data server side, the server should not have the AES key. If the number of clients were limited, the server could store a copy of the AES key for every client, each copy of the key already encrypted with the public key of each client, such that the server never sees the plain text data nor any unencrypted AES keys.
That is indeed the common approach, e.g. also used by NTFS file encryption.