I am trying to use the Amazon-S3 REST API to upload large file in chunks.
As per the API documentation here I formed my request with postman as follows.
After Initiating CreateMultipartUpload Post Request, I'm successfully getting UploadId for chunk put requests.
This is Working Fine.
I understand Multipart order, but when executing the step uploading partNumber & UploadId chunks using POSTMAN , I'm getting SHA256Mismatch error because POSTMAN calculating Content-MD5 for whole file not for chunks.
Header
Params
Body
I have found multiple solutions on various forums but those solutions didn't work.
Am I missing something here?
Successfully uploaded to Amazon S3 using PostMan using Amazon's multipart upload and the key (for me) was to add a Content-MD5 header manually, and paste in the Base64 encoded MD5 hash for that part (details below). This may not be the exact problem the OP was having, but still, I wanted to share how to use PostMan, provided you have good working IAM key id and secret to your Amazon S3 bucket.
First, I split a 9 MB "mytest.pdf" file into two parts to test this (I used Linux/WSL command: split -b 5242880 mytest.pdf) making sure the first part is larger than 5MB (the last part can be smaller than 5 MB).
Next, set up PostMan with following four requests:
CreateMultipartUpload (e.g., POST
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?uploads
UploadPart1 (e.g., PUT
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?partNumber=1&uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
UploadPart2 (e.g., PUT
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?partNumber=2&uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
CompleteMultipartUpload (e.g., POST
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
I pasted in my IAM key id and access key in the authorization section of PostMan (separate articles for that)
Ran the CreateMultipartUpload POST to get the UploadId from Amazon.
Next, calculated MD5 hash of each of my two file parts (I used 7-zip but use tool of your choice). Now, converted that result into Base64. However, I had to make sure that I ended up with 22 characters followed optionally by two equals signs == . When I converted MD5 as text to Base64, I ended up with a longer string, which ended with a single equal sign, not double (this is a sign that it is not encoded in a way that Amazon expects it and will produce an InvalidDigest error). For example, if you use 7-zip to calculate MD5 hash on file part and get the value 58942651efd0f5886810d04ed9df502f, and then use a tool such as the Bases64 encoder below and choose input at Text you get NTg5NDI2NTFlZmQwZjU4ODY4MTBkMDRlZDlkZjUwMmY= but if you choose "HEX" as the input, you get a smaller string WJQmUe/Q9YhoENBO2d9QLw== that is 24 characters (22 characters + 2 equals signs). That's the one you want.
(no implied endorsement of this tool - not affiliated https://emn178.github.io/online-tools/base64_encode.html)
If you get it wrong, Amazon will reply with the InvalidDigest error below
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InvalidDigest</Code>
<Message>The Content-MD5 you specified was invalid.</Message>
<Content-MD5>thisisbad</Content-MD5>
<RequestId>8274DC9566D4AAA8</RequestId>
<HostId>H6kSy4cl+54nMon1Hq6AGjmTX/MfTVMQQr8vEVNXUnPlfMtIt8HPdObfusckhBpwpG/CJ6ORWv16c=</HostId>
</Error>
Ran both UploadPart1 and 2
Finally ran CompleteMultipartUpload with copied and pasted Etag values from the Headers of the previous 2 requests
<CompleteMultipartUpload>
<Part>
<PartNumber>1</PartNumber>
<ETag>"c716d98e83db1edb27fc25fd03e0ae32"</ETag>
</Part>
<Part>
<PartNumber>2</PartNumber>
<ETag>"58942651efd0f5886810d04ed9df502f"</ETag>
</Part>
</CompleteMultipartUpload>
Related
In the app I'm working on, we have a process whereby a user can download a CSV or PDF version of their data. The generation works great, but I'm trying to get it to download the file and am running into all sorts of problems. We're using API Gateway for all the requests, and the generation happens inside a Lambda on a POST request. The GET endpoint takes in a file_name parameter and then constructs the path in S3 and then makes the request directly there. The problem I'm having is when I'm trying to transform the response. I get a 500 error and when I look at the logs, it says Execution failed due to configuration error: Unable to transform response. So, clearly that's where I've spent most of my time. I've tried at least 50 different iterations of templates and combinations with little success. The closest I've gotten is the following code, where the CSV downloads fine, but the PDF is not a valid PDF anymore:
CSV:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$input.body
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
PDF:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$util.base64Encode($input.body)
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
where contentHandling = CONVERT_TO_TEXT. My binaryMediaTypes just has application/pdf and that's it. My goal is to get this working without having to offload the problem into a Lambda so we don't have that overhead at the download step. Any ideas how to do this right?
Just as another comment, I've tried CONVERT_TO_BINARY and just leaving it as Passthrough. I've tried it with text/csv as another binary media type and I've tried different combinations of encoding and decoding base64 and stuff. I know the data is coming back right from S3, but the transformation is where it's breaking. I am happy to post more logs if need be. Also, I'm pretty sure this makes sense on StackOverflow, but if it would fit in another StackExchange site better, please let me know.
Resources I've looked at:
https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html#util-template-reference
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-workflow.html
https://docs.amazonaws.cn/en_us/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-control-service-api.html.
(But they're all so confusing...)
EDIT: One Idea I've had is to do CONVERT_TO_BINARY and somehow base64 encode the CSVs in the transformation, but I can't figure out how to do it right. I keep feeling like I'm misunderstanding the order of things, specifically when the "CONVERT" part happens. If that makes any sense.
EDIT 2: So, I got rid of the $util.base64Encode in the PDF one and now I have a PDF that's empty. The actual file in S3 definitely has things in it, but for some reason CONVERT_TO_TEXT is not handling it right or I'm still not understading how this all works.
Had similar issues. One major thing is the Accept header. I was testing in chrome which sends Accept header as text/html,application/xhtml.... api-gateway ignores everything except the first one(text/html). It will then convert any response from S3 to base64 to try and conform to text/html.
At last after trying everything else I tried via Postman which defaults the Accept header to */*. Also set your content handling on the Integration response to Passthrough. And everything was working!
One other thing is to pass the Content-Type and Content-Length headers through(Add them in method response first and then in Integration response):
Content-Length integration.response.header.Content-Length
Content-Type integration.response.header.Content-Type
One method of ensuring a file in S3 is what it claims to be is to download it, get its checksum, and match the result against the checksum you were expecting.
Does AWS provide any service that allows this to happen without the user needing to first download the file? (i.e. ideally a simple request/url that provides the checksum of an S3 file, so that it can be verified before the file is downloaded)
What I've tried so far
I can think of a DIY solution along the lines of
Create an API endpoint that accepts a POST request with the S3 file url
Have the API run a lambda that generates the checksum of the file
Respond with the checksum value
This may work, but is already a little complicated and would have further considerations, e.g. large files may take a long time to generate a checksum (e.g. > 60 seconds)
I'm hoping AWS have some simple way of validating S3 files?
There is an ETag created against each object, which is an MD5 of the object contents.
However, there seems to be some exceptions.
From Common Response Headers - Amazon Simple Storage Service:
ETag: The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Also, the calculation of an ETag for a multi-part upload can be complex. See: s3cmd - What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB? - Stack Overflow
When I want to get a partial range of file content in the google cloud storage, I used XML API and use HTTP Range Get requests. From the google cloud response, I can find the header x-goog-hash, and it contains CRC32C and MD5 checksums. But these checksums are calculated from the whole file. What I need is the crc32c checksum of the partial range of content in the response. With that partial crc32c checksum, I can verify the data in response, otherwise, I cannot check the validation of the response.
I was wondering: Are the files stored in your bucket on gzip format? I read here Using Range Header on gzip-compressed files that you can't get partial information from a compressed file. By default you get the whole file information.
Anyways, could you share the petition you're sending?
I looked for more information and found this: Request Headers and Cloud Storage.
It says that when you use the Range header, the returned checksum will cover the whole file.
So far, there's no way to get the checksum for a byte range alone using the XML API.
However, you could try to do it by splitting your file with your preferred programming language and get the checksum for that "splitted" part.
I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide
I have a system where after a file is uploaded to S3, a Lambda job raises a queue message and I use it to maintain a list of keys in a MySQL table.
I am trying to generate a pre-signed URL based on the records in my table.
i have two records currently
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist+-+Song.mp3
Generating pre-signed URL using :
var params = {
Bucket: bucket,
Expires: Settings.UrlGetTimeout,
Key: record
};
S3.getSignedUrl('getObject', params);
The URL with thumbnail.jpg works perfectly fine, but the one with +-+ fails. The original file name on local disk was "Artist - Song.mp3". S3 replaced spaces with '+'. Now when I am generating a URL using the exact same filename that S3 uses, it doesn't work; I get a "Specified Key doesn't exist" error from S3.
What must I do to generate URLs consistently for all filenames?
I solved this after a little experimentation.
Instead of directly storing key that S3 provides in their S3 event message, I am first replacing '+' character with space (as they are originally on the disk) and then URL decoding it.
return decodeURIComponent(str.replace(/\+/img, " "));
Now generating a S3 Pre-Signed URL works as expected.
Before MySQL has the following records:
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist+-+Song.mp3
Now:
/41jQnjTkg/thumbnail.jpg
/41jQnjTkg/Artist - Song.mp3
I personally feel there is an inconsistency with S3's api/event messages.
Had i generated a Signed URL directly using the Key that S3 itself provided in SQS event message, It wouldn't have worked. One must do this string replacement step & URL decoding on the key in order to use it to get a proper working url.
Not sure if this is by design or a bug.
The second file's name is coming to you form-urlencoded. The + is actually a space, and if you had other characters (like parenthesis) they would be percent-escaped. You need to run your data through a URL decoder before working with it further.
Side-note: if the only thing your Lambda function does is create an SQS message, you can do that directly from S3 without writing your own function.