CRC32C checksum for HTTP Range Get requests in google cloud storage - google-cloud-platform

When I want to get a partial range of file content in the google cloud storage, I used XML API and use HTTP Range Get requests. From the google cloud response, I can find the header x-goog-hash, and it contains CRC32C and MD5 checksums. But these checksums are calculated from the whole file. What I need is the crc32c checksum of the partial range of content in the response. With that partial crc32c checksum, I can verify the data in response, otherwise, I cannot check the validation of the response.

I was wondering: Are the files stored in your bucket on gzip format? I read here Using Range Header on gzip-compressed files that you can't get partial information from a compressed file. By default you get the whole file information.
Anyways, could you share the petition you're sending?

I looked for more information and found this: Request Headers and Cloud Storage.
It says that when you use the Range header, the returned checksum will cover the whole file.
So far, there's no way to get the checksum for a byte range alone using the XML API.
However, you could try to do it by splitting your file with your preferred programming language and get the checksum for that "splitted" part.

Related

Problem in uploading multipart Amazon S3 Rest API using PostMan

I am trying to use the Amazon-S3 REST API to upload large file in chunks.
As per the API documentation here I formed my request with postman as follows.
After Initiating CreateMultipartUpload Post Request, I'm successfully getting UploadId for chunk put requests.
This is Working Fine.
I understand Multipart order, but when executing the step uploading partNumber & UploadId chunks using POSTMAN , I'm getting SHA256Mismatch error because POSTMAN calculating Content-MD5 for whole file not for chunks.
Header
Params
Body
I have found multiple solutions on various forums but those solutions didn't work.
Am I missing something here?
Successfully uploaded to Amazon S3 using PostMan using Amazon's multipart upload and the key (for me) was to add a Content-MD5 header manually, and paste in the Base64 encoded MD5 hash for that part (details below). This may not be the exact problem the OP was having, but still, I wanted to share how to use PostMan, provided you have good working IAM key id and secret to your Amazon S3 bucket.
First, I split a 9 MB "mytest.pdf" file into two parts to test this (I used Linux/WSL command: split -b 5242880 mytest.pdf) making sure the first part is larger than 5MB (the last part can be smaller than 5 MB).
Next, set up PostMan with following four requests:
CreateMultipartUpload (e.g., POST
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?uploads
UploadPart1 (e.g., PUT
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?partNumber=1&uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
UploadPart2 (e.g., PUT
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?partNumber=2&uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
CompleteMultipartUpload (e.g., POST
https://{{mybucket}}.s3.{{myregion}}.amazonaws.com/mytest.pdf?uploadId=297a2XMl9kNDqw1BaKl7jk6uK_Mop0mCV68TmWU2n8xjsrM6sgt0hu.93J92Qgw8yaEHlrlj0MSoc9ljmU3sD3dlQsGJixMq9hugPDRTkikM0KV6rmLdpmHjFcWzDEDO)
I pasted in my IAM key id and access key in the authorization section of PostMan (separate articles for that)
Ran the CreateMultipartUpload POST to get the UploadId from Amazon.
Next, calculated MD5 hash of each of my two file parts (I used 7-zip but use tool of your choice). Now, converted that result into Base64. However, I had to make sure that I ended up with 22 characters followed optionally by two equals signs == . When I converted MD5 as text to Base64, I ended up with a longer string, which ended with a single equal sign, not double (this is a sign that it is not encoded in a way that Amazon expects it and will produce an InvalidDigest error). For example, if you use 7-zip to calculate MD5 hash on file part and get the value 58942651efd0f5886810d04ed9df502f, and then use a tool such as the Bases64 encoder below and choose input at Text you get NTg5NDI2NTFlZmQwZjU4ODY4MTBkMDRlZDlkZjUwMmY= but if you choose "HEX" as the input, you get a smaller string WJQmUe/Q9YhoENBO2d9QLw== that is 24 characters (22 characters + 2 equals signs). That's the one you want.
(no implied endorsement of this tool - not affiliated https://emn178.github.io/online-tools/base64_encode.html)
If you get it wrong, Amazon will reply with the InvalidDigest error below
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InvalidDigest</Code>
<Message>The Content-MD5 you specified was invalid.</Message>
<Content-MD5>thisisbad</Content-MD5>
<RequestId>8274DC9566D4AAA8</RequestId>
<HostId>H6kSy4cl+54nMon1Hq6AGjmTX/MfTVMQQr8vEVNXUnPlfMtIt8HPdObfusckhBpwpG/CJ6ORWv16c=</HostId>
</Error>
Ran both UploadPart1 and 2
Finally ran CompleteMultipartUpload with copied and pasted Etag values from the Headers of the previous 2 requests
<CompleteMultipartUpload>
<Part>
<PartNumber>1</PartNumber>
<ETag>"c716d98e83db1edb27fc25fd03e0ae32"</ETag>
</Part>
<Part>
<PartNumber>2</PartNumber>
<ETag>"58942651efd0f5886810d04ed9df502f"</ETag>
</Part>
</CompleteMultipartUpload>

How to facilitate downloading both CSV and PDF from API Gateway connected to S3

In the app I'm working on, we have a process whereby a user can download a CSV or PDF version of their data. The generation works great, but I'm trying to get it to download the file and am running into all sorts of problems. We're using API Gateway for all the requests, and the generation happens inside a Lambda on a POST request. The GET endpoint takes in a file_name parameter and then constructs the path in S3 and then makes the request directly there. The problem I'm having is when I'm trying to transform the response. I get a 500 error and when I look at the logs, it says Execution failed due to configuration error: Unable to transform response. So, clearly that's where I've spent most of my time. I've tried at least 50 different iterations of templates and combinations with little success. The closest I've gotten is the following code, where the CSV downloads fine, but the PDF is not a valid PDF anymore:
CSV:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$input.body
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
PDF:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$util.base64Encode($input.body)
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
where contentHandling = CONVERT_TO_TEXT. My binaryMediaTypes just has application/pdf and that's it. My goal is to get this working without having to offload the problem into a Lambda so we don't have that overhead at the download step. Any ideas how to do this right?
Just as another comment, I've tried CONVERT_TO_BINARY and just leaving it as Passthrough. I've tried it with text/csv as another binary media type and I've tried different combinations of encoding and decoding base64 and stuff. I know the data is coming back right from S3, but the transformation is where it's breaking. I am happy to post more logs if need be. Also, I'm pretty sure this makes sense on StackOverflow, but if it would fit in another StackExchange site better, please let me know.
Resources I've looked at:
https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html#util-template-reference
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-workflow.html
https://docs.amazonaws.cn/en_us/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-control-service-api.html.
(But they're all so confusing...)
EDIT: One Idea I've had is to do CONVERT_TO_BINARY and somehow base64 encode the CSVs in the transformation, but I can't figure out how to do it right. I keep feeling like I'm misunderstanding the order of things, specifically when the "CONVERT" part happens. If that makes any sense.
EDIT 2: So, I got rid of the $util.base64Encode in the PDF one and now I have a PDF that's empty. The actual file in S3 definitely has things in it, but for some reason CONVERT_TO_TEXT is not handling it right or I'm still not understading how this all works.
Had similar issues. One major thing is the Accept header. I was testing in chrome which sends Accept header as text/html,application/xhtml.... api-gateway ignores everything except the first one(text/html). It will then convert any response from S3 to base64 to try and conform to text/html.
At last after trying everything else I tried via Postman which defaults the Accept header to */*. Also set your content handling on the Integration response to Passthrough. And everything was working!
One other thing is to pass the Content-Type and Content-Length headers through(Add them in method response first and then in Integration response):
Content-Length integration.response.header.Content-Length
Content-Type integration.response.header.Content-Type

Online prediction with Data stored in Bucket

As per my understanding Online prediction works with json data. Currently i am running online prediction on local host, where each image get converted to json. ML engin API use this json from localhost for prediction.
Internally ML engine API might have been uploading json to cloud for prediction.
Is there any way to run online prediction on json files already uploaded to cloud bucket?
Internally we parse the input from the payload in the request directly for serving, not store the requests on disk. Currently reading inputs from Cloud is not supported for online prediction. You may consider to use batch prediction which reads data from files stored on cloud.
There is a small discrepancy of the inputs between online and batch for the model that accepts only one string input (probably like your case). In this case, you must base64 encode the image bytes and put it in a JSON file for online prediction, while for batch prediction you need to pack the image bytes into records in TFRecords format and save it as tfrecord file(s). Other than that, the inputs are compatible.

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide

Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL

I know of limiting the upload size of an object using this method: http://doc.s3.amazonaws.com/proposals/post.html#Limiting_Uploaded_Content
But i would like to know how it can be done while generating a pre-signed url using S3 SDK on the server side as an IAM user.
This Url from SDK has no such option in its parameters : http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#putObject-property
Neither in this:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getSignedUrl-property
Please note: I already know of this answer: AWS S3 Pre-signed URL content-length and it is NOT what i am looking for.
The V4 signing protocol offers the option to include arbitrary headers in the signature. See:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html
So, if you know the exact Content-Length in advance, you can include that in the signed URL. Based on some experiments with CURL, S3 will truncate the file if you send more than specified in the Content-Length header. Here is an example V4 signature with multiple headers in the signature
http://docs.aws.amazon.com/general/latest/gr/sigv4-add-signature-to-request.html
You may not be able to limit content upload size ex-ante, especially considering POST and Multi-Part uploads. You could use AWS Lambda to create an ex-post solution. You can setup a Lambda function to receive notifications from the S3 bucket, have the function check the object size and have the function delete the object or do some other action.
Here's some documentation on
Handling Amazon S3 Events Using the AWS Lambda.
For any other wanderers that end up on this thread - if you set the Content-Length attribute when sending the request from your client, there a few possibilities:
The Content-Length is calculated automatically, and S3 will store up to 5GB per file
The Content-Length is manually set by your client, which means one of these three scenarios will occur:
The Content-Length matches your actual file size and S3 stores it.
The Content-Length is less than your actual file size, so S3 will truncate your file to fit it.
The Content-Length is larger than your actual file size, and you will receive a 400 Bad Request
In any case, a malicious user can override your client and manually send a HTTP request with whatever headers they want, including a much larger Content-Length than you may be expecting. Signed URLs do not protect against this! The only way is to setup an POST policy. Official docs here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTConstructPolicy.html
More details here: https://janac.medium.com/sending-files-directly-from-client-to-amazon-s3-signed-urls-4bf2cb81ddc3?postPublishedType=initial
Alternatively, you can have a Lambda that automatically deletes files that are larger than expected.