How to specify custom metadata in resumable upload (via XML API)? - google-cloud-platform

I am following steps for resumable upload outlined here.
According to documentation custom metadata has to be specified in first POST and is to be passed via x-goog-meta-* headers. I.e.:
x-goog-meta-header1: value1
x-goog-meta-header2: value2
... etc
But in my testing all these values disappear. After final PUT object shows up in the bucket with proper content-type but without a single piece of custom metadata.
What I am doing wrong?
P.S. It is rather suspicious that JSON API in resumable upload takes metadata as payload of first POST...
P.P.S. I am performing resumable upload via XML API described here (only using C++ code instead of curl utility). Adding x-goog-meta-mykey: myvalue header has no effect on object's custom metadata.

if you replace AWS4-HMAC-SHA256 in Authorization header with GOOG4-HMAC-SHA256 -- it works. GCS uses this bit as a "should I expect x-amz- or x-goog- headers?" switch. Problem is that with resumable upload you have to specify x-goog-resumable and adding x-amz-meta-* headers causes request to fail with a message about mixing x-goog- and x-amz- headers.
I also went ahead and changed few other aspects of signature, namely:
request type: aws4_request -> goog4_request
signing key: AWS4 -> GOOG4 (GOOG1 works too)
service name: s3 -> storage (even though in some errors GCS asks for either s3 or storage to be specified here, it also takes gcs and maybe other values)
... this isn't necessary, it seems. I've done it just for consistency.

Related

How to facilitate downloading both CSV and PDF from API Gateway connected to S3

In the app I'm working on, we have a process whereby a user can download a CSV or PDF version of their data. The generation works great, but I'm trying to get it to download the file and am running into all sorts of problems. We're using API Gateway for all the requests, and the generation happens inside a Lambda on a POST request. The GET endpoint takes in a file_name parameter and then constructs the path in S3 and then makes the request directly there. The problem I'm having is when I'm trying to transform the response. I get a 500 error and when I look at the logs, it says Execution failed due to configuration error: Unable to transform response. So, clearly that's where I've spent most of my time. I've tried at least 50 different iterations of templates and combinations with little success. The closest I've gotten is the following code, where the CSV downloads fine, but the PDF is not a valid PDF anymore:
CSV:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$input.body
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
PDF:
#set($contentDisposition = "attachment;filename=${method.request.querystring.file_name}")
$util.base64Encode($input.body)
#set($context.responseOverride.header.Content-Disposition = $contentDisposition)
where contentHandling = CONVERT_TO_TEXT. My binaryMediaTypes just has application/pdf and that's it. My goal is to get this working without having to offload the problem into a Lambda so we don't have that overhead at the download step. Any ideas how to do this right?
Just as another comment, I've tried CONVERT_TO_BINARY and just leaving it as Passthrough. I've tried it with text/csv as another binary media type and I've tried different combinations of encoding and decoding base64 and stuff. I know the data is coming back right from S3, but the transformation is where it's breaking. I am happy to post more logs if need be. Also, I'm pretty sure this makes sense on StackOverflow, but if it would fit in another StackExchange site better, please let me know.
Resources I've looked at:
https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html#util-template-reference
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-workflow.html
https://docs.amazonaws.cn/en_us/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-control-service-api.html.
(But they're all so confusing...)
EDIT: One Idea I've had is to do CONVERT_TO_BINARY and somehow base64 encode the CSVs in the transformation, but I can't figure out how to do it right. I keep feeling like I'm misunderstanding the order of things, specifically when the "CONVERT" part happens. If that makes any sense.
EDIT 2: So, I got rid of the $util.base64Encode in the PDF one and now I have a PDF that's empty. The actual file in S3 definitely has things in it, but for some reason CONVERT_TO_TEXT is not handling it right or I'm still not understading how this all works.
Had similar issues. One major thing is the Accept header. I was testing in chrome which sends Accept header as text/html,application/xhtml.... api-gateway ignores everything except the first one(text/html). It will then convert any response from S3 to base64 to try and conform to text/html.
At last after trying everything else I tried via Postman which defaults the Accept header to */*. Also set your content handling on the Integration response to Passthrough. And everything was working!
One other thing is to pass the Content-Type and Content-Length headers through(Add them in method response first and then in Integration response):
Content-Length integration.response.header.Content-Length
Content-Type integration.response.header.Content-Type

CSV Export using Api Gateway and Lambda

What I would like to do:
What I would like to do is have a url which would return to the caller a CSV file which is essentially a export of data. I would like this to remain to be a serverless solution.
What I have done:
I have created an AWS API Gateway with the URL I want. I have created a lambda that will query the database and create a CSV string of that data. That data is placed in a JSON object and returned. API gateway then gets the CSV data from the json object and returns CSV to the caller with appropriate headers to indicate tht it is a CSV and attachment. Testing from the browser I get the download automatically just like I intended.
The problem I see:
This works well until there is a sizable amount of data at which point I start getting "body size is too long".
My attempts to resolve:
I did some googling around and I see others have had similar issues. In one solution I saw that they return a link to the file that they created. This solution seems viable for them because they had a server. For my serverless architecture it seems to be a little trickier. I could take and store the file into S3 but then i would have to return a link to S3. That seems like it could work but doesn't feel right like im missing a configuration option. It also feels like im exposing the implementation by returning the s3 urls as well.
I have looked around for tutorials and example of people doing similar things and i haven't found any.
My Questions:
Is there a way to do this?
Is there another solution that i dont know of?
How do i return a file, in this case CSV, from API Gateway of a larger size
There is a limit of 6 MB for AWS Lambda response payloads. If the files you need to server are larger than that you won't be able to serve them directly from Lambda.
Using S3 to store and serve the files is the standard way of doing something like this. I would leave the S3 bucket private and generate S3 Pre-signed URLs in the Lambda function. That will limit the time that the CSV file is available for download, and it will prevent people from being able to guess the URLs of files you are serving. You would use an S3 Lifecycle Policy to archive or delete the files after a period of time.

Amazon S3 Object Lifecycle Management via header

I've been searching for an answer to this question for quite some time but apparently I'm missing something.
I use s3cmd heavily to automate document uploads to AWS S3, via script. One of the parameters that can be used in s3cmd is --add-header, which I assume allows for lifecycle rules to be added.
My objective is to add this parameters and specify a +X (where X is days) to the upload. In the event of ... --add-header=...1 ... the lifecyle rule would delete this file after 24h.
I know this can be easily done via the console, but I would like to have a more detailed control over individual files/scripts.
I've read the parameters that can be passed to S3 via s3cmd, but I somehow can't understand how to put all of those together to get the intended result.
Thank you very much for any help or assistance!
The S3 API itself does not implement support for any request header that triggers lifecycle management at the object level.
The --add-header option for s3cmd can add headers that S3 understands, such as Content-Type, but there is no lifecycle header you can send using any tool.
You might be thinking of this:
If you make a GET or a HEAD request on an object that has been scheduled for expiration, the response will include an x-amz-expiration header that includes this expiration date and the corresponding rule Id
https://aws.amazon.com/blogs/aws/amazon-s3-object-expiration/
This is a reaponse header, and is read-only.

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide

Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL

I know of limiting the upload size of an object using this method: http://doc.s3.amazonaws.com/proposals/post.html#Limiting_Uploaded_Content
But i would like to know how it can be done while generating a pre-signed url using S3 SDK on the server side as an IAM user.
This Url from SDK has no such option in its parameters : http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#putObject-property
Neither in this:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getSignedUrl-property
Please note: I already know of this answer: AWS S3 Pre-signed URL content-length and it is NOT what i am looking for.
The V4 signing protocol offers the option to include arbitrary headers in the signature. See:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html
So, if you know the exact Content-Length in advance, you can include that in the signed URL. Based on some experiments with CURL, S3 will truncate the file if you send more than specified in the Content-Length header. Here is an example V4 signature with multiple headers in the signature
http://docs.aws.amazon.com/general/latest/gr/sigv4-add-signature-to-request.html
You may not be able to limit content upload size ex-ante, especially considering POST and Multi-Part uploads. You could use AWS Lambda to create an ex-post solution. You can setup a Lambda function to receive notifications from the S3 bucket, have the function check the object size and have the function delete the object or do some other action.
Here's some documentation on
Handling Amazon S3 Events Using the AWS Lambda.
For any other wanderers that end up on this thread - if you set the Content-Length attribute when sending the request from your client, there a few possibilities:
The Content-Length is calculated automatically, and S3 will store up to 5GB per file
The Content-Length is manually set by your client, which means one of these three scenarios will occur:
The Content-Length matches your actual file size and S3 stores it.
The Content-Length is less than your actual file size, so S3 will truncate your file to fit it.
The Content-Length is larger than your actual file size, and you will receive a 400 Bad Request
In any case, a malicious user can override your client and manually send a HTTP request with whatever headers they want, including a much larger Content-Length than you may be expecting. Signed URLs do not protect against this! The only way is to setup an POST policy. Official docs here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTConstructPolicy.html
More details here: https://janac.medium.com/sending-files-directly-from-client-to-amazon-s3-signed-urls-4bf2cb81ddc3?postPublishedType=initial
Alternatively, you can have a Lambda that automatically deletes files that are larger than expected.