Does content encoding only compress for transmission, or also for persistent storage in S3? - amazon-web-services

Does 'Content-Encoding': 'gzip' reduce the file size in AWS S3?
Documentation says: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding
This lets the recipient know how to decode the representation in order to obtain the original payload format.
Does it mean if a file / image / GIF is sent to S3, AWS will decode it and save it in a decoded way? Or S3 will store it in a compressed way, and serve it also in a compressed way?
In my case we store GIF-s in S3, and they have to be smaller than 8 MB. So need some kind of compression.

content type is applied on the fly on the image. You can disable and enable the content type headers on the s3 and check the file size on s3. They will be the same in both the cases.

Related

AWS s3 object boto3, how to stream upload while streaming download

I have a function that gets an object from one bucket and uploads it to another bucket. My file sizes are unpredictable so what I do is give my memory more than what I need most of the time.
Ideally what I want to do is stream the download/upload so I do not have to give it more memory than what it needs.
Stream download from bucketA (a chunk at a time)
Stream upload to bucketB
Remove uploaded chunk from buffer
repeat step 1 until all chunks have been transferred
This way, I'm only buffering the chunk size during the whole process.
So far I know that streaming download is possible
response = s3.get_object(Bucket='bucket-name', Key=file)
for i,line in enumerate(response['Body'].iter_lines()):
# upload line by line
How do I do I upload per "line" with put_object and also validating integrity with md5 hash?

AWS Transcribe is not recognizing the media format of my file correctly

I'm using a lambda function to receive a bytes array of audio data, save it as mp3, store it in S3, and then use the S3 object to start a Transcribe job.
Everything's been processed correctly. I can see the .mp3 file in S3. I've also downloaded it to my local machine and played it, and it plays correctly as mp3.
However, when I start the transcription job I get back an error:
The media format that you specified doesn't match the detected media format. Check the media format and try your request again.
This is my call to start the AWS Transcribe job:
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': job_uri},
MediaFormat='mp3',
LanguageCode='en-US'
)
Any idea what may be causing this?
Cheers!
mp3 requires compression, if you just save byte array, then it's not in .mp3 format. You can use soxi to validate audio files: http://sox.sourceforge.net/soxi.html

Using FileGetMimeType() with uploads to Amazon S3

I have so far allowed users to upload images to my server and then used CF's FileGetMimeType() function to determine if the MIME type is valid (.e.g jpg)
The problem is that FileGetMimeType() wants a full path to the file on the server to work. Amazon S3 is just a URL of where the image is stored. In order to get FileGetMimeType() to work, I have to first upload the image to Amazon S3 then download it again using CFHTTP and then determine the file type. This seems way less efficient than the old way.
So why not just upload to my own server first, determine the MIME type, and then upload to S3 right? I can't do that because some of these files are going to be huge with thousands of users uploading at the same time. We're talking videos as well as images.
Is there an efficient way to upload files to an external server i.e. Amazon S3 and then get the MIME type somehow without having to download the file all over again? Can it be done on S3's end?

How to enable gzip compression on AWS CloudFront

I m trying to gzip compress the img I m serving through CloudFront. My origin is S3
Based on several articles/blogs on aws, what I did is:
1) Set "Content-Length" header for the object I want to compress. I set the value equal to the size appeared on the size property box
2) Set the Compress Objects Automatically value to Yes in the Behaviour of my cloud distribution.
3) I invalidated my object to get a fresh copy from S3.
Still I m not able to make CloudFront gzip my object. Any ideas?
I'm trying to gzip compress the [image]
You don't typically need to gzip images -- doing so saves very little bandwidth, if any, since virtually all image formats used on the web are already compressed.
Also, CloudFront doesn't support it.
See File Types that CloudFront Compresses for the suported file formats. They are text-based formats, which tend to benefit substantially from gzip compression.
If you really want the files served gzipped, you can store the files in S3, already gzipped.
$ gzip -9 myfile.png
This will create a gzipped file myfile.png.gz.
Upload the file to S3 without the .gz on the end. Set the Content-Encoding: header to gzip and set the Content-Type: header to the normal, correct value for the file, such as image/png.
This breaks any browser that doesn't understand Content-Encoding: gzip, but there should be no browsers in use that have that limitation.
Note that the -9, above, means maximum compression.
If you're trying to gzip jpegs/pngs, I would suggest that you first compress them online with a tool such as https://tinyjpg.com/
You will not need to compress the images further ideally. Compressing images with image optimization tools will work better than using gzip -9 as it takes into consideration the textures, colors and patterns and such.
Also, make sure that you save your file in the proper formats (actual images in jpg and vector images in png) - This will help in reducing the size of the images

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide