Force AWS EMR to unzip files in S3 - amazon-web-services

I have a bucket in AWS's S3 service that contains gzipped CSV files, however when they were stored they all were saved with the metadata Content-Type of text/csv.
Now I am using AWS EMR, which will not recognize them as a zipped file and unzip them. I've looked through configuration option for EMR but don't see anything that would work... I have almost a million files, so renaming their metadata value would require a Boto script that cycled through all the files and renamed the metadata value.
Am I missing something easy? Thanks!

The Content-Type isn't the problem... that's correct if the files are csv, but if you stored them gzipped, then you needed to also have set Content-Encoding: gzip in the header metadata. Doing that "should" trigger the useragent that's fetching them to gunzip them on the fly when they are downloaded... so had you done that, it should have "just worked."
(I store gzipped log files this way, with Content-Type: text/plain and Content-Encoding: gzip and when you download them with a web browser, the file you get is no longer gzipped because the browser untwizzles the compression on the fly due to the Content-Encoding header.)
But, since you've already uploaded the files, I did find this in the google machine, which might help:
GZipped input. A lot of my input data had already been gzipped, but luckily if you pass -jobconf stream.recordreader.compression=gzip in the extra arguments section Hadoop will decompress them on the fly before passing the data to your mapper.
http://petewarden.typepad.com/searchbrowser/2010/01/elastic-mapreduce-tips.html

Related

AWS S3: .csv file is downloaded as .csv

I have 2 AWC accounts, each of them has one S3 bucket. I uploaded two same-size .CSV files to each of the S3 bucket.
When I try to Download or Download As, this file is downloaded as .CSV file in first account. BUT(!!) When I try to download this file from second account - it is downloading it as .TXT.
How can this happen? Both files are created in the same way: through Redshift UNLOAD query, that perform copying of selected data from Redshift to S3.
UPDATE:
Can it be because in this account for this document , **Server side encryption is equal to AWS-KMS?
I noticed that file, that converted from .csv to .txt has "Server side encryption: AWS-KMS", while .csv file that is downloaded as .csv - has "Server side encryption: NONE"
UPDATE: tried in different browsers - same result
Check the headers for each object in the AWS S3 console and compare the Content-Type values. Content-Type provides a hint to web browsers on what data the object contains.
If Content-Type does not exist or does not contain text/csv, add or modify the header in the S3 console or via your favorite S3 application such as CloudBerry.
John is right about the Content-Type not being text/csv. Sometimes, S3 will get it right and sometimes it won't. If you can't manually correct this yourself, you can run a Lambda function to do this for you everytime you upload a new object. You can use a Python 2.7 template Lambda function to download the object from the bucket, employ mimetypes library to guess_type for your S3 object, and then re-upload the file in the same bucket. You will need to trigger this function with S3 object upload and give it the necessary permissions (S3:GetObject).
P.S. This will work for files with any extension. If you know you are only going to upload .csv files, you can ignore the mimetypes and directly re-upload the object with
bucket.upload_fileobj(filename, key, ExtraArgs={'ContentType': 'text/csv'})
If the mimetypes cannot guess the typethen you might need to add the types, look at an example here https://www.programcreek.com/python/example/5209/mimetypes.add_type
Good Luck!
Here is scala solution (to specify content type):
val settingsLine: String = "csvdata1,csvdata2,csvdata3"
val settingsStream: InputStream = new ByteArrayInputStream(settingsLine.getBytes())
val metadata: ObjectMetadata = new ObjectMetadata()
metadata.setContentType("text/csv")
s3Client.putObject(bucketName, prefix, settingsStream, metadata)

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

gZIP with AWS cloudFront and S3

CloudFront offers compression (gZIP) for certain file types from the origin. My architecture looks like this:
So, the requirements for the files to get compressed in cloudFront are:
1. Have to enable Compress Objects Automatically option in cloudFront's cache behaviour settings.
2. content-type and content-length has to be returned by S3. S3 sends these headers by default. I have cross checked this.
3. The received file type must be one of the file types listed by cloudFront. In my case, I want to compress app.bundle.js which comes under application/javascript (content-type) and it is also present in the supported file-types of cloudFront.
I guess above are the only requirements to get a gZipped version of the files to browser. Even after having the above things, gzip does not work for me. Any ideas, what am I missing?

How to enable gzip compression on AWS CloudFront

I m trying to gzip compress the img I m serving through CloudFront. My origin is S3
Based on several articles/blogs on aws, what I did is:
1) Set "Content-Length" header for the object I want to compress. I set the value equal to the size appeared on the size property box
2) Set the Compress Objects Automatically value to Yes in the Behaviour of my cloud distribution.
3) I invalidated my object to get a fresh copy from S3.
Still I m not able to make CloudFront gzip my object. Any ideas?
I'm trying to gzip compress the [image]
You don't typically need to gzip images -- doing so saves very little bandwidth, if any, since virtually all image formats used on the web are already compressed.
Also, CloudFront doesn't support it.
See File Types that CloudFront Compresses for the suported file formats. They are text-based formats, which tend to benefit substantially from gzip compression.
If you really want the files served gzipped, you can store the files in S3, already gzipped.
$ gzip -9 myfile.png
This will create a gzipped file myfile.png.gz.
Upload the file to S3 without the .gz on the end. Set the Content-Encoding: header to gzip and set the Content-Type: header to the normal, correct value for the file, such as image/png.
This breaks any browser that doesn't understand Content-Encoding: gzip, but there should be no browsers in use that have that limitation.
Note that the -9, above, means maximum compression.
If you're trying to gzip jpegs/pngs, I would suggest that you first compress them online with a tool such as https://tinyjpg.com/
You will not need to compress the images further ideally. Compressing images with image optimization tools will work better than using gzip -9 as it takes into consideration the textures, colors and patterns and such.
Also, make sure that you save your file in the proper formats (actual images in jpg and vector images in png) - This will help in reducing the size of the images

Merging files on AWS S3 (Using Apache Camel)

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?
I am using Apache Camel for routing.
S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.
However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.
However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).
My production code does this by:
Interrogating the manifest of files to be uploaded
If first part is
under 5MB, download pieces* and buffer to disk until 5MB is buffered.
Append parts sequentially until file concatenation complete
If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.
Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.
* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.
** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1
P.S. When I have time to make a Gist of this code I'll post the link here.
You can use Multipart Upload with Copy to merge objects on S3 without downloading and uploading them again.
You can find some examples in Java, .NET or with the REST API here.