Decompress a zip file in AWS Glue - amazon-web-services

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.

Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.

Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)

I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Related

AWS S3 file format

While writing files in S3 through Glue job, how to give custom file-name and also with timestamp format ( for example - file-name_yyyy-mm-dd_hh-mm-ss) format ??
As by default, glue writes the output files in format part-0**
Since Glue is using Spark in the background it is not possible to change the file names directly.
There is the possibility to change it after you have written to S3 though. This answer provides a simple code snippet that should work.

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

AWS Glue - Avro snappy compression read error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split

After saving Avro files with snappy compression (also same error with gzip/bzip2 compression) in S3 using AWS Glue, when I try to read the data in athena using AWS Crawler, I get the following error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split - using org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat: Not a data file. Any idea why I get this error and how to resolve this?
Thank you.
Circumvented this issue by attaching native spark avro jar file to the glue job during execution and using native spark read/write methods to write them in avro format and for the compression setting spark.conf.set("spark.sql.avro.compression.codec","snappy") as soon as the spark session is created.
Works perfectly for me and could be read via Athena as well.
AWS Glue doesn't support writing avro with compression files even though it's not stated clearly in docs. A job succeeds but it applies compressions in a wrong way: instead of compressing file blocks it compresses entire file that is wrong and that's the reason why Athena can't query it.
There are plans to fix the issue but I don't know ETA.
It would be nice if you could contact AWS support to let them know that you are having this issue too (more customers affected - sooner fixed)

Amazon Athena and compressed S3 files

I have an S3 bucket with several zipped CSV files (utilization logs.) I'd like to query this data with Athena, but the output is completely garbled.
It appears Athena is trying to parse the zip files without decompressing them first. Is it possible to force Hive to recognize my files as compressed data?
For Athena compression is supported, but the supported formats are
Snappy (.snappy)
Zlib (.bz2)
GZIP (.gz)
Those formats are detected by their filename suffix. If the suffix doesn't match, the reader does not decode the content.
I tested it with a test.csv.gz file and it worked right away. So try changing the compression from zip to gzip and it should work.

how to download file on aws glacier without json format

In AWS Glacier , when we initiate job, after 4+ hours we can download it, but download format is supported only json format, so how i can download my original files ? is it by using get-job-output or any other things ?