Can glue Crawler read xml zip file - amazon-web-services

I have a xml zip file. Can i create Schema using glue crawler.
I was trying to use crawler XML classifier and added the classifier into crawler to create table.
since its zip file. not able to read. Can anyone experience using the Zip file in glue crawler

AWS glue can read zip files but the zip must contain only one file. From docs:
ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
However, reading xml is very limited. Not all xml files can be read. For example, you can't read self closing elements as shown in the docs.

Related

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Read a xml file inside .zip file on Amazon S3 without downloading the big zip file?

I have a lot of .zip files on Amazon S3, they are big and I don't need to download all of them. I only need the unique xml file inside them to know which file should be downloaded.
This is the case for Sentinel 3 data xfdumanifest.xml, e.g:
s3://s3-olci/LFR/2018/01/31/S3A_OL_2_LFR____20180131T225040_20180131T225340_20180202T040253_0180_027_215_2520_LN1_O_NT_002.zip/S3A_OL_2_LFR____20180131T225040_20180131T225340_20180202T040253_0180_027_215_2520_LN1_O_NT_002.SEN3/xfdumanifest.xml
Anyone knows how to read only this xfdumanifest.xml file without downloading the whole zip file?
S3 doesn't support downloading and extracting just one file from a ZIP.

No extension while using from_options' in DynamicFrameWriter in AWS Glue spark context

I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.
you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.

Decompress a zip file in AWS Glue

I have a compressed gzip file in an S3 bucket. The files will be uploaded to the S3 bucket daily by the client. The gzip when uncompressed will contain 10 files in CSV format, but with the same schema only. I need to uncompress the gzip file, and using Glue->Data crawler, need to create a schema before running a ETL script using a dev. endpoint.
Is glue capable to decompress the zip file and create a data catalog. Or any glue library available which we can use directly in the python ETL script? or should I opt for an Lambda/any other utility so that as soon as the zip file is uploaded, I run a utility to decompress and provide as a input to Glue?
Appreciate any replies.
Glue can do decompression. But it wouldn't be optimal. As gzip format is not splittable (that mean only one executor will work with it). More info about that here.
You can try to decompression by lambda and invoke glue crawler for new folder.
Use gluecontext.create_dynamic_frame.from_options and mention compression type in connection options. Similarly output can also be compressed while writing to s3. The below snippet worked for bzip, please change format to gz|gzip and try.
I tried the Target Location in UI of glue console and found bzip and gzip are supported in writing dynamic_frames to s3 and made changes to the code generated to read a compressed file from s3. In docs it is not directly available.
Not sure about the efficiency. It took around 180 seconds of execution time to read, Map transform, change to dataframe and back to dynamicframe for a 400mb compressed csv file in bzip format. Please note execution time is different from start_time and end_time shown in console.
datasource0 = glueContext.create_dynamic_frame
.from_options('s3',
{
'paths': ['s3://bucketname/folder/filename_20180218_004625.bz2'],
'compression':'bzip'
},
'csv',
{
'separator': ';'
}
)
I've written a Glue Job that can unzip s3 files and put them back in s3.
Take a look at https://stackoverflow.com/a/74657489/17369563

Using tar.gz file as a source for Amazon Athena

If I define *.tsv files on Amazon S3 as a source for an Athena table and use OpenCSVSerde or LazySimpleSerDe as a deserializer it works correctly. But if I define *.tar.gz files that include *.tsv files I see several strange rows in a table (e.g. a row that contains tsv file name and several empty rows). What is the right way to use tar.gz files in Athena?
The problem is tar, it adds additional rows. Athena can open only *.gz files, but not tar. So in this case I have to use *.gz instead of *.tar.gz.