I want the nested XML file to query from AWS Athena using AWS glue.
<Files>
<File>
<Charges>
<charge>
<FRNo>99988881111</FRNo>
<amount>25.0</amount>
<Date>2019-02-25</Date>
<chargeType>Recur</chargeType>
<phoneNo>4444000012</phoneNo>
</charge>
<charge>
<FRNo>99988881111</FRNo>
<amount>40.0</amount>
<Date>2019-02-25</Date>
<chargeType>Recur</chargeType>
<phoneNo>4444000012</phoneNo>
</charge>
</Charges>
<FRNo>99988881111</FRNo>
<address>New YORK</address>
<amount>111</amount>
<DN>100000</DN>
<name>Rite</name>
<phoneNo>4444000012</phoneNo>
<tax>8.0</tax>
</File>
</Files>
Like this I have some 10k records. I think we have to do some modification in ETL job. Let me know for any other information.
Athena cannot process XML file(s) directly. So, we need to any of the format's (CSV/JSON/etc..) which Athena supports.
1) Crawl XML file in Glue (Give proper rowTag value)
2) Write a Glue job to convert XML to CSV/JSON
3) Crawl converted CSV/JSON
Currently, Amazon Athena does not support the XML file format. You may find the list of supported formats here: Supported SerDes and Data Formats - Amazon Athena
Since AWS Glue supports XML as an ETL input format (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html), you may first convert your data from XML to JSON and then query the JSON data using Athena.
Related
Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.
I have a xml zip file. Can i create Schema using glue crawler.
I was trying to use crawler XML classifier and added the classifier into crawler to create table.
since its zip file. not able to read. Can anyone experience using the Zip file in glue crawler
AWS glue can read zip files but the zip must contain only one file. From docs:
ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
However, reading xml is very limited. Not all xml files can be read. For example, you can't read self closing elements as shown in the docs.
I have this type of data in my S3:
{"version":"0","id":"c1d9e9a4-25a2-a0d8-2fa4-b062efec98c4","detail-type":"OneTypeee","source":"OneSource","account":"123456789","time":"2021-01-17T12:35:17Z","region":"eu-central-1","resources":[],"detail":{"Key1":"Value1"}}
{"version":"0","id":"c13879a4-2h32-a0d8-9m33-b03jsh3cxxj4","detail-type":"OtherType","source":"SomeMagicSource","account":"123456789","time":"2021-01-17T12:36:17Z","region":"eu-central-1","resources":[],"detail":{"Key2":"Value2", "Key22":"Value22"}}
{"version":"0","id":"gi442233-3y44a0d8-9m33-937rjd74jdddj","detail-type":"MoreTypes","source":"SomeMagicSource2","account":"123456789","time":"2021-01-17T12:45:17Z","region":"eu-central-1","resources":[],"detail":{"MagicKey":"MagicValue", "Foo":"Bar"}}
Please note, I have added new lines to make it more readable. In reality, Kinesis Firehose produces these batches with no newlines.
When I try to run an AWS Glue crawler on this type of data, it only crawls the first JSON line and that's it. I know this because when I run Athena SQL queries, I always get only one (first) result.
How do I make a glue crawler correctly crawl through this data and make a correct schema so I could query all of that data?
I wasn't able to run a crawler through JSON lines data, but simply specifying in the Glue Table Serde properties that the data is JSON worked for me. Glue automatically splits the JSON by newline and I can query the data in my Glue Jobs.
Here's what my table's properties look like. Additionally, my json lines data was compressed, so here you can ignore the compressionType property.
I had the same issue and for me the reason was that json records were being written to S3 bucket without next line character: \n.
Make sure your json records are written with \n appended at the end. In case of java, something like this:
PutRecordRequest request = new PutRecordRequest()
.withRecord(new Record().withData(ByteBuffer.wrap((json + "\n").getBytes())))
.withDeliveryStreamName(streamName);
amazonKinesis.putRecordAsync(request);
I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.
I've XML files stored in AWS S3 bucket. I want to extract XML metadata and load in HIVE Tables on HDFS. Is there any tool, which can help to expediate this activity?
Well, you might need to use HIVE XML SerDe's to read the XML files or write/use Custom UDF's that can understand XML.
Some references that might help : https://community.hortonworks.com/articles/972/hive-and-xml-pasring.html
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
https://community.hortonworks.com/questions/47840/how-do-i-do-xml-string-parsing-in-hive.html