How to read S3 XML files query using Hive - amazon-web-services

I've XML files stored in AWS S3 bucket. I want to extract XML metadata and load in HIVE Tables on HDFS. Is there any tool, which can help to expediate this activity?

Well, you might need to use HIVE XML SerDe's to read the XML files or write/use Custom UDF's that can understand XML.
Some references that might help : https://community.hortonworks.com/articles/972/hive-and-xml-pasring.html
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
https://community.hortonworks.com/questions/47840/how-do-i-do-xml-string-parsing-in-hive.html

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

process non csv, json and parquet files from s3 using glue

Little disclaimer have never used glue.
I have files stored in s3 that I want to process using glue but from what I saw when I tried to start a new job from a plain graph the only option I got was csv, json and parquet file formats from s3 but my files are not of these types. Is there any way processing those files using glue? or do I need to use another aws service?
I can run a bash command to turn those files to json but the command is something I need to download to a machine if there any way i can do it and than use glue on that json
Thanks.

Can glue Crawler read xml zip file

I have a xml zip file. Can i create Schema using glue crawler.
I was trying to use crawler XML classifier and added the classifier into crawler to create table.
since its zip file. not able to read. Can anyone experience using the Zip file in glue crawler
AWS glue can read zip files but the zip must contain only one file. From docs:
ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
However, reading xml is very limited. Not all xml files can be read. For example, you can't read self closing elements as shown in the docs.

How to query nested XML file in AWS Athena via Glue

I want the nested XML file to query from AWS Athena using AWS glue.
<Files>
<File>
<Charges>
<charge>
<FRNo>99988881111</FRNo>
<amount>25.0</amount>
<Date>2019-02-25</Date>
<chargeType>Recur</chargeType>
<phoneNo>4444000012</phoneNo>
</charge>
<charge>
<FRNo>99988881111</FRNo>
<amount>40.0</amount>
<Date>2019-02-25</Date>
<chargeType>Recur</chargeType>
<phoneNo>4444000012</phoneNo>
</charge>
</Charges>
<FRNo>99988881111</FRNo>
<address>New YORK</address>
<amount>111</amount>
<DN>100000</DN>
<name>Rite</name>
<phoneNo>4444000012</phoneNo>
<tax>8.0</tax>
</File>
</Files>
Like this I have some 10k records. I think we have to do some modification in ETL job. Let me know for any other information.
Athena cannot process XML file(s) directly. So, we need to any of the format's (CSV/JSON/etc..) which Athena supports.
1) Crawl XML file in Glue (Give proper rowTag value)
2) Write a Glue job to convert XML to CSV/JSON
3) Crawl converted CSV/JSON
Currently, Amazon Athena does not support the XML file format. You may find the list of supported formats here: Supported SerDes and Data Formats - Amazon Athena
Since AWS Glue supports XML as an ETL input format (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html), you may first convert your data from XML to JSON and then query the JSON data using Athena.

Can S3 Select search multiple objects?

I'm testing out S3 Select and as far as I understand from the examples, you can treat a single object (CSV or JSON) as a data store.
I wanted to have a single JSON document per S3 object and search the entire bucket as a 'database'. I'm saving each 'file' as <ID>.json and each file has JSON documents with the same schema.
Is it possible to search multiple objects in a single call? i.e. Find all JSON documents where customerId = 123 ?
It appears that Amazon S3 Select operates on only one object.
You can use Amazon Athena to run queries across paths, which will include all files within that path. It also supports partitioning.
Simple, just iterate over the folder key in which you have all the files and grab the key and use the same to leverage S3 Select.