AWS Athena flattened data from nested JSON source - amazon-web-services

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b

I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);

Related

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

Redshift Spectrum: Query Anonymous JSON array structure

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if
serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?
Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update:
In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...
I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example:
for 2nd index element
select json_extract_array_element_text('[111,112,113]', 2);
output: 113
If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

Apache Calcite query json nested data

I'm done to read simple data in json format from kafka by using calcite sql.
There is a special requirement that there is nested data like bellow
{"DATA":{"SPEED":80},"USER_ID":100200,"USER_NAME":"user1"}
And I want DATA.SPEED no just DATA as my query criteria or where condition like bellow
select DATA.SPEED from "database" where DATA.SPEED > 50 order by
USER_ID desc
Is there is way to do this? Thanks

Pyspark with AWS Glue join on multiple columns creating duplicates

I have two tables in AWS Glue, table_1 and table_2 that have almost identical schemas, however, table_2 has two additional columns. I am trying to join these two tables together on the columns that are the same and add the columns that are unique to table_2 with null values for the "old" data whose schema does not include those values.
Currently, I am able to join the two tables, using something similar to:
joined_table = Join.apply(table_1, table_2, 'id', 'id')
where the first 'id' is the id column in table_1 and the second 'id' is the id column in table_2. This call successfully joins the table into one, however, the resulting joined_table has duplicate fields for the matching columns.
My two questions are:
How can I leverage AWS Glue job with Pyspark to join all columns that match across the two tables so that there are not duplicate columns and while adding the new fields?
This sample call only takes in the 'id' column as I was trying to get this just to work, however, I want to pass in all the columns that match across the two tables. How can I pass in a list of columns to this Join.apply call? I am aware of the available methods from Pyspark directly, however, am wondering if there is a way specific to AWS Glue jobs or if there is something I need to do within AWS Glue to leverage Pyspark functionality directly.
I found that I needed to rename the columns in table_1 and then was missing a call to .drop_fields after my Join.apply call to remove the old columns from the joined table.
Additionally, you can pass in a list of column names rather than the single 'id' column that I was trying to use in the question.
joineddata = Join.apply(frame1 = table1, frame2 = table2, keys1 = ['id'], keys2 = ['id'], transformation_ctx = 'joinedData')
The join in aws glue doesn't handle duplicates. You need to convert to dataframes and then drop duplicate.
If you have duplicates, Try this:
selectedFieldsDataFrame = joineddata.toDF()
selectedFieldsDataFrame.dropDuplicates()