Reading Json data from Athena - amazon-athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1

The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

Related

AWS Athena query returning empty string

I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!

incorrect row format returned by DDL

I have csv data formatted differently and this syntax works as expected. But when I use "Generate create table DDL" option, it does not return the same parameters for row format.
original and correct row format:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
"quoteChar" = "\""
)
row format generated by SHOW CREATE TABLE xyz syntax:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
I will like to know how to get back the exactly same create table statement that I used in the first place.
That is not possible. SHOW CREATE TABLE … will at best give you SQL that can actually be used, but many times will give you something that won't even run, let alone create a identical copy of the table.
You should use the Glue API instead. Use GetTable to retrieve the table structure, modify what you need (the name, the database, and/or the location, for example), and then use CreateTable to create the new table.
What SHOW CREATE TABLE … does is that it looks up the table metadata in Glue and then does (a poor) conversion of what it finds into SQL DDL. You will be much better off doing the Glue operations yourself.

Redshift Spectrum: Query Anonymous JSON array structure

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if
serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?
Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update:
In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...
I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example:
for 2nd index element
select json_extract_array_element_text('[111,112,113]', 2);
output: 113
If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

AWS Athena flattened data from nested JSON source

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b
I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);