Querying nested JSON structures in AWS Athena - amazon-athena

I got the following format of JSON document with nested structures
{
"id": "p-1234-2132321-213213213-12312",
"name": "athena to the rescue",
"groups": [
{
"strategy_group": "anyOf",
"conditions": [
{
"strategy_conditions": "anyOf",
"entries": [
{
"c_key": "service",
"C_operation": "isOneOf",
"C_value": "mambo,bambo,jumbo"
},
{
"c_key": "hostname",
"C_operation": "is",
"C_value": "lols"
}
]
}
]
}
],
"tags": [
"aaa",
"bbb",
"ccc"
]
}
I have created table in Athena to support it using the following
CREATE EXTERNAL TABLE IF NOT EXISTS filters ( id string, name string, tags array<string>, groups array<struct<
strategy_group:string,
conditions:array<struct<
strategy_conditions:string,
entries: array<struct<
c_key:string,
c_operation:string,
c_value:string
>>
>>
>> ) row format serde 'org.openx.data.jsonserde.JsonSerDe' location 's3://filterios/policies/';
My goal at the moment is to query based on the conditions entries columns as well. I have tried some queries however sql language is not my biggest trade ;)
I got at the moment to this query which gives me entries
select cnds.entries from
filters,
UNNEST(filters.groups) AS t(grps),
UNNEST(grps.conditions) AS t(cnds)
However since this is complex array it gives me some headeache what would be the proper way to query.
Any hints appreciated!
thanks
R

I am not sure whether I understood your query well. Look at this example below, maybe it will be useful to you.
select
id,
name,
tags,
grps.strategy_group,
cnds.strategy_conditions,
enes.c_key,
enes.c_operation,
enes.c_value
from
filters,
UNNEST(filters.groups) AS t(grps),
UNNEST(grps.conditions) AS t(cnds),
UNNEST(cnds.entries) AS t(enes)
where
enes.c_key='service'

Here is one example i recently worked with that may help:
My JSON:
{
"type": "FeatureCollection",
"features": [{
"first": "raj",
"geometry": {
"type": "Point",
"coordinates": [-117.06861096, 32.57889962]
},
"properties": "someprop"
}]
}
Created external table :
CREATE EXTERNAL TABLE `jsondata`(
`type` string COMMENT 'from deserializer',
`features` array<struct<type:string,geometry:struct<type:string,coordinates:array<string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='features,type')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://vicinitycheck/rawData/jsondata/'
TBLPROPERTIES (
'classification'='json')
Query data :
SELECT type AS TypeEvent,
features[1].geometry.coordinates AS FeatherType
FROM test_vicinitycheck.jsondata
WHERE type = 'FeatureCollection'
test_vicinitycheck - Is my database name in Athena
jsondata - table name in Athena
I documented some examples on my blog if it helps:
http://weavetoconnect.com/aws-athena-and-nested-json/

Related

Is there a way to skip one row during Data migration from Oracle to Redshift

I am working on Data migration from Oracle to Redshift and want to apply the transformation rule to skip one row. I know we can use remove column but not row. Anyone can give me any suggestions if I can there is a way to skip a row?
There is no direct inbuilt way to skip row from using DMS .
There is one thing that you can do is filter operation on column .
If you have a column where you can define range i.e integer column in source data table.
Oracle has one N tile query that will sort table in order .
Once you have range defined then you can split your DMS task based on ranges and skip the row that you want .
DMS Source Filter
Sample example for skipping row 3 here
{
"rule-type": "table-settings",
"rule-id": "4",
"rule-name": "4",
"object-locator": {
"schema-name": "abc",
"table-name": "table1"
},
"parallel-load": {
"type": "ranges",
"columns": [
"ID"
],
"boundaries": [
[
"Row1"
],
[
"Ro2"
],
[
"Row4"
],
[
"Ro5"
]
]
}
}
]
}

aws athena query json array data

i'm not able to query S3 files with Aws Athena, the content of the files are regular json arrays like this:
[
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "f257c9c0-b7e1-4663-8d6d-97e652b27c10",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200001697080089188056D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
},
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "e531272e-465c-4294-950d-95a683ff8e3b",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200341201321E0000A946D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
}
]
a simple query select * from mytable returns empty rows if the table has been generated in this way
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
or it gives me an error HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1] if the table has been generated with:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable(
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
if i modify the content of the file in this way (an json object each rows without trailing commas, the query gives me results)
{ "DataInvio": "2020-02-06T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
{ "DataInvio": "2020-02-07T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
How to query json array structures directly?
Athena Best Practices recommends to have one json per row:
Make sure that each JSON-encoded record is represented on a separate line.
This has been asked a few times and I don't think someone made it work with a array of json:
aws athena - Create table by an array of json object
AWS Glue Custom Classifiers Json Path
This is related to the formatting of the JSON objects. The resolution of these issues is also described here: https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
Apart from this, if you are using AWS Glue to crawl these files, make sure the Classification of database table of Data Catalog is not "UNKNOWN".

Is BigQuery possible to assign more than 10 thousands paremeters at `IN` query?

I defined following schema in BigQuery
[
{
"mode": "REQUIRED",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "segments",
"type": "RECORD",
"fields": [
{
"mode": "REQUIRED",
"name": "segment_id",
"type": "STRING"
}
]
}
]
I try to insert a new segment_id to specific customer ids something like this:
#standardSQL
UPDATE `sample-project.customer_segments.segments`
SET segments = ARRAY(
SELECT segment FROM UNNEST(segments) AS segment
UNION ALL
SELECT STRUCT('NEW_SEGMENT')
)
WHERE customer_id IN ('0000000000', '0000000001', '0000000002')
Is it possible to assign more than 10 thousands cusomer_id to IN query at BigQuery?
Is it possible to assign more than 10 thousands cusomer_id to IN query at BigQuery?
Assuming (based on example in your question) the length of customer_id is around 10 chars plus three chars for apostrophes and comma you will and up with extra around 130KB which is within limit of 250KB (see more in Quotas & Limits)
So, you should be fine with 10K and easily can calculate the limit - looks like limit will go around 19K
Just to clarify:
I meant below limitations (mostly first one)
Maximum unresolved query length — 256 KB
Maximum resolved query length — 12 MB
When working with a long list of possible values, it's a good idea to use a query parameter instead of inlining the entire list into the query, assuming you are working with the command line client or API. For example,
#standardSQL
UPDATE `sample-project.customer_segments.segments`
SET segments = ARRAY(
SELECT segment FROM UNNEST(segments) AS segment
UNION ALL
SELECT STRUCT('NEW_SEGMENT')
)
WHERE customer_id IN UNNEST(#customer_ids)
Here you would create a query parameter of type ARRAY<STRING> containing the customer IDs.

Partitioned table BigQuery (with custom field)

I can't find any examples that show how to write a JSON for a partitioned table using a custom field. Below is an example of how to specify a table partitioned by the type "DAY", but if I, in addition, would like to partition by a specific field - how would the JSON look like?
{
"tableReference": {
"projectId": "bookstore-1382",
"datasetId": "exports",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY"
}
}
Take a look at the API reference. The timePartitioning object currently supports the following attributes:
expirationMs
field
requirePartitionFilter
type
I won't copy/paste all of the comments here, but this is what it says for field:
[Experimental] [Optional] If not set, the table is partitioned by
pseudo column '_PARTITIONTIME'; if set, the table is partitioned by
this field. The field must be a top-level TIMESTAMP or DATE field. Its
mode must be NULLABLE or REQUIRED.
In your case, the payload would look like:
{
"tableReference": {
"projectId": "<your project>",
"datasetId": "<your dataset>",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY",
"field": "<date_or_timestamp_column_name>"
}
}
Alternatively, you can issue a CREATE TABLE DDL statement using standard SQL. To give an example:
#standardSQL
CREATE TABLE `your-project.your-dataset.table`
(
x INT64,
event_date DATE
)
PARTITION BY event_date;

Handling missing and new fields in tableSchema of BigQuery in Google Cloud Dataflow

Here is the situation:
My BigQuery TableSchema is as follows:
{
"name": "Id",
"type": "INTEGER",
"mode": "nullable"
},
{
"name": "Address",
"type": "RECORD",
"mode": "repeated",
"fields":[
{
"name": "Street",
"type": "STRING",
"mode": "nullable"
},
{
"name": "City",
"type": "STRING",
"mode": "nullable"
}
]
}
I am reading data from a Google Cloud Storage Bucket and writing in to BigQuery using a cloud function.
I have defined TableSchema in my cloud function as:
table_schema = bigquery.TableSchema()
Id_schema = bigquery.TableFieldSchema()
Id_schema.name = 'Id'
Id_schema.type = 'INTEGER'
Id_schema.mode = 'nullable'
table_schema.fields.append(Id_schema)
Address_schema = bigquery.TableFieldSchema()
Address_schema.name = 'Address'
Address_schema.type = 'RECORD'
Address_schema.mode = 'repeated'
Street_schema = bigquery.TableFieldSchema()
Street_schema.name = 'Street'
Street_schema.type = 'STRING'
Street_schema.mode = 'nullable'
Address_schema.fields.append(Street_schema)
table_schema.fields.append(Address_schema)
City_schema = bigquery.TableFieldSchema()
City_schema.name = 'City'
City_schema.type = 'STRING'
City_schema.mode = 'nullable'
Address_schema.fields.append(City_schema)
table_schema.fields.append(Address_schema)
My data file looks like this: (each row is json)
{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}
{"Id": 2, "Address": {"City":"Mumbai"}}
{"Id": 3, "Address": {"Street":"XYZ Road"}}
{"Id": 4}
{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}
Question:
How can I handle when the incoming data has some missing keys?
e.g.,
On row #2 of the data "Street" is missing
On row #3 of the data "City" is missing
On row #4 of the data "Address" is missing
On row #5 of the data "PhoneNumber" shows up..
Question 1: How to handle WriteToBigQuery if the data in missing (e.g., row #2,#3,#4)
Question 2: How to handle if a new field shows up in the data?
e.g.,
On row #5 "PhoneNumber" shows up..
How can I add a new column in BigQuery table on the fly?
(Do I have have to define the BigQuery table schema exhaustive enough at first in order to accommodate such newly added fields?)
Question 3: How can I iterate through each row (while reading data file) of the incoming data file and determine which fields to parse?
One of the option for you is - instead of straggling with schema changes I would recommend to write your data into table with just one field line of type string - and apply schema logic on fly during the querying
Below example is for BigQuery Standard SQL of how to apply schema on fly against table with whole row in one field
#standardSQL
WITH t AS (
SELECT '{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}' line UNION ALL
SELECT '{"Id": 2, "Address": {"City":"Mumbai"}}' UNION ALL
SELECT '{"Id": 3, "Address": {"Street":"XYZ Road"}}' UNION ALL
SELECT '{"Id": 4} ' UNION ALL
SELECT '{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}'
)
SELECT
JSON_EXTRACT_SCALAR(line, '$.Id') id,
JSON_EXTRACT_SCALAR(line, '$.PhoneNumber') PhoneNumber,
JSON_EXTRACT_SCALAR(line, '$[Address].Street') Street,
JSON_EXTRACT_SCALAR(line, '$[Address].City') City
FROM t
with result as below
Row id PhoneNumber Street City
1 1 null MG Road Pune
2 2 null null Mumbai
3 3 null XYZ Road null
4 4 null null null
5 5 12345678 ABCD Road Bangalore
I think this approach answers/addresses all your four questions
Question: How can I handle when the incoming data has some missing keys?
Question 1: How to handle WriteToBigQuery if the data in missing (e.g., row #2,#3,#4)
Question 2: How to handle if a new field shows up in the data?
I recommend decoding the JSON string to some data structure, for example a custom Contact class, where you can access and manipulate member variables and define which members are optional and which are required. Using a custom class gives you a level of abstraction so that downstream transforms in the pipeline don't need to worry about how to manipulate JSON. A downstream transform can be implemented to build a TableRow from a Contact object and also adhere to the BigQuery table schema. This design follows general abstraction and separation of concerns principles and is able to handle all scenarios of missing or additional fields.
Question 3: How can I iterate through each row (while reading data file) of the incoming data file and determine which fields to parse?
Dataflow's execution of the pipeline does this automatically. If the pipeline reads from Google Cloud Storage (using TextIO for example), then Dataflow will process each line of the file as an individual element (individual JSON string). Determining which fields to parse is a detail of the business logic and can be defined in a transform which parses the JSON string.