Redshift Spectrum: Query Anonymous JSON array structure - amazon-web-services

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if
serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?

Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update:
In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...

I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;

You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example:
for 2nd index element
select json_extract_array_element_text('[111,112,113]', 2);
output: 113

If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

Related

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

Athena Schema creation when log format has missing fields

I have a custom log format where the log entries vary by the request type. So certain rows have more fields.
Can we specify certain fields as optional so that in rows that they are missing, the values will be set to certain default (null, 0)?
Here are some hypothetical log entries:
{"data":"[2017-09-10 10:44:54.448998 -0000] info ip=773.555.557.445 cluster=\"production\" query=uris type=TXT class=IN rcode=NXDOMAIN cnt=0 offset=74","header":{"recvtime":"2017-09-10 10:45:02","server":"m0107481","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=991.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=NOERROR cnt=1 offset=90 score=400","header":{"recvtime":"2017-09-10 10:45:02","server":"m010748","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=971.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=REFUSED cnt=1","header":{"recvtime":"2017-09-10 10:45:02","server":"m010574","refid":"ABC-123"}}
Note that each row of the log data is in json format, and the header part is fixed. If query in data is dnsbl, then sometimes the row has a score field, but other times it is missing. And I am planning to use Athena to parse this type of data from S3 and query for some stats in the line of: what % of data are dns queries and what % have score above 300.
It looks like your data is JSON with embedded structured logging in the data field. As long as the data is well formed JSON with one object per line you should be able to create a JSON table and then use functions to extract the other pieces out of the data field. You can create a view that does the extraction so that you don't have to do that in every query.
I'm thinking something like this:
CREATE EXTERNAL TABLE raw_log_entries (
data string,
header struct<recvtime: string, server: string, refid: string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://some-bucket/and/path/';
CREATE VIEW log_entries AS
SELECT
header.recvtime,
header.server,
header.refid,
regexp_extract(data, 'query=(\S+)', 1) AS query,
regexp_extract(data, 'type=(\S+)', 1) AS type,
regexp_extract(data, 'score=(\S+)', 1) AS score,
-- and so on
FROM raw_log_entries
You'll have to experiment with the regexes, since I don't have your data I can't be sure if they will work for all cases, but I hope you get the idea.

Bigquery extract nested JSON using UNNEST

I have a table in Bigquery which has JSON data like below.
{
"block_id": "000000000000053d90510fa4bbfbbed243baca490c85ac7856b1a1fab4d367e4",
"transactions": [
{
"transaction_id": "4529b00ed3315ff85408118ef5992b3ad2b47f4c1c088cc3dea46084bdb600df",
"inputs": [
{
"input_script_bytes": "BIvbBRoDwAgBEi9QMlNIL0JJUDE2L3NsdXNoL1Is+r5tbf4lsR1tDNnUOZk9JGzN4MkWc914Rol/+47Hn+msUG/nAQAAAAAAAAA=",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "5048296000",
"output_pubkey_base58_error": "Cannot cast this script to a pay-to-address type"
}
]
},
{
"transaction_id": "838b03a6f741c844e22079cdb0d1401b9687d65a82f355ccb0a993b042c49d54",
"inputs": [
{
"input_script_bytes": "RzBEAiAE5fM2NHAEaWy9utrC2ypHQsKwUDeUTp/gjbj5tSy3lwIgUXXFcuwXhr3tx1m5D+kznhklTAK9+YYHRcB43aXTAZ8BQQR86qInfhczeYqqJsAD9yFfxSAzBAmIBlxk/bpTQSxgLkF4Ttipiuuoxt6TTVMDK/eewwFhAPJiHrvZq0psKI1d",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "1",
"output_pubkey_base58_error": null
},
{
"output_satoshis": "4949999",
"output_script_bytes": "dqkU4E0i4TQg1I6OpprIt6v7Ipuda/GIrA==",
"output_pubkey_base58_error": null
}
]
}
]
}
I want to extract the transaction_id,output.input_pubkey_base58_error from this table.
How can achieve this by using UNNEST?
You can refer the above example code.
It looks like the sintax should be like this. (Didn't try it!). Guessing that your table is called mybitcoindata in bigquery
SELECT block_id, output.output_pubkey_base58_error
FROM yourdataset.yourtable as A
CROSS JOIN UNNEST(A.transactions) AS transaction
CROSS JOIN UNNEST(transaction.outputs) AS output
;
There are very good examples here
EDIT:
Just tested. If you convert your json data to single line json, you can create the table in bigquery. The above query works to explode multiple arrays.
First of all, I would like to clarify that you said you are interested in the fields transaction_id and output.input_pubkey_base58_error, but the latter does not exist according to the table schema (maybe you were referring to inputs.input_pubkey_base58_error or outputs.output_pubkey_base58_error). So I believe it is worth that you clarify your scenario and/or use case.
In any case, working with the public Bitcoin dataset you mentioned, you can use a query like the one below in order to query (using Standard SQL) only for the fields you are interested in.
#standardSQL
SELECT
tr.transaction_id,
inp.input_pubkey_base58_error,
out.output_pubkey_base58_error
FROM
`bigquery-public-data.bitcoin_blockchain.blocks`,
UNNEST(transactions) AS tr,
UNNEST(tr.inputs) AS inp,
UNNEST(tr.outputs) as out
LIMIT
100
In this query, I am making use of the UNNEST StandardSQL operator in order to query for specific fields inside an array, but I strongly recommend you to go through the documentation in order to see more details and specific examples on how it works.

AWS Athena flattened data from nested JSON source

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b
I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);

dynamodb - scan items where map contains a key

I have a table that contains a field (not a key field), called appsMap, and it looks like this:
appsMap = { "qa-app": "abc", "another-app": "xyz" }
I want to scan all rows whose appsMap contains the key "qa-app" (the value is not important, just the key). I tried something like this but it doesn't work in the way I need:
FilterExpression = '#appsMap.#app <> :v',
ExpressionAttributeNames = {
"#app": "qa-app",
"#appsMap": "appsMap"
},
ExpressionAttributeValues = {
":v": { "NULL": True }
},
ProjectionExpression = "deviceID"
What's the correct syntax?
Thanks.
There is a discussion on the subject here:
https://forums.aws.amazon.com/thread.jspa?threadID=164470
You might be missing this part from the example:
ExpressionAttributeValues: {":name":{"S":"Jeff"}}
However, just wanted to echo what was already being said, scan is an expensive procedure that goes through every item and thus making your database hard to scale.
Unlike with other databases, you have to do plenty of setup with Dynamo in order to get it to perform at it's great level, here is a suggestion:
1) Convert this into a root value, for example add to the root: qaExist, with possible values of 0|1 or true|false.
2) Create secondary index for the newly created value.
3) Make query on the new index specifying 0 as a search parameter.
This will make your system very fast and very scalable regardless of how many records you get in there later on.
If I understand the question correctly, you can do the following:
FilterExpression = 'attribute_exists(#0.#1)',
ExpressionAttributeNames = {
"#0": "appsMap",
"#1": "qa-app"
},
ProjectionExpression = "deviceID"
Since you're not being a bit vague about your expectations and what's happening ("I tried something like this but it doesn't work in the way I need") I'd like to mention that a scan with a filter is very different than a query.
Filters are applied on the server but only after the scan request is executed, meaning that it will still iterate over all data in your table and instead of returning you each item, it applies a filter to each response, saving you some network bandwidth, but potentially returning empty results as you page trough your entire table.
You could look into creating a GSI on the table if this is a query you expect to have to run often.