Bigquery extract nested JSON using UNNEST - google-cloud-platform

I have a table in Bigquery which has JSON data like below.
{
"block_id": "000000000000053d90510fa4bbfbbed243baca490c85ac7856b1a1fab4d367e4",
"transactions": [
{
"transaction_id": "4529b00ed3315ff85408118ef5992b3ad2b47f4c1c088cc3dea46084bdb600df",
"inputs": [
{
"input_script_bytes": "BIvbBRoDwAgBEi9QMlNIL0JJUDE2L3NsdXNoL1Is+r5tbf4lsR1tDNnUOZk9JGzN4MkWc914Rol/+47Hn+msUG/nAQAAAAAAAAA=",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "5048296000",
"output_pubkey_base58_error": "Cannot cast this script to a pay-to-address type"
}
]
},
{
"transaction_id": "838b03a6f741c844e22079cdb0d1401b9687d65a82f355ccb0a993b042c49d54",
"inputs": [
{
"input_script_bytes": "RzBEAiAE5fM2NHAEaWy9utrC2ypHQsKwUDeUTp/gjbj5tSy3lwIgUXXFcuwXhr3tx1m5D+kznhklTAK9+YYHRcB43aXTAZ8BQQR86qInfhczeYqqJsAD9yFfxSAzBAmIBlxk/bpTQSxgLkF4Ttipiuuoxt6TTVMDK/eewwFhAPJiHrvZq0psKI1d",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "1",
"output_pubkey_base58_error": null
},
{
"output_satoshis": "4949999",
"output_script_bytes": "dqkU4E0i4TQg1I6OpprIt6v7Ipuda/GIrA==",
"output_pubkey_base58_error": null
}
]
}
]
}
I want to extract the transaction_id,output.input_pubkey_base58_error from this table.
How can achieve this by using UNNEST?
You can refer the above example code.

It looks like the sintax should be like this. (Didn't try it!). Guessing that your table is called mybitcoindata in bigquery
SELECT block_id, output.output_pubkey_base58_error
FROM yourdataset.yourtable as A
CROSS JOIN UNNEST(A.transactions) AS transaction
CROSS JOIN UNNEST(transaction.outputs) AS output
;
There are very good examples here
EDIT:
Just tested. If you convert your json data to single line json, you can create the table in bigquery. The above query works to explode multiple arrays.

First of all, I would like to clarify that you said you are interested in the fields transaction_id and output.input_pubkey_base58_error, but the latter does not exist according to the table schema (maybe you were referring to inputs.input_pubkey_base58_error or outputs.output_pubkey_base58_error). So I believe it is worth that you clarify your scenario and/or use case.
In any case, working with the public Bitcoin dataset you mentioned, you can use a query like the one below in order to query (using Standard SQL) only for the fields you are interested in.
#standardSQL
SELECT
tr.transaction_id,
inp.input_pubkey_base58_error,
out.output_pubkey_base58_error
FROM
`bigquery-public-data.bitcoin_blockchain.blocks`,
UNNEST(transactions) AS tr,
UNNEST(tr.inputs) AS inp,
UNNEST(tr.outputs) as out
LIMIT
100
In this query, I am making use of the UNNEST StandardSQL operator in order to query for specific fields inside an array, but I strongly recommend you to go through the documentation in order to see more details and specific examples on how it works.

Related

How to write a AWS AppSync response mapping template for an RDS data source

I have been following this guide for querying an Aurora Serverless database through an AppSync schema. Now I want to run a couple of queries at the same time with a request mapping like:
{
"version": "2018-05-29",
"statements": [
"SELECT * FROM MyTable WHERE category='$ctx.args.category'",
"SELECT COUNT(*) FROM MyTable WHERE category='$ctx.args.category'",
]
}
So, how to handle multiple selects in the response mapping? The page has a few examples, but none has two selects:
$utils.toJson($utils.rds.toJsonObject($ctx.result)[0]) ## For first item results
$utils.toJson($utils.rds.toJsonObject($ctx.result)[0][0]) ## For first item of first query
$utils.toJson($utils.rds.toJsonObject($ctx.result)[1][0]) ## For first item of second query
$utils.toJson($utils.rds.toJsonObject($ctx.result)??????) ## ?? For first & second item results
I predicted the response type to be like follows, but is not strict as long as I can get the values.
type MyResponse {
MyResponseItemList [MyResponseItem]
Count Int
}
type MyResponseItem {
Id: ID!
Name: String
...
}
Doing two selects will not work with AppSync.
I suggest you either break apart the two SQL queries into two different GraphQL query operations or combine the two SQL queries into one.
I faced the same issue and got this working as follows.
Instead of having Count as a direct Int type result, I converted that into a another type called PaginationResult.
type MyResponse {
MyResponseItemList [MyResponseItem]
Count PaginationResult
}
type PaginationResult {
Count Int
}
type MyResponseItem {
...
}
Response Velocity Template
#set($resMap = {
"MyResponseItemList": $utils.rds.toJsonObject($ctx.result)[0],
"Count": $utils.rds.toJsonObject($ctx.result)[1][0]
})
$util.toJson($resMap)
FWIW, I just got working a UNION ALL Appsync/RDS Request resolver query with two SELECTs:
{
"version": "2018-05-29",
"statements": ["SELECT patientIDa, patientIDb, distance FROM Distances WHERE patientIDa='$ctx.args.patientID' UNION ALL SELECT patientIDb, patientIDa, distance FROM Distances WHERE patientIDb='$ctx.args.patientID'"]
}
Not sure if this will help the OP but it may.
***Note: in my case (maybe because I'm on windows) the ENTIRE ["SELECT...] statement needs to be on one line (no cr/lf) or else graphql errors with "non-escaped character..." (testing using GraphiQL)

Redshift Spectrum: Query Anonymous JSON array structure

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if
serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?
Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update:
In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...
I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example:
for 2nd index element
select json_extract_array_element_text('[111,112,113]', 2);
output: 113
If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

Is BigQuery possible to assign more than 10 thousands paremeters at `IN` query?

I defined following schema in BigQuery
[
{
"mode": "REQUIRED",
"name": "customer_id",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "segments",
"type": "RECORD",
"fields": [
{
"mode": "REQUIRED",
"name": "segment_id",
"type": "STRING"
}
]
}
]
I try to insert a new segment_id to specific customer ids something like this:
#standardSQL
UPDATE `sample-project.customer_segments.segments`
SET segments = ARRAY(
SELECT segment FROM UNNEST(segments) AS segment
UNION ALL
SELECT STRUCT('NEW_SEGMENT')
)
WHERE customer_id IN ('0000000000', '0000000001', '0000000002')
Is it possible to assign more than 10 thousands cusomer_id to IN query at BigQuery?
Is it possible to assign more than 10 thousands cusomer_id to IN query at BigQuery?
Assuming (based on example in your question) the length of customer_id is around 10 chars plus three chars for apostrophes and comma you will and up with extra around 130KB which is within limit of 250KB (see more in Quotas & Limits)
So, you should be fine with 10K and easily can calculate the limit - looks like limit will go around 19K
Just to clarify:
I meant below limitations (mostly first one)
Maximum unresolved query length — 256 KB
Maximum resolved query length — 12 MB
When working with a long list of possible values, it's a good idea to use a query parameter instead of inlining the entire list into the query, assuming you are working with the command line client or API. For example,
#standardSQL
UPDATE `sample-project.customer_segments.segments`
SET segments = ARRAY(
SELECT segment FROM UNNEST(segments) AS segment
UNION ALL
SELECT STRUCT('NEW_SEGMENT')
)
WHERE customer_id IN UNNEST(#customer_ids)
Here you would create a query parameter of type ARRAY<STRING> containing the customer IDs.

Partitioned table BigQuery (with custom field)

I can't find any examples that show how to write a JSON for a partitioned table using a custom field. Below is an example of how to specify a table partitioned by the type "DAY", but if I, in addition, would like to partition by a specific field - how would the JSON look like?
{
"tableReference": {
"projectId": "bookstore-1382",
"datasetId": "exports",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY"
}
}
Take a look at the API reference. The timePartitioning object currently supports the following attributes:
expirationMs
field
requirePartitionFilter
type
I won't copy/paste all of the comments here, but this is what it says for field:
[Experimental] [Optional] If not set, the table is partitioned by
pseudo column '_PARTITIONTIME'; if set, the table is partitioned by
this field. The field must be a top-level TIMESTAMP or DATE field. Its
mode must be NULLABLE or REQUIRED.
In your case, the payload would look like:
{
"tableReference": {
"projectId": "<your project>",
"datasetId": "<your dataset>",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY",
"field": "<date_or_timestamp_column_name>"
}
}
Alternatively, you can issue a CREATE TABLE DDL statement using standard SQL. To give an example:
#standardSQL
CREATE TABLE `your-project.your-dataset.table`
(
x INT64,
event_date DATE
)
PARTITION BY event_date;

Query DynamoDb for multiple items

I think I'm misunderstanding DynamoDb. I would like to query for all items, with a child field of the json, which match an identifier I'm passing. The structure is something like -
{
"messageId": "ced96cab-767e-509198be5-3d2896a3efeb",
"identifier": {
"primary": "9927fd47-5d33-4f51-a5bb-f292a0c733b1",
"secondary": "none",
"tertiary": "cfd96cab-767e-5091-8be5-3d2896a3efeb"
},
"attributes": {
"MyID": {
"Type": "String",
"Value": "9927fd47-5c33-4f51-a5bb-f292a0c733b1"
}
}
}
I would like to query for all items in DynamoDb that has a value of MyID that I'm passing. Everything I've read seems to say you need to use the key which in my case is the messageId, this is unique for each entry and not a value I can use.
Hope this makes sense.
The DynamoDB Query API can be used only if you know the value of Partition key. Otherwise, you may need to scan the whole table using FilterExpression to find the item.
Scanning tables
You can create GSI on scalar attribute only. In the above case, it is a document data type (i.e. MAP). So, GSI can't be created.