Partitioned table BigQuery (with custom field) - google-cloud-platform

I can't find any examples that show how to write a JSON for a partitioned table using a custom field. Below is an example of how to specify a table partitioned by the type "DAY", but if I, in addition, would like to partition by a specific field - how would the JSON look like?
{
"tableReference": {
"projectId": "bookstore-1382",
"datasetId": "exports",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY"
}
}

Take a look at the API reference. The timePartitioning object currently supports the following attributes:
expirationMs
field
requirePartitionFilter
type
I won't copy/paste all of the comments here, but this is what it says for field:
[Experimental] [Optional] If not set, the table is partitioned by
pseudo column '_PARTITIONTIME'; if set, the table is partitioned by
this field. The field must be a top-level TIMESTAMP or DATE field. Its
mode must be NULLABLE or REQUIRED.
In your case, the payload would look like:
{
"tableReference": {
"projectId": "<your project>",
"datasetId": "<your dataset>",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY",
"field": "<date_or_timestamp_column_name>"
}
}
Alternatively, you can issue a CREATE TABLE DDL statement using standard SQL. To give an example:
#standardSQL
CREATE TABLE `your-project.your-dataset.table`
(
x INT64,
event_date DATE
)
PARTITION BY event_date;

Related

DynamoDB Query with key condition and filter expression by array contains

I want to query a DDB GSI with key condition, and apply filter on returned result using contains function.
Data I have in DDB table:
{
"lookupType": "PRODUCT_GROUP",
"name": "Spring framework study set",
"structureJson": {
"productIds": [
"FBCUPOQsrp",
"Y4LDaiViLY",
"J6N3UWq9CK"
]
},
"ownerId": "mT9R9y6zGO"
}
{
"lookupType": "PRODUCT_GROUP",
"name": "Relational databases study set",
"structureJson": {
"productIds": [
"QWQWQWQWQW",
"XZXZXZXZXZ"
]
},
"ownerId": "mT9R9y6zGO"
}
...
I have a compound GSI (ownerId - HASH, lookupType - RANGE).
When I try to query the DDB (query structure is in "2" field) I get the result (the result is in "3"):
{
"0":[
],
"2":{
"TableName":"Products",
"IndexName":"ProductsOwnerIdLookupTypeIndex",
"KeyConditionExpression":"#ownerId = :ownerId and #lookupType = :lookupType",
"FilterExpression":"contains(#structureMember, :memberId)",
"ExpressionAttributeNames":{
"#ownerId":"ownerId",
"#lookupType":"lookupType",
"#structureMember":"structureJson.productIds"
},
"ExpressionAttributeValues":{
":ownerId":"mT9R9y6zGO",
":lookupType":"PRODUCT_GROUP",
":memberId":"FBCUPOQsrp"
}
},
"3":{
"Items":[
],
"Count":0,
"ScannedCount":2
}
}
The returned result set is empty, despite I have data with given field values.
How I see the query (or what I want to achieve):
When I query the GSI with ownerId = mT9R9y6zGO and lookupType = PRODUCT_GROUP it will find 2 items - Spring and Relational DB sets
As the second step DDB will scan the returned query result with contains (structureJson.productIds, FBCUPOQsrp) filter expression and it should return one result to me, but I get empty set
Is something wrong with the query or do I miss some point in DDB query workflow?

How to query AWS DynamoDB using multiple Indexes?

I have an AWS DynamoDb cart table with the following item structure -
{
"cart_id": "5e4d0f9f-f08c-45ae-986a-f1b5ac7b7c13",
"user_id": 1234,
"type": "OTHER",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 90000,
"total_quantity": 2,
"items": [
{
"amount": 90000,
"category": "Laptops",
"name": "Apple MacBook Pro",
"quantity": 1
}
]
}
-
{
"cart_id": "12340f9f-f08c-45ae-986a-f1b5ac7b1234",
"user_id": 1234,
"type": "SPECIAL",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 1000,
"total_quantity": 2,
"items": [
{
"amount": 1000,
"category": "Special",
"name": "Special Item",
"quantity": 1
}
]
}
The table will have cart_id as Primary key,
user_id as an Index or GSI,
type as an Index or GSI.
I want to be able to query the cart table,
to find the items which have user_id = 1234 AND type != "SPECIAL".
I don't know if this means for the query -
--key-condition-expression "user_id = 1234 AND type != 'SPECIAL'"
I understand that an AWS DynamoDb table cannot be queried using multiple indexes at the same time,
I came across the following question, it has a similar use case and the answer is recommending creating a composite key,
Querying with multiple local Secondary Index Dynamodb
Does it mean that while putting a new item in the table,
I will need to maintain another column like user_id_type,
with its value as 1234SPECIAL and create an Index / GSI for user_id_type ?
Sample item structure -
{
"cart_id": "5e4d0f9f-f08c-45ae-986a-f1b5ac7b7c13",
"user_id": 1234,
"type": "OTHER",
"user_id_type" : "1234OTHER",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 90000,
"total_quantity": 2,
"items": [
{
"amount": 90000,
"category": "Laptops",
"name": "Apple MacBook Pro",
"quantity": 1
}
]
}
References -
1. Querying with multiple local Secondary Index Dynamodb
2. Is there a way to query multiple hash keys in DynamoDB?
Your assumption is correct. Maybe you can add into that a delimitter field1_field2 or hash them if either of them is too big in size hashOfField1_hashOfField2
That mean spending some more processing power on your side, however. As DynamoDB does not natively support It.
Composite key in DynamoDB with more than 2 columns?
Dynamodb: query using more than two attributes
Additional info on your use case
KeyConditionExpression only allowed for the hash key.
You can put it in the FilterExpression
Why is there no **not equal** comparison in DynamoDB queries?
Does it mean that while putting a new item in the table,
I will need to maintain another column like user_id_type,
with its value as 1234SPECIAL and create an Index / GSI for user_id_type?
The answer is it depends on how many columns (dynamodb is schema-less, by a column I mean data field) you need and are you happy with 2 round trips to DB.
your query:
user_id = 1234 AND type != "SPECIAL"
1- if you need all information in the cart but you are happy with two round trips:
Solution: Create a GSI with user_id (HASH) and type (RANGE), then add cart_id (base table Hash key) as projection.
Explanation: so, you need one query on index table to get the cart_id given user_id and type
--key-condition-expression "user_id = 1234 AND type != 'SPECIAL'"
then you need to use cart_id(s) from the result and make another query to the base table
2- if you do not need all of cart information.
Solution: you need to create a GSI and make user_id HASH and type as RANGE and add more columns (columns you need) to projections.
Explanation: projection is additional columns you want to have in your index table. So, add some extra columns, which are more likely to be used as a result of the query, to avoid an extra round trip to the base table
Note: adding too many extra columns can double your costs, as any update on base table results in updates in GSI tables projection fields)
3- if you want just one round trip and you need all data
then you need to manage it by yourself and your suggestion can be applied
One possible answer is to create a single index with a sort key. Then you can do this:
{
TableName: "...",
IndexName: "UserIdAndTypeIndex",
KeyConditionExpression: "user_id = :user_id AND type != :type",
ExpressionAttributeValues: {
":user_id": 1234,
":type": "SPECIAL"
}
}
You can build GraphQL schema with AWS AppSync from your DynamoDB table and than query it in your app with GraphQL. Link

Querying nested JSON structures in AWS Athena

I got the following format of JSON document with nested structures
{
"id": "p-1234-2132321-213213213-12312",
"name": "athena to the rescue",
"groups": [
{
"strategy_group": "anyOf",
"conditions": [
{
"strategy_conditions": "anyOf",
"entries": [
{
"c_key": "service",
"C_operation": "isOneOf",
"C_value": "mambo,bambo,jumbo"
},
{
"c_key": "hostname",
"C_operation": "is",
"C_value": "lols"
}
]
}
]
}
],
"tags": [
"aaa",
"bbb",
"ccc"
]
}
I have created table in Athena to support it using the following
CREATE EXTERNAL TABLE IF NOT EXISTS filters ( id string, name string, tags array<string>, groups array<struct<
strategy_group:string,
conditions:array<struct<
strategy_conditions:string,
entries: array<struct<
c_key:string,
c_operation:string,
c_value:string
>>
>>
>> ) row format serde 'org.openx.data.jsonserde.JsonSerDe' location 's3://filterios/policies/';
My goal at the moment is to query based on the conditions entries columns as well. I have tried some queries however sql language is not my biggest trade ;)
I got at the moment to this query which gives me entries
select cnds.entries from
filters,
UNNEST(filters.groups) AS t(grps),
UNNEST(grps.conditions) AS t(cnds)
However since this is complex array it gives me some headeache what would be the proper way to query.
Any hints appreciated!
thanks
R
I am not sure whether I understood your query well. Look at this example below, maybe it will be useful to you.
select
id,
name,
tags,
grps.strategy_group,
cnds.strategy_conditions,
enes.c_key,
enes.c_operation,
enes.c_value
from
filters,
UNNEST(filters.groups) AS t(grps),
UNNEST(grps.conditions) AS t(cnds),
UNNEST(cnds.entries) AS t(enes)
where
enes.c_key='service'
Here is one example i recently worked with that may help:
My JSON:
{
"type": "FeatureCollection",
"features": [{
"first": "raj",
"geometry": {
"type": "Point",
"coordinates": [-117.06861096, 32.57889962]
},
"properties": "someprop"
}]
}
Created external table :
CREATE EXTERNAL TABLE `jsondata`(
`type` string COMMENT 'from deserializer',
`features` array<struct<type:string,geometry:struct<type:string,coordinates:array<string>>>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='features,type')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://vicinitycheck/rawData/jsondata/'
TBLPROPERTIES (
'classification'='json')
Query data :
SELECT type AS TypeEvent,
features[1].geometry.coordinates AS FeatherType
FROM test_vicinitycheck.jsondata
WHERE type = 'FeatureCollection'
test_vicinitycheck - Is my database name in Athena
jsondata - table name in Athena
I documented some examples on my blog if it helps:
http://weavetoconnect.com/aws-athena-and-nested-json/

Bigquery extract nested JSON using UNNEST

I have a table in Bigquery which has JSON data like below.
{
"block_id": "000000000000053d90510fa4bbfbbed243baca490c85ac7856b1a1fab4d367e4",
"transactions": [
{
"transaction_id": "4529b00ed3315ff85408118ef5992b3ad2b47f4c1c088cc3dea46084bdb600df",
"inputs": [
{
"input_script_bytes": "BIvbBRoDwAgBEi9QMlNIL0JJUDE2L3NsdXNoL1Is+r5tbf4lsR1tDNnUOZk9JGzN4MkWc914Rol/+47Hn+msUG/nAQAAAAAAAAA=",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "5048296000",
"output_pubkey_base58_error": "Cannot cast this script to a pay-to-address type"
}
]
},
{
"transaction_id": "838b03a6f741c844e22079cdb0d1401b9687d65a82f355ccb0a993b042c49d54",
"inputs": [
{
"input_script_bytes": "RzBEAiAE5fM2NHAEaWy9utrC2ypHQsKwUDeUTp/gjbj5tSy3lwIgUXXFcuwXhr3tx1m5D+kznhklTAK9+YYHRcB43aXTAZ8BQQR86qInfhczeYqqJsAD9yFfxSAzBAmIBlxk/bpTQSxgLkF4Ttipiuuoxt6TTVMDK/eewwFhAPJiHrvZq0psKI1d",
"input_pubkey_base58_error": null
}
],
"outputs": [
{
"output_satoshis": "1",
"output_pubkey_base58_error": null
},
{
"output_satoshis": "4949999",
"output_script_bytes": "dqkU4E0i4TQg1I6OpprIt6v7Ipuda/GIrA==",
"output_pubkey_base58_error": null
}
]
}
]
}
I want to extract the transaction_id,output.input_pubkey_base58_error from this table.
How can achieve this by using UNNEST?
You can refer the above example code.
It looks like the sintax should be like this. (Didn't try it!). Guessing that your table is called mybitcoindata in bigquery
SELECT block_id, output.output_pubkey_base58_error
FROM yourdataset.yourtable as A
CROSS JOIN UNNEST(A.transactions) AS transaction
CROSS JOIN UNNEST(transaction.outputs) AS output
;
There are very good examples here
EDIT:
Just tested. If you convert your json data to single line json, you can create the table in bigquery. The above query works to explode multiple arrays.
First of all, I would like to clarify that you said you are interested in the fields transaction_id and output.input_pubkey_base58_error, but the latter does not exist according to the table schema (maybe you were referring to inputs.input_pubkey_base58_error or outputs.output_pubkey_base58_error). So I believe it is worth that you clarify your scenario and/or use case.
In any case, working with the public Bitcoin dataset you mentioned, you can use a query like the one below in order to query (using Standard SQL) only for the fields you are interested in.
#standardSQL
SELECT
tr.transaction_id,
inp.input_pubkey_base58_error,
out.output_pubkey_base58_error
FROM
`bigquery-public-data.bitcoin_blockchain.blocks`,
UNNEST(transactions) AS tr,
UNNEST(tr.inputs) AS inp,
UNNEST(tr.outputs) as out
LIMIT
100
In this query, I am making use of the UNNEST StandardSQL operator in order to query for specific fields inside an array, but I strongly recommend you to go through the documentation in order to see more details and specific examples on how it works.

Query DynamoDb for multiple items

I think I'm misunderstanding DynamoDb. I would like to query for all items, with a child field of the json, which match an identifier I'm passing. The structure is something like -
{
"messageId": "ced96cab-767e-509198be5-3d2896a3efeb",
"identifier": {
"primary": "9927fd47-5d33-4f51-a5bb-f292a0c733b1",
"secondary": "none",
"tertiary": "cfd96cab-767e-5091-8be5-3d2896a3efeb"
},
"attributes": {
"MyID": {
"Type": "String",
"Value": "9927fd47-5c33-4f51-a5bb-f292a0c733b1"
}
}
}
I would like to query for all items in DynamoDb that has a value of MyID that I'm passing. Everything I've read seems to say you need to use the key which in my case is the messageId, this is unique for each entry and not a value I can use.
Hope this makes sense.
The DynamoDB Query API can be used only if you know the value of Partition key. Otherwise, you may need to scan the whole table using FilterExpression to find the item.
Scanning tables
You can create GSI on scalar attribute only. In the above case, it is a document data type (i.e. MAP). So, GSI can't be created.