Retain column name case in a Athena json output - amazon-athena

I have a table in Athena and I am querying it to produce a json resultant file using CTAS queries. But the resultant json file is not maintaining the camel case of either the column name or the alias that was provided for the column name in the query.
Please suggest how i can overcome this issue.
For ex:
{"email":"abc_def#ghi.com", "firstname": "abc", "lastname": "def"}
This should be coming up in the json file(result of the Athena query) as:
{"email":"abc_def#ghi.com", "firstName": "abc", "lastName": "def"}
Query that I have tried:
CREATE TABLE users WITH (
format = 'JSON',
bucket_count = 1,
bucketed_by = ARRAY [ 'lastName' ]
) AS
SELECT email,
firstname as "firstName",
lastname as "lastName"
from all_users;

Related

AWS Athena query struct property in array

I have json files in S3 bucket generated by AWS Textract service and I'm using Athena to query data from those files. Every file has the same structure and I created a table in Athena where I have column "blocks" that is array of struct:
"blocks": [{
"BlockType": "LINE",
"Id": "12345",
"Text": "Text from document",
"Confidence": 98.7022933959961,
"Page": "1",
"SourceLanguage": "de",
"TargetLanguage": "en",
},
...100+ blocks]
How can I query just for the "Text" property from every block that has one?
Thanks in advance!
I have defined a table with exact schema of yours using sample JSON provided.
_col0
#
array(row(blocktype varchar, id varchar, text varchar, confidence double, page varchar, sourcelanguage varchar, targetlanguage varchar))
I have used unnest operator to flatten the array of blocks and fetch the Text column from it using below query:
select block.text from <table-name> CROSS JOIN UNNEST(blocks) as t(block)
It looks like column stores array of rows, so you can process it as one (array functions):
select transform(
filter(block_column, t -> t.text is not null),
r => cast(row(r.text) as row(text varchar))) texts
from table

aws athena query json array data

i'm not able to query S3 files with Aws Athena, the content of the files are regular json arrays like this:
[
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "f257c9c0-b7e1-4663-8d6d-97e652b27c10",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200001697080089188056D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
},
{
"DataInvio": "2020-02-06T13:37:00+00:00",
"DataLettura": "2020-02-06T13:35:50+00:00",
"FlagDownloaded": 0,
"GUID": "e531272e-465c-4294-950d-95a683ff8e3b",
"IMEI": "866100000062167",
"Id": 0,
"IdSessione": "4bd169ff-307c-4fbf-aa63-fce972f43fa2",
"IdTagLocal": 0,
"SerialNumber": "142707160028BJZZZZ",
"Tag": "E200341201321E0000A946D2",
"Tipo": "B",
"TipoEvento": "L",
"TipoSegnalazione": 0,
"TipoTag": "C",
"UsrId": "10642180-1e34-44ac-952e-9cb3e8e6a03c"
}
]
a simple query select * from mytable returns empty rows if the table has been generated in this way
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
or it gives me an error HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Missing value at 1 [character 2 line 1] if the table has been generated with:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable(
`IdSessione` string,
`DataLettura` date,
`GUID` string,
`DataInvio` date
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://athenatestsavino/files/anthea/'
TBLPROPERTIES ('has_encrypted_data'='false')
if i modify the content of the file in this way (an json object each rows without trailing commas, the query gives me results)
{ "DataInvio": "2020-02-06T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
{ "DataInvio": "2020-02-07T13:37:00+00:00", "DataLettura": "2020-02-06T13:35:50+00:00",....}
How to query json array structures directly?
Athena Best Practices recommends to have one json per row:
Make sure that each JSON-encoded record is represented on a separate line.
This has been asked a few times and I don't think someone made it work with a array of json:
aws athena - Create table by an array of json object
AWS Glue Custom Classifiers Json Path
This is related to the formatting of the JSON objects. The resolution of these issues is also described here: https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/
Apart from this, if you are using AWS Glue to crawl these files, make sure the Classification of database table of Data Catalog is not "UNKNOWN".

How to query AWS DynamoDB using multiple Indexes?

I have an AWS DynamoDb cart table with the following item structure -
{
"cart_id": "5e4d0f9f-f08c-45ae-986a-f1b5ac7b7c13",
"user_id": 1234,
"type": "OTHER",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 90000,
"total_quantity": 2,
"items": [
{
"amount": 90000,
"category": "Laptops",
"name": "Apple MacBook Pro",
"quantity": 1
}
]
}
-
{
"cart_id": "12340f9f-f08c-45ae-986a-f1b5ac7b1234",
"user_id": 1234,
"type": "SPECIAL",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 1000,
"total_quantity": 2,
"items": [
{
"amount": 1000,
"category": "Special",
"name": "Special Item",
"quantity": 1
}
]
}
The table will have cart_id as Primary key,
user_id as an Index or GSI,
type as an Index or GSI.
I want to be able to query the cart table,
to find the items which have user_id = 1234 AND type != "SPECIAL".
I don't know if this means for the query -
--key-condition-expression "user_id = 1234 AND type != 'SPECIAL'"
I understand that an AWS DynamoDb table cannot be queried using multiple indexes at the same time,
I came across the following question, it has a similar use case and the answer is recommending creating a composite key,
Querying with multiple local Secondary Index Dynamodb
Does it mean that while putting a new item in the table,
I will need to maintain another column like user_id_type,
with its value as 1234SPECIAL and create an Index / GSI for user_id_type ?
Sample item structure -
{
"cart_id": "5e4d0f9f-f08c-45ae-986a-f1b5ac7b7c13",
"user_id": 1234,
"type": "OTHER",
"user_id_type" : "1234OTHER",
"currency": "INR",
"created_date": 132432423,
"expiry": 132432425,
"total_amount": 90000,
"total_quantity": 2,
"items": [
{
"amount": 90000,
"category": "Laptops",
"name": "Apple MacBook Pro",
"quantity": 1
}
]
}
References -
1. Querying with multiple local Secondary Index Dynamodb
2. Is there a way to query multiple hash keys in DynamoDB?
Your assumption is correct. Maybe you can add into that a delimitter field1_field2 or hash them if either of them is too big in size hashOfField1_hashOfField2
That mean spending some more processing power on your side, however. As DynamoDB does not natively support It.
Composite key in DynamoDB with more than 2 columns?
Dynamodb: query using more than two attributes
Additional info on your use case
KeyConditionExpression only allowed for the hash key.
You can put it in the FilterExpression
Why is there no **not equal** comparison in DynamoDB queries?
Does it mean that while putting a new item in the table,
I will need to maintain another column like user_id_type,
with its value as 1234SPECIAL and create an Index / GSI for user_id_type?
The answer is it depends on how many columns (dynamodb is schema-less, by a column I mean data field) you need and are you happy with 2 round trips to DB.
your query:
user_id = 1234 AND type != "SPECIAL"
1- if you need all information in the cart but you are happy with two round trips:
Solution: Create a GSI with user_id (HASH) and type (RANGE), then add cart_id (base table Hash key) as projection.
Explanation: so, you need one query on index table to get the cart_id given user_id and type
--key-condition-expression "user_id = 1234 AND type != 'SPECIAL'"
then you need to use cart_id(s) from the result and make another query to the base table
2- if you do not need all of cart information.
Solution: you need to create a GSI and make user_id HASH and type as RANGE and add more columns (columns you need) to projections.
Explanation: projection is additional columns you want to have in your index table. So, add some extra columns, which are more likely to be used as a result of the query, to avoid an extra round trip to the base table
Note: adding too many extra columns can double your costs, as any update on base table results in updates in GSI tables projection fields)
3- if you want just one round trip and you need all data
then you need to manage it by yourself and your suggestion can be applied
One possible answer is to create a single index with a sort key. Then you can do this:
{
TableName: "...",
IndexName: "UserIdAndTypeIndex",
KeyConditionExpression: "user_id = :user_id AND type != :type",
ExpressionAttributeValues: {
":user_id": 1234,
":type": "SPECIAL"
}
}
You can build GraphQL schema with AWS AppSync from your DynamoDB table and than query it in your app with GraphQL. Link

Partitioned table BigQuery (with custom field)

I can't find any examples that show how to write a JSON for a partitioned table using a custom field. Below is an example of how to specify a table partitioned by the type "DAY", but if I, in addition, would like to partition by a specific field - how would the JSON look like?
{
"tableReference": {
"projectId": "bookstore-1382",
"datasetId": "exports",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY"
}
}
Take a look at the API reference. The timePartitioning object currently supports the following attributes:
expirationMs
field
requirePartitionFilter
type
I won't copy/paste all of the comments here, but this is what it says for field:
[Experimental] [Optional] If not set, the table is partitioned by
pseudo column '_PARTITIONTIME'; if set, the table is partitioned by
this field. The field must be a top-level TIMESTAMP or DATE field. Its
mode must be NULLABLE or REQUIRED.
In your case, the payload would look like:
{
"tableReference": {
"projectId": "<your project>",
"datasetId": "<your dataset>",
"tableId": "partition"
},
"timePartitioning": {
"type": "DAY",
"field": "<date_or_timestamp_column_name>"
}
}
Alternatively, you can issue a CREATE TABLE DDL statement using standard SQL. To give an example:
#standardSQL
CREATE TABLE `your-project.your-dataset.table`
(
x INT64,
event_date DATE
)
PARTITION BY event_date;

Handling missing and new fields in tableSchema of BigQuery in Google Cloud Dataflow

Here is the situation:
My BigQuery TableSchema is as follows:
{
"name": "Id",
"type": "INTEGER",
"mode": "nullable"
},
{
"name": "Address",
"type": "RECORD",
"mode": "repeated",
"fields":[
{
"name": "Street",
"type": "STRING",
"mode": "nullable"
},
{
"name": "City",
"type": "STRING",
"mode": "nullable"
}
]
}
I am reading data from a Google Cloud Storage Bucket and writing in to BigQuery using a cloud function.
I have defined TableSchema in my cloud function as:
table_schema = bigquery.TableSchema()
Id_schema = bigquery.TableFieldSchema()
Id_schema.name = 'Id'
Id_schema.type = 'INTEGER'
Id_schema.mode = 'nullable'
table_schema.fields.append(Id_schema)
Address_schema = bigquery.TableFieldSchema()
Address_schema.name = 'Address'
Address_schema.type = 'RECORD'
Address_schema.mode = 'repeated'
Street_schema = bigquery.TableFieldSchema()
Street_schema.name = 'Street'
Street_schema.type = 'STRING'
Street_schema.mode = 'nullable'
Address_schema.fields.append(Street_schema)
table_schema.fields.append(Address_schema)
City_schema = bigquery.TableFieldSchema()
City_schema.name = 'City'
City_schema.type = 'STRING'
City_schema.mode = 'nullable'
Address_schema.fields.append(City_schema)
table_schema.fields.append(Address_schema)
My data file looks like this: (each row is json)
{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}
{"Id": 2, "Address": {"City":"Mumbai"}}
{"Id": 3, "Address": {"Street":"XYZ Road"}}
{"Id": 4}
{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}
Question:
How can I handle when the incoming data has some missing keys?
e.g.,
On row #2 of the data "Street" is missing
On row #3 of the data "City" is missing
On row #4 of the data "Address" is missing
On row #5 of the data "PhoneNumber" shows up..
Question 1: How to handle WriteToBigQuery if the data in missing (e.g., row #2,#3,#4)
Question 2: How to handle if a new field shows up in the data?
e.g.,
On row #5 "PhoneNumber" shows up..
How can I add a new column in BigQuery table on the fly?
(Do I have have to define the BigQuery table schema exhaustive enough at first in order to accommodate such newly added fields?)
Question 3: How can I iterate through each row (while reading data file) of the incoming data file and determine which fields to parse?
One of the option for you is - instead of straggling with schema changes I would recommend to write your data into table with just one field line of type string - and apply schema logic on fly during the querying
Below example is for BigQuery Standard SQL of how to apply schema on fly against table with whole row in one field
#standardSQL
WITH t AS (
SELECT '{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}' line UNION ALL
SELECT '{"Id": 2, "Address": {"City":"Mumbai"}}' UNION ALL
SELECT '{"Id": 3, "Address": {"Street":"XYZ Road"}}' UNION ALL
SELECT '{"Id": 4} ' UNION ALL
SELECT '{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}'
)
SELECT
JSON_EXTRACT_SCALAR(line, '$.Id') id,
JSON_EXTRACT_SCALAR(line, '$.PhoneNumber') PhoneNumber,
JSON_EXTRACT_SCALAR(line, '$[Address].Street') Street,
JSON_EXTRACT_SCALAR(line, '$[Address].City') City
FROM t
with result as below
Row id PhoneNumber Street City
1 1 null MG Road Pune
2 2 null null Mumbai
3 3 null XYZ Road null
4 4 null null null
5 5 12345678 ABCD Road Bangalore
I think this approach answers/addresses all your four questions
Question: How can I handle when the incoming data has some missing keys?
Question 1: How to handle WriteToBigQuery if the data in missing (e.g., row #2,#3,#4)
Question 2: How to handle if a new field shows up in the data?
I recommend decoding the JSON string to some data structure, for example a custom Contact class, where you can access and manipulate member variables and define which members are optional and which are required. Using a custom class gives you a level of abstraction so that downstream transforms in the pipeline don't need to worry about how to manipulate JSON. A downstream transform can be implemented to build a TableRow from a Contact object and also adhere to the BigQuery table schema. This design follows general abstraction and separation of concerns principles and is able to handle all scenarios of missing or additional fields.
Question 3: How can I iterate through each row (while reading data file) of the incoming data file and determine which fields to parse?
Dataflow's execution of the pipeline does this automatically. If the pipeline reads from Google Cloud Storage (using TextIO for example), then Dataflow will process each line of the file as an individual element (individual JSON string). Determining which fields to parse is a detail of the business logic and can be defined in a transform which parses the JSON string.