Apache Calcite query json nested data - apache-calcite

I'm done to read simple data in json format from kafka by using calcite sql.
There is a special requirement that there is nested data like bellow
{"DATA":{"SPEED":80},"USER_ID":100200,"USER_NAME":"user1"}
And I want DATA.SPEED no just DATA as my query criteria or where condition like bellow
select DATA.SPEED from "database" where DATA.SPEED > 50 order by
USER_ID desc
Is there is way to do this? Thanks

Related

How to use query parameters in GCP BigQuery federated queries

I have a gcp based environment. I use standard SQL scripting in gcp BigQuery and federated query to cloudsql MySql. Federated query selects data from cloudsql mysql database. I need to select data from cloudsql mysql database based on condition that depends on data in BigQuery. I use variables in standard sql scriping in gcp bigquery to store the value that I select from bigquery. I want to value of this variable in the where clause of mysql query. See following example where I select a date from BigQuery and store it in a variable "BQ_LAST_DATETIME".
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
Since I am using bigquery federated query to read data out of cloudsql database (https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries) as shown below and I want to use value that I stored in the variable "BQ_LAST_DATETIME" in the mysql query where clause
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = #BQ_LAST_DATETIME;" );
Please note that in above query I have used "#BQ_LAST_DATETIME" as a placeholder to show what I want to achieve. I am not sure if I can directly use bigquery scripting variable as query parameter in the "external" query part of federated query.
Any suggestions on how to achieve parametrization of external queries in federated query, or if you know how I could achieve effect similar to what my intent is?
I actually tried following as depicted . I used bigquery scripting variable as query parameter in the "external" query part of federated query. only nuance here is that since the I was dealing with dates I performed a cast and also since the date variable actually is treated as a string I formatted it back to date using mysql STR_TO_DATE as follows
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
SET BQ_LAST_DATE= CAST(BQ_LAST_DATETIME AS DATE);
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = STR_TO_DATE(#BQ_LAST_DATE,'%Y-%m-%d') ;" );
While this query is accepted by parser it is NOT giving expected result.
Basically the value of the variable #BQ_LAST_DATE does not seem to get to MySQL query as expected.
Does anyone know what am I missing ?
Thanks a lot for your help
You can try EXECUTE IMMEDIATE:
DECLARE BQ_LAST_DATETIME STRING;
DECLARE DSQL STRING;
SET BQ_LAST_DATETIME = 'SELECT max(date_created) from bq_my_dataset.bq_my_table';
SET DSQL = '"select * from mysqlschema.mysql_table where date_created = (' || BQ_LAST_DATETIME || ')"';
EXECUTE IMMEDIATE 'SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql",' || DSQL || ');'

Athena Schema creation when log format has missing fields

I have a custom log format where the log entries vary by the request type. So certain rows have more fields.
Can we specify certain fields as optional so that in rows that they are missing, the values will be set to certain default (null, 0)?
Here are some hypothetical log entries:
{"data":"[2017-09-10 10:44:54.448998 -0000] info ip=773.555.557.445 cluster=\"production\" query=uris type=TXT class=IN rcode=NXDOMAIN cnt=0 offset=74","header":{"recvtime":"2017-09-10 10:45:02","server":"m0107481","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=991.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=NOERROR cnt=1 offset=90 score=400","header":{"recvtime":"2017-09-10 10:45:02","server":"m010748","refid":"ABC-123"}}
{"data":"[2017-09-10 10:44:54.457718 -0000] info ip=971.509.704.832 cluster=\"inbound\" query=dnsbl type=A class=IN rcode=REFUSED cnt=1","header":{"recvtime":"2017-09-10 10:45:02","server":"m010574","refid":"ABC-123"}}
Note that each row of the log data is in json format, and the header part is fixed. If query in data is dnsbl, then sometimes the row has a score field, but other times it is missing. And I am planning to use Athena to parse this type of data from S3 and query for some stats in the line of: what % of data are dns queries and what % have score above 300.
It looks like your data is JSON with embedded structured logging in the data field. As long as the data is well formed JSON with one object per line you should be able to create a JSON table and then use functions to extract the other pieces out of the data field. You can create a view that does the extraction so that you don't have to do that in every query.
I'm thinking something like this:
CREATE EXTERNAL TABLE raw_log_entries (
data string,
header struct<recvtime: string, server: string, refid: string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://some-bucket/and/path/';
CREATE VIEW log_entries AS
SELECT
header.recvtime,
header.server,
header.refid,
regexp_extract(data, 'query=(\S+)', 1) AS query,
regexp_extract(data, 'type=(\S+)', 1) AS type,
regexp_extract(data, 'score=(\S+)', 1) AS score,
-- and so on
FROM raw_log_entries
You'll have to experiment with the regexes, since I don't have your data I can't be sure if they will work for all cases, but I hope you get the idea.

Redshift Spectrum: Query Anonymous JSON array structure

I have a JSON array of structures in S3, that is successfully Crawled & Cataloged by Glue.
[{"key":"value"}, {"key":"value"}]
I'm using the custom Classifier:
$[*]
When trying to query from Spectrum, however, it returns:
Top level Ion/JSON structure must be an anonymous array if and only if
serde property 'strip.outer.array' is set. Mismatch occured in file...
I set that serde property manually in the Glue catalog table, but nothing changed.
Is it no possible to query an anonymous array via Spectrum?
Naming the array in the JSON file like this:
"values":[{"key":"value"},...}
And updating the classifier:
$.values[*]
Fixes the issue... Interested to know if there is a way to query anonymous arrays though. It seems pretty common to store data like that.
Update:
In the end this solution didn't work, as Spectrum would never actually return any results. There was no error, just no results, and as of now still no solution other than using individual records per line:
{"key":"value"}
{"key":"value"}
etc.
It does seem to be a Spectrum specific issue, as Athena would still work.
Interested to know if anyone else was able to get it to work...
I've successfully done this, but without a data classifier. My JSON file looks like:
[
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
{
"col1": "data_from_col1",
"col2": "data_from_col2",
"col3": [
{
"col4": "data_from_col4",
...
{
]
},
...
]
I started with a crawler to get a basic table definition. IMPORTANT: the crawler's configuration options under Output CAN'T be set to Update the table definition..., or else re-running the crawler later will overwrite the manual changes described below. I used Add new columns only.
I had to add the 'strip.outer.array' property AND manually add the topmost columns within my anonymous array. The original schema from the initial crawler run was:
anon_array array<struct<col1:string,col2:string,col3:array<struct<col4...>>>
partition_0 string
I manually updated my schema to:
col1:string
col2:string
col3:array<struct<col4...>>
partition_0 string
(And also add the serde param strip.outer.array.)
Then I had to rerun my crawler, and finally I could query in Spectrum like:
select o.partition_0, o.col1, o.col2, t.col4
from db.tablename o
LEFT JOIN o.col3 t on true;
You can use json_extract_path_text for extracting the element or json_extract_array_element_text('json string', pos [, null_if_invalid ] ).
for example:
for 2nd index element
select json_extract_array_element_text('[111,112,113]', 2);
output: 113
If your table's structure is as follows:
CREATE EXTERNAL TABLE spectrum.testjson(struct<id:varchar(25),
columnName<array<struct<key:varchar(20),value:varchar(20)>>>);
you can use the following query to access the array element:
SELECT c.id, o.key, o.value FROM spectrum.testjson c, c.columnName o;
For more information you can refer the AWS Documentation:
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data-sqlextensions.html

How can I check the partition list from Athena in AWS?

I want to check the partition lists in Athena.
I used query like this.
show partitions table_name
But I want to search specific table existed.
So I used query like below but there was no results returned.
show partitions table_name partition(dt='2010-03-03')
Because dt contains hour data also.
dt='2010-03-03-01', dt='2010-03-03-02', ...........
So is there any way to search when I input '2010-03-03' then it search '2010-03-03-01', '2010-03-03-02'?
Do I have to separate partition like this?
dt='2010-03-03', dh='01'
And show partitions table_name returned only 500 rows in Hive. Is the same in Athena also?
In Athena v2:
Use this SQL:
SELECT dt
FROM db_name."table_name$partitions"
WHERE dt LIKE '2010-03-03-%'
(see the official aws docs)
In Athena v1:
There is a way to return the partition list as a resultset, so this can be filtered using LIKE. But you need to use the internal information_schema database like this:
SELECT partition_value
FROM information_schema.__internal_partitions__
WHERE table_schema = '<DB_NAME>'
AND table_name = '<TABLE_NAME>'
AND partition_value LIKE '2010-03-03-%'

AWS Athena flattened data from nested JSON source

I'd like to create a table from a nested JSON in Athena. The solutions described here using tools like hive Openx-JsonSerDe attempt to mirror the JSON data in the SQL statement. I just want to get a few fields from the JSON file and create the table. I can't seem to find any resources on how to do that.
E.g.
JSON file {"records": [{"a": "data1", "b": "data2", "c": "data3"}]}
The table I'd like to create just only has columns a and b
I think what you are trying to achieve is unnesting the array to transform one array entry into one row.
This is possible through the correct querying of your data structure.
table definition:
CREATE external TABLE complex (
records array<struct<a:string,b:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/test1/';
query:
select record.a,record.b from complex
cross join UNNEST(complex.records) as t1(record);