Different default behaviors of skipping empty buckets in timeseries queries by Druid native query and Druid SQL

Different default behaviors of skipping empty buckets in timeseries queries by Druid native query and Druid SQL - apache-superset

According to the document of Apache Druid about native query
Timeseries queries normally fill empty interior time buckets with zeroes.
For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive:
[
{
"timestamp": "2012-01-01T00:00:00.000Z",
"result": { "sample_name1": <some_value> }
},
{
"timestamp": "2012-01-02T00:00:00.000Z",
"result": { "sample_name1": 0 }
},
{
"timestamp": "2012-01-03T00:00:00.000Z",
"result": { "sample_name1": <some_value> }
}
]
This could be controlled by the value of context flag "skipEmptyBuckets", and the default value is false (do not skip the empty bucket by zero-filling).
However, when querying timeseries data with Druid SQL, the default behavior is to skip all empty buckets. I have to set query context explicitly to get the results I want.
"context": {
"skipEmptyBuckets": true
}
This troubles me a lot because I need zero-filling to show all buckets in Apache Superset's timeseries charts. But there's no way to set the query context.
As far as I know, the SQL statement internally is translated to native query, so why is the inconsistency?

Related

Glue JSON serialization and athena query, return full record each field

I've been trying for a long time, through Glue's crawlers, to recognize .jsons from my S3, to be queried in Athena. But after different changes in settings, the best result I got, is still wrong.
Glue's crawler even recognizes the column structure of my .json, however, when queried in Athena, it sets up the columns found, but throws all items in the same line, one item for each column, as in images below.
My Classifier setting is "$[*]".
The .json data structure
[
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e3a", "airspace_p": 1061, "codedistv1": "SFC", "fid": 299 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e39", "airspace_p": 408, "codedistv1": "STD", "fid": 766 },
{ "id": "TMA.fid--4f6e8018_18596f01b4f\_-5e38", "airspace_p": 901, "codedistv1": "STD", "fid": 806 },
...
]
Configuration result in Glue:
Configuration result in Glue
Result in Athena from this table:
Result in Athena from this table
I already tried different .json structures, different classifiers, changed and added the JsonSerde

If you can change the data source, use the JSON lines format instead, then run the Glue crawler without any custom classifier.
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e3a","airspace_p": 1061,"codedistv1": "SFC","fid": 299}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e39","airspace_p": 408,"codedistv1": "STD","fid": 766}
{"id": "TMA.fid--4f6e8018_18596f01b4f_-5e38","airspace_p": 901,"codedistv1": "STD","fid": 806}
Cause of your issue is that Athena doesn't support custom JSON classifier.

Is it possible to iterate through a DynamoDB table within a step function's map state?

Just what the title says, basically. I have read through the documentation:
https://docs.aws.amazon.com/step-functions/latest/dg/connect-ddb.html
This describes how to get a single item of information out of a DynamoDB table from a step function. What I would like to do is iterate through the entire table and start execution of another state machine for each item. Each new state machine would have an individual item as input. I have attempted the following code, which unfortunately is not functional:
{
"StartAt": "OuterFunction",
"States": {
"OuterFunction": {
"Type": "Map",
"Iterator": {
"StartAt": "InnerFunction",
"States": {
"InnerFunction": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem.sync",
"Parameters": {
"StateMachineArn":"other-state-machine-arn",
"TableName": "TestTable"
},
"End": true
}
}
},
"End": true
}
}
}
Is it actually possible to iterate through a DynamoDB table in this way?

You are now able to call DynamoDB directly from step functions. This includes the query and scan operations. With the result, you can then iterate through the items. The one less convenient, caveat is that it does not use the document client, so the results are in the dynamodb json format.
https://docs.aws.amazon.com/step-functions/latest/dg/connect-ddb.html

No, getItem is designed to fetch particular DynamoDB document. You need to write custom Lambda that will .query() or .scan() your table and then use Map step to iterate over results (most likely you won't need getItem at that time, because you can load all data with the query/scan operation).

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}

I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).

AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}

Logsink to bigquery partitioning not working

I created a logsink on folder level, so it neatly streams all the logs to Bigquery. In the logsink configuration, I specified the following options to let the logsink stream to (daily) partitions:
"bigqueryOptions": {
"usePartitionedTables": true,
"usesTimestampColumnPartitioning": true # output only
}
According to the bigquery documentation and bigquery resource type, I would assume that this would automatically create partitions, but it doesn't. I verified that it didn't create the partitions with the following query:
#LegacySQL
SELECT table_id, partition_id from [dataset1.table1$__PARTITIONS_SUMMARY__];
Gives me:
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
}
]
Is there something I am missing here? It should have partitioned by date.

The problem was that I did not wait long enough for the first partition to become active. Basically, a logsink streams data as unpartitioned. After a while, the data is partitioned by date, which is only visible after a few hours for the partition of today. Problem solved!
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
},
{
"table_id": "table1",
"partition_id": "20200510"
},
{
"table_id": "table1",
"partition_id": "20200511"
}
]

AWS IoT rule - timestamp for Elasticsearch

Have a bunch of IoT devices (ESP32) which publish a JSON object to things/THING_NAME/log for general debugging (to be extended into other topics with values in the future).
Here is the IoT rule which kind of works.
{
"sql": "SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS timestamp, topic(2) AS deviceId FROM 'things/+/stdout'",
"ruleDisabled": false,
"awsIotSqlVersion": "2016-03-23",
"actions": [
{
"elasticsearch": {
"roleArn": "arn:aws:iam::xxx:role/iot-es-action-role",
"endpoint": "https://xxxx.eu-west-1.es.amazonaws.com",
"index": "devices",
"type": "device",
"id": "${newuuid()}"
}
}
]
}
I'm not sure how to set #timestamp inside Elasticsearch to allow time based searches.
Maybe I'm going about this all wrong, but it almost works!

Elasticsearch can recognize date strings matching dynamic_date_formats.
The following format is automatically mapped as a date field in AWS Elasticsearch 7.1:
SELECT *, parse_time("yyyy/MM/dd HH:mm:ss", timestamp()) AS timestamp FROM 'events/job/#'
This approach does not require to create a preconfigured index, which is important for dynamically created indexes, e.g. with daily rotation for logs:
devices-${parse_time("yyyy.MM.dd", timestamp(), "UTC")}
According to elastic.co documentation,
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

#timestamp is just a convention as the # prefix is the default prefix for Logstash generated fields. Because you are not using Logstash as a middleman between IoT and Elasticsearch, you don't have a default mapping for #timestamp.
But basically, it is just a name, so call it what you want, the only thing that matters is that you declare it as a timestamp field in the mappings section of the Elasticsearch index.
If for some reason you still need it to be called #timestamp, you can either SELECT it with that prefix right away in the AS section (might be an issue with IoT's sql restrictions, not sure):
SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS #timestamp, topic(2) AS deviceId FROM 'things/+/stdout'
Or you use the copy_to functionality when declaring you're mapping:
PUT devices/device
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"copy_to": "#timestamp"
},
"#timestamp": {
"type": "date",
}
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Different default behaviors of skipping empty buckets in timeseries queries by Druid native query and Druid SQL - apache-superset

Related

Glue JSON serialization and athena query, return full record each field

Is it possible to iterate through a DynamoDB table within a step function's map state?

How to specify attributes to return from DynamoDB through AppSync

Logsink to bigquery partitioning not working

AWS IoT rule - timestamp for Elasticsearch

Categories

Resources