Logsink to bigquery partitioning not working

Logsink to bigquery partitioning not working - google-cloud-platform

I created a logsink on folder level, so it neatly streams all the logs to Bigquery. In the logsink configuration, I specified the following options to let the logsink stream to (daily) partitions:
"bigqueryOptions": {
"usePartitionedTables": true,
"usesTimestampColumnPartitioning": true # output only
}
According to the bigquery documentation and bigquery resource type, I would assume that this would automatically create partitions, but it doesn't. I verified that it didn't create the partitions with the following query:
#LegacySQL
SELECT table_id, partition_id from [dataset1.table1$__PARTITIONS_SUMMARY__];
Gives me:
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
}
]
Is there something I am missing here? It should have partitioned by date.

The problem was that I did not wait long enough for the first partition to become active. Basically, a logsink streams data as unpartitioned. After a while, the data is partitioned by date, which is only visible after a few hours for the partition of today. Problem solved!
[
{
"table_id": "table1",
"partition_id": "__UNPARTITIONED__"
},
{
"table_id": "table1",
"partition_id": "20200510"
},
{
"table_id": "table1",
"partition_id": "20200511"
}
]

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?

Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

Different default behaviors of skipping empty buckets in timeseries queries by Druid native query and Druid SQL

According to the document of Apache Druid about native query
Timeseries queries normally fill empty interior time buckets with zeroes.
For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive:
[
{
"timestamp": "2012-01-01T00:00:00.000Z",
"result": { "sample_name1": <some_value> }
},
{
"timestamp": "2012-01-02T00:00:00.000Z",
"result": { "sample_name1": 0 }
},
{
"timestamp": "2012-01-03T00:00:00.000Z",
"result": { "sample_name1": <some_value> }
}
]
This could be controlled by the value of context flag "skipEmptyBuckets", and the default value is false (do not skip the empty bucket by zero-filling).
However, when querying timeseries data with Druid SQL, the default behavior is to skip all empty buckets. I have to set query context explicitly to get the results I want.
"context": {
"skipEmptyBuckets": true
}
This troubles me a lot because I need zero-filling to show all buckets in Apache Superset's timeseries charts. But there's no way to set the query context.
As far as I know, the SQL statement internally is translated to native query, so why is the inconsistency?

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}

I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).

AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}

AWS IoT rule - timestamp for Elasticsearch

Have a bunch of IoT devices (ESP32) which publish a JSON object to things/THING_NAME/log for general debugging (to be extended into other topics with values in the future).
Here is the IoT rule which kind of works.
{
"sql": "SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS timestamp, topic(2) AS deviceId FROM 'things/+/stdout'",
"ruleDisabled": false,
"awsIotSqlVersion": "2016-03-23",
"actions": [
{
"elasticsearch": {
"roleArn": "arn:aws:iam::xxx:role/iot-es-action-role",
"endpoint": "https://xxxx.eu-west-1.es.amazonaws.com",
"index": "devices",
"type": "device",
"id": "${newuuid()}"
}
}
]
}
I'm not sure how to set #timestamp inside Elasticsearch to allow time based searches.
Maybe I'm going about this all wrong, but it almost works!

Elasticsearch can recognize date strings matching dynamic_date_formats.
The following format is automatically mapped as a date field in AWS Elasticsearch 7.1:
SELECT *, parse_time("yyyy/MM/dd HH:mm:ss", timestamp()) AS timestamp FROM 'events/job/#'
This approach does not require to create a preconfigured index, which is important for dynamically created indexes, e.g. with daily rotation for logs:
devices-${parse_time("yyyy.MM.dd", timestamp(), "UTC")}
According to elastic.co documentation,
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

#timestamp is just a convention as the # prefix is the default prefix for Logstash generated fields. Because you are not using Logstash as a middleman between IoT and Elasticsearch, you don't have a default mapping for #timestamp.
But basically, it is just a name, so call it what you want, the only thing that matters is that you declare it as a timestamp field in the mappings section of the Elasticsearch index.
If for some reason you still need it to be called #timestamp, you can either SELECT it with that prefix right away in the AS section (might be an issue with IoT's sql restrictions, not sure):
SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS #timestamp, topic(2) AS deviceId FROM 'things/+/stdout'
Or you use the copy_to functionality when declaring you're mapping:
PUT devices/device
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"copy_to": "#timestamp"
},
"#timestamp": {
"type": "date",
}
}
}
}

Analytics in WSO2DAS

I'm getting a Table Not Found error while running a select query on spark console of wso2das. I've kept all the default configurations intact after the installation. I'm unable to fetch the data from the event stream even when it's been shown under table dropdown of data explorer.

Initially when the data is moved into the wso2das, it would be persisted in the data store you mention.
But, these are not the tables that are created in spark. You need to write a spark query to create a temporary table in spark which would reference the table you have persisted.
For example,
If your stream is,
{
"name": "sample",
"version": "1.0.0",
"nickName": "",
"description": "",
"payloadData": [
{
"name": "ID",
"type": "INT"
},
{
"name": "NAME",
"type": "STRING"
}
]
}
you need to write the following spark query in the spark console,
CREATE TEMPORARY TABLE sample_temp USING CarbonAnalytics OPTIONS (tableName "sample", schema "ID INT, NAME STRING");
after executing the above script,try the following,
select * from sample_temp;
This should fetch the data you have pushed into WSO2DAS.
Happy learning!! :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Logsink to bigquery partitioning not working - google-cloud-platform

Related

AWS AppFlow - rename source field

Different default behaviors of skipping empty buckets in timeseries queries by Druid native query and Druid SQL

How to specify attributes to return from DynamoDB through AppSync

AWS IoT rule - timestamp for Elasticsearch

Analytics in WSO2DAS

Categories

Resources