Analytics in WSO2DAS - wso2

I'm getting a Table Not Found error while running a select query on spark console of wso2das. I've kept all the default configurations intact after the installation. I'm unable to fetch the data from the event stream even when it's been shown under table dropdown of data explorer.

Initially when the data is moved into the wso2das, it would be persisted in the data store you mention.
But, these are not the tables that are created in spark. You need to write a spark query to create a temporary table in spark which would reference the table you have persisted.
For example,
If your stream is,
{
"name": "sample",
"version": "1.0.0",
"nickName": "",
"description": "",
"payloadData": [
{
"name": "ID",
"type": "INT"
},
{
"name": "NAME",
"type": "STRING"
}
]
}
you need to write the following spark query in the spark console,
CREATE TEMPORARY TABLE sample_temp USING CarbonAnalytics OPTIONS (tableName "sample", schema "ID INT, NAME STRING");
after executing the above script,try the following,
select * from sample_temp;
This should fetch the data you have pushed into WSO2DAS.
Happy learning!! :)

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?
Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

How to POST logical type decimal values for Avro records to Confluent Rest Proxy?

We are using the Confluent Rest Proxy to communicate with Kafka and need to test a variety of data. We are using the Rest Proxy to allow a vendor to communicate with our Kafka system.
One of our fields in the Avro schema has a logical type of decimal. To keep this simple, let's assume the schema shown here:
{
"fields": [
{
"name": "fieldName",
"type": "string"
},
{
"name": "amount",
"type": {
"logicalType": "decimal",
"precision": 16,
"scale": 2,
"type": "bytes"
}
}
],
"name": "Sample",
"namespace": "com.test.sample",
"type": "record"
}
It's easy enough to write to the topic via a Java producer, using Avro Tools to produce the appropriate class files. But when attempting to use the Rest Proxy, we have to pass values such as this:
{"value_schema_id":132,"records": [{"value":{"fieldName":"Field Name","amount":"\u0001ã"}}]}
This was copied from a record created via the Java producer and then downloaded from the topic. But in the amount field, we'd like to be able to pass a value such as 123.45. We're using Postman for the most part to send data. Is there a way to do this with a logical decimal field and without having to create and serialize the data first to see the representation such as \u0001ã?

GCP Data Fusion multiple table import

I'm trying to use Multiple Database Tables and BigQuery Multi Table Data Fusion plugin to import multiple table in one pipeline
But when I try to execute I get the following error
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point.
I'm using Data Fusion version 6.1.4 Multiple Database Tables version 1.2.0 and BigQuery Multi Table version 0.14.8.
Any suggestion on what may be the problem?
Edit:
following the configuration of multiple table database source
{
"name": "Multiple Database Tables",
"plugin": {
"name": "MultiTableDatabase",
"type": "batchsource",
"label": "Multiple Database Tables",
"artifact": {
"name": "multi-table-plugins",
"version": "1.2.0",
"scope": "USER"
},
"properties": {
"splitsPerTable": "1",
"referenceName": "multiTable",
"connectionString": "${secure(connection)}",
"jdbcPluginName": "netezza",
"user": "${secure(username)}",
"password": "${secure(password)}",
"whiteList": "categoria_l,cliente_l,regione_l"
}
},
"outputSchema": [
{
"name": "etlSchemaBody",
"schema": ""
}
]
},
After further test the problem is that the source response is empty because data fusion is not reading view from source database but only tables
It seems like the Multiple Database Tables source produced no records ("Out 0"). I'd check there first. You can do a quick check using the Preview mode. Plugin doc here.
Related answer here.

Store invalid JSON columns are STRING or skip them in BigQuery

I have a JSON data file which looks something like below
{
"key_a": "value_a",
"key_b": "value_b",
"key_c": {
"c_nested/invalid.key.according.to.bigquery": "valid_value_though"
}
}
As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).
As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.
I presume my table definition would look something like below:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_c",
"type": "STRING"
}
]
When I try to create a table with this schema I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key
Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
}
]
The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:
Error while reading table:
projectname.dataset_name.table_name, error message:
JSON parsing error in row starting at position 0: No such field: key_c.
I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.
Is cleaning the data the only way? How can I tackle this?
I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?
One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.
With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:
with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select
JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data

AWS IoT rule - timestamp for Elasticsearch

Have a bunch of IoT devices (ESP32) which publish a JSON object to things/THING_NAME/log for general debugging (to be extended into other topics with values in the future).
Here is the IoT rule which kind of works.
{
"sql": "SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS timestamp, topic(2) AS deviceId FROM 'things/+/stdout'",
"ruleDisabled": false,
"awsIotSqlVersion": "2016-03-23",
"actions": [
{
"elasticsearch": {
"roleArn": "arn:aws:iam::xxx:role/iot-es-action-role",
"endpoint": "https://xxxx.eu-west-1.es.amazonaws.com",
"index": "devices",
"type": "device",
"id": "${newuuid()}"
}
}
]
}
I'm not sure how to set #timestamp inside Elasticsearch to allow time based searches.
Maybe I'm going about this all wrong, but it almost works!
Elasticsearch can recognize date strings matching dynamic_date_formats.
The following format is automatically mapped as a date field in AWS Elasticsearch 7.1:
SELECT *, parse_time("yyyy/MM/dd HH:mm:ss", timestamp()) AS timestamp FROM 'events/job/#'
This approach does not require to create a preconfigured index, which is important for dynamically created indexes, e.g. with daily rotation for logs:
devices-${parse_time("yyyy.MM.dd", timestamp(), "UTC")}
According to elastic.co documentation,
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]
#timestamp is just a convention as the # prefix is the default prefix for Logstash generated fields. Because you are not using Logstash as a middleman between IoT and Elasticsearch, you don't have a default mapping for #timestamp.
But basically, it is just a name, so call it what you want, the only thing that matters is that you declare it as a timestamp field in the mappings section of the Elasticsearch index.
If for some reason you still need it to be called #timestamp, you can either SELECT it with that prefix right away in the AS section (might be an issue with IoT's sql restrictions, not sure):
SELECT *, parse_time(\"yyyy-mm-dd'T'hh:mm:ss\", timestamp()) AS #timestamp, topic(2) AS deviceId FROM 'things/+/stdout'
Or you use the copy_to functionality when declaring you're mapping:
PUT devices/device
{
"mappings": {
"properties": {
"timestamp": {
"type": "date",
"copy_to": "#timestamp"
},
"#timestamp": {
"type": "date",
}
}
}
}