GCP Data Fusion multiple table import

GCP Data Fusion multiple table import - google-cloud-platform

I'm trying to use Multiple Database Tables and BigQuery Multi Table Data Fusion plugin to import multiple table in one pipeline
But when I try to execute I get the following error
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point.
I'm using Data Fusion version 6.1.4 Multiple Database Tables version 1.2.0 and BigQuery Multi Table version 0.14.8.
Any suggestion on what may be the problem?
Edit:
following the configuration of multiple table database source
{
"name": "Multiple Database Tables",
"plugin": {
"name": "MultiTableDatabase",
"type": "batchsource",
"label": "Multiple Database Tables",
"artifact": {
"name": "multi-table-plugins",
"version": "1.2.0",
"scope": "USER"
},
"properties": {
"splitsPerTable": "1",
"referenceName": "multiTable",
"connectionString": "${secure(connection)}",
"jdbcPluginName": "netezza",
"user": "${secure(username)}",
"password": "${secure(password)}",
"whiteList": "categoria_l,cliente_l,regione_l"
}
},
"outputSchema": [
{
"name": "etlSchemaBody",
"schema": ""
}
]
},
After further test the problem is that the source response is empty because data fusion is not reading view from source database but only tables

It seems like the Multiple Database Tables source produced no records ("Out 0"). I'd check there first. You can do a quick check using the Preview mode. Plugin doc here.
Related answer here.

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?

Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

How to POST logical type decimal values for Avro records to Confluent Rest Proxy?

We are using the Confluent Rest Proxy to communicate with Kafka and need to test a variety of data. We are using the Rest Proxy to allow a vendor to communicate with our Kafka system.
One of our fields in the Avro schema has a logical type of decimal. To keep this simple, let's assume the schema shown here:
{
"fields": [
{
"name": "fieldName",
"type": "string"
},
{
"name": "amount",
"type": {
"logicalType": "decimal",
"precision": 16,
"scale": 2,
"type": "bytes"
}
}
],
"name": "Sample",
"namespace": "com.test.sample",
"type": "record"
}
It's easy enough to write to the topic via a Java producer, using Avro Tools to produce the appropriate class files. But when attempting to use the Rest Proxy, we have to pass values such as this:
{"value_schema_id":132,"records": [{"value":{"fieldName":"Field Name","amount":"\u0001ã"}}]}
This was copied from a record created via the Java producer and then downloaded from the topic. But in the amount field, we'd like to be able to pass a value such as 123.45. We're using Postman for the most part to send data. Is there a way to do this with a logical decimal field and without having to create and serialize the data first to see the representation such as \u0001ã?

Error when importing CSV file into Amazon Personalize

I am trying to import a CSV file into Amazon Personalize
my schema looks like this:
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "AUTHOR",
"type": "string",
"categorical": true
},
{
"name": "COUNTRY",
"type": "string",
"categorical": true
},
{
"name": "CITY",
"type": "string",
"categorical": true
},
{
"name": "STYLES",
"type": "string",
"categorical": true
},
{
"name": "CATEGORIES",
"type": "string",
"categorical": true
}
],
"version": "1.0"
}
the first few rows of data look like this:
ITEM_ID,AUTHOR,COUNTRY,CITY,STYLES,CATEGORIES
5b4253a7e12434f55875381e,5acd193f48ed4b9b3add5be6,US,city_us_austin,5ad45bc575eb016f3cdb562b|571aa21888a4fd9934f0fd7b|571aa21888a4fd9934f0fd79|5ad45e8c75eb016f3cdb563f|5b4ea35abaa12285687a1f47,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
5b4253a7e12434f55875381f,5acd193f48ed4b9b3add5be6,US,city_us_jackson,571aa21888a4fd9934f0fd82|57600e419e4959cd069658eb|5ad45c3a75eb016f3cdb5631|571aa21888a4fd9934f0fd7b|57aaa7094a393f531ace43f0|575e6d8e34ca56f742bea1c8|571aa21888a4fd9934f0fd8f,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
I get the error
Failed to create a data import job for item dataset.
Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
How can I figure out what is wrong with the CSV (it's thousands of lines long), so I have not idea if its a general mistake, or something wrong on a specific line?

In my experience, so long as the dataset is not >250 thousand records, you can still use Excel to check the data utilizing data filters and corresponding search functions. If it's more than that, look into using Notepad++ and RegEx. Your problem may be one of the following things:
(1) There's a missing comma. This would misalign your data and keep it from being processed.
(2) There's a missing ITEM_ID value. For Items, Personalize requires ITEM_ID and at least one metadata field. It might give this error if there is an instance where you are missing ITEM_ID or have ITEM_ID but no other metadata field values.
(3) STYLES and/or CATEGORIES exceeds 256 characters. There is probably a limit on String length, but I can't get a clear answer on this from the developer's guide. I would guess it's 256 characters. If I was betting money, this would be my guess on your problem.

Here is a different approach to solve the problem, maybe will be useful for other cases. I had the same issue, but when dealing with int columns having null values. Pandas by default converts the columns to float data type - something AWS Personalize dataset import job will not accept if you have dedfined these columns as int or long. Long story short, converting these columns to int solves the problem:
df.column_name = df.column_name.astype(pd.Int32Dtype())

Store invalid JSON columns are STRING or skip them in BigQuery

I have a JSON data file which looks something like below
{
"key_a": "value_a",
"key_b": "value_b",
"key_c": {
"c_nested/invalid.key.according.to.bigquery": "valid_value_though"
}
}
As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).
As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.
I presume my table definition would look something like below:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_c",
"type": "STRING"
}
]
When I try to create a table with this schema I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key
Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
}
]
The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:
Error while reading table:
projectname.dataset_name.table_name, error message:
JSON parsing error in row starting at position 0: No such field: key_c.
I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.
Is cleaning the data the only way? How can I tackle this?
I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?

One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.
With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:
with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select
JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data

Analytics in WSO2DAS

I'm getting a Table Not Found error while running a select query on spark console of wso2das. I've kept all the default configurations intact after the installation. I'm unable to fetch the data from the event stream even when it's been shown under table dropdown of data explorer.

Initially when the data is moved into the wso2das, it would be persisted in the data store you mention.
But, these are not the tables that are created in spark. You need to write a spark query to create a temporary table in spark which would reference the table you have persisted.
For example,
If your stream is,
{
"name": "sample",
"version": "1.0.0",
"nickName": "",
"description": "",
"payloadData": [
{
"name": "ID",
"type": "INT"
},
{
"name": "NAME",
"type": "STRING"
}
]
}
you need to write the following spark query in the spark console,
CREATE TEMPORARY TABLE sample_temp USING CarbonAnalytics OPTIONS (tableName "sample", schema "ID INT, NAME STRING");
after executing the above script,try the following,
select * from sample_temp;
This should fetch the data you have pushed into WSO2DAS.
Happy learning!! :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCP Data Fusion multiple table import - google-cloud-platform

It seems like the Multiple Database Tables source produced no records ("Out 0"). I'd check there first. You can do a quick check using the Preview mode. Plugin doc here. Related answer here.

Related

AWS AppFlow - rename source field

How to POST logical type decimal values for Avro records to Confluent Rest Proxy?

Error when importing CSV file into Amazon Personalize

Store invalid JSON columns are STRING or skip them in BigQuery

Analytics in WSO2DAS

Categories

Resources