Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline - google-cloud-platform

I have started trying out google cloud data fusion as a prospect ETL tool that I can finally decide to use.When building a data pipeline to fetch data from a REST API source and load it to a MySQL database am facing this error Expected a string but was NULL at line 1 column 221'. Please check the system logs for more details. and yes it's true I have a field that is null from the JSON response am seeing
"systemanswertime": null
How do I deal with null values since the available dropdown in the cloud data fusion studio string is not working are they other optional data types that I can use?
Below are two screenshots showing my current data pipeline structure
geneneral view
view showing mapping and the output schema
Thank You!!

What you need to do is to tell HTTP plugin that you are expecting a null by checking the null checkbox in front of output on the right side. See below example

You might be getting this error because in the JSON schema you are defining the value properties. You should allow systemanswertime parameter to be NULL.
You could try to parse the JSON value as follow:
"systemanswertime": {
"type": [
"string",
"null"
]
}
In the case you don't have access to the JSON file, you could try to use this plug in in order to enable the HTTP to manage nulleable values by dynamically substituting the configurations that can be served by the HTTP Server. You will need access to the HTTP endpoint in order construct an accessible HTTP endpoint that can serve content similar to:
{
"name" : "output.schema", "type" : "schema", "value" :
[
{ "name" : "id", "type" : "int", "nullable" : true},
{ "name" : "first_name", "type" : "string", "nullable" : true},
{ "name" : "last_name", "type" : "string", "nullable" : true},
{ "name" : "email", "type" : "string", "nullable" : true},
]
},

In case you are facing an error such as: No matching schema found for union type: ["string","null"], you could try the following workaround. The root cause of this errors are when the entries in the response from the API doesn't have all the fields it needs to have. For example, some entries may have callerId, channel, last_channel, last data, etc... but others entries may have not have last_channel or whatever other field from the JSON. This leads to a mismatch in the schema provided in the HTTP source and the pipeline fails right away.
As pear this when nodes encounter null values, logical errors, or other sources of errors, you may use an error handler plugin to catch errors. The way is as following:
In the HTTP source plug-in, change the following:
Output schema to account for custom field.
JSON/XML field mapping to account into custom field.
Changed Non-HTTP Error Handling field to Send to Error. This way it pushes the records through error collector and the pipeline proceeds with subsequent records.
Added Error Collector and a sink to capture the error records.
With this method you will be able to run the pipeline and had the problematic fields detected.
Kind regards,
Manuel

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?

Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

Creating Batch Operations with AWS Amplify [GraphQL, DataStore, AppSync]

I've currently been handling batch operations with a for loop, but obviously, this is not the best approach, especially as I'm adding an 'upload by CSV' option, which will take 1000+ putItems.
I searched around for the best ways to implement this, specifically this link:
https://docs.aws.amazon.com/appsync/latest/devguide/tutorial-dynamodb-batch.html
However, even after following those steps mentioned I'm not able to achieve a batch operation. Below is my code for a 'batch delete' operation.
Here is my schema.graphql file:
type Client #model #auth(rules: [{ allow: owner }]) {
id: ID!
name: String!
company: String
phone: String
email: String
}
type Mutation {
batchDelete(ids: [ID]): [Client]
}
I then create two new files. One request mapping template and one response mapping template.
#set($clientsdata = [])
#foreach($item in ${ctx.args.clients})
$util.qr($clientsdata.delete($util.dynamodb.toMapValues($item)))
#end
{
"version" : "2018-05-29",
"operation" : "BatchDeleteItem",
"tables" : {
"Clients": $utils.toJson($clientsdata)
}
}
and then as per the tutorial a "simple pass through" response mapping template:
$util.toJson($ctx.result.data.Posts)
However now when I run the batchdelete command, I keep getting nothing returned.
Would really appreciate guidance on this!

When it comes to performing DynamoDB batch operations in tandem with Amplify, note that the table name specified in the schema is actually different per environment, i.e. your "Client" table wouldn't be recognized as "Clients" as you have stated it in the request mapping template, but rather the name it is given on Amplify push, per environment.
E.g. Client-<some alphanumeric number>-envName
Add the full name of the table to your request and response mapping templates.
Also your foreach statement should read:
#foreach($item in ${ctx.args.clientsdata}) wherein you iterate through each of the items in the array that is passed as the argument to the context object.
Hope this helps.

AWS Kendra PreHook Lambdas for Data Enrichment

I am working on a POC using Kendra and Salesforce. The connector allows me to connect to my Salesforce Org and index knowledge articles. I have been able to set this up and it is currently working as expected.
There are a few custom fields and data points I want to bring over to help enrich the data even more. One of these is an additional answer / body that will contain key information for the searching.
This field in my data source is rich text containing HTML and is often larger than 2048 characters, a limit that seems to be imposed in a String data field within Kendra.
I came across two hooks that are built in for Pre and Post data enrichment. My thought here is that I can use the pre hook to strip HTML tags and truncate the field before it gets stored in the index.
Hook Reference: https://docs.aws.amazon.com/kendra/latest/dg/API_CustomDocumentEnrichmentConfiguration.html
Current Setup:
I have added a new field to the index called sf_answer_preview. I then mapped this field in the data source to the rich text field in the Salesforce org.
If I run this as is, it will index about 200 of the 1,000 articles and give an error that the remaining articles exceed the 2048 character limit in that field, hence why I am trying to set up the enrichment.
I set up the above enrichment on my data source. I specified a lambda to use in the pre-extraction, as well as no additional filtering, so run this on every article. I am not 100% certain what the S3 bucket is for since I am using a data source, but it appears to be needed so I have added that as well.
For my lambda, I create the following:
exports.handler = async (event) => {
// Debug
console.log(JSON.stringify(event))
// Vars
const s3Bucket = event.s3Bucket;
const s3ObjectKey = event.s3ObjectKey;
const meta = event.metadata;
// Answer
const answer = meta.attributes.find(o => o.name === 'sf_answer_preview');
// Remove HTML Tags
const removeTags = (str) => {
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace( /(<([^>]+)>)/ig, '');
}
// Truncate
const truncate = (input) => input.length > 2000 ? `${input.substring(0, 2000)}...` : input;
let result = truncate(removeTags(answer.value.stringValue));
// Response
const response = {
"version" : "v0",
"s3ObjectKey": s3ObjectKey,
"metadataUpdates": [
{"name":"sf_answer_preview", "value":{"stringValue":result}}
]
}
// Debug
console.log(response)
// Response
return response
};
Based on the contract for the lambda described here, it appears pretty straight forward. I access the event, find the field in the data called sf_answer_preview (the rich text field from Salesforce) and I strip and truncate the value to 2,000 characters.
For the response, I am telling it to update that field to the new formatted answer so that it complies with the field limits.
When I log the data in the lambda, the pre-extraction event details are as follows:
{
"s3Bucket": "kendrasfdev",
"s3ObjectKey": "pre-extraction/********/22736e62-c65e-4334-af60-8c925ef62034/https://*********.my.salesforce.com/ka1d0000000wkgVAAQ",
"metadata": {
"attributes": [
{
"name": "_document_title",
"value": {
"stringValue": "What majors are under the Exploratory track of Health and Life Sciences?"
}
},
{
"name": "sf_answer_preview",
"value": {
"stringValue": "A complete list of majors affiliated with the Exploratory Health and Life Sciences track is available online. This track allows you to explore a variety of majors related to the health and life science professions. For more information, please visit the Exploratory program description. "
}
},
{
"name": "_data_source_sync_job_execution_id",
"value": {
"stringValue": "0fbfb959-7206-4151-a2b7-fce761a46241"
}
},
]
}
}
The Problem:
When this runs, I am still getting the same field limit error that the content exceeds the character limit. When I run the lambda on the raw data, it strips and truncates it as expected. I am thinking that the response in the lambda for some reason isn't setting the field value to the new content correctly and still trying to use the data directly from Salesforce, thus throwing the error.
Has anyone set up lambdas for Kendra before that might know what I am doing wrong? This seems pretty common to be able to do things like strip PII information before it gets indexed, so I must be slightly off on my setup somewhere.
Any thoughts?

since you are still passing the rich text as a metadata filed of a document, the character limit still applies so the document would fail at validation step of the API call and would not reach the enrichment step. A work around is to somehow append those rich text fields to the body of the document so that your lambda can access it there. But if those fields are auto generated for your documents from your data sources, that might not be easy.

Store invalid JSON columns are STRING or skip them in BigQuery

I have a JSON data file which looks something like below
{
"key_a": "value_a",
"key_b": "value_b",
"key_c": {
"c_nested/invalid.key.according.to.bigquery": "valid_value_though"
}
}
As we know BigQuery considers c_nested/invalid.key.according.to.bigquery as an invalid column name. I have a huge amount of log data exported by StackDriver into Google Cloud Storage which has a lot of invalid fields (according to BigQuery Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long).
As a workaround, I am trying to store the value to the key_c (the whole {"c_nested/invalid.key.according.to.bigquery": "valid_value_though"} thing) as a string in the BigQuery table.
I presume my table definition would look something like below:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_c",
"type": "STRING"
}
]
When I try to create a table with this schema I get the below error:
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: Expected key
Assuming it is now supported in BigQuery, I thought of simply skipping the key_c column with the below schema:
[
{
"mode": "NULLABLE",
"name": "key_a",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "key_b",
"type": "STRING"
}
]
The above schema lets me at least create a permanent table (for querying external data), but when I am trying to query the data I get the following error:
Error while reading table:
projectname.dataset_name.table_name, error message:
JSON parsing error in row starting at position 0: No such field: key_c.
I understand there is a way described here to load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery but hat makes the queries too complicated.
Is cleaning the data the only way? How can I tackle this?
I am looking for a way to skip making a column for invalid fields and store then directly as STRING or simply ignore them fully. Is this possible?

One of the main premise why people use BQ (and other cloud databases) is that storage is cheap. In practice, it is often helpful to load 'raw' or 'source' data into BQ and then transform it as needed (views or other transformation tools). This is a paradigm shift from ETL to ELT.
With that in mind, I would import your "invalid" JSON blob as a string, and then parse it in your transformation steps. Here is one method:
with data as (select '{"key_a":"value_a","key_b":"value_b","key_c":{"c_nested/invalid.key.according.to.bigquery":"valid_value_though"}}' as my_string)
select
JSON_EXTRACT_SCALAR(my_string,'$.key_a') as key_a,
JSON_EXTRACT_SCALAR(my_string,'$.key_b') as key_b,
JSON_EXTRACT_SCALAR(REPLACE(my_string,"c_nested/invalid.key.according.to.bigquery","custom_key"),'$.key_c.custom_key') as key_c
from data

Amazon API Gateway swagger importer tool does not import minItems feild from swagger

I am trying the api gateway validation example from here https://github.com/rpgreen/apigateway-validation-demo . I observed that from the given swagger.json file, minItems is not imported into the models which got created during the swagger import.
"CreateOrders": {
"title": "Create Orders Schema",
"type": "array",
"minItems" : 1,
"items": {
"type": "object",
"$ref" : "#/definitions/Order"
}
}
Because of this when you give an empty array [ ] as input, instead of throwing an error about minimum items in an array, the api responds with a message 'created orders successfully'.
When I manually add the same from the API gateway console UI, it seems to work as expected. Am i missing something or this is a bug in the importer?

This is a known issue with the Swagger import feature of API Gateway.
From http://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-known-issues.html
The maxItems and minItems tags are not included in simple request validation. To work around this, update the model after import before doing validation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Dealing With Incoming Null Values In Cloud Data Fusion When Building Data Pipeline - google-cloud-platform

What you need to do is to tell HTTP plugin that you are expecting a null by checking the null checkbox in front of output on the right side. See below example

Related

AWS AppFlow - rename source field

Creating Batch Operations with AWS Amplify [GraphQL, DataStore, AppSync]

AWS Kendra PreHook Lambdas for Data Enrichment

Store invalid JSON columns are STRING or skip them in BigQuery

Amazon API Gateway swagger importer tool does not import minItems feild from swagger

Categories

Resources