AWS Datapipeline : moving data from RDS database(postgres) to Redshift using pipeline

AWS Datapipeline : moving data from RDS database(postgres) to Redshift using pipeline - amazon-web-services

Basically i am trying to transfer data from postgres to redshift using aws datapipeline and the process i am following
Write a pipeline(CopyActivity) that moves data from postgres to s3
Write a pipeline(RedShiftCopyActivity) that moves data from s3 to redshift
So in my case both are working perfectly with the pipelines i wrote, but the problem is the data was duplicating in the Redshift database
For example below is the data from postgres database in a table called company
After the successful run of s3 to redshift(RedShiftCopyActivity) pipeline the data was copied but it was duplicated as below
below is the some of the definition part from RedShiftCopyActivity(S3 to Redshift) pipeline
pipeline_definition = [{
"id":"redshift_database_instance_output",
"name":"redshift_database_instance_output",
"fields":[
{
"key" : "database",
"refValue" : "RedshiftDatabaseId_S34X5",
},
{
"key" : "primaryKeys",
"stringValue" : "id",
},
{
"key" : "type",
"stringValue" : "RedshiftDataNode",
},
{
"key" : "tableName",
"stringValue" : "company",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
{
"key" : "schemaName",
"stringValue" : RedShiftSchemaName,
},
]
},
{
"id":"CopyS3ToRedshift",
"name":"CopyS3ToRedshift",
"fields":[
{
"key" : "output",
"refValue" : "redshift_database_instance_output",
},
{
"key" : "input",
"refValue" : "s3_input_data",
},
{
"key" : "runsOn",
"refValue" : "ResourceId_z9RNH",
},
{
"key" : "type",
"stringValue" : "RedshiftCopyActivity",
},
{
"key" : "insertMode",
"stringValue" : "KEEP_EXISTING",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
]
},]
So according to the docs of RedShitCopyActivity we need to use insertMode to describe how the data should behave(inserted/updated/deleted) when copying to database table as below
insertMode : Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are KEEP_EXISTING, OVERWRITE_EXISTING, TRUNCATE and APPEND. KEEP_EXISTING adds new rows to the table, while leaving any existing rows unmodified. KEEP_EXISTING and OVERWRITE_EXISTING use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows, according to the information provided in Updating and inserting new data in the Amazon Redshift Database Developer Guide. TRUNCATE deletes all the data in the destination table before writing the new data. APPEND will add all records to the end of the Redshift table. APPEND does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.
So what my requirements are
When copying from postgres (infact data is in s3 now) to Redshift database if it found already existing rows then just update it
If it founds new records from s3 then create new records in Redshift
But for me even though i have used KEEP_EXISTING or OVERWRITE_EXISTING, the data was just repeating over and over again as shown in the above redshift database picture
So finally how to achieve my requirements ? are there still any tweaks or settings to add to my configuration ?
Edit
Table(company) definition from redshift

If you want to avoid duplication , you must define Primary key in redshift and also set myInsertMode as "OVERWRITE_EXISTING" .

Related

AWS AppFlow - rename source field

I have an AppFlow set up with Salesforce as the source and S3 as the destination. I am able to move all columns over by using a Map_all task type in the flow definition, and leaving the source fields empty.
However now I want to move just a few columns to S3, and rename them as well. I was trying to do something like this :
"Tasks": [
{
"SourceFields": ["Website"],
"DestinationField": "Website",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account Name"],
"DestinationField": "AccountName",
"TaskType": "Map",
"TaskProperties": {},
},
{
"SourceFields": ["Account ID"],
"DestinationField": "AccountId",
"TaskType": "Map",
"TaskProperties": {},
}
],
but I get the error
Create Flow request failed: [Task Validation Error: You must specify a projection task or a MAP_ALL task].
How can I select a few columns as well as rename them before moving them to S3 without resorting to something like Glue?

Figured it out - first added a Projection task to fetch the fields needed, and then Map tasks, one per field being renamed

How to specify attributes to return from DynamoDB through AppSync

I have an AppSync pipeline resolver. The first function queries an ElasticSearch database for the DynamoDB keys. The second function queries DynamoDB using the provided keys. This was all working well until I ran into the 1 MB limit of AppSync. Since most of the data is in a few attributes/columns I don't need, I want to limit the results to just the attributes I need.
I tried adding AttributesToGet and ProjectionExpression (from here) but both gave errors like:
{
"data": {
"getItems": null
},
"errors": [
{
"path": [
"getItems"
],
"data": null,
"errorType": "MappingTemplate",
"errorInfo": null,
"locations": [
{
"line": 2,
"column": 3,
"sourceName": null
}
],
"message": "Unsupported element '$[tables][dev-table-name][projectionExpression]'."
}
]
}
My DynamoDB function request mapping template looks like (returns results as long as data is less than 1 MB):
#set($ids = [])
#foreach($pResult in ${ctx.prev.result})
#set($map = {})
$util.qr($map.put("id", $util.dynamodb.toString($pResult.id)))
$util.qr($map.put("ouId", $util.dynamodb.toString($pResult.ouId)))
$util.qr($ids.add($map))
#end
{
"version" : "2018-05-29",
"operation" : "BatchGetItem",
"tables" : {
"dev-table-name": {
"keys": $util.toJson($ids),
"consistentRead": false
}
}
}

I contacted the AWS people who confirmed that ProjectionExpression is not supported currently and that it will be a while before they will get to it.
Instead, I created a lambda to pull the data from DynamoDB.
To limit the results form DynamoDB I used $ctx.info.selectionSetList in AppSync to get the list of requested columns, then used the list to specify the data to pull from DynamoDB. I needed to get multiple results, maintaining order, so I used BatchGetItem, then merged the results with the original list of IDs using LINQ (which put the DynamoDB results back in the correct order since BatchGetItem in C# does not preserve sort order like the AppSync version does).
Because I was using C# with a number of libraries, the cold start time was a little long, so I used Lambda Layers pre-JITed to Linux which allowed us to get the cold start time down from ~1.8 seconds to ~1 second (when using 1024 GB of RAM for the Lambda).

AppSync doesn't support projection but you can explicitly define what fields to return in the response template instead of returning the entire result set.
{
"id": "$ctx.result.get('id')",
"name": "$ctx.result.get('name')",
...
}

How do I run mutlple SQL statements in an AppSync resolver?

I have two tables, cuts and holds. In response to an event in my application I want to be able to move the values from a hold entry with a given id across to the cuts table. The naïve way is to do an INSERT and then a DELETE.
How do I run multiple sql statements in an AppSync resolver to achieve that result? I have tried the following (replacing sql by statements and turning it into an array) without success.
{
"version" : "2017-02-28",
"operation": "Invoke",
#set($id = $util.autoId())
"payload": {
"statements": [
"INSERT INTO cuts (id, rollId, length, reason, notes, orderId) SELECT '$id', rollId, length, reason, notes, orderId FROM holds WHERE id=:ID",
"DELETE FROM holds WHERE id=:ID"
],
"variableMapping": {
":ID": "$context.arguments.id"
},
"responseSQL": "SELECT * FROM cuts WHERE id = '$id'"
}
}

If you're using the "AWS AppSync Using Amazon Aurora as a Data Source via AWS Lambda" found here https://github.com/aws-samples/aws-appsync-rds-aurora-sample, you won't be able to send multiple statements in the sql field
If you are using the AWS AppSync integration with the Aurora Serverless Data API, you can pass up to 2 statements in a statements array such as in the example below:
{
"version": "2018-05-29",
"statements": [
"select * from Pets WHERE id='$ctx.args.input.id'",
"delete from Pets WHERE id='$ctx.args.input.id'"
]
}

You will be able to do it as the following.If you're using the "AWS AppSync Using Amazon Aurora as a Data Source via AWS Lambda" found here https://github.com/aws-samples/aws-appsync-rds-aurora-sample.
In the resolver, add the "sql0" & "sql1" field (you can name them what ever you want) :
{
"version" : "2017-02-28",
"operation": "Invoke",
#set($id = $util.autoId())
"payload": {
"sql":"INSERT INTO cuts (id, rollId, length, reason, notes, orderId)",
"sql0":"SELECT '$id', rollId, length, reason, notes, orderId FROM holds WHERE id=:ID",
"sql1":"DELETE FROM holds WHERE id=:ID",
"variableMapping": {
":ID": "$context.arguments.id"
},
"responseSQL": "SELECT * FROM cuts WHERE id = '$id'"
}
}
In your lambda, add the following piece of code:
if (event.sql0) {
const inputSQL0 = populateAndSanitizeSQL(event.sql0, event.variableMapping, connection);
await executeSQL(connection, inputSQL0);
}
if (event.sql1) {
const inputSQL1 = populateAndSanitizeSQL(event.sql1, event.variableMapping, connection);
await executeSQL(connection, inputSQL1);
}
With this approach, you can send to your lambda as much sql statements as you want,and then your lambda will execute them.

AppSync GraphQL mutation server logic in resolvers

I am having issues finding good sources for / figuring out how to correctly add server-side validation to my AppSync GraphQL mutations.
In essence I used AWS dashboard to define my AppSync schema, hence had DynamoDB tables created for me, plus some basic resolvers set up for the data.
No I need to achieve following:
I have a player who has inventory and gold
Player calls purchaseItem mutation with item_id
Once this mutation is called I need to perform some checks in resolver i.e. check if item_id exists int 'Items' table of associated DynamoDB, check if player has enough gold, again in 'Players' table of associated DynamoDB, if so, write to Players DynamoDB table by adding item to their inventory and new subtracted gold amount.
I believe most efficient way to achieve this that will result in less cost and latency is to use "Apache Velocity" templating language for AppSync?
It would be great to see example of this showing how to Query / Write to DynamoDB, handle errors and resolve the mutation correctly.

For writing to DynamoDB with VTL use the following tutorial
you can start with the PutItem template. My request template looks like this:
{
"version" : "2017-02-28",
"operation" : "PutItem",
"key" : {
"noteId" : { "S" : "${context.arguments.noteId}" },
"userId" : { "S" : "${context.identity.sub}" }
},
"attributeValues" : {
"title" : { "S" : "${context.arguments.title}" },
"content": { "S" : "${context.arguments.content}" }
}
}
For query:
{ "version" : "2017-02-28",
"operation" : "Query",
"query" : {
## Provide a query expression. **
"expression": "userId = :userId",
"expressionValues" : {
":userId" : {
"S" : "${context.identity.sub}"
}
}
},
## Add 'limit' and 'nextToken' arguments to this field in your schema to implement pagination. **
"limit": #if(${context.arguments.limit}) ${context.arguments.limit} #else 20 #end,
"nextToken": #if(${context.arguments.nextToken}) "${context.arguments.nextToken}" #else null #end
}
This is based on the Paginated Query template.

What you want to look at is at Pipeline Resolvers:
https://docs.aws.amazon.com/appsync/latest/devguide/pipeline-resolvers.html
Yes, this requires the VTL (Velocity Template)
That allows you to perform read, writes, validation, and anything you'd like using VTL. What you basically do is chain the inputs and outputs into the next template and make the required process.
Here's a Medium post showing you how to do it:
https://medium.com/#dabit3/intro-to-aws-appsync-pipeline-functions-3df87ceddac1
In other words, what you can do is:
Have one template that queries the database, pipeline the result to another template that validates the result and inserts it if succeeds or fails it.

Append item to list using AWS AppSync to DynamoDB

This might be a stupid question, but I really cannot find a way to do that.
So, I have DynamoDB tables and I have schema in AppSync api. In a table, for each row, there is a field which has a list as its value. So how can I append multiple items into this list without replacing the existing items? How should I write the resolver of that mutation?
Here is the screenshot of my table:
And you can see there are multiple programs in the list.
How can I just append two more programs.
Here is a new screenshot of my resolver:
screenshot of resolver
I want to add a existence check method in UpdateItem operation. But the current code does not work. The logic I want is that use the "contains" method to see whether the "toBeAddedProgramId" already exists. But the question is, how to extract the current program id list from User table and how to make the program id list a "list" type (since the contains method only take String set and String).
I hope this question makes sense. Thanks so much guys.
Best,
Harrison

To append items to a list, you should use the DynamoDB UpdateItem operation.
Here is an example if you're using DynamoDB directly
In AWS AppSync, you can use the DynamoDB data source and specify the DynamoDB UpdateItem operation in your request mapping template.
Your UpdateItem request template could look like the following (modify it to serve your needs):
{
"version" : "2017-02-28",
"operation" : "UpdateItem",
"key" : {
"id" : { "S" : "${context.arguments.id}" }
},
"update" : {
"expression" : "SET #progs = list_append(#progs, :vals)",
"expressionNames": {
"#progs" : "programs"
},
"expressionValues": {
":vals" : {
"L": [
{ "M" : { "id": { "S": "49f2c...." }}},
{ "M" : { "id": { "S": "931db...." }}}
]
}
}
}
}
We have a tutorial here that goes into more details if you are interested in learning more

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Datapipeline : moving data from RDS database(postgres) to Redshift using pipeline - amazon-web-services

If you want to avoid duplication , you must define Primary key in redshift and also set myInsertMode as "OVERWRITE_EXISTING" .

Related

AWS AppFlow - rename source field

How to specify attributes to return from DynamoDB through AppSync

How do I run mutlple SQL statements in an AppSync resolver?

AppSync GraphQL mutation server logic in resolvers

Append item to list using AWS AppSync to DynamoDB

Categories

Resources