AWS DMS CDC - Only capture changed values not entire record? (Source RDS MySQL) - amazon-web-services

I have a DMS CDC task set (change data capture) from a MySQL database to stream to a Kinesis stream which a Lambda is connected to.
I was hoping to ultimately receive only the value that has changed and not on entire dump of the row, this way I know what column is being changed (at the moment it's impossible to decipher this without setting up another system to track changes myself).
Example, with the following mapping rule:
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "my-schema",
"table-name": "product"
},
"rule-action": "include",
"filters": []
},
and if I changed the name property of a record on the product table, I would hope to recieve a record like this:
{
"data": {
"name": "newValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}
However what I actually recieve is something like this:
{
"data": {
"name": "newValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}
As you can see when receiving this, it's impossible to decipher what property has changed without setting up additional systems to track this.
I've found another stackoverflow thread where someone is posting an issue because their CDC is doing what I want mine to do. Can anyone point me into the right direction to achieve this?

I found the answer after digging into AWS documentation some more.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kinesis.html#CHAP_Target.Kinesis.BeforeImage
Different source database engines provide different amounts of
information for a before image:
Oracle provides updates to columns only if they change.
PostgreSQL provides only data for columns that are part of the primary
key (changed or not).
MySQL generally provides data for all columns (changed or not).
I used the BeforeImageSettings on the task setting to include the original data with payloads.
"BeforeImageSettings": {
"EnableBeforeImage": true,
"FieldName": "before-image",
"ColumnFilter": "all"
}
While this still gives me the whole record, it give me enough data to work out what's changed without additional systems.
{
"data": {
"name": "newValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"before-image": {
"name": "oldValue",
"id": "unchangedId",
"quantity": "unchangedQuantity",
"otherProperty": "unchangedValue"
},
"metadata": {
"timestamp": "2021-07-26T06:47:15.762584Z",
"record-type": "data",
"operation": "update",
"partition-key-type": "schema-table",
"schema-name": "my-schema",
"table-name": "product",
"transaction-id": 8633730840
}
}

Related

DMS replication to Kinesis omit certain fields

We have a use case where we have enabled a AWS DMS replication task which streams changes to our Aurora Postgres cluster to a Kinesis Data stream. The replication task is working as expected but the data that its sending to Kinesis Data Stream as json contains fields like metadata that we don't care about and would ideally like to omit them. Is there a way to do this without triggering a Lambda on KDS to remove the unwanted fields from the json?
I was looking at using table mappings config of the DMS task when KDS is the target, documentation here - https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kinesis.html. The docs don't mention anything of this sort. Maybe I am missing something.
The current table mapping for my usecase is as follows -
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"rule-action": "include",
"object-locator": {
"schema-name": "public",
"table-name": "%"
}
},
{
"rule-type": "object-mapping",
"rule-id": "2",
"rule-name": "DefaultMapToKinesis",
"rule-action": "map-record-to-record",
"object-locator": {
"schema-name": "public",
"table-name": "testing"
}
}
]
}
The table testing only has two columns namely id and value of type varchar and decimal respectively.
The result I am getting in KDS is as follows -
{
"data": {
"id": "5",
"value": 1111.22
},
"metadata": {
"timestamp": "2022-08-23T09:32:34.222745Z",
"record-type": "data",
"operation": "insert",
"partition-key-type": "schema-table",
"schema-name": "public",
"table-name": "testing",
"transaction-id": 145524
}
}
As seen above we are only interested in the data key of the json.
Is there any way in DMS config or KDS to filter on the data portion of the json sent by DMS without involving any new infra like Lambda?

Amazon Textract getting around odd behavior with tables

I'm using Textract to parse table data in PDFs. Most of the pages are parsed out correctly:
But the last page table has the columns split out in strange ways:
I've got a config-driven process that relies on the index of a column to determine where to get the needed values. It's causing problems when the index for a value in one part of the table all of a sudden changes.
Here's a portion of the config file for this particular document:
"rows": {
"invoice_nbr": {
"index": 3,
"type": "string"
},
"date": {
"index": 5,
"type": "date"
},
"tx_type": {
"index": 1,
"type": "list",
"match_list": [
"INV",
"CRM",
"OAC",
"SVC",
"ADJ"
]
},
"po_nbr": {
"index": 4,
"type": "none"
},
"original_amount": {
"index": 6,
"type": "currency"
},
"amount": {
"index": 10,
"type": "currency"
}
}
Looking for some ideas about how to approach this problem. I've thought about making the config index an array and falling back to another index if the first one doesn't validate, but I think it could still validate the wrong column potentially and I feel like that could be opening it up to future headaches, as the code handles different configurations for different document layouts.

List users as non admin with custom fields

As per the documentation, I should be able to get a list of users with a custom schema as long as the field in the schema has a value of ALL_DOMAIN_USERS in the readAccessType property. That is the exact set up I have in the admin console; Moreover, when I perform a get request to the schema get endpoint for the schema in question, I get confirmation that the schema fields are set to ALL_DOMAIN_USERS in the readAccessType property.
The problem is when I perform a users list request, I don't get the custom schema in the response. The request is the following:
GET /admin/directory/v1/users?customer=my_customer&projection=full&query=franc&viewType=domain_public
HTTP/1.1
Host: www.googleapis.com
Content-length: 0
Authorization: Bearer fakeTokena0AfH6SMD6jF2DwJbgiDZ
The response I get back is the following:
{
"nextPageToken": "tokenData",
"kind": "admin#directory#users",
"etag": "etagData",
"users": [
{
"externalIds": [
{
"type": "organization",
"value": "value"
}
],
"organizations": [
{
"department": "department",
"customType": "",
"name": "Name",
"title": "Title"
}
],
"kind": "admin#directory#user",
"name": {
"fullName": "Full Name",
"givenName": "Full",
"familyName": "Name"
},
"phones": [
{
"type": "work",
"value": "(999)999-9999"
}
],
"thumbnailPhotoUrl": "https://photolinkurl",
"primaryEmail": "user#domain.com",
"relations": [
{
"type": "manager",
"value": "user#domain.com"
}
],
"emails": [
{
"primary": true,
"address": "user#domain.com"
}
],
"etag": "etagData",
"thumbnailPhotoEtag": "photoEtagData",
"id": "xxxxxxxxxxxxxxxxxx",
"addresses": [
{
"locality": "Locality",
"region": "XX",
"formatted": "999 Some St Some State 99999",
"primary": true,
"streetAddress": "999 Some St",
"postalCode": "99999",
"type": "work"
}
]
}
]
}
However, if I perform the same request with a super admin user, I get an extra property in the response:
"customSchemas": {
"Dir": {
"fieldOne": false,
"fieldTwo": "value",
"fieldThree": value
}
}
My understanding is that I should get the custom schema with a non admin user as long as the custom schema fields are set to be visible by all domain users. This is not happening. I opened a support ticket with G Suite but the guy that provided "support", send me in this direction. I believe this is a bug or maybe I overlooked something.
I contacted G Suite support and in fact, this issue is a domain specific problem.
It took several weeks for the issue to be addressed by the support engineers at Google but it was finally resolved. The behaviour is the intended one now.

JSONAPI: Update relationships including attributes

I have a nested object in my SQLAlchemy table, produced with Marshmallow's nested schema feature. For example, an articles object GET response would include an author (a User type) object along with it.
I know that the JSONAPI spec already allows updating relationships. However, often I would like to update an article as well with its nested objects in one call (POST requests of articles that include a new author will automatically create the author). Is it possible to make a PATCH request that includes the resources of a relationship object that doesn't yet exist?
So instead of just this:
PATCH /articles/1 HTTP/1.1
Content-Type: application/vnd.api+json
Accept: application/vnd.api+json
{
"data": {
"type": "articles",
"id": "1",
"relationships": {
"author": {
"data": { "type": "people", "id": "1" }
}
}
}
}
It'd be ideal to pass this in to create a new author if there didn't exist one (this is not my actual use case, but I have a similar real life need):
PATCH /articles/1 HTTP/1.1
Content-Type: application/vnd.api+json
Accept: application/vnd.api+json
{
"data": {
"type": "articles",
"id": "1",
"relationships": {
"author": {
"data": { "type": "people", "id": "1", "attributes": {"name": "new author":, "articles_written": 1} }
}
}
}
}
Is this at all possible or have suggestions on what REST frameworks can support this, or would it be totally against the JSON API spec?
Updating multiple resources at once is not possible with JSON API spec v1.0. But there are several suggestions how to do that. Both official extensions which are not supported anymore aimed to support creating, updating or deleting multiple request at once. Also there is an open pull request which would introduce operations to upcoming JSON API spec v1.2.
For example a request updating two resources at once using suggested operations[1] would look like this:
PATCH /bulk HTTP/1.1
Host: example.org
Content-Type: application/vnd.api+json
{
"operations": [{
"op": "update",
"ref": {
"type": "articles",
"id": "1"
},
"data": {
"type": "articles",
"id": "1",
"attributes": {
"title": "Updated title of articles 1"
}
}
}, {
"op": "update",
"ref": {
"type": "people",
"id": "2"
},
"data": {
"type": "people",
"id": "2",
"attributes": {
"name": "updated name of author 2"
}
}
}]
}
This would update title attribute of article with id 1 and name attribute of person resource with id 2 in a single transaction.
An request to update a to-one relationship and updating the related resource in a singular transaction would look like this:
PATCH /bulk HTTP/1.1
Host: example.org
Content-Type: application/vnd.api+json
{
"operations": [{
"op": "update",
"ref": {
"type": "articles",
"id": "1",
"relationship": "author"
},
"data": {
"type": "people",
"id": "2"
}
}, {
"op": "update",
"ref": {
"type": "people",
"id": "2"
},
"data": {
"type": "people",
"id": "2",
"attributes": {
"name": "updated name of author 2"
}
}
}]
}
This is a request which creates a new person and associate it as author to an existing article with id 1:
PATCH /bulk HTTP/1.1
Host: example.org
Content-Type: application/vnd.api+json
{
"operations": [{
"op": "add",
"ref": {
"type": "people"
},
"data": {
"type": "people",
"lid": "a",
"attributes": {
"name": "name of new person"
}
}
}, {
"op": "update",
"ref": {
"type": "articles",
"id": "1",
"relationship": "author"
},
"data": {
"type": "people",
"lid": "a"
}
}]
}
Note that the order is important. Operations must be performed in order by server. So the resource has to be created before it could be associated to an existing one. To express the relationship we use a local id (lid).
Please note that operations should only be used if a request has to be performed transactionally. If each of the included operations could be executed atomically, singular requests should be used.
Accordingly to implementation list provided on jsonapi.org there are some libraries supporting Patch extension.
Update: Operations were not included in release candidate for JSON API v1.1. It's planned for v1.2 now. The author of the pull request, which is also one of the maintainers of the spec, said that "operations are now the highest priority". A release candidate for v1.2 including Operations may be shipped "RC within a few months of 1.1 final."
[1] Introducing operations to JSON API v1.2 is currently only suggested but not merged. There may be breaking changes. Read the related pull request before implementation.

Delete record Libcloud (GoDaddy api)

I try to implement delete method for Record delate-record, but its my first time to use python and this api.
The GoDaddy API doesn't have a delete record method, so this functionality is not exposed in the driver.
https://developer.godaddy.com/doc#!/_v1_domains/recordReplace
The driver could offer the 'replace records in zone' method, which would allow you to fetch the current list of records, and then set the new list minus the record you want to remove. But that feature is not implemented and quite risky.
First,
Send a GET request to https://api.godaddy.com/v1/domains/{DOMAIN}/records
Then, Enumerate over all records of API Response (JSON Array) and prepare new data by removing the one that needs to be deleted.
API Response (SAMPLE)
[
{
"data": "192.168.1.1",
"name": "#",
"ttl": 600,
"type": "A"
},
{
"data": "ns1.example.com",
"name": "#",
"ttl": 3600,
"type": "NS"
},
{
"data": "#",
"name": "www",
"ttl": 3600,
"type": "CNAME"
},
{
"data": "mail.example.com",
"name": "#",
"ttl": 3600,
"priority": 1,
"type": "MX"
}
]
New Data (After deleting record) (SAMPLE)
[
{
"data": "192.168.1.1",
"name": "#",
"ttl": 600,
"type": "A"
},
{
"data": "ns1.example.com",
"name": "#",
"ttl": 3600,
"type": "NS"
},
{
"data": "#",
"name": "www",
"ttl": 3600,
"type": "CNAME"
}
]
Now,
Send a PUT request to https://api.godaddy.com/v1/domains/{DOMAIN}/records with new data.
The most important thing is how you identify the records in above array which needs to be deleted. This would not be a difficult task, assuming you have good programming skills.
I managed to worked around it in kind of a hacky - we had bunch of records we wanted to delete, doing it manually seemed weird so I added a Javascript into the Chrome Developer Console, running on an authenticated session from the DNS manage page:
function deleteGoDaddyRecords(recordId) {
$.ajax({
url: 'https://dcc.godaddy.com/api/v3/domains/<YOUR-DOMAIN.com>/records?recordId='+recordId,
type: 'DELETE',
success: function(result) {
console.log(result)
}
});
}
which let me use the same call the UI is calling when you ask to delete a record.
the only thing you need to provide is the AttributeUid which is not available with the public API, but it is in the front-end API:
https://dcc.godaddy.com/api/v2/domains/runahr.com/records
So I managed to create a script that will generate bunch of
deleteGoDaddyRecords('<RECORD-UUID>');
deleteGoDaddyRecords('<RECORD-UUID>');
copy & paste the generated script into the Developers Console and that solved it for now.
I hope GoDaddy will add a public DELETE endpoint to their API in the future :)