AWS DMS with AWS MSK(Kafka) CDC transactional changes - amazon-web-services

I'm going to use AWS Database Migration Service (DMS) with AWS MSK(Kafka).
I'd like to send all changes within the same transaction into the same partition of Kafka topic - in order to guarantee correct message order(reference integrity)
For this purpose I'm going to enable the following property:
IncludeTransactionDetails – Provides detailed transaction information from the source database. This information includes a commit timestamp, a log position, and values for transaction_id, previous_transaction_id, and transaction_record_id (the record offset within a transaction). The default is false. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kafka.html
Also, as I may see from the same documentation:
AWS DMS supports the following two forms for partition keys:
1. SchemaName.TableName: A combination of the schema and table name.
2. ${AttributeName}: The value of one of the fields in the JSON, or the primary key of the table in the source database.
I have a question - in case of 'IncludeTransactionDetails = true', will I be able to use 'transaction_id' from event JSON as partition key for MSK(Kafka) migration topic?

The documentation says, you can define partition key to group the data
"You also define a partition key for each table, which Apache Kafka uses to group the data into its partitions"

Related

AWS DMS Removing LOB Columns

I'm trying to set up a Postgresql migration using the DMS to s3 as target. But after running I noticided that some tables were missing some columns.
After checking the logs I noticed this message:
Column 'column_name' was removed from table definition 'schema.table': the column data type is LOB and the table has no primary key or unique index
In the settings of the task migration I tried to increase the lob limit in the option
Maximum LOB size to 2000000
But still getting the same result.
Does anyone know a workaround for this problem?
I guess, the problem is you do not have the primary key in your table.
From AWS documentation:
Currently, a table must have a primary key for AWS DMS to capture LOB
changes. If a table that contains LOBs doesn't have a primary key,
there are several actions you can take to capture LOB changes:
Add a primary key to the table. This can be as simple as adding an ID
column and populating it with a sequence using a trigger.
Create a materialized view of the table that includes a
system-generated ID as the primary key and migrate the materialized
view rather than the table.
Create a logical standby, add a primary key to the table, and migrate
from the logical standby.
Learn more
It is also important to have the primary key of a simple type, not LOB:
In FULL LOB or LIMITED LOB mode, AWS DMS doesn't support replication of primary keys that are LOB data types.
Learn more

Dynamodb update item at a same time

I have a dynamodb table called events
table schema is
partition_key : <user_id>
sort_key : <month>
attributes: [<list of user events>]
I opened 3 terminals and running update_item command at the sametime for same partition_key and sort_key
Question:
How DynamoDb works in this case?
Will Dynamodb follows any approach like FIFO ?
OR
will Dynamodb performs update_item operation parlalley for the same partition key and sort key ?
Can someone tell me how Dyanmodb works?
How DynamoDb works is explained in the excellent AWS presentation:
AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database
The relevant part to your question is at 6.46 minute, where they talk about storage leader nodes. So when you put or update the same item, your requests will go to a single, specific storage leader node responsible for the partition where the item exists. This means, that all your concurrent updates will end up in the single node. The node probably (not explicitly stated) will be able to queue the requests, in presumably a similar way as for global tables discussed at time 51.58, which is "last writer wins" based on timestamp.
There are other questions discussing similar topics, e.g. here.

Is there any mechanism to send notification for large set of queries if matching item added to table

I have a AWS dynamodb table with set of documents and all data will be replicated to elasticsearch cluster using AWS Lambda function. My web site serve feature to search these documents of that dynamodb table using elasticsearch cluster.
My requirement is, a users should be able to subscribe for a search query to get email updates daily or weekly (or instant) of newly added documents which are matching for their query. According to the data set, expected user query amount is around 1,000,000. New documents can be added around 100. (In this case no need to use full-text search. Simple matching is acceptable).
Is there any known technology for fulfilling this requirement? I am currently using nodeJS + AWSLabda + DynamoDB + Elasticsearch and also ready to technology change if required.
Current database details:
Entity: Book
Fields:
name: string
price: number
category: string array
I can't run all 1,000,000 queries if new document added to the database. Is there any efficient technology to do that.

DynamoDB table/index schema design for querying multi-valued attributes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊

How to update multiple columns of dynamo db using aws iot rules engine

I have set of data: id, name, height and weight.
I am sending this data to aws iot in json format. From there I need to update the respective columns in a dynamo db hence I have created 3 rules to update name, height and weight keeping id as partition key.
But when I send the message only one column is getting updated. If I disable any 2 rules then the remaining rule works fine. Therefore every time I update, columns are getting overwritten.
How can I update all three columns from the incoming message?
Another answer: in your rule, use instead the "dynamoDBv2" action -- which "allows you to write all or part of an MQTT message to a DynamoDB table. Each attribute in the payload is written to a separate column in the DynamoDB database ..."
dynamoDBv2 action: writes each attribute in the payload to a separate column in the DynamoDB database.
The answer is: You can't do this with the IoT gateway rules themselves. You can only store data in a single column through the rules (apart from the hash and sort key).
A way around this is to make a lambda rule which calls for example a python script which then takes the message and stores it in the table. See also this other SO question.