I have a dynamodb table called events
table schema is
partition_key : <user_id>
sort_key : <month>
attributes: [<list of user events>]
I opened 3 terminals and running update_item command at the sametime for same partition_key and sort_key
Question:
How DynamoDb works in this case?
Will Dynamodb follows any approach like FIFO ?
OR
will Dynamodb performs update_item operation parlalley for the same partition key and sort key ?
Can someone tell me how Dyanmodb works?
How DynamoDb works is explained in the excellent AWS presentation:
AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database
The relevant part to your question is at 6.46 minute, where they talk about storage leader nodes. So when you put or update the same item, your requests will go to a single, specific storage leader node responsible for the partition where the item exists. This means, that all your concurrent updates will end up in the single node. The node probably (not explicitly stated) will be able to queue the requests, in presumably a similar way as for global tables discussed at time 51.58, which is "last writer wins" based on timestamp.
There are other questions discussing similar topics, e.g. here.
Related
I'm going to use AWS Database Migration Service (DMS) with AWS MSK(Kafka).
I'd like to send all changes within the same transaction into the same partition of Kafka topic - in order to guarantee correct message order(reference integrity)
For this purpose I'm going to enable the following property:
IncludeTransactionDetails – Provides detailed transaction information from the source database. This information includes a commit timestamp, a log position, and values for transaction_id, previous_transaction_id, and transaction_record_id (the record offset within a transaction). The default is false. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Kafka.html
Also, as I may see from the same documentation:
AWS DMS supports the following two forms for partition keys:
1. SchemaName.TableName: A combination of the schema and table name.
2. ${AttributeName}: The value of one of the fields in the JSON, or the primary key of the table in the source database.
I have a question - in case of 'IncludeTransactionDetails = true', will I be able to use 'transaction_id' from event JSON as partition key for MSK(Kafka) migration topic?
The documentation says, you can define partition key to group the data
"You also define a partition key for each table, which Apache Kafka uses to group the data into its partitions"
In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.
I want to process recent updates on a DynamoDB table and save them in another one. Let's say I get updates from an IoT device irregularly put in Table1, and I need to use the N last updates to compute an update in Table2 for the same device in sync with the original updates (kind of a sliding window).
DynamoDB Triggers (Streams + Lambda) seem quite appropriate for my needs, but I did not find a clear definition of TRIM_HORIZON. In some docs I understand that it is the oldest data in Table1 (can get huge), but in other docs it would seems that it is 24h. Or maybe the oldest in the stream, which is 24h?
So anyone knows the truth about TRIM_HORIZON? Would it even be possible to configure it?
The alternative I see is not to use TRIM_HORIZON, but rather tu use LATEST and perform a query on Table1. But it sort of defeats the purpose of streams.
Here are the relevant aspects for you, from DynamoDB's documentation (1 and 2):
All data in DynamoDB Streams is subject to a 24 hour lifetime. You can
retrieve and analyze the last 24 hours of activity for any given table
TRIM_HORIZON - Start reading at the last (untrimmed) stream record,
which is the oldest record in the shard. In DynamoDB Streams, there is
a 24 hour limit on data retention. Stream records whose age exceeds
this limit are subject to removal (trimming) from the stream.
So, if you have a Lambda that is continuously processing stream updates, I'd suggest going with LATEST.
Also, since you "need to use the N last updates to compute an update in Table2", you will have to query Table1 for every update, so that you can 'merge' the current update with the previous ones for that device. I don't think you can't get around that using TRIM_HORIZON too.
I'm trying to figure out how to model the following data in AWS DynamoDB table.
I have a lot of IOT devices, each sends telemetry data every few seconds.
Attributes
device_id
timestamp
malware_name
company_name
action_performed (two possible values)
Queries
Show all incidents that happened in the last week.
Show all incidents for a specific device_id.
Show all incidents with action "unable_to_remove".
show all incidents related to specific malware.
Show all incidents related to specific company.
Thoughts
I understand that I can add GSI's for each attribute, but I would like to use GSI's only if there is no other choice as it costs me more money.
What would be the main primary-key (partition-key:sort-key) ?
Please share you thoughts, I care about them more than I care about the perfect answer as I'm trying to learn how to think and what to consider instead of having an answer for a specific question.
Thanks a lot !
If you absolutely need the querability patterns mentioned, you have no way out but create GSIs for each. That too has its set of caveats:
For query #1, your GSI would be incident_date (or whatever) as partition-key and device_id as sort-key. This might lead to hot partitioning in DynamoDB, based on your access patterns.
There is a limit of 5 GSIs per table, that you'll use up right away. What'll you do if you need to support another kind of query in future?
While evaluating pros and cons of using NoSQL for a given situation, one needs to consider both read and write access patterns. So, the question you should ask is, why DynamoDB?
For e.g., do you really need realtime queries? If not, you can use DynamoDB as the main database and periodically sync data (using AWS Lambda or Kinesis Firehose) to EMR or Redshift for later batch processing.
Edit: Proposed primary key:
device_id as partition-key and incident_date as sort-key, if you know that no 2 or more incidents, for a given device_id, can come at exact same time.
If above doesn't work, then incident_id as partition-key and incident_date as sort-key.
I have a data pipeline running every hour, running a HiveCopyActivity to select the past hour's data from DynamoDB into S3. The table I'm selecting from has a hash key VisitorID and range key Timestamp, around 4 million rows and is 7.5GB in size. To reduce the time taken for the job, I created a global secondary index on Timestamp but after monitoring Cloudwatch, it seems that HiveCopyActivity doesn't use the index. I've read through all the relevant AWS documentation but can't find any mention of indexes.
Is there a way to force data pipeline to use an index while filtering like this? If not, are there any alternative applications which could transfer hourly (or any other period) data from DynamoDB to S3?
The DynamoDB EMR Hive adapter currently doesn't support using indexes, unfortunately. You would need to write your own sweeper that scans the index and outputs it to S3 - you can check out https://github.com/awslabs/dynamodb-import-export-tool for some basics to implementing the import/export pipe. The library is essentially a parallel scan framework for sweeping DDB tables.