Consider the following architecture:
write -> DynamoDB table -> stream -> Lambda -> write metadata item to same table
It could be used for many, many awsome situations, e.g table and item level aggregations. I've seen this architecture promoted in several tech talks by official AWS engineers.
But doesn't writing metadata item add new item to stream and run Lambda again?
How to avoid infinite loop? Is there a way to avoid metadata write appearing in stream?
Or is spending 2 stream and Lambda requests inevitable with this architecture? (we're charged per request) I.e exit Lambda function early if it's metadata item.
As triggering an AWS Lambda function from a DynamoDB stream is a binary option (on/off), it's not possible to only trigger the AWS Lambda function for certain writes to the table. So your AWS Lambda function will be called again for the items it just wrote to the DynamoDB table. The important bit is to have logic in place in your AWS Lambda function to detect that it wrote that data and to not write data in that case again. Otherwise you'd get the mentioned infinite loop, which would be a really unfortunate situation, especially if it would went unnoticed.
Currently dynamo DB does not offer condition based subscription to stream, so yes Dynamo DB will execute your lambda function in an infinite loop, currently the only solution is to limit the time your lambda function execute, you can use multiple lambda functions, one lambda function would be there just to check whether a metadata was written or not, I'm sharing a cloud architecture diagram of how you can achieve it,
A bit late but hopefully people looking for a more demonstrative answer will find this useful.
Suppose you want to process records where you want to add to an item up to a certain threshold, you could have an if condition that checks that and processes or skips the record, e.g.
This code assumes you have an attribute "Type" for each of your entities / object types - this was recommended to me by Rick Houlihan himself but you could also check if an attribute exists i.e. "<your-attribute>" in record["dynamodb"]["NewImage"] - and you are designing with PK and SK as generic primary and sort key names.
threshold = (os.environ.get("THRESHOLD"))
def get_value():
response = table.query(KeyConditionExpression=Key('PK').eq(<your-pk>))
value = response['Items']['<your-attribute>'] if 'Items' in response else 0
return value
def your_aggregation_function():
# Your aggregation logic here
# Write back to the table with a put_item call once done
def lambda_handler(event, context):
for record in event['Records']:
if record['eventName'] != "REMOVE" and record["dynamodb"]["NewImage"]["Type'] == <your-entity-type>:
# Query the table to extract the attribute value
attribute_value = get_value(record["dynamodb"]["Keys"]["PK"]["S"])
if attribute_value < threshold:
# Send to your aggregation function
Having the conditions in place in the lambda handler (or you could change where to suit your needs) prevents the infinite loop mentioned.
You may want additional checks in the update expression to make sure two (or more) concurrent lambda are not writing the same object. I suggest you use a date = # timestamp defined in the lambda and add this in the SK, or if you cant, have an "EventDate" attribute in your item so that yo ucould add ConditionExpression or UpdateExpression SET if_not_exists(#attribute, :date)
The above will guarantee that your lambda is idempotent.
Related
My model represents users with unique names. In order to achieve that I store user and its name as 2 separate items using TransactWriteItems. The approximate structure looks like this:
PK | data
--------------------------------
userId#<userId> | {user data}
userName#<userName> | {userId: <userId>}
Data arrives to a lambda from a Kinesis stream. If one lambda invocation processes an "insert" event and another lambda request comes in about at the same time (the difference could be 5 milliseconds) the "update" event causes a TransactionConflictException: Transaction is ongoing for the item error.
Should I just re-try to run update again in a second or so? I couldn't really find a resolution strategy.
That implies you’re getting data about the same user in quick succession and both writes are hitting the same items. One succeeds while the other exceptions out.
Is it always duplicate data? If you’re sure it is, then you can ignore the second write. It would be a no-op.
Is it different data? Then you’ve got to decide how to handle that conflict. You’ll have one dataset in the database and a different dataset live in your code. That’s a business logic question not database question.
I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?
Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times
I have an application (both Lambda & a microservice) that reads from DynamoDB streams.
Is it possible to define a timestamp from where the application starts reading the data?
Defining timestamp is not a data access pattern for DynamoDb streams.
Based on the documentation, the only available data access pattern is by using shard identifiers.
There might be though a way to use the halving interval (aka bisection) method to lookup shard records and their ApproximateCreationDateTime.
After re-reading this question, I believe what you are asking for is a 'start location in the dynamo' from where the lambda will start reading data.
The answer to this is no, because that is not how streams work. dynamo streams are not I/O streams of data to your lambda, but rather batched events that are collected into a single JSON event that is sent to your lambda when its conditions (amount of events or time passed) is met. You have some options like TRIM_HORIZON and such that give you some control over what events are sent and where it "starts' but this is not a 'start in the middle of the stream' sort of operation. These are single json events sent as they are generated.
It really depends on your Use Case here but Im guessing you want to be able to add a bunch of items to the dynamo and NOT have those trigger the Lambda, then at a certain point have the items begin to trigger the lambda.
If this is the case, you have two options:
Add an attribute to the items you don't want to process. Have the lambda check the event in the stream for that attribute, and if it finds it, ignore that event.
Or 2) use your SDK for your language to turn the stream on and off.
option 1 is far less complicated. And probably the far better option.
I'm currently working on a project where I use DynamoDB as my nosql database. Before I started I tried to learn how to model nosql databases since its really different to our known relational databases. I learned that I have to stick on the single table model. Im using DynamoDB Streams to aggregate some data, for instance the customer count for a product (there are some more complex cases than that). Since I have only one single table, my lambda function writes in the same table, from which the stream came. Add new customer to TableA -> DynamoDB Stream triggers Lambda for TableA -> Lambda updates a row in TableA -> DynamoDB Stream triggers Lambda for Table A -> Lambda function terminates. If I understand it right, this scenario could lead to an infinite loop since the first insert trigger triggers an update trigger. This is something im escaping in my lambda function. My question now is Am I getting billed for two lambda functions and two DynamoDb streams each time I make an insert in my db?
If yes should I ignore the best practise way of a nosql db and split the table into multiple or should I invest the money ? Because I am doubling my bill in this case. What are the cons of splitting my table in multiple tables ? Would be the effect of the cons that big ?
In general I would advise against such loops. In distributed system loops of dependencies could result in in various kinds of undesired effects. In an architecture based on DDB + Lambda, some of these effects are eliminated but I would still try to avoid it. On top of that in such an architecture there is also the risk of infinite loop of Lambda invocations. This could result in a significant financial price. Even if you write your best code (and even implement safety checks against such loops) I think the best course of action is design the architecture such that loops are not possible.
Thus, I would definitely split the DDB table into two.
The negative side of splitting into two tables is this: instead of retrieving just a single item your application may now need to retrieve two items (from the two tables) to get the same amount of information.
Is there a way to make a Dynamodb Update return Old and New values?
something like:
updateItemSpec
.withPrimaryKey("id", id)
.withUpdateExpression(myUpdateExpression)
.withNameMap(nameMap)
.withValueMap(valueMap)
.withReturnValues("UPDATED_NEW, UPDATED_OLD");
There isn't.
It should be easy for you to simulate this by returning UPDATED_OLD. You already have the new values as you set them in the update, so get the updated old values, and use that to extract your new values from your value map.
Depending on where you want to use the data, if you don't need it in the body of code where you update a DynamoDB record, you can capture table activity using DynamoDB streams. You can configure an AWS lambda trigger on the table so it invokes the lambda when a specified event occurs, passing this event (in our case, the stream) to the lambda. From this, depending on how you have set up the stream, you can access the old and new versions of the record.