I have a DynamoDB "My_Table" with index on "lock_status":
{
my_pk: string,
lock_status: string,
useful_data: string,
}
Is it possible for two different thread to execute the update code below on the same record?
Essentially, I want exactly one thread to have access to any given record's "useful_data". To do this, I'm "locking" the record via lockStatus while the thread is working with this item. What I am afraid is that two thread executes this code at the same time. They both find the same record based on the "ConditionExpression" and locks the same record.
const client = new AWS.DynamoDB.DocumentClient();
return await client.update({
TableName: 'My_Table',
Limit: 1,
UpdateExpression: 'set lockStatus = :status_locked',
ConditionExpression: 'lockStatus <> :status_available',
ExpressionAttributeValues: {
':status_locked': 'LOCKED',
':status_available': 'AVAILABLE',
},
ReturnValues: 'ALL_NEW',
}).promise();
This seems possible to avoid this problem if I was using TransactWriteItem, but can I get away with using simple update for my simple scenario?
Edit
Allow me to add a little context so that things make more sense. I'm building a "library" of reusable user accounts for testing. A test would "check out" and "check in" the user account. While the account is checked out, I want to prevent all other tests from using the same account.
One piece of information I neglected to mention in my original post was that I'm first getting the My_Table data by getting the next not locked item. Something like this:
const client = new AWS.DynamoDB.DocumentClient();
return await client.query({
TableName: 'My_Table',
IndexName: 'LOCK_STATUS_INDEX',
Limit: 1,
KeyConditionExpression: 'lockStatus = :status_available',
ExpressionAttributeValues: { ':status_available': 'AVAILABLE' }
}).promise();
Then in my subsequent update call, I'm locking the row as mentioned in my original post.
As #maurice suggested, I was looking into the optimistic locking. As a matter of fact, this article perfectly describes a scenario that I'm facing.
However, there is a problem that I will likely run into under high load. The problem goes something like this:
10 threads come and asks for the next not locked record. All 10 threads get the same record. (This is a very possible since all I'm doing is Limit 1 and the dynamoDb will likely return the first record it runs across, which would be the same for all threads).
10 threads try to update the same record with a give version number. One thread succeeds in the update and the rest fail.
9 threads retry and goes back to step 1. (Worst case, more threads are added)
I'm starting to think that my design is flawed. Or perhaps dynamoDb is not the right technology. Any help with this problem would be useful.
You could use optimistic locking for this - the idea is fairly simple.
You create a version attribute for your item that's an integer which will be incremented.
{
pk: 123
sk: 123
version: 0
randomValue: abc
}
When you read the item to update it, you note the current version number. After you update the item, you also increment the version number. So if you wanted to update the random value, the item you'll write to DynamoDB would look like this:
{
pk: 123
sk: 123
version: 1
randomValue: newValue
}
You now add a condition expression to your update or putitem call, to ensure this only succeeds, when the current version of that item is still 0.
That way the call will fail, if somebody else updated the item while you were processing it and you can read it again, update it and write again.
If the call succeeds, you know there has been nobody else that messed with the item.
I also wrote a more detailed blog post about this if you're curious: link
Related
As far as I can tell there's no identifier being passed with the GT worker metadata (see below from documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html)? How would I link this information back to the actual labeling task?
sub I believe is a cognito reference to the worker, so not a unique identifier for the submisson. As of right now, I jsut know that one of the tasks took a certian amount of time for a particular worker, but I can't tell which one. I also guess i have to jump through a few hoops via cognito to get the GT worker id from the sub?
I am looking for a way to summarize origina data shown (from input manifest file), the label given, the time it took to complete. As of right now, I have to make one table that has the data with their human submitted label, and a separate table with time it took to complete by task, but no way to link the two...am I missing something?
here's the worker metadata json:
"submissionTime": "2020-12-28T18:59:58.321Z",
"acceptanceTime": "2020-12-28T18:59:15.191Z",
"timeSpentInSeconds": 40.543,
"workerId": "a12b3cdefg4h5i67",
"workerMetadata": {
"identityData": {
"identityProviderType": "Cognito",
"issuer": "https://cognito-idp.aws-region.amazonaws.com/aws-region_123456789",
"sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
}
}
Hi everyone,
I'm a little bit lost with a problem thinking in ddd way.
Imagine you have an application to sell concert ticket. So you have an entity which is called Concert with the quantity number and a method to buy a ticket.
class Concert {
constructor(
public id: string,
public name: string,
public ticketQuantity: number,
) {}
buyTicket() {
this.ticketQuantity = this.ticketQuantity - 1;
}
}
The command looks like this:
async execute(command: BookConcertCommand): Promise<void> {
const concert = await this.concertRepository.findById(command.concertId);
concert.buyTicket();
await this.concertRepository.save(concert);
}
Imagine, your application has to carry a lot of users and 1000 users try to buy a ticket at the same when the ticketQuantity is 500.
How can you ensure the invariant of the quantity can't be lower than 0 ?
How can you deal with concurrency here because even if two users try to buy a ticket at the same time the data can be false ?
What are the patterns we can use to ensure consistency and concurrency ?
Optimistic or pessismistic concurrency can't be a solution because it will frustrate a lot of users and we try to put all our logic domain into our domain so we can't put any logic inside sql/db or use a transactional script approach.
How can you ensure the invariant of the quantity can't be lower than 0
You include logic in your domain model that only assigns a ticket if at least one unassigned ticket is available.
You include locking (either optimistic or pessimistic) to ensure "first writer wins" -- the loser(s) in a data race should abort or retry.
If your book of record was just data in memory, then you would ensure that all attempts to buy tickets for concert 12345 must first acquire the same lock. In effect, you serialize the requests so that the business logic is running one at a time only.
If your book of record was a relational database, then within the context of each transaction you might perform a "select for update" to get a copy of the current data, and perform the update in the same transaction. The database will raise it's flavor of concurrent modification exception to the connections that lose the race.
Alternatively, you use something like the semantics of a conditional-write / compare and swap: you get an unlocked copy of the concert from the book of record, make your changes, then send a "update this record if it still looks like the unlocked copy" message, if you get the response announcing you've won the race, congratulations - you're done. If not, you retry or fail.
Optimistic or pessismistic concurrency can't be a solution because it will frustrate a lot of users
Of course it can
If the concert is overbooked, they are going to be frustrated anyway
The business logic doesn't have to run synchronously with the request - it might be acceptable to write down that they want a ticket, and then contact them asynchronously to let them know that a ticket has been assigned to them
It may be helpful to review some of Udi Dahan's writing on collaborative and competitive domains; for instance, this piece from 2011.
In a collaborative domain, an inherent property of the domain is that multiple actors operate in parallel on the same set of data. A reservation system for concerts would be a good example of a collaborative domain – everyone wants the “good seats” (although it might be better call that competitive rather than collaborative, it is effectively the same principle).
You might be following these steps:
1- ReserveRequested -> ReserveRequestAccepted -> TicketReserved
2- ReserveRequested -> ReserveRequestRejected
When somebody clicks on the buy ticket button, you should create a reserve request entity, and then you can process the reservation in the background and by a queue system.
On the user side, you can return a unique reserve request-id to check the result of the process. So the frontend developer should fetch the result of process periodically until it succeeds or fails.
I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)
I have a list of Lambda worker functions (say 1000), each running simultaneously and doing its job. To be able to figure out the end result of all workers I have come up with this idea.
Before starting the job and spawning the Lambda worker functions, I save a record in DynamoDB, for example two attributes:
total_number_of_jobs
jobs_completed (set initially to 0)
On finish of each Lambda worker function it will go and increment the attribute jobs_completed by one. Then read the record and check if total_number_of_jobs equals to jobs_completed and if it is, put a record in SQS.
My questions are:
Is this a good idea?
Would the updates be consistent and atomic? Could there be any race conditions?
Any better solution than this?
I would update the counter, jobs_completed, in an UpdateItem API call like this:
SET jobs_completed = jobs_completed + :incr_by where incr_by would be equal to 1.
As long as you use DynamoDB atomic counters, like your example shows, and you check the return value of the UpdateItem call instead of running a separate query, then your proposed solution should work fine.
In my application I have the concept of a Draw, and that Draw has to always be contained within an Order.
A Draw has a set of attributes: background_color, font_size, ...
Quoting the famous REST thesis:
Any information that can be named can be a resource: a document or
image, a temporal service (e.g. "today's weather in Los Angeles"), a
collection of other resources, a non-virtual object (e.g. a person),
and so on.
So, my collection of other resources here would be an Order. An Order is a set of Draws (usually more than thousands). I want to let the User create an Order with several Draws, and here is my first approach:
{
"order": {
"background_color" : "rgb(255,255,255)", "font_size" : 10,
"draws_attributes": [{
"background_color" : "rgb(0,0,0)", "font_size" : 14
}, {
"other_attribute" : "value",
},
]
}
}
A response to this would look like this:
"order": {
"id" : 30,
"draws": [{
"id" : 4
}, {
"id" : 5
},
]
}
}
So the User would know which resources have been created in the DB. However, when there are many draws in the request, since all those draws are inserted in the DB, the response takes a while. Imagine doing 10.000 inserts if an Order has 10.000 draws.
Since I need to give the User the ID of the draws that were just created (by the way, created but not finished, because when the Order is processed we actually build the Draw with some image manipulation libraries), so they can fetch them later, I fail to see how to deal with this in a RESTful way, avoiding to make the HTTP request take a lot time, but at the same time giving the User some kind of Ids for the draws, so they can fetch them later.
How do you deal with this kind of situations?
Accept the request wholesale, queue the processing, return a status URL that represents the state of the request. When the request is finished processing, present a url that represents the results of the request. Then, poll.
POST /submitOrder
301
Location: http://host.com/orderstatus/1234
GET /orderstatus/1234
200
{ status:"PROCESSING", msg: "Request still processing"}
...
GET /orderstaus/1234
200
{ status:"COMPLETED", msg: "Request completed", rel="http://host.com/orderresults/3456" }
Addenda:
Well, there's a few options.
1) They can wait for the result to process and get the IDs when it's done, just like now. The difference with what I suggested is that the state of the network connection is not tied to the success or failure of the transaction.
2) You can pre-assign the order ids before hitting the database, and return those to the caller. But be aware that those resources do not exist yet (and they won't until the processing is completed).
3) Speed up your system to where the timeout is simply not an issue.
I think your exposed granularity is too fine - does the user need to be able to modify each Draw separately? If not, then present a document that represents an Order, and that contains naturally the Draws.
Will you need to query specific Draws from the database based on specific criteria that are unrelated to the Order? If not, then represent all the Draws as a single blob that is part of a row that represents the Order.