I've read the article.
The article describes the next solution to situations when many users can write to the same DB.
You as a user need to:
Retrieve the row and the last modified dateTime of the row.
Make the calculations you want, but don't write anything to the DB yet.
After the calculations, just before you want to write the result to the DB, retrieve the last modified dateTime of the same row again.
Compare the date time of #1 to the dateTime of #2.
If they equal - everything is ok, commit, and write the current time as the last modified date time of the row.
else - other user was here - Rollback.
This process seems logical, BUT I see the next hole in it:
In #3 the user retrieves the last modified dateTime of the row, but what if between the reading of this dateTime (in #3), and the time of writing in #4, an other user enters, writes its data and get out? The first user can never know about it, and it will override the second user's data.
Isn't it possible?
The algorithm you describe does indeed have an opportunity of missing concurrent updated between step #3 and #4.
The part about testing for optimistic concurrency violations says:
When an update is attempted, the timestamp value in the database is
compared to the original timestamp value contained in the modified
row. If they match, the update is performed and the timestamp column
is updated with the current time to reflect the update. If they do not
match, an optimistic concurrency violation has occurred.
Although not mentioned explicitly, the idea is for the compare and update step to happen atomically on the server. This can be done with an UPDATE statement containing a WHERE clause involving the timestamp and its original value. Similar to the example mentioned in the article where all the original column values in a row still match those found in the database.
Related
I'm using Django 2.2 and my question is: does transaction.atomic roll back increments to a pk sequence?
Below is the background bug I wrote up that led me to this issue
I'm facing a really weird issue that I can't figure out and I'm hoping someone has faced a similar issue.
An insert using the django ORM .create() function is returning django.db.utils.IntegrityError: duplicate key value violates unique constraint "my_table_pkey" DETAIL: Key (id)=(5795) already exists.
Fine. But then I look at the table and no record with id=5795 exists!
SELECT * from my_table where id=5795;
shows (0 rows)
A look at the sequence my_table_id_seq shows that it has nonetheless incremented to show last_value = 5795 as if the above record was inserted. Moreover the issue does not always occur. A successful insert with different data is inserted at id=5796. (I tried reset the pk sequence but that didn't do anything, since it doesnt seem to be the problem anyway)
I'm quite stumped by this and it has caused us a lot of issues on one specific table. Finally I realize the call is wrapped in transaction.atomic and that a particular scenario may be causing a double insert with the same pk.
So my theory is: The transaction atomic is not rolling back the increment of the
Postgres sequences do not roll back. Every time they are touched by a statement they advance whether the statement succeeds or not. For more information see Notes section here Create Sequence.
Here are my tables:
Table1
Id (String, composite PK partition key)
IdTwo (String, composite PK sort key)
Table2
IdTwo (String, simple PK partition key)
Timestamp (Number)
I want to PutItem in Table1 only if IdTwo does not exist in Table2 or the item in Table2 with the same IdTwo has Timestamp less than the current time (can be given as outside input).
The simple approach I know would work is:
GetItem on Table2 with ConsistentRead=true. If item exists or its Timestamp < current time, exit early.
PutItem on Table1.
However, this is two network calls to DDB. I'd prefer optimizing it, like using TransactWriteItems which is one network call. Is it possible for my use case?
If you want to share code, I'd prefer Go, but any language is fine.
First off, the operation you're looking for is TransactWriteItems - https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html
This is the API operation that lets you do atomic and transactional conditional writing operations. There's two parts to your question, not sure they can be done together—but then they might not need to be.
The first part, insert in table1 if condition is met in table2 is simple enough—you add the item you want in table1 in the Put section of the API call, and phrase the existence check for table2 in the ConditionCheck section.
You can't do multiple checks right now, so the check to see if the timestamp is lower than current time is another separate operation, also in the ConditionCheck. You can't combine them together or do just one because of your rules.
I'd suggest doing a bit of optimistic concurrency here. Try the TransactWriteItems with the second ConditionCheck, where the write will succeed only if the timestamp is less than current time. This is what should happen in most cases. If the transaction fails, now you need to check if it failed because the timestamp was lower or because the item doesn't yet exist.
If it doesn't yet exist, then do a TransactWiteItems where you populate the timestamp with a ConditionCheck to make sure it doesn't exist (another thread might have written it in the meantime) and then retry the first operation.
You basically want to keep retrying the first operation (write with condition check to make sure timestamp is lower) until it succeeds or fails for a good reason. If it fails because the data is uninitialized, initizalize it taking into account race conditions and then try again.
In Google Spanner, commit timestamps are generated by the server and based on "TrueTime" as discussed in https://cloud.google.com/spanner/docs/commit-timestamp. This page also states that timestamps are not guarnateed to be unique, so multiple independent writers can generate timestamps that are exactly the same.
On the documentation of consistency guarantees, it is stated that In addition if one transaction completes before another transaction starts to commit, the system guarantees that clients can never see a state that includes the effect of the second transaction but not the first.
What I'm trying to understand is the combination of
Multiple concurrent transactions committing "at the same time" resulting in the same commit timestamp (where the commit timestamp forms part of a key for the table)
A reader observing new rows being entered into above table
Under these circumstances, is it possible that a reader can observe some but not all of the rows that will (eventually) be stored with the exact same timestamp? Or put differently, if searching for all rows up to a known exact timestamp, and with rows are being inserted with that timestamp, is it possible that the query first returns some of the results, but when executed again returns more?
The context of this is an attempt to model a stream of events ordered by time in an append only manner - I need to be able to keep what is effectively a cursor to a particular point in time (point in the stream of events) and need to know whether or not having observed events at time T means you can never get more events again at exactly time T.
Spanner is externally consistent, meaning that any reader will only be able to read the results of completed transactions...
Along with all externally consistent DB's, it is not possible for a reader outside of a transaction to be able to read the 'pending state' of another transaction. So a reader at time T will only be able to see transactions that have been committed before time T.
Multiple simultaneous insert/update transactions at commit time T (which would affect different rows, otherwise they could not be simultaneous) would not be seen by the reader at time T, but both would be seen by a reader at T+1
I ... need to know whether or not having observed events at time T means you can never get more events again at exactly time T.
Yes - ish. Rephrasing slightly as this is nuanced:
Having read events up to and including time T means you will never get any more events occurring with time equal to or before time T
But remember that the commit timestamp column is a simple TIMESTAMP column where any value can be stored -- it is the application that requests that the value stored is the commit timestamp, and there is nothing at the DB level to stop the application storing any value it likes...
As always with Spanner, it is the application which has to enforce/maintain the data integrity.
Scenario: We have a Dynamo DB table supporting Optimistic Locking with Version Number. Two concurrent threads are trying to save two different entries with the same primary key value to that Table.
Question: Will ConditionalCheckFailedException be thrown for the latter save action?
Yes, the second thread which tries to insert the same data would throw ConditionalCheckFailedException.
com.amazonaws.services.dynamodbv2.model.ConditionalCheckFailedException
As soon as the item is saved in database, the subsequent updates should have the version matching with the value on DynamoDB table (i.e. server side value).
save — For a new item, the DynamoDBMapper assigns an initial version
number 1. If you retrieve an item, update one or more of its
properties and attempt to save the changes, the save operation
succeeds only if the version number on the client-side and the
server-side match. The DynamoDBMapper increments the version number
automatically.
We had a similar use case in past but in our case, multiple threads reading first from the dynamoDB and then trying to update the values.
So finally there will be change in version by the time they read and they try to update the document and if you don't read the latest value from the DynamoDB then intermediate update will be lost(which is known as update loss issue refer aws-docs for more info).
I am not sure, if you have this use-case or not but if you have simply 2 threads trying to update the value and then if one of them get different version while their request reached to DynamoDB then you will get ConditionalCheckFailedException exception.
More info about this error can be found here http://grepcode.com/file/repo1.maven.org/maven2/com.michelboudreau/alternator/0.10.0/com/amazonaws/services/dynamodb/model/ConditionalCheckFailedException.java
In a Django project, I'm refreshing tens of thousands of lines of data from an external API on a daily basis. The problem is that since I don't know if the data is new or just an update, I can't do a bulk_create operation.
Note: Some, or perhaps many, of the rows, do not actually change on a daily basis, but I don't which, or how many, ahead of time.
So for now I do:
for row in csv_data:
try:
MyModel.objects.update_or_create(id=row['id'], defaults={'field1': row['value1']....})
except:
print 'error!'
And it takes.... forever! One or two lines a second, max speed, sometimes several seconds per line. Each model I'm refreshing has one or more other models connected to it through a foreign key, so I can't just delete them all and reinsert every day. I can't wrap my head around this one -- how can I cut down significantly the number of database operations so the refresh doesn't take hours and hours.
Thanks for any help.
The problem is you are doing a database action on each data row you grabbed from the api. You can avoid doing that by understanding which of the rows are new (and do a bulk insert to all new rows), Which of the rows actually need update, and which didn't change.
To elaborate:
grab all the relevant rows from the database (meaning all the rows that can possibly be updated)
old_data = MyModel.objects.all() # if possible than do MyModel.objects.filter(...)
Grab all the api data you need to insert or update
api_data = [...]
for each row of data understand if its new and put it in array, or determine if the row needs to update the DB
for row in api_data:
if is_new_row(row, old_data):
new_rows_array.append(row)
else:
if is_data_modified(row, old_data):
...
# do the update
else:
continue
MyModel.objects.bulk_create(new_rows_array)
is_new_row - will understand if the row is new and add it to an array that will be bulk created
is_data_modified - will look for the row in the old data and understand if the data of that row is changed and will update only if its changed
If you look at the source code for update_or_create(), you'll see that it's hitting the database multiple times for each call (either a get() followed by a save(), or a get() followed by a create()). It does things this way to maximize internal consistency - for example, this ensures that your model's save() method is called in either case.
But you might well be able to do better, depending on your specific models and the nature of your data. For example, if you don't have a custom save() method, aren't relying on signals, and know that most of your incoming data maps to existing rows, you could instead try an update() followed by a bulk_create() if the row doesn't exist. Leaving aside related models, that would result in one query in most cases, and two queries at the most. Something like:
updated = MyModel.objects.filter(field1="stuff").update(field2="other")
if not updated:
MyModel.objects.bulk_create([MyModel(field1="stuff", field2="other")])
(Note that this simplified example has a race condition, see the Django source for how to deal with it.)
In the future there will probably be support for PostgreSQL's UPSERT functionality, but of course that won't help you now.
Finally, as mentioned in the comment above, the slowness might just be a function of your database structure and not anything Django-specific.
Just to add to the accepted answer. One way of recognizing whether the operation is an update or create is to ask the api owner to include a last updated timestamp with each row (if possible) and store it in your db for each row. That way you only have to check for those rows where this timestamp is different from the one in api.
I faced an exact issue where I was updating every existing row and creating new ones. It took a whole minute to update 8000 odd rows. With selective updates, I cut down my time to just 10-15 seconds depending on how many rows have actually changed.
I think below code can do the same thing together instead of update_or_create:
MyModel.objects.filter(...).update()
MyModel.objects.get_or_create()