DynamoDB Concurrent Add to StringSet - amazon-web-services

I have read that concurrent writes on the same attribute, such as list appends or overwriting by using SET attr = :val, will lose some data (unordered overwrites, etc.). A quick test with 4 processes running in parallel successfully added all elements to the set but I want to be sure. Can I count on this behavior? Will concurrent writes using ADD to the same StringSet field correctly write all elements to the set?
Currently using updateItem with the following payload:
TableName='TestTable',
Key={
'PK': {'S': 'PrimaryKey'},
'SK': {'S':'SortKey'}
},
UpdateExpression='ADD Field :value',
ExpressionAttributeValues={
':value': {'SS': ['test value']}
}

Related

How to fetch a certain number of records from paginated dynamodb table?

I am trying to work with the first 50 records or 1st scan page returned from the get_paginator method.
This is how i scan through the table and get paginated results over which i loop and do some post processing.
dynamo_client = boto3.client('dynamodb')
paginator = dynamo_client.get_paginator("scan")
for page in paginator.paginate(TableName=table_name):
yield from page["Items"]
Is it possible to only work on say the 1st scanned page and explicitly mention 2nd page onwards? Summing it up, i am trying to query the first page results in one lambda function and the 2nd page specifically using another lambda function. How can i achieve this?
You need to pass the NextToken to your other Lambda, somehow.
On the paginator response, there is a NextToken property. You can then pass that in the config of the paginator.paginate() call.
Somewhat contrived example:
dynamo_client = boto3.client('dynamodb')
paginator = dynamo_client.get_paginator("scan")
token = ""
# Grab the first page
for page in paginator.paginate(TableName=table_name):
# do some work
dowork(page["Items"])
# grab the token
token = page["NextToken"]
# stop iterating after the first page for some reason
break
# This will continue to iterator where the last iterator left off
for page in paginator.paginate(TableName=table_name, PaginationConfig= 'StartingToken': token }):
# do some work
dowork(page["Items"])
Let's say you were trying to use a Lambda to iterate over all your DynamoDB items in a table. You could have the iterator run until a time limit, break, then queue up next Lambda function, passing along the NextToken for it to resume with.
You can learn more via the API doc which details what this does or see some further examples on GitHub.

Unable to iterate over all objects in table using objects.all()

I am writing a migration script that will iterate over all objects of a cassandra model (Cats). There are more than 30000 objects of Cat but using Cats.objects.all(), I am only able to iterate over 10000 objects.
qs = Cats.objects.all()
print(qs.count()) # returns 30000
print(len(qs)) # returns 10000
Model:
from django_cassandra_engine.models import DjangoCassandraModel
class Cats(DjangoCassandraModel):
...
Cassnadra backend used: django-Cassandra-engine version 1.6.2
The default fetch size (aka page size) is 10K so you'll only get the first 10K rows returned. If you really want to get all the records in the table, you'll need to override the session defaults:
'cassandra': {
...
'OPTIONS': {
...
'session': {
...
'default_fetch_size': 10000
}
}
}
But be careful around setting it to a very high value because it can overload the coordinator node for the request and affect the performance of your cluster.
You should instead iterate through the results in a page, then request the next page until you've reached the end. Cheers!

Django Rest Framework: Disable save in update

It is my first question here, after reading the similar questions I did not find what I need, thanks for your help.
I am creating a fairly simple API but I want to use best practices at the security level.
Requirement: There is a table in SQL Server with +5 million records that I should ONLY allow READ (all fields) and UPDATE (one field). This is so that a data scientist consumes data from this table and through a predictive model (I think) can assign a value to each record.
For this I mainly need 2 things:
That only one field is updated despite sending all the fields of the table in the Json (I think I have achieved it with my serializer).
And, where I have problems, is in disabling the creation of new records when updating one that does not exist.
I am using an UpdateAPIView to allow trying to allow a bulk update using a json like this (subrrogate_key is in my table and I use lookup_field to:
[
{
"subrrogate_key": "A1",
"class": "A"
},
{
"subrrogate_key": "A2",
"class": "B"
},
{
"subrrogate_key": "A3",
"class": "C"
},
]
When using the partial_update methods use update and this perform_update and this finally calls save and the default operation is to insert a new record if the primary key (or the one specified in lookup_field) is not found.
If I overwrite them, how can I make a new record not be inserted, and only update the field if it exists?
I tried:
Model.objects.filter (subrrogate_key = ['subrrogate_key']). Update (class = ['class])
Model.objects.update_or_create (...)
They work fine if all the keys in the Json exist, because if a new one comes they will insert (I don't want this).
P.S. I use a translator, sorry.
perform_update will create a new record if you passed a serializer that doesn't have an instance. Depending on how you wrote your view, you can simply check if there is an instance in the serializer before calling save in perform_update to prevent creating a new record:
def perform_update(self, serializer):
if not serializer.instance:
return
serializer.save()
Django implements that feature through the use of either force_update or update_fields during save().
https://docs.djangoproject.com/en/3.2/ref/models/instances/#forcing-an-insert-or-update
https://docs.djangoproject.com/en/3.2/ref/models/instances/#specifying-which-fields-to-save
https://docs.djangoproject.com/en/3.2/ref/models/instances/#saving-objects
In some rare circumstances, it’s necessary to be able to force the
save() method to perform an SQL INSERT and not fall back to doing an
UPDATE. Or vice-versa: update, if possible, but not insert a new row.
In these cases you can pass the force_insert=True or force_update=True
parameters to the save() method.
model_obj.save(force_update=True)
or
model_obj.save(update_fields=['field1', 'field2'])

objects.update_or_create() creates a new record in database rathe than update exisiting record

I have created a model in Django and for which I get the data from an API. I am trying to use the update_or_create method for getting the data from the API and into my database. However, I may be confused on how it works.
When I run it for the first time it adds the data into the database as expected, however if I were to update one of the fields for a record - the data from the API has changed for the respective record - and then run it again, it is creating a new record rather than updating the existing record.
So in the scenario below, I run it and the count for record Commander Legends is 718, which I then update manually in the database to be 100. When I run it again, it creates a new record with count of 718
With my understanding of it, it should of updated the record rather than create a new record.
views.py
def set_update(request):
try:
discover_api = requests.get('https://api.scryfall.com/sets').json()
set_data = discover_api['data']
while discover_api['has_more']:
discover_api = requests.get(discover_api['next_page']).json()
set_data.extend(discover_api['data'])
except HTTPError as http_err:
print(f'HTTP error occurred: {http_err}')
except Exception as err:
print(f'Other error occurred: {err}')
sorted_data = sorted(set_data, key=lambda k: k['releaseDate'], reverse=False)
for i in sorted_data:
Set.objects.update_or_create(
scry_id=i.get('id'),
code=i.get('code'),
name=i.get('name'),
type=i.get('set_type').replace("_", " ").title(),
release_date=i.get('released_at'),
card_count=i.get('card_count'),
is_digital_only=i.get('digital', False),
is_non_foil_only=i.get('nonfoil_only', False),
is_foil_only=i.get('foil_only', False),
block_name=i.get('block'),
block_code=i.get('block_code'),
parent_set_code=i.get('parent_set_code'),
tcgplayer_id=i.get('tcgplayer_id'),
last_modified=date.today(),
defaults={
'scry_id': i.get('id'),
'code': i.get('code'),
'name': i.get('name'),
'type': i.get('set_type').replace("_", " ").title(),
'release_date': i.get('released_at'),
'card_count': i.get('card_count'),
'is_digital_only': i.get('digital', False),
'is_non_foil_only': i.get('nonfoil_only', False),
'is_foil_only': i.get('foil_only', False),
'block_name': i.get('block'),
'block_code': i.get('block_code'),
'parent_set_code': i.get('parent_set_code'),
'tcgplayer_id': i.get('tcgplayer_id'),
'last_modified': date.today(),
}
)
return redirect('dashboard:sets')
Screenshot
In a update_or_create(…) [Django-doc], you have basically two parts:
the named parameters which do the filtering, only if they can find a record that matches all the filters, it will update that record; and
the defaults=… parameter, which is a dictionary of values that will be used to update or create that record.
If you thus want to update only the card_count and last_modified, it looks like:
Set.objects.update_or_create(
scry_id=i.get('id'),
code=i.get('code'),
name=i.get('name'),
type=i.get('set_type').replace('_', ' ').title(),
release_date=i.get('released_at'),
is_digital_only=i.get('digital', False),
is_non_foil_only=i.get('nonfoil_only', False),
is_foil_only=i.get('foil_only', False),
block_name=i.get('block'),
block_code=i.get('block_code'),
parent_set_code=i.get('parent_set_code'),
tcgplayer_id=i.get('tcgplayer_id'),
defaults={
'card_count': i.get('card_count'),
'last_modified': date.today()
}
)
If you thus add card_count at the kwargs, it will only update the record if the card_count matches completely, so if both defaults and the kwargs contain the same values, you basically will either do nothing in case a record exists with all the values in place, or create a new one if no such record is present.

Why does the following dynamoDB write with conditional expression succeeds?

I have the following code to create a dynamoDB table :
def create_mock_dynamo_table():
conn = boto3.client(
"dynamodb",
region_name=REGION,
aws_access_key_id="ak",
aws_secret_access_key="sk",
)
conn.create_table(
TableName=DYNAMO_DB_TABLE,
KeySchema=[
{'AttributeName': 'PK', 'KeyType': 'HASH'},
{'AttributeName': 'SK', 'KeyType': 'RANGE'}
],
AttributeDefinitions=[
{'AttributeName': 'PK', 'AttributeType': 'S'},
{'AttributeName': 'SK', 'AttributeType': 'S'}],
ProvisionedThroughput={"ReadCapacityUnits": 5, "WriteCapacityUnits": 5},
)
mock_table = boto3.resource('dynamodb', region_name=REGION).Table(DYNAMO_DB_TABLE)
return mock_table
Then I use it to create two put-items :
mock_table = create_mock_dynamo_table()
mock_table.put_item(
Item={
'PK': 'did:100000001',
'SK': 'weekday:monday:start_time:00:30',
}
)
mock_table.put_item(
Item={
'PK': 'did:100000001',
'SK': 'weekday:monday:start_time:00:40',
},
ConditionExpression='attribute_not_exists(PK)'
)
When I do the second put_item, the PK is already there in the system and only the sort key is different. But the condition I am setting only in the existence of same PK. So the second put_item should fail right ?
The condition check for PutItem does not check the condition against arbitrary items. It only checks the condition against an item with the same primary key (hash and sort keys), if such an item exists.
In your case, the value of the sort key is different, so when you put the second item, DynamoDB sees that no item exists with that key, therefore the PK attribute does not exist.
This is also why the condition check fails the second time you run the code—because at that point you do already have an item with the same hash and sort keys.
DynamoDB's "IOPS" is very low and the actual write takes some time. You can read more about it here. But, if you run the code a second time soon after, you'll see that you'll get the expected botocore.errorfactory.ConditionalCheckFailedException.
If I may refer to what I think you're trying to do - mock a DB + data. When you want to mock such an "expensive" resource, make an actual fake class. You'll want to wrap all your DB accesses in the actual code with some kind of dal.py module that consolidates operations such as write/read/etc. Then, you mock those methods/functions.
You don't want to write code so tightly coupled with the chosen DB.
The best practice is using an ORM framework such as SQLAlchemy. It is invaluable to take the time now to learn it. But, you might have time constraints I'm not aware of.