DynamoDB concurrent write - amazon-web-services

DynamoDB concurrent write - amazon-web-services

I have an existing DynamoDB table which has attributes say
---------------------------------------------------------
hk(hash-key)| rk(range-key)| a1 | a2 | a3 |
---------------------------------------------------------
I have an existing DynamoDb client which will only update existing record for a1 only. I want to create a second writer(DDB client) which will also update an existing record, but, for a2 and a3 only. If both the ddb client tries to update same record (1 for a1 and other for a2 and a3) at the exact same time, will DynamoDb guarantee that all a1 a2 a3 are updated with correct value(all three new values)? Is using save behavior UPDATE_SKIP_NULL_ATTRIBUTES sufficient for this purpose or do I need to implement some kind of optimistic locking? If not,
Is there something that DDB provides on the fly for this purpose?

If you happen to be using the Dynamo Java SDK you are in luck, because the SDK supports just that with Optimistic Locking. Im not sure if the other SDKs support anything similar - I suspect they do not.
Optimistic locking is a strategy to ensure that the client-side item
that you are updating (or deleting) is the same as the item in
DynamoDB. If you use this strategy, then your database writes are
protected from being overwritten by the writes of others — and
vice-versa.

Consider using this distributed locking library, https://www.npmjs.com/package/dynamodb-lock-client, here is the sample code we use in our codebase:
const DynamoDBLockClient = require('dynamodb-lock-client');
const PARTITION_KEY = 'id';
const HEARTBEAT_PERIOD_MS = 3e3;
const LEASE_DURATION_MS = 1e4;
const RETRY_COUNT = 1e2;
function dynamoLock(dynamodb, lockKey, callback) {
const failOpenClient = new DynamoDBLockClient.FailOpen({
dynamodb,
lockTable: process.env.LOCK_STORE_TABLE,// replace this with your own lock store table
partitionKey: PARTITION_KEY,
heartbeatPeriodMs: HEARTBEAT_PERIOD_MS,
leaseDurationMs: LEASE_DURATION_MS,
retryCount: RETRY_COUNT,
});
return new Promise((resolve, reject) => {
let error;
// Locking required as several lambda instances may attempt to update the table at the same time and
// we do not want to get lost updates.
failOpenClient.acquireLock(lockKey, async (lockError, lock) => {
if (lockError) {
return reject(lockError);
}
let result = null;
try {
result = await callback(lock);
} catch (callbackError) {
error = callbackError;
}
return lock.release((releaseError) => {
if (releaseError || error) {
return reject(releaseError || error);
}
return resolve(result);
});
});
});
}
async function doStuff(id) {
await dynamoLock(dynamodb, `Lock-DataReset-${id}`, async () => {
// do your ddb stuff here
});
}

Reads to DyanmoDB are eventually consistent.
See this: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html
DynamoDB supports eventually consistent and strongly consistent reads.
Eventually Consistent Reads
When you read data from a DynamoDB table, the response might not
reflect the results of a recently completed write operation. The
response might include some stale data. If you repeat your read
request after a short time, the response should return the latest
data.
Strongly Consistent Reads
When you request a strongly consistent read, DynamoDB returns a
response with the most up-to-date data, reflecting the updates from
all prior write operations that were successful. A strongly consistent
read might not be available if there is a network delay or outage.
Note DynamoDB uses eventually consistent reads, unless you specify
otherwise. Read operations (such as GetItem, Query, and Scan) provide
a ConsistentRead parameter. If you set this parameter to true,
DynamoDB uses strongly consistent reads during the operation.
Basically you have specify that you need to have strongly consistent data when you read.
And that should solve your problem. With consistent reads you should see updates to all three fields.
Do note that there are pricing impacts for strongly consistent reads.

Related

Making sure I do not overwrite a file on Cloud Storage by accident

(Node.js API)
I am trying to do the following:
Generate file path like /uploads/${uuid.v4()}.extension
Write the file.
This is the code:
const path = `/uploads/${uuidv4()}.${extname(fileName)}`;
const file = bucket.file(path);
await new Promise((resolve, reject) =>
data
.pipe(file.createWriteStream({ contentType }))
.once('error', reject)
.once('finish', resolve),
);
It works fine. But bothers me to no end that there is that miniscule probability that same UUID will be generated. It is not a practical concern.
How can I upload data to Cloud Storage but get an error if there's a clash? I can check if the file exists beforehand but there is still a race condition technically...

The chance of a collision is not just miniscule: it's astronomically low for UUIDs of significant size. Putting effort into solving the problem of such a collision is not likely to be worth the effort.
That said, if you still want to, you won't be able to do it with Cloud Storage APIs alone, since there is no transactional, atomic API to interact with. If you want a "hard" guarantee that there is no collision, you will need to interact with an entirely different Cloud service that does allow you to effectively "lock" some unique string (e.g. a file path) as a flag for all other processes to check so that they don't collide. Since you are working in Google Cloud, you might want to consider using a database (like any SQL database, or Firestore) with atomic transactional operations to "reserve" the path so that only one process can use it (assuming they all correctly observe this reservation and cooperate as such).

Isn't this exactly what preconditions are for?
Copied from the docs: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-nodejs
const options = {
destination: destFileName,
// Optional:
// Set a generation-match precondition to avoid potential race conditions
// and data corruptions. The request to upload is aborted if the object's
// generation number does not match your precondition. For a destination
// object that does not yet exist, set the ifGenerationMatch precondition to 0
// If the destination object already exists in your bucket, set instead a
// generation-match precondition using its generation number.
preconditionOpts: {ifGenerationMatch: generationMatchPrecondition},
};
await storage.bucket(bucketName).upload(filePath, options);
console.log(`${filePath} uploaded to ${bucketName}`);

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.

If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server

Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.

So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

Pagination in Dynamo DB Results with Completable Future

I am querying Dynamo DB for a given primary key. Primary Key consists of two UUID fields (fieldUUID1, fieldUUID2).
I have a lot of queries to be executed for the above primary key combination with list of values. For which i am using Asynchronous CompleteableFuture with ExecutorService with a thread pool of size 4.
After all the queries return results, which is CompletableFuture<Object>, i join them using allOf method of completable future which ensures that all the query execution is complete, and it gives me CompletableFuture<void>, on which using stream i receive CompletableFuture<List<Object>>
If some of the queries result in pagination of result, i.e. returns lastEvaluatedKey, there is no way for me to know which Query Request returned this.
if i do a .get() call while i received `CompletableFuture, this will be a blocking operation, which defeats the purpose of using asynchronous. Is there a way i can handle this scenario?
example:
I can try thenCompose method, but how do i know at what point i need to stop when lastEvaluatedKey is absent.
for (final QueryRequest queryRequest : queryRequests) {
final CompletableFuture<QueryResult> futureResult =
CompletableFuture.supplyAsync(() ->
dynamoDBClient.query(queryRequest), executorService));
if (futureResult == null) {
continue;
}
futures.add(futureResult);
}
// Wait for completion of all of the Futures provided
final CompletableFuture<Void> allfuture = CompletableFuture
.allOf(futures.toArray(new CompletableFuture[futures.size()]));
// The return type of the CompletableFuture.allOf() is a
// CompletableFuture<Void>. The limitation of this method is that it does not
// return the combined results of all Futures. Instead we have to manually get
// results from Futures. CompletableFuture.join() method and Java 8 Streams API
// makes it simple:
final CompletableFuture<List<QueryResult>> allFutureList = allfuture.thenApply(val -> {
return futures.stream().map(f -> f.join()).collect(Collectors.toList());
});
final List<QueryOutcome> completableResults = new ArrayList<>();
try {
try {
// at this point all the Futures should be done, because we already executed
// CompletableFuture.allOf method.
final List<QueryResult> returnedResult = allFutureList.get();
for (final QueryResult queryResult : returnedResult) {
if (MapUtils.isNotEmpty(queryResult.getLastEvaluatedKey()) {
// how to get hold of original request and include last evaluated key ?
}
}
} finally {
}
} finally {
}
I can rely on .get() method, but it will be a blocking call.

the quick solution to your need is to change your futures list. Instead of having it store CompletableFuture<QueryResult> you can change to store CompletableFuture<RequestAndResult> where RequestAndResult is a simple data class holding a QueryRequest and a QueryResult. To do that you need to change your first loop.
Then, once the allfuture completes you can iterate over futures and get access to both the requests and the results.
However, there is a deeper issue here. What are you planning to do once you have access to the origianl QueryRequest? my guess is that you want to issue a followup request with exclusiveStartKey set to whatever the response's lastEvaluatedKey holds. This means that you will wait for all original queries to complete and only then you'll issue the next bunch. This is inefficient: if a query returned with a lastEvaluatedKey you want to issue its followup query ASAP.
To achieve this my advise to you is to introduce a new method which takes a single QueryRequest object and returns a CompletableFuture<QueryResult>. Its implementation will be roughly as follows:
issue a query with the given request
once the result arrives check it. if its lastEvaluatedKey is empty return it as the result of the method
otherwise, update request.exclusiveStartKey and go back to the first step.
Yes, its a bit harder to do that with CompletableFutures (compared to blocking code) but is totally doable.
Once you have that method your code needs to call this method once for each of the requests in queryRequests, put the returned CompletableFutures in a list, and do a CompletableFuture.allOf() on that list. Once the allOf future completes you can just use the results - no need to do issue followup queries.

entity framework 6 and pessimistic concurrency

I'm working on a project to gradually phase out a legacy application.
In the proces, as a temporary solution we integrate with the legacy application using the database.
The legacy application uses transactions with serializable isolation level.
Because of database integration with a legacy application, i am for the moment best off using the same pessimistic concurrency model and serializable isolation level.
These serialised transactions should not only be wrapped around the SaveChanges statement but includes some reads of data as well.
I do this by
Creation a transactionScope around my DbContext with serialised isolation level.
Create a DbContext
Do some reads
Do some changes to objects
Call SaveChanges on the DbContext
Commit the transaction scope (thus saving the changes)
I am under the notion that this wraps my entire reads and writes into on serialised transaction and then commits.
I consider this a way form of pessimistic concurrency.
However, reading this article, https://learn.microsoft.com/en-us/aspnet/mvc/overview/getting-started/getting-started-with-ef-using-mvc/handling-concurrency-with-the-entity-framework-in-an-asp-net-mvc-application
states that ef does not support pessimistic concurrency.
My question is:
A: Does EF support my way of using a serializable transaction around reads and writes
B: Wrapping the reads and writes in one transaction gives me the guarantee that my read data is not changed when committing the transaction.
C: This is a form of pessimistic concurrency right?

One way to acheive pessimistic concurrency is to use sonething like this:
var options = new TransactionOptions
{
IsolationLevel = System.Transactions.IsolationLevel.Serializable,
Timeout = new TimeSpan(0, 0, 0, 10)
};
using(var scope = new TransactionScope(TransactionScopeOption.RequiresNew, options))
{ ... stuff here ...}
In VS2017 it seems you have to rightclick TransactionScope then get it to add a reference for: Reference Assemblies\Microsoft\Framework.NETFramework\v4.6.1\System.Transactions.dll
However if you have two threads attempt to increment the same counter, you will find one succeeds whereas the other thread thows a timeout in 10 seconds. The reason for this is when they proceed to saving changes they both need to upgrade their lock to exclusive, but they cannot because other transaction is already holding a shared lock on the same row. SQL Server will then detect the deadlock after a while fails one transactions to solve the deadlock. Failing one transaction will release shared lock and the second transaction will be able to upgrade its shared lock to exclusive lock and proceed with execution.
The way out of this deadlocking is to provide a UPDLOCK hint to the database using something such as:
private static TestEntity GetFirstEntity(Context context) {
return context.TestEntities
.SqlQuery("SELECT TOP 1 Id, Value FROM TestEntities WITH (UPDLOCK)")
.Single();
}
This code came from Ladislav Mrnka's blog which now looks to be unavailable. The other alternative is to resort to optimistic locking.

The document states that EF does not have a built in pessimistic concurrency support. But this does not mean you can't have pessimistic locking with EF. So YOU CAN HAVE PESSIMISTIC LOCKING WITH EF!
The recipe is simple:
use transactions (not necessarily serializable, cause it will lead to poor perf.) - readcommitted is ok to use...but depends...
do your changes, call dbcontext.savechanges()
do lock your table - execute T-SQL manually, or feel free to use the code att. below.
the given T-SQL command with the hints will keep that database locked until the duration of the given transaction.
there's one thing you need to take care: your loaded entities might be obsolete at the point you do the lock, so all entities from the locked table should be re-fetched (reloaded).
I did a lot of pessimistic locking, but optimistic locking is better. You can't go wrong with it.
A typical example where pessimistic locking can't help is a parent child relation, where you might lock the parent and treat it like an aggregate (so you assume you are the only one having access to the child too). So if other thread tries to access the parent object, it won't work (will be blocked) until the other thread releases the lock from the parent table. But with an ORM, any other coder can load the child independently - and from that point 2 threads will make changes to the child object... With pessimistic locking you might mess up the data, with optimistic you'll get an exception, you can reload valid data and do try to save again...
So the code:
public static class DbContextSqlExtensions
{
public static void LockTable<Entity>(this DbContext context) where Entity : class
{
var tableWithSchema = context.GetTableNameWithSchema<Entity>();
context.Database.ExecuteSqlCommand(string.Format("SELECT null as dummy FROM {0} WITH (tablockx, holdlock)", tableWithSchema));
}
}
public static class DbContextExtensions
{
public static string GetTableNameWithSchema<T>(this DbContext context)
where T : class
{
var entitySet = GetEntitySet<T>(context);
if (entitySet == null)
throw new Exception(string.Format("Unable to find entity set '{0}' in edm metadata", typeof(T).Name));
var tableName = GetStringProperty(entitySet, "Schema") + "." + GetStringProperty(entitySet, "Table");
return tableName;
}
private static EntitySet GetEntitySet<T>(DbContext context)
{
var type = typeof(T);
var entityName = type.Name;
var metadata = ((IObjectContextAdapter)context).ObjectContext.MetadataWorkspace;
IEnumerable<EntitySet> entitySets;
entitySets = metadata.GetItemCollection(DataSpace.SSpace)
.GetItems<EntityContainer>()
.Single()
.BaseEntitySets
.OfType<EntitySet>()
.Where(s => !s.MetadataProperties.Contains("Type")
|| s.MetadataProperties["Type"].ToString() == "Tables");
var entitySet = entitySets.FirstOrDefault(t => t.Name == entityName);
return entitySet;
}
private static string GetStringProperty(MetadataItem entitySet, string propertyName)
{
MetadataProperty property;
if (entitySet == null)
throw new ArgumentNullException("entitySet");
if (entitySet.MetadataProperties.TryGetValue(propertyName, false, out property))
{
string str = null;
if (((property != null) &&
(property.Value != null)) &&
(((str = property.Value as string) != null) &&
!string.IsNullOrEmpty(str)))
{
return str;
}
}
return string.Empty;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js