How to deal with deadlocks using ReentrantReadWriteLock in microservices - concurrency

We have a deadlock situation which occured because of this heavy load on the microservice (Say A) causing multiple requests from different client services (B,C). So these calls from B and C come for the same clientId(key) and are served by different instances of A and they try to update the same clientId data in database at same time causing below error.
CannotAcquireLockException is thrown,
(SQL Error: 60, SQLState: 61000..
ORA-00060: deadlock detected while waiting for resource
We have decided to implement sharding at load balancer(haproxy) level which will ensure same instance of A will always serve the requests from B and C for a specific key(clientId), so we dont have multiple instances processing the request for same key(clientId).
Now we get into the mode of everything in single jvm as we have made sure requests from B and C for a specific clientId always come to same instance of A.
With this its still possible that requests from B and C services come for same clientId with difference in time of nanoseconds. Any then multiple threads will again try to update the same clientId data in database at same time causing same error again.
To improve this we are looking for possible solutions and one solutions is ReentrantReadWriteLock which should take care of this based on the concepts.
We are using spring data jpa and have a save being done which looks like
clientJpaRepository.save(ClientObject);
Now is it possible to use something like below.
public void save(Client clientObject) {
String clientId = clientObject.getClientId();
try {
boolean isLockAcquired = writeLock.tryLock(100, TimeUnit.MILLISECONDS);
if (isLockAcquired) {
clientJpaRepository.save(clientObject);
}
} catch (InterruptedException e) {
log.error("exception occured trying to acquire lock for clientId={}", clientId);
} finally {
writeLock.unlock();
}
}
I am not very sure how its going to deal with the keys. As in i don't want any threads to block if they are wanting to update for different key(clientId 2).
Also, other thing to note is there could be reads happening as part of other API calls for this data from database. They would not be waiting too long hopefully and i hope i don't need to make any changes there for the reads.
Sorry for the long question, Hope i will hear from someone soon.
Thanks.

Related

Using a lock in C++ across multiple tasks

I am not really seeking code examples, but I'm hoping someone can review my program design and provide feedback. I am trying to figure out how do I ensure I have one instance of my "workflow" running at a time.
I am working in C++.
This is my workflow:
I read rows off of a Postgres database.
If the table has any records, I want to do these instructions:
Read the records and transform them to JSON
Send the JSON document to a remote Web service
Parse the response from the service. The service tells me which records were saved or not saved, based on their primary key.
I delete the successfully saved records
I log the unsuccessful records (there's another process that consumes the logs and so my work is done).
I want to perform all of this threads using a separate thread (or "task", whatever higher-level abstraction is available in C++), and I want to make sure that if my function for [1] gets called multiple times, the additional calls basically get "dropped" if step 1 is already in flight.
In C++, I believe I can use a flag and a mutex. I use a something like std::lock_guard<std::mutex> at the top of my method. Then the next line checks for a flag.
// MyWorkflow.cpp
std::mutex myMutex;
int inFlight = 0;
void process() {
std::lock_guard<std::mutex> guard(myMutex);
if (inflight) {
return;
}
inflight = 1;
std::vector<Widget> widgets = readFromMyTable();
std::string json = getJson(&widgets);
... // Send the json to the remote service and handle the response
}
Okay, let me explain my confusion. I want to use Curl to perform the HTTP request. But Curl works asynchronously. And so if I make the asynchronous HTTP call via Curl, my update function will just return and myMutex will be released, right?
I think in my asynchronous response handler, I need to call a second function that's in MyWorkflow.cpp
void markCompletion() {
std::lock_guard<std::mutex> guard(myMutex);
inFlight = 0; // Reset the inflight flag here
}
Is this the right approach? I am worried that if an exception is thrown anywhere before I call markCompletion(), I will block all future callers. I think I need to ensure I have proper exception handling and always call markCompletion().
I am terribly sorry for asking such a noob question, but I really want to learn to do this the right way.

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

How to lock a long async call in a WebApi action?

I have this scenario where I have a WebApi and an endpoint that when triggered does a lot of work (around 2-5min). It is a POST endpoint with side effects and I would like to limit the execution so that if 2 requests are sent to this endpoint (should not happen, but better safe than sorry), one of them will have to wait in order to avoid race conditions.
I first tried to use a simple static lock inside the controller like this:
lock (_lockObj)
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
this is of course not possible because of the await inside the lock statement.
Another solution I considered was to use a SemaphoreSlim implementation like this:
await semaphore.WaitAsync();
try
{
var results = await _service.LongRunningWithSideEffects();
return Ok(results);
}
finally
{
semaphore.Release();
}
However, according to MSDN:
The SemaphoreSlim class represents a lightweight, fast semaphore that can be used for waiting within a single process when wait times are expected to be very short.
Since in this scenario the wait times may even reach 5 minutes, what should I use for concurrency control?
EDIT (in response to plog17):
I do understand that passing this task onto a service might be the optimal way, however, I do not necessarily want to queue something in the background that still runs after the request is done.
The request involves other requests and integrations that take some time, but I would still like the user to wait for this request to finish and get a response regardless.
This request is expected to be only fired once a day at a specific time by a cron job. However, there is also an option to fire it manually by a developer (mostly in case something goes wrong with the job) and I would like to ensure the API doesn't run into concurrency issues if the developer e.g. double-sends the request accidentally etc.
If only one request of that sort can be processed at a given time, why not implement a queue ?
With such design, no more need to lock nor wait while processing the long running request.
Flow could be:
Client POST /RessourcesToProcess, should receive 202-Accepted quickly
HttpController simply queue the task to proceed (and return the 202-accepted)
Other service (windows service?) dequeue next task to proceed
Proceed task
Update resource status
During this process, client should be easily able to get status of requests previously made:
If task not found: 404-NotFound. Ressource not found for id 123
If task processing: 200-OK. 123 is processing.
If task done: 200-OK. Process response.
Your controller could look like:
public class TaskController
{
//constructor and private members
[HttpPost, Route("")]
public void QueueTask(RequestBody body)
{
messageQueue.Add(body);
}
[HttpGet, Route("taskId")]
public void QueueTask(string taskId)
{
YourThing thing = tasksRepository.Get(taskId);
if (thing == null)
{
return NotFound("thing does not exist");
}
if (thing.IsProcessing)
{
return Ok("thing is processing");
}
if (!thing.IsProcessing)
{
return Ok("thing is not processing yet");
}
//here we assume thing had been processed
return Ok(thing.ResponseContent);
}
}
This design suggests that you do not handle long running process inside your WebApi. Indeed, it may not be the best design choice. If you still want to do so, you may want to read:
Long running task in WebAPI
https://blogs.msdn.microsoft.com/webdev/2014/06/04/queuebackgroundworkitem-to-reliably-schedule-and-run-background-processes-in-asp-net/

Eclipse RAP Multi-client but single server thread

I understand how RAP creates scopes have a specific thread for each client and so on. I also understand how the application scope is unique among several clients, however I don't know how to access that specific scope in a single thread manner.
I would like to have a server side (with access to databases and stuff) that is a single execution to ensure it has a global knowledge of all transaction and that requests from clients are executed in sequence instead of parallel.
Currently I am accessing the application context as follows from the UI:
synchronized( MyServer.class ) {
ApplicationContext appContext = RWT.getApplicationContext();
MyServer myServer = (MyServer) appContext.getAttribute("myServer");
if (myServer == null){
myServer = new MyServer();
appContext.setAttribute("myServer", myServer);
}
myServer.doSomething(RWTUtils.getSessionID());
}
Even if I access myServer object there and trigger requests, the execution will still be running in the UI thread.
For now the only way to ensure the sequence is to use synchronized as follows on my server
public class MyServer {
String text = "";
public void doSomething(String string) {
try {
synchronized (this) {
System.out.println("doSomething - start :" + string);
text += "[" + string + "]";
System.out.println("text: " + (text));
Thread.sleep(10000);
System.out.println("text: " + (text));
System.out.println("doSomething - stop :" + string);
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Is there a better way to not have to manage the thread synchronization myself?
Any help is welcome
EDIT:
To better explain myself, here is what I mean. Either I trust the database to handle multiple request properly and I have to handle also some other knowledge in a synchronized manner to share information between clients (example A) or I find a solution where another thread handles both (example B), the knowledge and the database. Of course, the problem here is that one client may block the others, but this is can be managed with background threads for long actions, most of them will be no problem. My initial question was, is there maybe already some specific thread of the application scope that does Example B or is Example A actually the way to go?
Conclusion (so far)
Basically, option A) is the way to go. For database access it will require connection pooling and for shared information it will require thoughtful synchronization of key objects. Main attention has to be done in the database design and the synchronization of objects to ensure that two clients cannot write incompatible data at the same time (e.g. write contradicting entries that make the result dependent of the write order).
First of all, the way that you create MyServer in the first snippet is not thread safe. You are likely to create more than one instance of MyServer.
You need to synchronize the creation of MyServer, like this for example:
synchronized( MyServer.class ) {
MyServer myServer = (MyServer) appContext.getAttribute("myServer");
if (myServer == null){
myServer = new MyServer();
appContext.setAttribute("myServer", myServer);
}
}
See also this post How to implement thread-safe lazy initialization? for other possible solutions.
Furthermore, your code is calling doSomething() on the client thread (i.e. the UI thread) which will cause each client to wait until pending requests of other clients are processed. The client UI will become unresponsive.
To solve this problem your code should call doSomething() (or any other long-running operation for that matter) from a background thread (see also
Threads in RAP)
When the background thread has finished, you should use Server Push to update the UI.