Experiencing deadlocks when using the Hikari transactor for Doobie with ZIO - concurrency

I'm using Doobie in a ZIO application, and sometimes I get deadlocks (total freeze of the application). That can happen if I run my app on only one core, or if I reach the number of maximum parallel connections to the database.
My code looks like:
def mkTransactor(cfg: DatabaseConfig): RManaged[Blocking, Transactor[Task]] =
ZIO.runtime[Blocking].toManaged_.flatMap { implicit rt =>
val connectEC = rt.platform.executor.asEC
val transactEC = rt.environment.get.blockingExecutor.asEC
HikariTransactor
.fromHikariConfig[Task](
hikari(cfg),
connectEC,
Blocker.liftExecutionContext(transactEC)
)
.toManaged
}
private def hikari(cfg: DatabaseConfig): HikariConfig = {
val config = new com.zaxxer.hikari.HikariConfig
config.setJdbcUrl(cfg.url)
config.setSchema(cfg.schema)
config.setUsername(cfg.user)
config.setPassword(cfg.pass)
config
}
Alternatively, I set the leak detection parameter on Hikari (config.setLeakDetectionThreshold(10000L)), and I get leak errors which are not due to the time taken to process DB queries.

There is a good explanation in the Doobie documentation about the execution contexts and the expectations for each: https://tpolecat.github.io/doobie/docs/14-Managing-Connections.html#about-transactors
According to the docs, the "execution context for awaiting connection to the database" (connectEC in the question) should be bounded.
ZIO, by default, has only two thread pools:
zio-default-async – Bounded,
zio-default-blocking – Unbounded
So it is quite natural to believe that we should use zio-default-async since it is bounded.
Unfortunately, zio-default-async makes an assumption that its operations never, ever block. This is extremely important because it's the execution context used by the ZIO interpreter (its runtime) to run. If you block on it, you can actually block the evaluation progression of the ZIO program. This happens more often when there's only one core available.
The problem is that the execution context for awaiting DB connection is meant to block, waiting for free space in the Hikari connection pool. So we should not be using zio-default-async for this execution context.
The next question is: does it makes sense to create a new thread pool and corresponding execution context just for connectEC? There is nothing forbidding you to do so, but it is likely not necessary, for three reasons:
You want to avoid creating thread pools, especially since you likely have several already created from your web framework, DB connection pool, scheduler, etc. Each thread pool has its cost. Some examples are:
More to manage for the jvm JVM
Consumes more OS resources
Switching between threads, which that part is expensive in terms of performance
Makes your application runtime more complex to understand(complex thread dumps, etc)
ZIO thread pool ergonomics start to be well optimized for their usage
At the end of the day, you will have to manage your timeout somewhere, and the connection is not the part of the system which is the most likely to have enough information to know how long it should wait: different interactions (ie, in the outer parts of your app, nearer to use points) may require different timeout/retry logic.
All that being said, we found a configuration that works very well in an application running in production:
// zio.interop.catz._ provides a `zioContextShift`
val xa = (for {
// our transaction EC: wait for aquire/release connections, must accept blocking operations
te <- ZIO.access[Blocking](_.get.blockingExecutor.asEC)
} yield {
Transactor.fromDataSource[Task](datasource, te, Blocker.liftExecutionContext(te))
}).provide(ZioRuntime.environment).runNow
def transactTask[T](query: Transactor[Task] => Task[T]): Task[T] = {
query(xa)
}
I made a drawing of how Doobie and ZIO execution context map one other to each other: https://docs.google.com/drawings/d/1aJAkH6VFjX3ENu7gYUDK-qqOf9-AQI971EQ4sqhi2IY
UPDATE: I created a repos with 3 examples of that pattern usage (mixed app, pure app, ZLayer app) here: https://github.com/fanf/test-zio-doobie
Any feedback is welcome.

Related

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

Implementing a custom async task type and await

I am developing a C++ app in which i need to receive messages from an MQ and then parsing them according to their type and for a particular reason I want to make this process (receiving a single message followed by processing it) asynchronous. Since, I want to keep things as simple as possible in a way that the next developer would have no problem continuing the code, I have written a very small class to implement Asynchrony.
I first raise a new thread and pass a function to the thread:
task = new thread([&] {
result = fn();
isCompleted = true;
});
task->detach();
and in order to await the task I do the following:
while (!isCompleted && !(*cancelationToken))
{
Sleep(5);
}
state = 1; // marking the task as completed
So far there is no problem and I have not faced any bug or error but I am not sure if this is "a good way to do this" and my question is focused on determining this.
Read about std::future and std::async.
If your task runs in another core or processor, the variable isCompleted may become un-synchronized having two copies in core cache. So you may be waiting more than needed.
If you have to wait for something it is better to use a semaphore.
As said in comments, using standard methods is better anyway.

How to deal with deadlocks using ReentrantReadWriteLock in microservices

We have a deadlock situation which occured because of this heavy load on the microservice (Say A) causing multiple requests from different client services (B,C). So these calls from B and C come for the same clientId(key) and are served by different instances of A and they try to update the same clientId data in database at same time causing below error.
CannotAcquireLockException is thrown,
(SQL Error: 60, SQLState: 61000..
ORA-00060: deadlock detected while waiting for resource
We have decided to implement sharding at load balancer(haproxy) level which will ensure same instance of A will always serve the requests from B and C for a specific key(clientId), so we dont have multiple instances processing the request for same key(clientId).
Now we get into the mode of everything in single jvm as we have made sure requests from B and C for a specific clientId always come to same instance of A.
With this its still possible that requests from B and C services come for same clientId with difference in time of nanoseconds. Any then multiple threads will again try to update the same clientId data in database at same time causing same error again.
To improve this we are looking for possible solutions and one solutions is ReentrantReadWriteLock which should take care of this based on the concepts.
We are using spring data jpa and have a save being done which looks like
clientJpaRepository.save(ClientObject);
Now is it possible to use something like below.
public void save(Client clientObject) {
String clientId = clientObject.getClientId();
try {
boolean isLockAcquired = writeLock.tryLock(100, TimeUnit.MILLISECONDS);
if (isLockAcquired) {
clientJpaRepository.save(clientObject);
}
} catch (InterruptedException e) {
log.error("exception occured trying to acquire lock for clientId={}", clientId);
} finally {
writeLock.unlock();
}
}
I am not very sure how its going to deal with the keys. As in i don't want any threads to block if they are wanting to update for different key(clientId 2).
Also, other thing to note is there could be reads happening as part of other API calls for this data from database. They would not be waiting too long hopefully and i hope i don't need to make any changes there for the reads.
Sorry for the long question, Hope i will hear from someone soon.
Thanks.

Reading all available messages from mpsc UnboundedReceiver without blocking unnecessarily

I have an futures::sync::mpsc::unbounded channel. I can send messages to the UnboundedSender<T> but have problems receiving them from the UnboundedReciever<T>.
I use the channel to send messages to the UI thread, and I have a function that gets called every frame, and I'd like to read all the available messages from the channel on each frame, without blocking the thread when there are no available messages.
From what I've read the Future::poll method is kind of what I need, I just poll, and if I get Async::Ready, I do something with the message, and if not, I just return from the function.
The problem is the poll panics when there is no task context (I'm not sure what that means or what to do about it).
What I tried:
let (sender, receiver) = unbounded(); // somewhere in the code, doesn't matter
// ...
let fut = match receiver.by_ref().collect().poll() {
Async::Ready(items_vec) => // do something on UI with items,
_ => return None
}
this panics because I don't have a task context.
Also tried:
let (sender, receiver) = unbounded(); // somewhere in the code, doesn't matter
// ...
let fut = receiver.by_ref().collect(); // how do I run the future?
tokio::runtime::current_thread::Runtime::new().unwrap().block_on(fut); // this blocks the thread when there are no items in the receiver
I would like help with reading the UnboundedReceiver<T> without blocking the thread when there are no items in the stream (just do nothing then).
Thanks!
You are using futures incorrectly -- you need a Runtime and a bit more boilerplate to get this to work:
extern crate tokio;
extern crate futures;
use tokio::prelude::*;
use futures::future::{lazy, ok};
use futures::sync::mpsc::unbounded;
use tokio::runtime::Runtime;
fn main() {
let (sender, receiver) = unbounded::<i64>();
let receiver = receiver.for_each(|result| {
println!("Got: {}", result);
Ok(())
});
let rt = Runtime::new().unwrap();
rt.executor().spawn(receiver);
let lazy_future = lazy(move || {
sender.unbounded_send(1).unwrap();
sender.unbounded_send(2).unwrap();
sender.unbounded_send(3).unwrap();
ok::<(), ()>(())
});
rt.block_on_all(lazy_future).unwrap();
}
Further reading, from Tokio's runtime model:
[...]in order to use Tokio and successfully execute tasks, an application must start an executor and the necessary drivers for the resources that the application’s tasks depend on. This requires significant boilerplate. To manage the boilerplate, Tokio offers a couple of runtime options. A runtime is an executor bundled with all necessary drivers to power Tokio’s resources. Instead of managing all the various Tokio components individually, a runtime is created and started in a single call.
Tokio offers a concurrent runtime and a single-threaded runtime. The concurrent runtime is backed by a multi-threaded, work-stealing executor. The single-threaded runtime executes all tasks and drivers on thee current thread. The user may pick the runtime with characteristics best suited for the application.

Handling Interrupt in C++

I am writing a framework for an embedded device which has the ability to run multiple applications. When switching between apps how can I ensure that the state of my current application is cleaned up correctly? For example, say I am running through an intensive loop in one application and a request is made to run a second app while that loop has not yet finished. I cannot delete the object containing the loop until the loop has finished, yet I am unsure how to ensure the looping object is in a state ready to be deleted. Do I need some kind of polling mechanism or event callback which notifies me when it has completed?
Thanks.
Usually if you need to do this type of thing you'll have an OS/RTOS that can handle the multiple tasks (even if the OS is a simple homebrew type thing).
If you don't already have an RTOS, you may want to look into one (there are hundreds available) or look into incorporating something simple like protothreads: http://www.sics.se/~adam/pt/
So you have two threads: one running the kernel and one running the app? You will need to make a function in your kernel say ReadyToYield() that the application can call when it's happy for you to close it down. ReadyToYield() would flag the kernel thread to give it the good news and then sit and wait until the kernel thread decides what to do. It might look something like this:
volatile bool appWaitingOnKernel = false;
volatile bool continueWaitingForKernel;
On the app thread call:
void ReadyToYield(void)
{
continueWaitingForKernel = true;
appWaitingOnKernel = true;
while(continueWaitingForKernel == true);
}
On the kernel thread call:
void CheckForWaitingApp(void)
{
if(appWaitingOnKernel == true)
{
appWaitingOnKernel = false;
if(needToDeleteApp)
DeleteApp();
else
continueWaitingForKernel = false;
}
}
Obviously, the actual implementation here depends on the underlying O/S but this is the gist.
John.
(1) You need to write thread-safe code. This is not specific to embedded systems.
(2) You need to save state away when you do a context switch.