I'm using Doobie in a ZIO application, and sometimes I get deadlocks (total freeze of the application). That can happen if I run my app on only one core, or if I reach the number of maximum parallel connections to the database.
My code looks like:
def mkTransactor(cfg: DatabaseConfig): RManaged[Blocking, Transactor[Task]] =
ZIO.runtime[Blocking].toManaged_.flatMap { implicit rt =>
val connectEC = rt.platform.executor.asEC
val transactEC = rt.environment.get.blockingExecutor.asEC
HikariTransactor
.fromHikariConfig[Task](
hikari(cfg),
connectEC,
Blocker.liftExecutionContext(transactEC)
)
.toManaged
}
private def hikari(cfg: DatabaseConfig): HikariConfig = {
val config = new com.zaxxer.hikari.HikariConfig
config.setJdbcUrl(cfg.url)
config.setSchema(cfg.schema)
config.setUsername(cfg.user)
config.setPassword(cfg.pass)
config
}
Alternatively, I set the leak detection parameter on Hikari (config.setLeakDetectionThreshold(10000L)), and I get leak errors which are not due to the time taken to process DB queries.
There is a good explanation in the Doobie documentation about the execution contexts and the expectations for each: https://tpolecat.github.io/doobie/docs/14-Managing-Connections.html#about-transactors
According to the docs, the "execution context for awaiting connection to the database" (connectEC in the question) should be bounded.
ZIO, by default, has only two thread pools:
zio-default-async – Bounded,
zio-default-blocking – Unbounded
So it is quite natural to believe that we should use zio-default-async since it is bounded.
Unfortunately, zio-default-async makes an assumption that its operations never, ever block. This is extremely important because it's the execution context used by the ZIO interpreter (its runtime) to run. If you block on it, you can actually block the evaluation progression of the ZIO program. This happens more often when there's only one core available.
The problem is that the execution context for awaiting DB connection is meant to block, waiting for free space in the Hikari connection pool. So we should not be using zio-default-async for this execution context.
The next question is: does it makes sense to create a new thread pool and corresponding execution context just for connectEC? There is nothing forbidding you to do so, but it is likely not necessary, for three reasons:
You want to avoid creating thread pools, especially since you likely have several already created from your web framework, DB connection pool, scheduler, etc. Each thread pool has its cost. Some examples are:
More to manage for the jvm JVM
Consumes more OS resources
Switching between threads, which that part is expensive in terms of performance
Makes your application runtime more complex to understand(complex thread dumps, etc)
ZIO thread pool ergonomics start to be well optimized for their usage
At the end of the day, you will have to manage your timeout somewhere, and the connection is not the part of the system which is the most likely to have enough information to know how long it should wait: different interactions (ie, in the outer parts of your app, nearer to use points) may require different timeout/retry logic.
All that being said, we found a configuration that works very well in an application running in production:
// zio.interop.catz._ provides a `zioContextShift`
val xa = (for {
// our transaction EC: wait for aquire/release connections, must accept blocking operations
te <- ZIO.access[Blocking](_.get.blockingExecutor.asEC)
} yield {
Transactor.fromDataSource[Task](datasource, te, Blocker.liftExecutionContext(te))
}).provide(ZioRuntime.environment).runNow
def transactTask[T](query: Transactor[Task] => Task[T]): Task[T] = {
query(xa)
}
I made a drawing of how Doobie and ZIO execution context map one other to each other: https://docs.google.com/drawings/d/1aJAkH6VFjX3ENu7gYUDK-qqOf9-AQI971EQ4sqhi2IY
UPDATE: I created a repos with 3 examples of that pattern usage (mixed app, pure app, ZLayer app) here: https://github.com/fanf/test-zio-doobie
Any feedback is welcome.
Ignoring the absence of the Mix config file, I write the following:
defmodule Test.Supervisor do
use Supervisor
def start_link do
#"name:" will show up in :observer...
Supervisor.start_link(__MODULE__, [], [name: :"root_supervisor"])
end
def init(args) do
children = [
worker(Test.Method, [], [function: :start, id: "my_root_process"]),
]
supervise(children, [strategy: :one_for_one, name: :root])
end
end
defmodule Test do
def start(_type, _args) do
Test.Supervisor.start_link()
end
end
defmodule Test.Method do
def start do
IO.puts("Expect to see me often... #{self}")
end
end
Which crashes after the first run (iex -S mix) without restarting the application. The error message is:
=INFO REPORT==== 14-Jan-2016::22:34:04 ===
application: logger
exited: stopped
type: temporary
** (Mix) Could not start application mememe: Test.start(:normal, {}) returned
an error: shutdown: failed to start child: "my_root_process"
** (EXIT) :ok
If however, I change Test.start() to call Test.Method.start() directly, like so:
defmodule Test do
def start(_type, _args) do
Test.Method.start()
end
end
Then it runs fine, but then the code won't be supervised.
I'm quite sure I'm making an elementary mistake either in implementation, or comprehension here, but what is that mistake exactly?
There are couple of problems with your code. Firstly, you need a long running function to supervise. Something like:
def loop do
receive do
_anything -> IO.puts "Expect to see me often"
end
loop
end
Then in Test.Method module, you have to spawn it.
def start do
IO.puts("Starting...")
pid = spawn_link(&loop/0)
{:ok, pid}
end
It is important, that start function returns tuple {:ok, pid_to_supervise}. It was crashing your app, because supervisor expected a process to monitor, but got only atom :ok returned by IO.puts. Worker specification does not spawn new process. It requires a function, that will return pid of spawned process.
You should also link supervisor to the supervised process, so in the end it might be good idea to rename the function to start_link, instead of start as #Jason Harrelson suggested.
This should be enough to properly start your project. Note, that you will not see your processes in observers Applications section. You are not using Application behaviour, so your root_supervisor will be floating somewhere. You can find it in Processes tab. my_root_process is an id to use withing supervisor, so it won't be visible even in Processes tab.
Spawning process this way is easy for educational purposes, but in real world system, you would like your processes to follow OTP design principles. It means reacting to system messages, better logging, tracing and debugging. Making process that meets all requirements is quite hard, but you don't have to do it manually. All behaviours implement those principles for you.
So instead of spawning a process with loop, try using GenServer.
I would try changing the Test.Method.start function to a Test.Method.start_link function and stop using the function: :start in your opts to the worker function. The supervisor calls start_link by default and there is no reason to break these semantics as the supervisor will always link to the supervised process. If this does not work then at least we have ruled an issue in this area out.
I'm writing a healthcheck endpoint for my web service.
The end point calls a series of functions which return True if the component is working correctly:
The system is considered to be working if all the components are working:
def is_health():
healthy = all(r for r in (database(), cache(), worker(), storage()))
return healthy
When things aren't working, the functions may take a long time to return. For example if the database is bogged down with slow queries, database() could take more than 30 seconds to return.
The healthcheck endpoint runs in the context of a Django view, running inside a uWSGI container. If the request / response cycle takes longer than 30 seconds, the request is harakiri-ed!
This is a huge bummer, because I lose all contextual information that I could have logged about which component took a long time.
What I'd really like, is for the component functions to run within a timeout or a deadline:
with timeout(seconds=30):
database_result = database()
cache_result = cache()
worker_result = worker()
storage_result = storage()
In my imagination, as the deadline / harakiri timeout approaches, I can abort the remaining health checks and just report the work I've completely.
What's the right way to handle this sort of thing?
I've looked at threading.Thread and Queue.Queue - the idea being that I create a work and result queue, and then use a thread to consume the work queue while placing the results in result queue. Then I could use the thread's Thread.join function to stop processing the rest of the components.
The one challenge there is that I'm not sure how to hard exit the thread - I wouldn't want it hanging around forever if it didn't complete it's run.
Here is the code I've got so far. Am I on the right track?
import Queue
import threading
import time
class WorkThread(threading.Thread):
def __init__(self, work_queue, result_queue):
super(WorkThread, self).__init__()
self.work_queue = work_queue
self.result_queue = result_queue
self._timeout = threading.Event()
def timeout(self):
self._timeout.set()
def timed_out(self):
return self._timeout.is_set()
def run(self):
while not self.timed_out():
try:
work_fn, work_arg = self.work_queue.get()
retval = work_fn(work_arg)
self.result_queue.put(retval)
except (Queue.Empty):
break
def work(retval, timeout=1):
time.sleep(timeout)
return retval
def main():
# Two work items that will take at least two seconds to complete.
work_queue = Queue.Queue()
work_queue.put_nowait([work, 1])
work_queue.put_nowait([work, 2])
result_queue = Queue.Queue()
# Run the `WorkThread`. It should complete one item from the work queue
# before it times out.
t = WorkThread(work_queue=work_queue, result_queue=result_queue)
t.start()
t.join(timeout=1.1)
t.timeout()
results = []
while True:
try:
result = result_queue.get_nowait()
results.append(result)
except (Queue.Empty):
break
print results
if __name__ == "__main__":
main()
Update
It seems like in Python you've got a few options for timeouts of this nature:
Use SIGALARMS which work great if you have full control of the signals used by the process but probably are a mistake when you're running in a container like uWSGI.
Threads, which give you limited timeout control. Depending on your container environment (like uWSGI) you might need to set options to enable them.
Subprocesses, which give you full timeout control, but you need to be conscious of how they might change how your service consumes resources.
Use existing network timeouts. For example, if part of your healthcheck is to use Celery workers, you could rely on AsyncResult's timeout parameter to bound execution.
Do nothing! Log at regular intervals. Analyze later.
I'm exploring the benefits of these different options more.
Update #2
I put together a GitHub repo with quite a bit more information on the topic:
https://github.com/johnboxall/pytimeout
I'll type it up into a answer one day but the TLDR is here:
https://github.com/johnboxall/pytimeout#recommendations
There is a pattern I've seen occasionally where the init/1 function of a gen_server process will send a message to itself signalling that it should be initialized. The purpose of this is for the gen_server process to initialize itself asynchronously so that the process spawning it doesn't have to wait. Here is an example:
-module(test).
-compile(export_all).
init([]) ->
gen_server:cast(self(), init),
{ok, {}}.
handle_cast(init, {}) ->
io:format("initializing~n"),
{noreply, lists:sum(lists:seq(1,10000000))};
handle_cast(m, X) when is_integer(X) ->
io:format("got m. X: ~p~n", [X]),
{noreply, X}.
b() ->
receive P -> {} end,
gen_server:cast(P, m),
b().
test() ->
B = spawn(fun test:b/0),
{ok, A} = gen_server:start_link(test,[],[]),
B ! A.
The process assumes that the init message will be received before any other message - otherwise it will crash. Is it possible for this process to get the m message before the init message?
Let's assume there's no process sending messages to random pids generated by list_to_pid, since any application doing this will probably not work at all, regardless of the answer to this question.
Theoretical answer to the question is it possible for a process to get a message before the init message? is YES.
But practically (when no process is doing list_to_pid and sending a message) to this process the answer is NO provided the gen_server is not a registered process.
This is because the return of gen_server:start_link ensures that callback init of gen_server is executed. Thus initialize message is the first message in the process message queue before any other process gets the Pid to send a message. Thus your process is safe and does not receive any other message before init.
But same does not go true for registered process as there can be a process which might be sending message to the gen_server using registered name even before it completes callback init function.
Lets consider this test function.
test() ->
Times = lists:seq(1,1000),
spawn(gen_server, start_link,[{local, ?MODULE}, ?MODULE, [], []]),
[gen_server:cast(?MODULE, No) || No <-Times].
The sample output is
1> async_init:test().
Received:356
Received:357
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
Received:358
Received:359
2> Received:360
2> Received:361
...
2> Received:384
2> Received:385
2> Initializing
2> Received:386
2> Received:387
2> Received:388
2> Received:389
...
You can see that gen_server received messages from 356 to 385 messages before initialize.
Thus the async callback does not work in registered name scenario.
This can be solved by two ways
1.Register the process after Pid is returned.
start_link_reg() ->
{ok, Pid} = gen_server:start(?MODULE, [], []),
register(?MODULE, Pid).
2.Or in handle_cast for init message register the process.
handle_cast(init, State) ->
register(?MODULE, self()),
io:format("Initializing~n"),
{noreply, State};
The sample output after this change is
1> async_init:test().
Initializing
Received:918
Received:919
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
Received:920
2> Received:921
2> Received:922
...
Thus sending a message to itself for initializing does not ensure that it is the first message that it receives but with bit of changes in the code (and design) can ensure that it is the first to be executed.
In this particular case, you would be safe in the assumption that the 'init' message will be received before 'm'. In general (and especially if you register your process) this is not true though.
If you want to be 100% safe knowing that your init code will run first you can do something like:
start_link(Args...) ->
gen_server:start_link(test, [self(), Args...], []).
init([Parent, Args...]) ->
do_your_synchronous_start_stuff_here,
proc_lib:init_ack(Parent, {ok, self()}),
do_your_async_initializing_here,
io:format("initializing~n"),
{ok, State}.
I didn't test this, so I don't know if the "bonus" init_ack will print an ugly message to the terminal or not. If it does, the code has to be expanded slightly, but the general idea still stands. Let me know and I'll update my answer.
Your sample code is safe and m is always received after init.
However, from a theoretical point of view, if init/1 handler of a gen_server sends a message to itself, using gen_server:cast/2 or the send primitive, it is not guaranteed to be the first message.
There is no way to guarantee this simply because init/1 is executed within the process of the gen_server, therefore after the process was created and allocated a pid and a mailbox. In non-SMP mode, the scheduler can schedule out the process under some load before the init function is called or before the message is sent, since calling a function (such as gen_server:cast/2 or the init handler for that matter) generates a reduction and the BEAM emulator tests whether it's time to give some time to other processes. In SMP mode, you can have another scheduler that will run some code sending a message to your process.
What distinguishes theory from practice is the way to find out about the existence of the process (in order to send it a message before the init message). Code could use links from the supervisor, the registered name, the list of processes returned by erlang:processes() or even call list_to_pid/1 with random values or unserialize pids with binary_to_term/1. Your node might even get a message from another node with a serialized pid, especially considering that creation number wraps around after 3 (see your other question Wrong process getting killed on other node?).
This is unlikely in practice. As a result, from a practical point of view, every time this pattern is used, the code can be designed to ensure init message is received first and the server is initialized before it receives other messages.
If the gen_server is a registered process, you would start it from a supervisor and ensure that all clients are started afterward in the supervision tree or introduce some kind of (probably inferior) synchronization mechanism. This is required even if you do not use this pattern of asynchronous initialization (otherwise clients could not reach the server). Of course, you might still have issues in case of crashes and restarts of this gen_server, but this is true whatever the scenario, and you can only be saved by a carefully crafted supervision tree.
If the gen_server is not registered or referred to by name, clients will eventually pass the pid to gen_server:call/2,3 or gen_server:cast/2 which they would obtain through the supervisor which calls gen_server:start_link/3. gen_server:start_link/3 only returns when init/1 returned and therefore after the init message was enqueued. This is exactly what your code above does.
gen_server uses proc_lib:init_ack to make sure that the process is properly started before returning the pid from start_link. So the message sent in init will be the first message.
This is not 100% safe!
In gen.erl line 117-129, we can see this:
init_it(GenMod, Starter, Parent, Mod, Args, Options) ->
init_it2(GenMod, Starter, Parent, self(), Mod, Args, Options).
init_it(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
case name_register(Name) of
true ->
init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options);
{false, Pid} ->
proc_lib:init_ack(Starter, {error, {already_started, Pid}})
end.
init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
GenMod:init_it(Starter, Parent, Name, Mod, Args, Options).
In init_it/7 the process register its Name first, and then in init_it2/7 it calls GenMod:init_it/6 in which it calls your init/1 function.
Although, before gen_server:start_link returns, it is hardly to guess the new process id. However, if you send a message to the server with the registered Name, and the message arrives before your gen_server:cast is called, your code will be wrong.
Daniel's solution may be right, but I'm not quite sure whether two proc_lib:init_ack will cause an error or not. However, the parent would never like to receive an unexpected message. >_<
Here is another solution. Keep a flag in your gen_servser state to mark whether the server is initialized. And when you receive m, just check whether the server is initialized, if not, gen_cast m to yourself.
This is a little troublesome solution, but I'm sure it is right. =_=
I'm a freshman here, how I wish I could add a comment. >"<