Elixir - Basic supervisor setup crashes instead of restarting the child process

Elixir - Basic supervisor setup crashes instead of restarting the child process - concurrency

Ignoring the absence of the Mix config file, I write the following:
defmodule Test.Supervisor do
use Supervisor
def start_link do
#"name:" will show up in :observer...
Supervisor.start_link(__MODULE__, [], [name: :"root_supervisor"])
end
def init(args) do
children = [
worker(Test.Method, [], [function: :start, id: "my_root_process"]),
]
supervise(children, [strategy: :one_for_one, name: :root])
end
end
defmodule Test do
def start(_type, _args) do
Test.Supervisor.start_link()
end
end
defmodule Test.Method do
def start do
IO.puts("Expect to see me often... #{self}")
end
end
Which crashes after the first run (iex -S mix) without restarting the application. The error message is:
=INFO REPORT==== 14-Jan-2016::22:34:04 ===
application: logger
exited: stopped
type: temporary
** (Mix) Could not start application mememe: Test.start(:normal, {}) returned
an error: shutdown: failed to start child: "my_root_process"
** (EXIT) :ok
If however, I change Test.start() to call Test.Method.start() directly, like so:
defmodule Test do
def start(_type, _args) do
Test.Method.start()
end
end
Then it runs fine, but then the code won't be supervised.
I'm quite sure I'm making an elementary mistake either in implementation, or comprehension here, but what is that mistake exactly?

There are couple of problems with your code. Firstly, you need a long running function to supervise. Something like:
def loop do
receive do
_anything -> IO.puts "Expect to see me often"
end
loop
end
Then in Test.Method module, you have to spawn it.
def start do
IO.puts("Starting...")
pid = spawn_link(&loop/0)
{:ok, pid}
end
It is important, that start function returns tuple {:ok, pid_to_supervise}. It was crashing your app, because supervisor expected a process to monitor, but got only atom :ok returned by IO.puts. Worker specification does not spawn new process. It requires a function, that will return pid of spawned process.
You should also link supervisor to the supervised process, so in the end it might be good idea to rename the function to start_link, instead of start as #Jason Harrelson suggested.
This should be enough to properly start your project. Note, that you will not see your processes in observers Applications section. You are not using Application behaviour, so your root_supervisor will be floating somewhere. You can find it in Processes tab. my_root_process is an id to use withing supervisor, so it won't be visible even in Processes tab.
Spawning process this way is easy for educational purposes, but in real world system, you would like your processes to follow OTP design principles. It means reacting to system messages, better logging, tracing and debugging. Making process that meets all requirements is quite hard, but you don't have to do it manually. All behaviours implement those principles for you.
So instead of spawning a process with loop, try using GenServer.

I would try changing the Test.Method.start function to a Test.Method.start_link function and stop using the function: :start in your opts to the worker function. The supervisor calls start_link by default and there is no reason to break these semantics as the supervisor will always link to the supervised process. If this does not work then at least we have ruled an issue in this area out.

Related

How to test GenServer restart behaviour?

In my app I have a GenServer. It backs up data needed to start again in an Agent. I want to test if my GenServer backs up and restores correctly, so I wanted to start backup agent, then restart GenServer and see if it works (remembers config from before restart).
Right now I have GenServer configured and started (with start_supervised!) in test setup. I need to somehow restart that GenServer.
Is there a good way to do it? Should I be doing it completely differently? Is there a different, correct way of testing restart behavior?

A Supervisor decides when to restart a process under its supervision through the child_spec of that child process. By default, when you define your GenServer module, and use use GenServer on the module, the default (restart values) will be :permanent, which means, always restart this process if it exits.
Given this, it should be enough to send it an exit signal, with Process.exit(your_gen_server_pid, :kill) (:kill will ensure that even if the process is trapping exits it will be killed), and the supervisor should then start the process again and you can then do your assertions.
You'll need a way to address the "new" genserver process, since it will be killed, when restarted its pid won't be the same as was originally, usually you do that by providing a name when starting it.
If your genserver loads the state as part of its init you don't necessarily need to supervise it to test the backup behaviour, you could just start it individually, kill it, and then start it again.
There might be edge-cases depending on how you establish the backup, etc, but normally that would be enough.
UPDATE:
To address both the process exiting and being up again, you could write 2 helper functions to deal specifically with that.
def ensure_exited(pid, timeout \\ 1_000) do
true = Process.alive?(pid)
ref = Process.monitor(pid)
Process.exit(pid, :kill)
receive do
{:DOWN, ^ref, :process, ^pid, _reason} -> :ok
after
timeout -> :timeout
end
end
You could make it take instead a name and do GenServer.whereis to retrieve the pid, but the idea is the same.
To make sure it's alive:
def is_back_up?(name, max \\ 200, tries \\ 0) when tries <= max do
case GenServer.whereis(name) do
nil ->
Process.sleep(5)
is_back_up?(name, max, tries + 1)
pid -> true
end
end
def is_back_up?(_, _, _), do: false
The basic idea is that. Not sure if there's already some helpers to do this sort of thing.
Then you just use that (you could write a 3rd helper that takes the live pid, the name, and does it all in one "step"), or write:
:ok = ensure_exited(pid)
true = is_back_up?(name)

akka system.shutdown and awaitTermination

Following code is called at the very end of my program (it's written in JRuby):
#na.tell(PoisonPill) if defined? #na # #na, #sa and #pe are Actors
#sa.tell(PoisonPill) if defined? #sa
#pe.tell(PoisonPill) if defined? #pe
##system.shutdown # ##system is the ActorSystem
##system.awaitTermination
I found this approach here but I don't understand why it works.
Does awaitTermination wait for all Actors to terminate?
Isn't ##system shutted down before awaitTermination is called?
edit: I noticed that I doesn't even need to call tell(PoisonPill). I commented it out and it still works...

Okay I solved it now. When I call system.shutdown all actors terminate after their current task. That's not what I want because there could be more tasks in the queue.
So I send a PoisonPill to each actor at the end of my main thread and then wait for them to terminate. I also use the function postStop in each actor to set a finished flag and shut down the system when all actors have finished.
import Actors # needed for Java-style poisonPill
actor1.tell(Actors::poisonPill) # for Akka 2.0.0
actor2.tell(Actors::poisonPill)
system.awaitTermination

Can Amazon Simple Workflow (SWF) be made to work with jRuby?

For uninteresting reasons, I have to use jRuby on a particular project where we also want to use Amazon Simple Workflow (SWF). I don't have a choice in the jRuby department, so please don't say "use MRI".
The first problem I ran into is that jRuby doesn't support forking and SWF activity workers love to fork. After hacking through the SWF ruby libraries, I was able to figure out how to attach a logger and also figure out how to prevent forking, which was tremendously helpful:
AWS::Flow::ActivityWorker.new(
swf.client, domain,"my_tasklist", MyActivities
) do |options|
options.logger= Logger.new("logs/swf_logger.log")
options.use_forking = false
end
This prevented forking, but now I'm hitting more exceptions deep in the SWF source code having to do with Fibers and the context not existing:
Error in the poller, exception:
AWS::Flow::Core::NoContextException: AWS::Flow::Core::NoContextException stacktrace:
"aws-flow-2.4.0/lib/aws/flow/implementation.rb:38:in 'task'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:292:in 'respond_activity_task_failed'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:204:in 'respond_activity_task_failed_with_retry'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:335:in 'process_single_task'",
"aws-flow-2.4.0/lib/aws/decider/task_poller.rb:388:in 'poll_and_process_single_task'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:447:in 'run_once'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:419:in 'start'",
"org/jruby/RubyKernel.java:1501:in `loop'",
"aws-flow-2.4.0/lib/aws/decider/worker.rb:417:in 'start'",
"/Users/trcull/dev/etl/flow/etl_runner.rb:28:in 'start_workers'"
This is the SWF code at that line:
# #param [Future] future
# Unused; defaults to **nil**.
#
# #param block
# The block of code to be executed when the task is run.
#
# #raise [NoContextException]
# If the current fiber does not respond to `Fiber.__context__`.
#
# #return [Future]
# The tasks result, which is a {Future}.
#
def task(future = nil, &block)
fiber = ::Fiber.current
raise NoContextException unless fiber.respond_to? :__context__
context = fiber.__context__
t = Task.new(nil, &block)
task_context = TaskContext.new(:parent => context.get_closest_containing_scope, :task => t)
context << t
t.result
end
I fear this is another flavor of the same forking problem and also fear that I'm facing a long road of slogging through SWF source code and working around problems until I finally hit a wall I can't work around.
So, my question is, has anyone actually gotten jRuby and SWF to work together? If so, is there a list of steps and workarounds somewhere I can be pointed to? Googling for "SWF and jRuby" hasn't turned up anything so far and I'm already 1 1/2 days into this task.

I think the issue might be that aws-flow-ruby doesn't support Ruby 2.0. I found this PDF dated Jan 22, 2015.
1.2.1
Tested Ruby Runtimes The AWS Flow Framework for Ruby has been tested
with the official Ruby 1.9 runtime, also known as YARV. Other versions
of the Ruby runtime may work, but are unsupported.

I have a partial answer to my own question. The answer to "Can SWF be made to work on jRuby" is "Yes...ish."
I was, indeed, able to get a workflow working end-to-end (and even make calls to a database via JDBC, the original reason I had to do this). So, that's the "yes" part of the answer. Yes, SWF can be made to work on jRuby.
Here's the "ish" part of the answer.
The stack trace I posted above is the result of SWF trying to raise an ActivityTaskFailedException due to a problem in some of my activity code. That part is my fault. What's not my fault is that the superclass of ActivityTaskFailedException has this code in it:
def initialize(reason = "Something went wrong in Flow",
details = "But this indicates that it got corrupted getting out")
super(reason)
#reason = reason
#details = details
details = details.message if details.is_a? Exception
self.set_backtrace(details)
end
When your activity throws an exception, the "details" variable you see above is filled with a String. MRI is perfectly happy to take a String as an argument to set_backtrace(), but jRuby is not, and jRuby throws an exception saying that "details" must be an Array of Strings. This exception blows through all the nice error catching logic of the SWF library and into this code that's trying to do incompatible things with the Fiber library. That code then throws a follow-on exception and kills the activity worker thread entirely.
So, you can run SWF on jRuby as long as your activity and workflow code never, ever throws exceptions because otherwise those exceptions will kill your worker threads (which is not the intended behavior of SWF workers). What they are designed to do instead is communicate the exception back to SWF in a nice, trackable, recoverable fashion. But, the SWF code that does the communicating back to SWF has, itself, code that's incompatible with jRuby.
To get past this problem, I monkey-patched AWS::Flow::FlowException like so:
def initialize(reason = "Something went wrong in Flow",
details = "But this indicates that it got corrupted getting out")
super(reason)
#reason = reason
#details = details
details = details.message if details.is_a? Exception
details = [details] if details.is_a? String
self.set_backtrace(details)
end
Hope that helps someone in the same situation as me.

I'm using JFlow, it lets you start SWF flow activity workers with JRuby.

Rails - run pusher in background

I use Pusher in my Rails-4 application.
The problem is that sometimes the connection is slow, so the execution of the code becomes slower.
I also get from time to time the following error:
Pusher::HTTPError: execution expired (HTTPClient::ConnectTimeoutError)
I send signals via Pusher with this code:
Pusher[channel].trigger!(event, msg)
I would like to execute it in background, so if an exception is thrown it will not break the flow of my app, and neither slow it down.
I tried to wrap the call with begin ... rescue but it didn't solve the exception problem. Of course even if it would, it wouldn't solve the slow-down problem i want to avoid.

Information on performing asynchronous triggers can be found here:
https://github.com/pusher/pusher-gem#asynchronous-requests
This also provides you within information on catching/handling errors.

Finally I implemented this solution:
Thread.new do
begin
Pusher[channel].trigger!(ch, ev, msg)
ActiveRecord::Base.connection.close
rescue Pusher::Error => e
Rails.logger.error "Pusher error: #{e.message}"
end
end

If the init/1 function in a gen_server process sends a message to itself, is it guaranteed to arrive before any other message?

There is a pattern I've seen occasionally where the init/1 function of a gen_server process will send a message to itself signalling that it should be initialized. The purpose of this is for the gen_server process to initialize itself asynchronously so that the process spawning it doesn't have to wait. Here is an example:
-module(test).
-compile(export_all).
init([]) ->
gen_server:cast(self(), init),
{ok, {}}.
handle_cast(init, {}) ->
io:format("initializing~n"),
{noreply, lists:sum(lists:seq(1,10000000))};
handle_cast(m, X) when is_integer(X) ->
io:format("got m. X: ~p~n", [X]),
{noreply, X}.
b() ->
receive P -> {} end,
gen_server:cast(P, m),
b().
test() ->
B = spawn(fun test:b/0),
{ok, A} = gen_server:start_link(test,[],[]),
B ! A.
The process assumes that the init message will be received before any other message - otherwise it will crash. Is it possible for this process to get the m message before the init message?
Let's assume there's no process sending messages to random pids generated by list_to_pid, since any application doing this will probably not work at all, regardless of the answer to this question.

Theoretical answer to the question is it possible for a process to get a message before the init message? is YES.
But practically (when no process is doing list_to_pid and sending a message) to this process the answer is NO provided the gen_server is not a registered process.
This is because the return of gen_server:start_link ensures that callback init of gen_server is executed. Thus initialize message is the first message in the process message queue before any other process gets the Pid to send a message. Thus your process is safe and does not receive any other message before init.
But same does not go true for registered process as there can be a process which might be sending message to the gen_server using registered name even before it completes callback init function.
Lets consider this test function.
test() ->
Times = lists:seq(1,1000),
spawn(gen_server, start_link,[{local, ?MODULE}, ?MODULE, [], []]),
[gen_server:cast(?MODULE, No) || No <-Times].
The sample output is
1> async_init:test().
Received:356
Received:357
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
Received:358
Received:359
2> Received:360
2> Received:361
...
2> Received:384
2> Received:385
2> Initializing
2> Received:386
2> Received:387
2> Received:388
2> Received:389
...
You can see that gen_server received messages from 356 to 385 messages before initialize.
Thus the async callback does not work in registered name scenario.
This can be solved by two ways
1.Register the process after Pid is returned.
start_link_reg() ->
{ok, Pid} = gen_server:start(?MODULE, [], []),
register(?MODULE, Pid).
2.Or in handle_cast for init message register the process.
handle_cast(init, State) ->
register(?MODULE, self()),
io:format("Initializing~n"),
{noreply, State};
The sample output after this change is
1> async_init:test().
Initializing
Received:918
Received:919
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]
Received:920
2> Received:921
2> Received:922
...
Thus sending a message to itself for initializing does not ensure that it is the first message that it receives but with bit of changes in the code (and design) can ensure that it is the first to be executed.

In this particular case, you would be safe in the assumption that the 'init' message will be received before 'm'. In general (and especially if you register your process) this is not true though.
If you want to be 100% safe knowing that your init code will run first you can do something like:
start_link(Args...) ->
gen_server:start_link(test, [self(), Args...], []).
init([Parent, Args...]) ->
do_your_synchronous_start_stuff_here,
proc_lib:init_ack(Parent, {ok, self()}),
do_your_async_initializing_here,
io:format("initializing~n"),
{ok, State}.
I didn't test this, so I don't know if the "bonus" init_ack will print an ugly message to the terminal or not. If it does, the code has to be expanded slightly, but the general idea still stands. Let me know and I'll update my answer.

Your sample code is safe and m is always received after init.
However, from a theoretical point of view, if init/1 handler of a gen_server sends a message to itself, using gen_server:cast/2 or the send primitive, it is not guaranteed to be the first message.
There is no way to guarantee this simply because init/1 is executed within the process of the gen_server, therefore after the process was created and allocated a pid and a mailbox. In non-SMP mode, the scheduler can schedule out the process under some load before the init function is called or before the message is sent, since calling a function (such as gen_server:cast/2 or the init handler for that matter) generates a reduction and the BEAM emulator tests whether it's time to give some time to other processes. In SMP mode, you can have another scheduler that will run some code sending a message to your process.
What distinguishes theory from practice is the way to find out about the existence of the process (in order to send it a message before the init message). Code could use links from the supervisor, the registered name, the list of processes returned by erlang:processes() or even call list_to_pid/1 with random values or unserialize pids with binary_to_term/1. Your node might even get a message from another node with a serialized pid, especially considering that creation number wraps around after 3 (see your other question Wrong process getting killed on other node?).
This is unlikely in practice. As a result, from a practical point of view, every time this pattern is used, the code can be designed to ensure init message is received first and the server is initialized before it receives other messages.
If the gen_server is a registered process, you would start it from a supervisor and ensure that all clients are started afterward in the supervision tree or introduce some kind of (probably inferior) synchronization mechanism. This is required even if you do not use this pattern of asynchronous initialization (otherwise clients could not reach the server). Of course, you might still have issues in case of crashes and restarts of this gen_server, but this is true whatever the scenario, and you can only be saved by a carefully crafted supervision tree.
If the gen_server is not registered or referred to by name, clients will eventually pass the pid to gen_server:call/2,3 or gen_server:cast/2 which they would obtain through the supervisor which calls gen_server:start_link/3. gen_server:start_link/3 only returns when init/1 returned and therefore after the init message was enqueued. This is exactly what your code above does.

gen_server uses proc_lib:init_ack to make sure that the process is properly started before returning the pid from start_link. So the message sent in init will be the first message.

This is not 100% safe!
In gen.erl line 117-129, we can see this:
init_it(GenMod, Starter, Parent, Mod, Args, Options) ->
init_it2(GenMod, Starter, Parent, self(), Mod, Args, Options).
init_it(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
case name_register(Name) of
true ->
init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options);
{false, Pid} ->
proc_lib:init_ack(Starter, {error, {already_started, Pid}})
end.
init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
GenMod:init_it(Starter, Parent, Name, Mod, Args, Options).
In init_it/7 the process register its Name first, and then in init_it2/7 it calls GenMod:init_it/6 in which it calls your init/1 function.
Although, before gen_server:start_link returns, it is hardly to guess the new process id. However, if you send a message to the server with the registered Name, and the message arrives before your gen_server:cast is called, your code will be wrong.
Daniel's solution may be right, but I'm not quite sure whether two proc_lib:init_ack will cause an error or not. However, the parent would never like to receive an unexpected message. >_<
Here is another solution. Keep a flag in your gen_servser state to mark whether the server is initialized. And when you receive m, just check whether the server is initialized, if not, gen_cast m to yourself.
This is a little troublesome solution, but I'm sure it is right. =_=
I'm a freshman here, how I wish I could add a comment. >"<

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js