Can I shutdown Ray[RLlib] A3C model's Rolloutworker when inferencing? - ray

Is there a way to shutdown or kill the rolloutworker of the Ray A3C model? I noticed that it's required to init a A3C trainer but not useful in the inference purpose. It comsumed a lot of cpu resource.

Related

Keras predict not returning inside celery task

Following Keras function (predict) works when called synchronously
pred = model.predict(x)
But it does not work when called from within an asynchronous task queue (Celery).
Keras predict function does not return any output when called asynchronously.
The stack is: Django, Celery, Redis, Keras, TensorFlow
I ran into this exact same issue, and man was it a rabbit hole. Wanted to post my solution here since it might save somebody a day of work:
TensorFlow Thread-Specific Data Structures
In TensorFlow, there are two key data structures that are working behind the scenes when you call model.predict (or keras.models.load_model, or keras.backend.clear_session, or pretty much any other function interacting with the TensorFlow backend):
A TensorFlow graph, which represents the structure of your Keras model
A TensorFlow session, which is the connection between your current graph and the TensorFlow runtime
Something that is not explicitly clear in the docs without some digging is that both the session and the graph are properties of the current thread. See API docs here and here.
Using TensorFlow Models in Different Threads
It's natural to want to load your model once and then call .predict() on it multiple times later:
from keras.models import load_model
MY_MODEL = load_model('path/to/model/file')
def some_worker_function(inputs):
return MY_MODEL.predict(inputs)
In a webserver or worker pool context like Celery, what this means is that you will load the model when you import the module containing the load_model line, then a different thread will execute some_worker_function, running predict on the global variable containing the Keras model. However, trying to run predict on a model loaded in a different thread produces "tensor is not an element of this graph" errors. Thanks to the several SO posts that touched on this topic, such as ValueError: Tensor Tensor(...) is not an element of this graph. When using global variable keras model. In order to get this to work, you need to hang on to the TensorFlow graph that was used-- as we saw earlier, the graph is a property of the current thread. The updated code looks like this:
from keras.models import load_model
import tensorflow as tf
MY_MODEL = load_model('path/to/model/file')
MY_GRAPH = tf.get_default_graph()
def some_worker_function(inputs):
with MY_GRAPH.as_default():
return MY_MODEL.predict(inputs)
The somewhat surprising twist here is: the above code is sufficient if you are using Threads, but hangs indefinitely if you are using Processes. And by default, Celery uses processes to manage all its worker pools. So at this point, things are still not working on Celery.
Why does this only work on Threads?
In Python, Threads share the same global execution context as the parent process. From the Python _thread docs:
This module provides low-level primitives for working with multiple threads (also called light-weight processes or tasks) — multiple threads of control sharing their global data space.
Because threads are not actual separate processes, they use the same python interpreter and thus are subject to the infamous Global Interpeter Lock (GIL). Perhaps more importantly for this investigation, they share global data space with the parent.
In contrast to this, Processes are actual new processes spawned by the program. This means:
New Python interpreter instance (and no GIL)
Global address space is duplicated
Note the difference here. While Threads have access to a shared single global Session variable (stored internally in the tensorflow_backend module of Keras), Processes have duplicates of the Session variable.
My best understanding of this issue is that the Session variable is supposed to represent a unique connection between a client (process) and the TensorFlow runtime, but by being duplicated in the forking process, this connection information is not properly adjusted. This causes TensorFlow to hang when trying to use a Session created in a different process. If anybody has more insight into how this is working under the hood in TensorFlow, I would love to hear it!
The Solution / Workaround
I went with adjusting Celery so that it uses Threads instead of Processes for pooling. There are some disadvantages to this approach (see GIL comment above), but this allows us to load the model only once. We aren't really CPU bound anyways since the TensorFlow runtime maxes out all the CPU cores (it can sidestep the GIL since it is not written in Python). You have to supply Celery with a separate library to do thread-based pooling; the docs suggest two options: gevent or eventlet. You then pass the library you choose into the worker via the --pool command line argument.
Alternatively, it seems (as you already found out #pX0r) that other Keras backends such as Theano do not have this issue. That makes sense, since these issues are tightly related to TensorFlow implementation details. I personally have not yet tried Theano, so your mileage may vary.
I know this question was posted a while ago, but the issue is still out there, so hopefully this will help somebody!
I got the reference from this Blog
Tensorflow is Thread-Specific data structure that are working behind the scenes when you call model.predict
GRAPH = tf.get_default_graph()
with GRAPH.as_default():
pred = model.predict
return pred
But Celery uses processes to manage all its worker pools. So at this point, things are still not working on Celery for that you need to use gevent or eventlet library
pip install gevent
now run celery as :
celery -A mysite worker --pool gevent -l info

Akka equivalent of Spring InitializingBean

I have written some actor classes and I find that I have to get a handle into the lifecycle of these entities. For example whenever my actor is initialized I would like a method to be called so that I can setup some listeners on message queues (or open db connections etc).
Is there an equivalent of this? The equivalent I can think of is Spring's InitialisingBean and DisposableBean
This is a typical scenario where you would override methods like preStart(), postStop(), etc. I don't see anything wrong with this.
Of course you have to be aware of the details - for example postStop() is called asynchronously after actor.stop() is invoked while preStart() is called when an Actor is started. This means that potentially slow/blocking things like DB interaction should be kept to a minimum.
You can also use the Actor's constructor for initialization of data.
As Matthew mentioned, supervision plays a big part in Akka - so you can instruct the supervisor to perform specific stuff on events. For example the so-called DeathWatch - you can be notified if one of the actors "you are watching upon" dies:
context.watch(child)
...
def receive = {
case Terminated(`child`) => lastSender ! "finished"
}
An Actor is basically two methods -- a constructor, and onMessage(Object): void.
There's nothing in its lifecycle that naturally provides for "wiring" behavior, which leaves you with a few options.
Use a Supervisor actor to create your other actors. A Supervisor is responsible for watching, starting and restarting Actors on failure -- and therefore it is often valuable to have a Supervisor that understands the state of integrated systems to avoid continously restarting. This Supervisor would create and manage Service objects (possibly via Spring) and pass them to Actors.
Use your preferred Initialization technique at the time of Actor construction. It's tricky but you can certainly combine Spring with Actors. Just be aware that should a Supervisor restart your actor, you'll need to be able to resurrect its desired state from whatever content you placed in the Props object you used to start it in the first place.
Wire everything on-demand. Open connections on demand when an Actor starts (and cache them as necessary). I find I do this fairly often -- and I let the Actor fail when its connections no longer work. The supervisor will restart the Actor, which will recreate all connections.
Something important things to remember:
The intent of Actor model is that Actors don't run continuously -- they only run when there are messages provided to them. If you add a message listener to an Actor, you are essentially adding new threads that can access that actor. This can be a problem if you use supervision -- a restarted actor may leak that thread and this may in turn cause the actor not to be garbage collected. It can also be a problem because it introduces a race condition, and part of the value of actors is avoiding that.
An Actor that does I/O is, from the perspective of the actor system, blocking. If you have too many Actors doing I/O at the same time, you will exhaust your Dispatcher's thread pool and lock up the system.
A given Actor instance can operate on many different threads over its lifetime, but will only operate on one thread at a time. This can be confusing to some messaging systems -- for example, JMS' Spec asserts that a Session not be used on multiple threads, and many JMS interpret this as "can only run on the thread on which it was started." You may see warnings, or even exceptions, resulting from this.
For these reasons, I prefer to use non-actor code to do some of my I/O. For example, I'll have an incoming message listener object whose responsibility is to take JMS messages off a queue, use them to create POJO messages, and send tells to the Actor system. Alternately, I'll use an Actor, but place that actor on a custom Dispatcher that has thread pinning enabled. This assures that that Actor will only run on a specific thread and won't block up the system that other non-I/O actors are using.

What is the purpose of stopping actors in Akka?

I have read the Akka docs on fault tolerance & supervision, and I think I totally get them, with one big exception (no pun intended).
Why would you ever want/need to stop a child actor???
The only clue in the docs is:
Closer to the Erlang way is the strategy to just stop children when they fail and then take corrective action in the supervisor...
But to me, stopping a child is the same as saying "don't execute this code any longer", which to me, is effectively the same as deploying new changes to the code which has that actor removed entirely:
Every Actor plays some critical role in the actor system
To simply stop the actor means that actor currently doesn't have a role any longer, and presumes the system can now somehow (magically) work without it
So again, to me, this is no different than refactoring the code to not even have the actor any more, and then deploying those changes
I'm sure I'm just not seeing the forest through the trees on this one, but I just don't see any use cases where I'd have this big complex actor system, where each actor does critical work and then hands it off to the next critical actor, but then I stop an actor, and magically the whole system keeps on working perfectly.
In short: stopping an actor (to me) is like ripping the transmission out of a moving vehicle. How can this ever be a good/desirable thing?!?
The essence of the "error kernel" pattern is to delegate risky operations and protect essential state, it is common to spawn child-actors for one-off operations, and when that operation is completed and its result send off somewhere else, the child-actor or the parent-actor needs to stop it. (otherwise the child-actor will remain active/leak)
If the child actor is doing a longer process that could be terminated safely, such as video coding, or some kind of file transformation and you have to deploy a new build, in that case a terminate sign would be useful to stop running processes gracefully.
Every Actor plays some critical role in the actor system
This is where you are running into trouble, I can create a child actor to do a job, for example execute a query against a database or maintain the state of a connected user and this is its only purpose.
Once the database query is complete or the user has gracefully disconnected the child actor no longer has any role to play and should be stopped so that it will release any resources it holds.
To simply stop the actor means that actor currently doesn't have a role any >longer, and presumes the system can now somehow (magically) work without it
The system is able to continue because I can create new child actors if/when they are needed.

Celery tasks per Model Object. Cleanest way to track progress

I have distributed hardware sensor nodes that will be interrogated by celery tasks. Each sensor node has a object associated holding recent readings, and config data.
I never want more than one celery task interrogating a single sensornode. But requests might come to interrogate the node while it is still being worked on from a previous request.
I didn't see any example of this sort of task tracking in any of the celery docs. But I assume its a fairly common requirement.
My first thought was to just mark the model object at the beginning and end of the task with a task_in_progress like flag.
Is there anything in the task instantiation that I can use to better realize my task tracking?
What you want is to lock a task on a given resource, there is a very nice example on the Celery.
To summarize the example suggests to use a cache key to hold the lock, a task will check the lock key (you can generate a instance specific cache key like "sensor-%(id)s") before starting and execute only if the cache key is not set.
example.
def check_sensor(sensor_id):
if check_lock_from_cache(sensor_id):
... handle the lock ...
else:
lock(sensor_id)
... use the sensor ...
unlock(sensor_id)
you probably want to be really sure to do the unlock properly (try except finally)
here's the celery example http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time

scala specs don't exit when testing actors

I'm trying to test some actors using scala specs. I run the test in IDEA or Maven (as junit) and it does not exit. Looking at the code, my test finished, but some internal threads (scheduler) are hanging around. How can I make the test finish?
Currently this is only possible by causing the actor framework's scheduler to forcibly shut down:
scala.actors.Scheduler.impl.shutdown
However, the underlying implementation of the scheduler has been changing in patch-releases lately, so this may be different, or not quite work with the version you are on. In 2.7.7 the default scheduler appears to be an instance of scala.actors.FJTaskScheduler2 for which this approach should work, however if you end up with a SingleThreadedScheduler it will not, as the shutdown method is a no-op
This will only work if your actors are not waiting on a react at that time