Following Keras function (predict) works when called synchronously
pred = model.predict(x)
But it does not work when called from within an asynchronous task queue (Celery).
Keras predict function does not return any output when called asynchronously.
The stack is: Django, Celery, Redis, Keras, TensorFlow
I ran into this exact same issue, and man was it a rabbit hole. Wanted to post my solution here since it might save somebody a day of work:
TensorFlow Thread-Specific Data Structures
In TensorFlow, there are two key data structures that are working behind the scenes when you call model.predict (or keras.models.load_model, or keras.backend.clear_session, or pretty much any other function interacting with the TensorFlow backend):
A TensorFlow graph, which represents the structure of your Keras model
A TensorFlow session, which is the connection between your current graph and the TensorFlow runtime
Something that is not explicitly clear in the docs without some digging is that both the session and the graph are properties of the current thread. See API docs here and here.
Using TensorFlow Models in Different Threads
It's natural to want to load your model once and then call .predict() on it multiple times later:
from keras.models import load_model
MY_MODEL = load_model('path/to/model/file')
def some_worker_function(inputs):
return MY_MODEL.predict(inputs)
In a webserver or worker pool context like Celery, what this means is that you will load the model when you import the module containing the load_model line, then a different thread will execute some_worker_function, running predict on the global variable containing the Keras model. However, trying to run predict on a model loaded in a different thread produces "tensor is not an element of this graph" errors. Thanks to the several SO posts that touched on this topic, such as ValueError: Tensor Tensor(...) is not an element of this graph. When using global variable keras model. In order to get this to work, you need to hang on to the TensorFlow graph that was used-- as we saw earlier, the graph is a property of the current thread. The updated code looks like this:
from keras.models import load_model
import tensorflow as tf
MY_MODEL = load_model('path/to/model/file')
MY_GRAPH = tf.get_default_graph()
def some_worker_function(inputs):
with MY_GRAPH.as_default():
return MY_MODEL.predict(inputs)
The somewhat surprising twist here is: the above code is sufficient if you are using Threads, but hangs indefinitely if you are using Processes. And by default, Celery uses processes to manage all its worker pools. So at this point, things are still not working on Celery.
Why does this only work on Threads?
In Python, Threads share the same global execution context as the parent process. From the Python _thread docs:
This module provides low-level primitives for working with multiple threads (also called light-weight processes or tasks) — multiple threads of control sharing their global data space.
Because threads are not actual separate processes, they use the same python interpreter and thus are subject to the infamous Global Interpeter Lock (GIL). Perhaps more importantly for this investigation, they share global data space with the parent.
In contrast to this, Processes are actual new processes spawned by the program. This means:
New Python interpreter instance (and no GIL)
Global address space is duplicated
Note the difference here. While Threads have access to a shared single global Session variable (stored internally in the tensorflow_backend module of Keras), Processes have duplicates of the Session variable.
My best understanding of this issue is that the Session variable is supposed to represent a unique connection between a client (process) and the TensorFlow runtime, but by being duplicated in the forking process, this connection information is not properly adjusted. This causes TensorFlow to hang when trying to use a Session created in a different process. If anybody has more insight into how this is working under the hood in TensorFlow, I would love to hear it!
The Solution / Workaround
I went with adjusting Celery so that it uses Threads instead of Processes for pooling. There are some disadvantages to this approach (see GIL comment above), but this allows us to load the model only once. We aren't really CPU bound anyways since the TensorFlow runtime maxes out all the CPU cores (it can sidestep the GIL since it is not written in Python). You have to supply Celery with a separate library to do thread-based pooling; the docs suggest two options: gevent or eventlet. You then pass the library you choose into the worker via the --pool command line argument.
Alternatively, it seems (as you already found out #pX0r) that other Keras backends such as Theano do not have this issue. That makes sense, since these issues are tightly related to TensorFlow implementation details. I personally have not yet tried Theano, so your mileage may vary.
I know this question was posted a while ago, but the issue is still out there, so hopefully this will help somebody!
I got the reference from this Blog
Tensorflow is Thread-Specific data structure that are working behind the scenes when you call model.predict
GRAPH = tf.get_default_graph()
with GRAPH.as_default():
pred = model.predict
return pred
But Celery uses processes to manage all its worker pools. So at this point, things are still not working on Celery for that you need to use gevent or eventlet library
pip install gevent
now run celery as :
celery -A mysite worker --pool gevent -l info
Related
In my current setup, if I do five 100ms queries, they take 500ms total. Is there a way I can run them in parallel so it only takes 100ms?
I'm running Flask behind nginx/uwsgi, but can change any of that.
Specifically, I'd like to be able to turn code from this:
result_1 = db.session.query(...).all()
result_2 = db.session.query(...).all()
result_3 = db.session.query(...).all()
To something like this:
result_1, result_2, result_3 = run_in_parallel([
db.session.query(...).all(),
db.session.query(...).all(),
db.session.query(...).all(),
])
Is there a way to do that with Flask and SQLAlchemy?
Parallelism in general
In general, if you want to run tasks in parallel you can use threads or processes. In python, threads are great for tasks that are I/O bound (meaning the time they take is spent waiting on another resource - waiting for your database, or for the disk, or for a remote webserver), and processes are great for tasks that are CPU bound (math and other computationally intensive tasks).
concurrent.futures
In your case, threads are ideal. Python has a threading module that you can look into, but there's a fair bit to unpack: safely using threads usually means limiting the number of threads that can be run by using a pool of threads and a queue for tasks. For that reason I much prefer concurrent.futures library, which provides wrappers around threading to give you an easy to use interface and to handle a lot of the complexity for you.
When using concurrent.futures, you create an executor, and then you submit tasks to it along with a list of arguments. Instead of calling a function like this:
# get 4 to the power of 5
result = pow(4, 5)
print(result)
You submit the function, and its arguments:
You would normally use concurrent.futures a bit like this:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor()
future = executor.submit(pow, 4, 5)
print(future.result())
Notice how we don't call the function by using pow(), we submit the function object pow which the executor will call inside a thread.
To make it easier to use the concurrent.futures library with Flask, you can use flask-executor which works like any other Flask extension. It also handles the edge cases where your background tasks require access to Flask's context locals (like the app, session, g or request objects) inside a background task. Full disclosure: I wrote and maintain this library.
(Fun fact: concurrent.futures wraps both threading and multiprocessing, using the same API - so if you find yourself needing multiprocessing for CPU bound tasks in future, you can use the same library in the same way to achieve your goal)
Putting it all together
Here's what using flask-executor to run SQLAlchemy tasks in parallel looks like:
from flask_executor import Executor
# ... define your `app` and `db` objects
executor = Executor(app)
# run the same query three times in parallel and collect all the results
futures = []
for i in range(3):
# note the lack of () after ".all", as we're passing the function object, not calling it ourselves
future = executor.submit(db.session.query(MyModel).all)
futures.append(future)
for future in futures:
print(future.result())
Boom, you have now run multiple Flask SQLAlchemy queries in parallel.
I have a working Python (2.7) script that communicates with gdb interactively through the pexpect module. However, it's intolerably slow, and I needed to speed it up with a multiprocessing pool. I found a way to do this, but in my implementation, each one of the multiple processes has to spawn its own pexpect instance. This seems like a massive waste of computational time, since spawning each pexpect instance takes a couple minutes, and I'll have to spawn hundreds of them.
Instead of this kind of flowchart which represents the current program,
Process A --- pexpect A
\
\
Process B --- pexpect B --- Main Script
/
/
Process C --- pexpect C
I would like to have something like this:
Process A
\
\
Process B --- global pexpect process -- Main Script
/
/
Process C
I'm aware that sharing objects between multiple processes is not new ground here on StackOverflow, but those objects discussed have by-and-large been read-only in nature. I think my issue is different in that this pexpect instance can run out of virtual memory, and will occasionally need to be restarted.
This means that the shared pexpect object needs to be writable in each one of the multiple processes, and every one of the multiple processes needs to be told to wait until the pexpect process has finished its restart. Further, each of the multiple processes needs to be able to update their copy of the pexpect instance with the restarted version after the restart has been completed.
Frankly, I don't know if this is possible. I'm aware that under the covers, Python uses os.fork() to implement multiprocessing, so I'm thinking that an arbitrary multiple-writer/multiple-reader shared-memory resource can't even be built. Nonetheless, I toyed around with trying to communicate the shared pexpect object around using `multiprocessing.Manager()', but when I tried to implement communication through the manager, I got an error that the pexpect object wasn't pickle-able.
Am I just dense in thinking this is basically impossible, or can this actually be done?
I have distributed hardware sensor nodes that will be interrogated by celery tasks. Each sensor node has a object associated holding recent readings, and config data.
I never want more than one celery task interrogating a single sensornode. But requests might come to interrogate the node while it is still being worked on from a previous request.
I didn't see any example of this sort of task tracking in any of the celery docs. But I assume its a fairly common requirement.
My first thought was to just mark the model object at the beginning and end of the task with a task_in_progress like flag.
Is there anything in the task instantiation that I can use to better realize my task tracking?
What you want is to lock a task on a given resource, there is a very nice example on the Celery.
To summarize the example suggests to use a cache key to hold the lock, a task will check the lock key (you can generate a instance specific cache key like "sensor-%(id)s") before starting and execute only if the cache key is not set.
example.
def check_sensor(sensor_id):
if check_lock_from_cache(sensor_id):
... handle the lock ...
else:
lock(sensor_id)
... use the sensor ...
unlock(sensor_id)
you probably want to be really sure to do the unlock properly (try except finally)
here's the celery example http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time
Disclaimer: I do know that there are several similar questions on SO. I think I've read most if not all of them, but did not find an answer to my real question (see later).
I also do know that using celery or other asynchronous queue systems is the best way to achieve long running tasks - or at least use a cron-managed script. There's also mod_wsgi doc about processes and threads but I'm not sure I got it all correct.
The question is:
what are the exact risks/issues involved with using the solutions listed down there? Is any of them viable for long running tasks (ok, even though celery is better suited)?
My question is really more about understanding the internals of wsgi and python/django than finding the best overall solution. Issues with blocking threads, unsafe access to variables, zombie processing, etc.
Let's say:
my "long_process" is doing something really safe. even if it fails i don't care.
python >= 2.6
I'm using mod_wsgi with apache (will anything change with uwsgi or gunicorn?) in daemon mode
mod_wsgi conf:
WSGIDaemonProcess NAME user=www-data group=www-data threads=25
WSGIScriptAlias / /path/to/wsgi.py
WSGIProcessGroup %{ENV:VHOST}
I figured that these are the options available to launch separate processes (meant in a broad sense) to carry on a long running task while returning quickly a response to the user:
os.fork
import os
if os.fork()==0:
long_process()
else:
return HttpResponse()
subprocess
import subprocess
p = subprocess.Popen([sys.executable, '/path/to/script.py'],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
(where the script is likely to be a manage.py command)
threads
import threading
t = threading.Thread(target=long_process,
args=args,
kwargs=kwargs)
t.setDaemon(True)
t.start()
return HttpResponse()
NB.
Due to the Global Interpreter Lock, in CPython only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better of use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
The main thread will quickly return (the httpresponse). Will the spawned long thread block wsgi from doing something else for another request?!
multiprocessing
from multiprocessing import Process
p = Process(target=_bulk_action,args=(action,objs))
p.start()
return HttpResponse()
This should solve the thread concurrency issue, shouldn't it?
So those are the options I could think of. What would work and what not, and why?
os.fork
A fork will clone the parent process, which in this case, is your Django stack. Since you're merely wanting to run a separate python script, this seems like an unnecessary amount of bloat.
subprocess
Using subprocess is expected to be interactive. In other words, while you can use this to effectively spawn off a process, it's expected that at some point you'll terminate it when finished. It's possible Python might clean up for you if you leave one running, but my guess would be that this will actually result in a memory leak.
threading
Threads are defined units of logic. They start when their run() method is called, and terminate when the run() method's execution ends. This makes them well suited to creating a branch of logic that will run outside the current scope. However, as you mentioned, they are subject to the Global Interpreter Lock.
multiprocessing
This module allows you to spawn processes, and it has an API similar to that of threading. You could say is like threads on steroids. These processes are not subject to the Global Interpreter Lock, and they can take advantage of multi-core architectures. However, they are more complicated to work with as a result.
So, your choices really come down to threads or processes. If you can get by with a thread and it makes sense for your application, go with a thread. Otherwise, use processes.
I have found that using uWSGI Decorators is quite simpler than using Celery if you need just run some long task in background.
Think Celery is best solution for serious heavy project, and it's overhead for doing something simple.
For start using uWSGI Decorators you just need to update your uWSGI config with
<spooler-processes>1</spooler-processes>
<spooler>/here/the/path/to/dir</spooler>
write code like:
#spoolraw
def long_task(arguments):
try:
doing something with arguments['myarg'])
except Exception as e:
...something...
return uwsgi.SPOOL_OK
def myView(request)
long_task.spool({'myarg': str(someVar)})
return render_to_response('done.html')
Than when you start view in uWSGI log appears:
[spooler] written 208 bytes to file /here/the/path/to/dir/uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414
and when task finished:
[spooler /here/the/path/to/dir pid: 31138] done with task uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414 after 78 seconds
There is strange(for me) restrictions:
- spool can receive as argument only dictionary of strings, look like because it's serialize in file as strings.
- spool should be created on start up so "spooled" code it should be contained in separate file which should be defined in uWSGI config as <import>pyFileWithSpooledCode</import>
For the question:
Will the spawned long thread block wsgi from doing something else for
another request?!
the answer is no.
You still have to be careful creating background threads from a request though in case you simply create huge numbers of them and clog up the whole process. You really need a task queueing system even if you are doing stuff in process.
In respect of doing a fork or exec from web process, especially from Apache that is generally not a good idea as Apache may impose odd conditions on the environment of the sub process created which could technically interfere with its operation.
Using a system like Celery is still probably the best solution.
I'm trying to work out how to run a process in a background thread in Django. I'm new to both Django and threads, so please bear with me if I'm using the terminology wrong.
Here's the code I have. Basically I'd like start_processing to begin as soon as the success function is triggered. However start_processing is the kind of function that could easily take a few minutes or fail (it's dependent on an external service over which I have no control), and I don't want the user to have to wait for it to complete successfully before the view is rendered. ('Success' as far as they are concerned isn't dependent on the result of start_processing; I'm the only person who needs to worry if it fails.)
def success(request, filepath):
start_processing(filepath)
return render_to_response('success.html', context_instance = RequestContext(request))
From the Googling I've done, most people suggest that background threads aren't used in Django, and instead a cron job is more suitable. But I would quite like start_processing to begin as soon as the user gets to the success function, rather than waiting until the cron job runs. Is there a way to do this?
If you really need a quick hack, simply start a process using subprocess.
But I would not recommend spawning a process (or even a thread), especially if your web site is public: in case of high load (which could be "natural" or the result of a trivial DoS attack), you would be spawning many processes or threads, which would end up using up all your system resources and killing your server.
I would instead recommend using a job server: I use Celery (with Redis as the backend), it's very simple and works just great. You can check out many other job servers, such as RabbitMQ or Gearman. In your case, a job server might be overkill: you could simply run Redis and use it as a light-weight message server. Here is an example of how to do this.
Cheers
In case someone really wants to run another thread
def background_process():
import time
print("process started")
time.sleep(100)
print("process finished")
def index(request):
import threading
t = threading.Thread(target=background_process, args=(), kwargs={})
t.setDaemon(True)
t.start()
return HttpResponse("main thread content")
This will return response first, then print "process finished" to console. So user will not face any delay.
Using Celery is definitely a better solution. However, installing Celery could be unnecessary for a very small project with a limited server etc.
You may also need to use threads in a big project. Because running Celery in all your servers is not a good idea. Then there won't be a way to run a separate process in each server. You may need threads to handle this case. File system operations might be an example. It's not very likely though and it is still better to use Celery with long running processes.
Use wisely.
I'm not sure you need a thread for that. It sounds like you just want to spawn off a process, so look into the subprocess module.
IIUC, The problem here is that the webserver process might not like extra long-running threads, it might kill/spawn server processes as demand go up and down, etc etc.
You're probably better of by communicating to an external service process for this type of processing, instead of embedding it in in the webserver's wsgi/fastcgi process.
If the only thing you're sending over is the filepath, it ought to be pretty easy to write that service app.