Workers ending prematurely when using Tune - ray

I'm trying to learn the basics of Tune. In the following script, I would expect each worker to run for 100 iterations and then end however, the workers end before reaching 100 iterations with state 3 ( TypeError? ). I do not see any error messages so I might be confused as to what is actually supposed to happen. Out of 10 samples, only 2 reach 100 iterations. The rest of the samples are between 5 and 16 iterations.
"""Testing Tune with CartPole."""
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.suggest.bayesopt import BayesOptSearch
if __name__ == "__main__":
tune_metric = "info/learner/default_policy/critic_loss"
space = {"gamma": (0.01, 1)}
algo = BayesOptSearch(
space,
metric=tune_metric,
mode="min",
utility_kwargs={
"kind": "ucb",
"kappa": 2.5,
"xi": 0.0
})
scheduler = AsyncHyperBandScheduler(metric=tune_metric, mode="min")
ray.init()
analysis = tune.run(
"SAC",
stop={"training_iteration": 100},
search_alg=algo,
scheduler=scheduler,
num_samples=10,
config={
"env": "CartPole-v0",
},
)
print("Best config: ", analysis.get_best_config(metric=tune_metric,
mode="min"))
When I attempt to run the following example, the same thing occurs ( mnist pytorch trainable )

scheduler = AsyncHyperBandScheduler(metric=tune_metric, mode="min")
This will automatically terminate low-performing trials.

Related

how to schedule a task to run every n minutes at a specific time (celery-django)

When using the celerybeat in django, I would like to schedule a task to start at a specific time, and then run every 5 minutess. I was able to get the task to run every 5 minutes, using crontab(minute='*/5') and this will run after celerybeat is up but i want to run it for example at 8:30, how to do this?
First you set up your task to run every 5 minutes and you have already done that.
The second step is to wrap the body of your task into a conditional that checks if at the current time it should run or not.
Something like this
from django.utils import timezone
import datetime
#app.task
def my_task():
now = timezone.now().time()
start_time = datetime.time(8, 30, 0)
end_time = datetime.time(17, 30, 0)
if now >= start_time and now < end_time:
# your task ...

Celery/redis tasks don't always complete - not sure why or how to fit it

I am running celery v 4.0.3/redis v 4.09 in a django v 3.0.1 application (Python v 3.6.9). I am also using face_recognition in a celery task find_faces to find faces in the images I have uploaded to the application, among other image processing celery tasks. There are no issues processing five or fewer image files in that all the image processing celery tasks complete successfully.
When I have the image process tasks (including find_faces) iterate over 100 images there are 10-30 images where the find_faces task does not complete. When I use flower v0.9.7 to take a look at the celery tasks, I see that the find_faces task status is "started" for those images that did not complete. All the other images have find_faces task status as "success". The status of these "started" tasks never changes, and there are no errors or exceptions reported. I can then run the image processing tasks, including the find_faces task, on each of these images individually, and the task status is "success". These results do not change if I run celery as a daemon or locally, or if I run the django app using wsgi and apache or runserver. Flower also reports that retries = 0 for all my tasks.
I have CELERYD_TASK_SOFT_TIME_LIMIT = 60 set globally in the django app, and max_retries=5 for the find_faces task.
#app.task(bind=True, max_retries=5)
def find_faces_task(self, document_id, use_cuda=settings.USE_CUDA):
logger.debug("find_faces_task START")
try:
temp_face = None
from memorabilia.models import TaskStatus, Document
args = "document_id=%s, use_cuda=%s" % (document_id, use_cuda)
ts = TaskStatus(document_id_id=document_id, task_id=self.request.id, task_name='find_faces_task', task_args=args, task_status=TaskStatus.PENDING)
ts.save()
import time
time_start = time.time()
# Check if we already have the faces for this document
from biometric_identification.models import Face
if len(Face.objects.filter(document_id=document_id)) != 0:
# This document has already been scanned, so need to remove it and rescan
# Have to manually delete each object per django docs to insure the
# model delete method is run to update the metadata.
logger.debug("Document %s has already been scanned" % document_id)
faces = Face.objects.filter(document_id=document_id)
for face in faces:
face.delete()
logger.debug("Deleted face=%s" % face.tag_value.value)
document = Document.objects.get(document_id=document_id)
image_file = document.get_default_image_file(settings.DEFAULT_DISPLAY_IMAGE)
image_path = image_file.path
time_start_looking = time.time()
temp_file = open(image_path, 'rb')
temp_image = Image.open(temp_file)
logger.debug("fred.mode=%s" % fred.mode)
width, height = temp_image.size
image = face_recognition.load_image_file(temp_file)
# Get the coordinates of each face
if use_cuda:
# With CUDA installed
logger.debug("Using CUDA for face recognition")
face_locations = face_recognition.face_locations(image, model="cnn", number_of_times_to_upsample=0)
else:
# without CUDA installed
logger.debug("NOT using CUDA for face recognition")
face_locations = face_recognition.face_locations(image, model="hog", number_of_times_to_upsample=2)
time_find_faces = time.time()
# Get the face encodings for each face in the picture
face_encodings = face_recognition.face_encodings(image, known_face_locations=face_locations)
logger.debug("Found %s face locations and %s encodings" % (len(face_locations), len(face_encodings)))
time_face_encodings = time.time()
# Save the faces found in the database
for location, encoding in zip(face_locations, face_encodings):
# Create the new Face object and load in the document, encoding, and location of a face found
# Locations seem to be of the form (y,x)
from memorabilia.models import MetaData, MetaDataValue
tag_type_people = MetaDataValue.objects.filter(metadata_id=MetaData.objects.filter(name='Tag_types')[0].metadata_id, value='People')[0]
tag_value_unknown = MetaDataValue.objects.filter(metadata_id=MetaData.objects.filter(name='Unknown')[0].metadata_id, value='Unknown')[0]
new_face = Face(document=document, face_encoding=numpy_to_json(encoding), face_location=location, image_size={'width': width, "height":height}, tag_type=tag_type_people, tag_value=tag_value_unknown)
# save the newly found Face object
new_face.save()
logger.debug("Saved new_face %s" % new_face.face_file)
time_end = time.time()
logger.debug("total time = {}".format(time_end - time_start))
logger.debug("time to find faces = {}".format(time_find_faces - time_start_looking))
logger.debug("time to find encodings = {}".format(time_face_encodings - time_find_faces))
ts.task_status = TaskStatus.SUCCESS
ts.comment = "Found %s faces" % len(face_encodings)
return document_id
except Exception as e:
logger.exception("Hit an exception in find_faces_task %s" % str(e))
ts.task_status = TaskStatus.ERROR
ts.comment = "An exception while finding faces: %s" % repr(e)
finally:
logger.debug("Finally clause in find-faces_task")
if temp_image:
temp_image.close()
if temp_file:
temp_file.close()
ts.save(update_fields=['task_status', 'comment'])
logger.debug("find_faces_task END")
The find_faces task is called as part of a larger chain of tasks that manipulate the images. Each image file goes through this chain, where step_1 and step_2 are chords for different image processing steps:
step_1 = chord( group( clean ), chordfinisher.si() ) # clean creates different image sizes
step_2 = chord( group( jobs ), chordfinisher.si() ) # jobs include find_faces
transaction.on_commit(lambda: chain(step_1, step_2, faces_2, ocr_job, change_state_task.si(document_id, 'ready')).delay())
#app.task
def chordfinisher( *args, **kwargs ):
return "OK"
The images are large, so it can take up to 30 seconds for the find_faces task to complete. I thought the CELERYD_TASK_SOFT_TIME_LIMIT = 60 would take care of this long processing time.
I am by no means a celery expert, so I assume there is a celery setting or option that I need to enable to make sure the find_faces task completes all the time. I just don't know what that would be.
After some more research, I can up with this suggestion from Lewis Carroll, in this post "Beware the oom-killer, my son! The jaws that bite, the claws that catch!", and this post Chaining Chords produces enormously big messages causing OOM on workers, and this post WorkerLostError: Worker exited prematurely: exitcode 155.
It seems my celery workers may have been running out of memory, as I did find traces of the dreaded oomkiller in my syslogs. I reconfigured my tasks to just be in a chain (removed all the groups and chords) so each task is run individually in sequence for each image, and the tasks all completed successfully, no matter how many images I processed.

"Rate of traffic exceeds capacity" error on Google Cloud VertexAI but only sending a single prediction request

As In the title. Exact response:
{
"error": {
"code": 429,
"message": "Rate of traffic exceeds capacity. Ramp your traffic up more slowly. endpoint_id: <My Endpoint>, deployed_model_id: <My model>.",
"status": "RESOURCE_EXHAUSTED"
}
I send a single prediction request which consists of an instance of 1 string. The model is a pipeline of a custom tfidf vectorizer and logistic regression. I timed the loading time: ~0.5s, prediction time < 0.01s.
I can confirm through logs that the prediction is executed successfully but for some reason this is the response I get. Any ideas?
Few things to consider:
Allow your prediction service to serve using multiple workers
Increase your number of replicas in Vertex or set your machine types to stronger types as long as you gain improvement
However, there's something worth doing first in the client side assuming most of your prediction calls go through successfully and it is not that frequent that the service is unavailable,
Configure your prediction client to use Retry (exponential backoff):
from google.api_core.retry import Retry, if_exception_type
import requests.exceptions
from google.auth import exceptions as auth_exceptions
from google.api_core import exceptions
if_error_retriable = if_exception_type(
exceptions.GatewayTimeout,
exceptions.TooManyRequests,
exceptions.ResourceExhausted,
exceptions.ServiceUnavailable,
exceptions.DeadlineExceeded,
requests.exceptions.ConnectionError, # The last three might be an overkill
requests.exceptions.ChunkedEncodingError,
auth_exceptions.TransportError,
)
def _get_retry_arg(settings: PredictionClientSettings):
return Retry(
predicate=if_error_retriable,
initial=1.0, # Initial delay
maximum=4.0, # Maximum delay
multiplier=2.0, # Delay's multiplier
deadline=9.0, # After 9 secs it won't try again and it will throw an exception
)
def predict_custom_trained_model_sample(
project: str,
endpoint_id: str,
instance_dict: Dict,
location: str = "us-central1",
api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
...
response = await client.predict(
endpoint=endpoint,
instances=instances,
parameters=parameters,
timeout=SOME_VALUE_IN_SEC,
retry=_get_retry_arg(),
)

Tensorflow/Keras with django not working correctly with celery

We are building a script for face recognition, mainly with tensorflow for basic recognition functions, from videos.
When we try the soft directly with a python test-reco.py (which take a video path as parameter) it works perfectly.
Now we are trying to integrate it through our website, within a celery task.
Here is the main code:
def extract_labels(self, path_to_video):
if not os.path.exists(path_to_video):
print("NO VIDEO!")
return None
video = VideoFileClip(path_to_video)
n_frames = int(video.fps * video.duration)
out = []
for i, frame in enumerate(video.iter_frames()):
if self.verbose > 0:
print(
'processing frame:',
str(i).zfill(len(str(n_frames))),
'/',
n_frames
)
try:
rect = face_detector(frame[::2, ::2], 0)[0]
y0, x0, y1, x1 = np.array([rect.left(), rect.top(), rect.right(), rect.bottom()])*2
bbox = frame[x0:x1, y0:y1]
bbox = resize(bbox, [128, 128])
bbox = rgb2gray(bbox)
bbox = equalize_hist(bbox)
y_hat = self.model.predict(bbox[None, :, :, None], verbose=1, batch_size=1)[0]
# y_hat = np.ones(7)
out.append(y_hat)
except IndexError as e:
print(out)
print(e)
We need a try catch because sometimes there aren't any face present in the first frames.
But then we have this line:
y_hat = self.model.predict(bbox[None, :, :, None], verbose=1, batch_size=1)[0]
blocking. Like an endless loop.
The bbox isn't empty.
The celery worker simply blocks on it and you can't exit the process (the warm / cold quit never occurs)
Is there something specific to do with tensorflow with Celery?
I had a very similar setup and problem. In my case it helped to simply shift all the imports that referenced Keras stuff into a dedicated initializer function, leading to a setup like this:
from celery import Celery
from celery.signals import worker_process_init
CELERY = ...
#worker_process_init.connect()
def init_worker_process(**kwargs):
// Load all Keras related imports here
import ...
#CELERY.task()
def long_running_task(*args, **kwargs):
// Actual calculation task
...
tf.Session (Tensorflow session) is not fork-safe. Celery won't work if a package is not fork-safe.
I guess self.model.predict will call tf.Session somewhere and it gets blocked.

How this code for parallel task works in Python?

I've been using a script (above) to run some task in parallel in an Ubuntu server with 16 processors, it actually works but I have a few questions about it:
What is the code actually doing?
As more workers I set up the script run faster, but what is the limit of workers?, I've run 100.
How could improve it?
#!/usr/bin/env python
from multiprocessing import Process, Queue
from executable import run_model
from database import DB
import numpy as np
def worker(work_queue, db_conection):
try:
for phone in iter(work_queue.get, 'STOP'):
registers_per_number = retrieve_CDRs(phone, db_conection)
run_model(np.array(registers_per_number), db_conection)
#print("The phone %s was already run" % (phone))
except Exception:
pass
return True
def retrieve_CDRs(phone, db_conection):
return db_conection.retrieve_data_by_person(phone)
def main():
phone_numbers = np.genfromtxt("../listado.csv", dtype="int")[:2000]
workers = 16
work_queue = Queue()
processes = []
#print("Process started with %s" % (workers))
for phone in phone_numbers:
work_queue.put(phone)
#print("Phone %s put at the queue" % (phone))
#print("The queue %s" % (work_queue))
for w in xrange(workers):
#print("The worker %s" % (w))
# new conection to data base
db_conection = DB()
p = Process(target=worker, args=(work_queue, db_conection))
p.start()
#print("Process %s started" % (p))
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
if __name__ == '__main__':
main()
Cheers!
At first, start from the main function:
It's creating an numpy array of 2000 integers type phone numbers from a CSV file.
Then creating some variables and lists.
Next, you are creating a queue with all the phone numbers that you extracted from the CSV file
Next, for the 16 workers, you are creating a DB connection for each, setting up the processing arguments and started the process for all the worker processors.
Hope that helps you to understand the code. Actually, it's kind of multi-threading you are trying and it's behaving like parallel processing. So, the more number you use, it becomes more faster. You should be able to use 2000 processors as my common sense says that. After that it's not meaningful as master-slave philosophy. Also, parallel processing suggests you to minimize the number of idle processors/workers. If you have more than 2000 workers, then you will have some idle workers which will reduce your performance. Finally, improving parallel processing needs to improve this kind of ideology.
Hope that helps. Cheers!