How to prevent trials execution on the head - ray

I'm using ray.tune on an aws "Autoscaling GPU cluster". Currently, my head and workers all have a GPU and are all used to execute trials. I'm trying to move to a setup where the head doesn't have a GPU --
along the lines of how Ray's doc defines "Autoscaling GPU cluster". However, I keep running into CUDA problems on the head which makes sense since it is used for trials execution. The solution appears simple enough: I guess I need to prevent trials execution on the head but I can't find how. I tried various resources_per_trial values, same with ray.init() but didn't get this to work.
Additional details:
I use ray 0.8.6.
I set resources_per_trial={'gpu': 1}
I set torch.device("cuda:0") everywhere
I use 1 head (cpu only) and 1 worker (gpu only), I required a minimum of 1 worker.
So everything is made to run only on GPU which is why I focused on preventing execution on the head.
Regarding errors and warnings, I get the following:
WARNING tune.py:318 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
WARNING ray_trial_executor.py:549 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff128bce290200 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:10.160.26.189: 1.000000}, {object_store_memory: 12.304688 GiB}, {CPU: 3.000000}, {memory: 41.650391 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
Even when I wait for the gpu worker to be running I still get the above.
Finally, the error is:
ERROR trial_runner.py:520 -- Trial TrainableAE_a441f_00000: Error processing event.
Traceback (most recent call last):
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 468, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 1467, in get
values = worker.get_objects(object_ids, timeout=timeout)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 306, in get_objects
return self.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 281, in deserialize_objects
return context.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 312, in deserialize_objects
self._deserialize_object(data, metadata, object_id))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 233, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 221, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/storage.py", line 136, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 773, in _legacy_load
result = unpickler.load()
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 729, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 178, in default_restore_location
result = fn(storage, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
device = validate_cuda_device(location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 138, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Thanks to richliaw for comments. The solution became obvious once I stopped trying to prevent trials execution on the head and instead focused on finding why these were happening in the first place. The AMI on the head of my cluster had NVidia drivers and cuda installed on it. After I removed those ray stopped trying to execute on the head. So I guess this is how ray decides to send computation on the head when resources_per_trial={'gpu': 1}.

Related

Celery with Redis and Django giving WorkerLostError on long running tasks

I have a long running Celery task that computes the PDP of a feature. Below is the shared task that's run:
#shared_task
def get_pdp_single(bst, train_df, feature, value, f_id=-1):
x_temp = train_df.copy()
x_temp.iloc[:, f_id] = value
data = xgb.DMatrix(x_temp, feature_names=x_temp.columns.tolist())
predictions = (bst.predict(data))
avg_predictions = np.mean(predictions)
result_dict = {
"feature": feature,
"avg_predictions": avg_predictions.item()
}
return result_dict
I'm computing Hstatistics of all the features taken in the XGBoost model built. So, we have lots of such tasks being queued in the Broker (Redis). ~12k tasks gets queued into Redis for this.
I have a 8core 16GB VM on which I instantiate a single Celery worker to do this task. Each single child task takes ~40 seconds to complete, this is because XGBoost predict method takes its time to complete.
On such long running task, I'm invariably getting WorkerLostErrors and it is quite unpredictable when and how this is occurring. However, I'm pretty sure this is because of the number of tasks being queued on the broker, because ~4-5k tasks run fine on the same setup without any issues.
Below is the stack trace that I get on Celery.
Restarting celery worker (/~/anaconda3/envs/py35_clone_canary/bin/celery -A ba_tpe_python_service worker -Q staging_celery_queue --loglevel=info)
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/loops.py", line 74, in asynloop
state.maybe_shutdown()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/state.py", line 80, in maybe_shutdown
raise WorkerShutdown(should_stop)
celery.exceptions.WorkerShutdown: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/billiard-3.6.1.0-py3.5.egg/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 70.
I have also looked at multiple issues reported on the Github pages of Celery and Billiard. The solution have been mentioned as to take the latest version of Celery and Billiard. I have taken the latest master branch from their respective Git and built it in my environment but still facing the same issue
Celery version used: 4.4.0rc3
Billiard version used: 3.6.1.0
Please help me in debugging the issue.

What is the root cause of PyArrow HDFS IO error?

I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to determine the root cause of this issue, anyone have any ideas?
File "/extractor.py", line 87, in __call__
json.dump(results_dict, fp=_UTF8Encoder(f), indent=4)
File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__
File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
Turns out this was being cause by duplication of computation of "dask.get" tasks on Delayed objects, which was leading to multiple processes attempting to write to the same file.

Training model on AWS Deep Learning AMI instance - gets 'killed' with warnings

I am trying to train inception ResNetV2 model on my own dataset on Amazon's Deep Learning AMI
When I try to train on local machine the training starts as usual but when I try to train on aws instance it gets killed.
First I tried to train with MXNET backend . It gave the following error :
Notice that it gets killed.
So in
nano ~/.keras/keras.json
I tried to set image data format to channels_first :
{
"image_data_format": "channels_first",
"backend": "mxnet"
}
Then I got the error:
Traceback (most recent call last):
File "train.py", line 17, in <module>
model = applications.inception_resnet_v2.InceptionResNetV2(include_top=False, weights='imagenet', input_shape = (img_width, img_height, 3))
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/inception_resnet_v2.py", line 243, in InceptionResNetV2
weights=weights)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/imagenet_utils.py", line 296, in _obtain_input_shape
'`input_shape=' + str(input_shape) + '`')
ValueError: The input must have 3 channels; got `input_shape=(182, 182, 3)`
Then I tried to switch to tensorflow backend to see how it plays out because there might be some misunderstanding on my part on how this process works. But when I switched to tensorflow backend and started training I got the following error :
As you can see it gets killed again.
I am not sure what to do next. Some help would be great.
P.S I am sorry for the screenshots. You're going to have to zoom in a little to get a better view.
Deep Learning AMI was mostly not supported on t2 instance type. It should work on most of the good cpu instance type (like C4, C5) or GPU instance type (G3, P2 and P3) and many other instance type.

Amazon EMR Pyspark: rdd.distinct.count() failling

I am currently working with an EMR cluster connecting to RDS to gather 2 table.
The two RDD created are quite huge but I can perform .take(x) operations other them.
I can also perform more complex operations such as:
info_rdd = somerdd.map(lambda x: (x[1], x[2])).groupByKey().map(some_lambda)
apps_rdd = apps.join(info_rdd).map(lambda x: (x[0], (x[1][0], x[1][1][0], x[1][1][1])))
But doing the following operation to count the number of distinct users imported from RDS do not work:
unique_users = rdd.distinct.count()
I have tried many configuration to see if it was a memory issues before (just in case but it does not solve the problem) ...
These are the errors I am getting now:
Traceback (most recent call last):
File "/home/hadoop/AppEngine/src/server.py", line 56, in <module>
run_server()
File "/home/hadoop/AppEngine/src/server.py", line 53, in run_server
AppServer().run()
File "/home/hadoop/AppEngine/src/server.py", line 45, in run
api = create_app(self.context, self.apps, self.devices)
File "/home/hadoop/AppEngine/src/api.py", line 190, in create_app
engine = AppEngine(spark_context, apps, devices)
File "/home/hadoop/AppEngine/src/engine.py", line 56, in __init__
self.unique_users = self.ratings.distinct().count()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1041, in count
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1032, in sum
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 906, in fold
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 809, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.5 in stage 0.0 (TID 5, ip-172-31-3-140.eu-west-1.compute.internal, executor 13): ExecutorLostFailure (executor 13 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 164253 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)`
The solution for the problem was the following:
I did not have enough memory to perform the task.
I changed the type of the core instance I was using in my cluster to instance with more memory available (m4.4xlarge here).
Then I had to precise parameters to force the memory allocation of my instances for the spark-sumbmit:
--driver-memory 2G
--executor-memory 50G
You can also add these parameters to avoid a long task from failling because of the heartbeat or the memory allocation:
--conf spark.yarn.executor.memoryOverhead=XXX (large number such as 1024 or 4096)
--conf spark.executor.heartbeatInterval=60s
ExecutorLostFailure Reason: Executor heartbeat timed out after 164253 ms
This error means that the executor didn't respond after 165 seconds, and was killed (under the assumption that it is dead)
If by any chance you have a task which occupy the executor for such a long time, and is need to be executed you can try the following setting in the spark-submit command line which will increase the heartbeat timeout to a huge amount of time as mentioned here: https://stackoverflow.com/a/37260231/5088142
Some methods how to investigate this issue can be found here: https://stackoverflow.com/a/37272249/5088142
The below will try to clarify some issues which raised in your question.
Spark Actions vs Transformations
Spark uses lazy computation, i.e. when you perform transformation it doesn't execute it. Spark execute only when you perform action
In the complex operations example you gave there is no action (i.e. nothing was executed/computed):
info_rdd = somerdd.map(lambda x: (x[1], x[2])).groupByKey().map(some_lambda)
apps_rdd = apps.join(info_rdd).map(lambda x: (x[0], (x[1][0], x[1][1][0], x[1][1][1])))
Reviewing spark doc about transformation
You can see that all operation used in the example: map, groupByKey and join are transformation.
Hence nothing actually was done after you execute those commands.
The difference between actions
The two RDD created are quite huge but I can perform .take(x)
operations other them.
There is a difference between take(x) action, and count
take(x) action ends after it returned the first x elements.
count() action ends only after it pass the entire RDD
The fact that you execute some transformation (as in the example) which were seems to be running has no meaning - as they weren't executed.
Running take(x) action can't give any indication as it will use only very small portion of your RDD.
Conclusion
It seems like the configuration of your machine doesn't support the size of the data you are using, or your code create huge tasks which cause the executors to hang for a long period of time (160 seconds).
The first action which was actually executed on your RDD was the count action

How to end APScheduler job after set number of seconds?

I use APScheduler to schedule a job which calls an API every minute. I now get a massive error which ends with:
File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/lib/python2.7/ssl.py", line 341, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 260, in read
return self._sslobj.read(len)
SSLError: The read operation timed out
WARNING:apscheduler.scheduler:Execution of job "getAndStoreAPICallResult
(trigger: cron[minute='*'], next run at: 2014-07-08 15:37:00)" skipped: maximum
number of running instances reached (1)
So I guess that the API call somehow doesn't get a response and therefore never finishes. This inhibits running the next job, because the first has never ended. Because of this, it somehow cannot start running a new job.
I could of course increase the number of allowed concurrent running instances, but that wouldn't really solve the problem. I guess I need to make the job end if it hasn't finished after a certain number of seconds (lets say 5).
Because I've got a couple other API-call-jobs that I start with APScheduler it would be awesome if I can somehow solve this using APScheduler. Does anybody know if APScheduler makes it possible to terminate too long running jobs, or do I need to solve this in another way?
All tips are welcome!
Since there is no way to terminate a thread from outside, consider setting a shorter SSL timeout.