Training model on AWS Deep Learning AMI instance - gets 'killed' with warnings - amazon-web-services

I am trying to train inception ResNetV2 model on my own dataset on Amazon's Deep Learning AMI
When I try to train on local machine the training starts as usual but when I try to train on aws instance it gets killed.
First I tried to train with MXNET backend . It gave the following error :
Notice that it gets killed.
So in
nano ~/.keras/keras.json
I tried to set image data format to channels_first :
{
"image_data_format": "channels_first",
"backend": "mxnet"
}
Then I got the error:
Traceback (most recent call last):
File "train.py", line 17, in <module>
model = applications.inception_resnet_v2.InceptionResNetV2(include_top=False, weights='imagenet', input_shape = (img_width, img_height, 3))
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/inception_resnet_v2.py", line 243, in InceptionResNetV2
weights=weights)
File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/keras_applications/imagenet_utils.py", line 296, in _obtain_input_shape
'`input_shape=' + str(input_shape) + '`')
ValueError: The input must have 3 channels; got `input_shape=(182, 182, 3)`
Then I tried to switch to tensorflow backend to see how it plays out because there might be some misunderstanding on my part on how this process works. But when I switched to tensorflow backend and started training I got the following error :
As you can see it gets killed again.
I am not sure what to do next. Some help would be great.
P.S I am sorry for the screenshots. You're going to have to zoom in a little to get a better view.

Deep Learning AMI was mostly not supported on t2 instance type. It should work on most of the good cpu instance type (like C4, C5) or GPU instance type (G3, P2 and P3) and many other instance type.

Related

How to test pytorch GPU code on a CPU machine

I want to verify on CPU machine that I've moved all tensors to CUDA device correctly. Is there a way to create a mock device that is still computed on CPU but marked to be incompatible with tensors I forget to move to this device?
Motivation: this would speed up my development cycle as I don't need to deploy code and run it on the GPU server only to find out I forgot to specify device on some tensor / model.
Maybe the meta device is what you need.
>>> a = torch.ones(2) # defaults to cpu
>>> b = torch.ones(2, device='meta')
>>> c = a + b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, meta and cpu!
Note that original intention for meta device is to compute shape and dtype of output without actually doing full computation. Tensors on meta device have no data in them, so I am not sure if there are some aspects of the test where meta device behaves different from a "real" device.
Interestingly I couldn't find much documentation about this meta device at all. I found it mentioned here https://github.com/pytorch/pytorch/issues/61654#issuecomment-879989145

How to prevent trials execution on the head

I'm using ray.tune on an aws "Autoscaling GPU cluster". Currently, my head and workers all have a GPU and are all used to execute trials. I'm trying to move to a setup where the head doesn't have a GPU --
along the lines of how Ray's doc defines "Autoscaling GPU cluster". However, I keep running into CUDA problems on the head which makes sense since it is used for trials execution. The solution appears simple enough: I guess I need to prevent trials execution on the head but I can't find how. I tried various resources_per_trial values, same with ray.init() but didn't get this to work.
Additional details:
I use ray 0.8.6.
I set resources_per_trial={'gpu': 1}
I set torch.device("cuda:0") everywhere
I use 1 head (cpu only) and 1 worker (gpu only), I required a minimum of 1 worker.
So everything is made to run only on GPU which is why I focused on preventing execution on the head.
Regarding errors and warnings, I get the following:
WARNING tune.py:318 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
WARNING ray_trial_executor.py:549 -- Allowing trial to start even though the cluster does not have enough free resources. Trial actors may appear to hang until enough resources are added to the cluster (e.g., via autoscaling). You can disable this behavior by specifying `queue_trials=False` in ray.tune.run().
WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff128bce290200 is pending and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {node:10.160.26.189: 1.000000}, {object_store_memory: 12.304688 GiB}, {CPU: 3.000000}, {memory: 41.650391 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
Even when I wait for the gpu worker to be running I still get the above.
Finally, the error is:
ERROR trial_runner.py:520 -- Trial TrainableAE_a441f_00000: Error processing event.
Traceback (most recent call last):
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 468, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 1467, in get
values = worker.get_objects(object_ids, timeout=timeout)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 306, in get_objects
return self.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/worker.py", line 281, in deserialize_objects
return context.deserialize_objects(data_metadata_pairs, object_ids)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 312, in deserialize_objects
self._deserialize_object(data, metadata, object_id))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 233, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/ray/serialization.py", line 221, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/storage.py", line 136, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 773, in _legacy_load
result = unpickler.load()
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 729, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 178, in default_restore_location
result = fn(storage, location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 154, in _cuda_deserialize
device = validate_cuda_device(location)
File "/opt/anaconda/2020/envs/py_37_pands0.25/lib/python3.7/site-packages/torch/serialization.py", line 138, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Thanks to richliaw for comments. The solution became obvious once I stopped trying to prevent trials execution on the head and instead focused on finding why these were happening in the first place. The AMI on the head of my cluster had NVidia drivers and cuda installed on it. After I removed those ray stopped trying to execute on the head. So I guess this is how ray decides to send computation on the head when resources_per_trial={'gpu': 1}.

Celery with Redis and Django giving WorkerLostError on long running tasks

I have a long running Celery task that computes the PDP of a feature. Below is the shared task that's run:
#shared_task
def get_pdp_single(bst, train_df, feature, value, f_id=-1):
x_temp = train_df.copy()
x_temp.iloc[:, f_id] = value
data = xgb.DMatrix(x_temp, feature_names=x_temp.columns.tolist())
predictions = (bst.predict(data))
avg_predictions = np.mean(predictions)
result_dict = {
"feature": feature,
"avg_predictions": avg_predictions.item()
}
return result_dict
I'm computing Hstatistics of all the features taken in the XGBoost model built. So, we have lots of such tasks being queued in the Broker (Redis). ~12k tasks gets queued into Redis for this.
I have a 8core 16GB VM on which I instantiate a single Celery worker to do this task. Each single child task takes ~40 seconds to complete, this is because XGBoost predict method takes its time to complete.
On such long running task, I'm invariably getting WorkerLostErrors and it is quite unpredictable when and how this is occurring. However, I'm pretty sure this is because of the number of tasks being queued on the broker, because ~4-5k tasks run fine on the same setup without any issues.
Below is the stack trace that I get on Celery.
Restarting celery worker (/~/anaconda3/envs/py35_clone_canary/bin/celery -A ba_tpe_python_service worker -Q staging_celery_queue --loglevel=info)
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/loops.py", line 74, in asynloop
state.maybe_shutdown()
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/celery-4.4.0rc3-py3.5.egg/celery/worker/state.py", line 80, in maybe_shutdown
raise WorkerShutdown(should_stop)
celery.exceptions.WorkerShutdown: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/~/anaconda3/envs/py35_clone_canary/lib/python3.5/site-packages/billiard-3.6.1.0-py3.5.egg/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 70.
I have also looked at multiple issues reported on the Github pages of Celery and Billiard. The solution have been mentioned as to take the latest version of Celery and Billiard. I have taken the latest master branch from their respective Git and built it in my environment but still facing the same issue
Celery version used: 4.4.0rc3
Billiard version used: 3.6.1.0
Please help me in debugging the issue.

Run tflite accuracy tool on official tensorflow resnet50 model

I have downloaded the official resnet50 model provided here: https://github.com/tensorflow/models/tree/master/official/resnet. I needed a tflite quantized version of this model and hence I converted the model to a tflite format as follows :
toco --output_file /tmp/resnet50_quant.tflite --saved_model_dir <path/to/saved_model_dir> --output_format TFLITE --quantize_weights QUANTIZE_WEIGHTS
After this, I thought I'd run the tflite accuracy tool to verify the accuracy of this model is still reasonable. Although it looks like I run into the following issue:
bazel run -c opt --copt=-march=native --cxxopt='--std=c++11' -- //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval --model_file=/tmp/resnet50_quant.tflite --ground_truth_images_path=<path/to/images> --ground_truth_labels=/tmp/validation_labels.txt --model_output_labels=/tmp/tf_labels.txt --output_file_path=/tmp/accuracy_output.txt --num_images=0
INFO: Analysed target //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval (0 packages loaded).
INFO: Found 1 target...
Target //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval up-to-date:
bazel-bin/tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval
INFO: Elapsed time: 14.589s, Critical Path: 14.28s
INFO: 3 processes: 3 local.
INFO: Build completed successfully, 4 total actions
INFO: Running command line: bazel-bin/tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval '--model_file=/tmp/resnet50_quant.tflite' '--ground_truth_images_path=<path/to/images>' '--ground_truth_labels=/tmp/validation_labels.txt' '--model_output_labels=/tmp/tf_labels.txt' '--output_file_path=/tmp/accuracy_output.txt' 'INFO: Build completed successfully, 4 total actions
2018-10-12 15:30:06.237058: E tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval.cc:155] Starting evaluation with: 4 threads.
2018-10-12 15:30:06.536802: E tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval.cc:98] Starting model evaluation: 50000
2018-10-12 15:30:06.565334: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at run_tflite_model_op.cc:89 : Invalid argument: Data shapes mismatch for tensors: 0 expected: [64,224,224,3] got: [1,224,224,3]
2018-10-12 15:30:06.565453: F tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_model_evaluator.cc:222] Non-OK-status: eval_pipeline->Run(CreateStringTensor(image_label.image), CreateStringTensor(image_label.label)) status: Invalid argument: Data shapes mismatch for tensors: 0 expected: [64,224,224,3] got: [1,224,224,3]
[[{{node stage_run_tfl_model_output}} = RunTFLiteModel[input_type=[DT_FLOAT], model_file_path="/tmp/resnet50_quant.tflite", output_type=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](stage_inception_preprocess_output)]]
It looks like the issue is that the official resnet model has an input tensor of [64, 224, 224, 3] whereas the accuracy tool provides an input of [1, 224, 224, 3]. So, the official model seems to expect a batch of 64 images and hence the accuracy tool fails.
I was wondering what I need to do to get the accuracy tool to run on the official resnet50 model? I'm guessing that although the input tensor for resnet 50 is [64, 224, 224, 3], there should be a way to still run a single image through the model.
There are two ways to go about it:
Resize the input of your model to [1, 224, 224, 3] and run the tool.
You could try looking at this and then modifying this file accordingly.
Alternatively modify the same tool so that it feeds in 64 images at a time instead of 1. You can look at the same code file I point to above and feed 64 at a time instead of 1.
If you're looking for long-term support, consider filing a feature request on Github where we can support batching.

No space left on device in Sagemaker model training

I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:
model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))
Unluckily, apparently the model gets too big with time and I get the
following error:
RuntimeError: Problems closing file (file write failed: time = Thu Jul
26 00:24:48 2018
00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28,
error message = 'No space left on device', buf = 0x1a41d7d0, total
write[...]
So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?
When you train a model with Estimators, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.
Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.
In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h. For a specific path, try something like !df -h /opt.