gcloud ai-platform local train not running in jupyter notebook - google-cloud-platform

This one unsolved part from another post. I am trying to submit a google cloud job that trains cnn model for mnist digit.
here's my systems. windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0.
I use gcloud command for local train. the second cell seems stuck at [*] status and showing nothing until I close and halt the ipynb file. the training started right after it and results are correct as I monitored it on Tensorboard.
I can make it run in a terminal without this issue. I also successfully submitted the job to cloud and finished successfully.
any thought of the local train problem? codes are here.
OUTDIR='trained_test'
INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True)
!gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=trainer \
-- \
--output_dir=$OUTDIR \
--input_dir=$INPDIR \
--epochs=2 \
--learning_rate=0.001 \
--batch_size=100

Related

How can we get GCP Project ID on AI Platform Training?

I want to get my GCP Project ID on AI Platform.
I tried to
use metadata server
run gcloud config get-value project
but, AI Platform instance seems to work outside my GCP Project.
One thing you can do is to pass --project $PROJECT_ID as an application parameter when you launch the job (docs). As an example based on this sample:
gcloud ai-platform jobs submit training ${JOB_NAME} \
--stream-logs \
# more job configuration parameters...
--config=./config.yaml \
-- \
--project=${PROJECT_ID} \
# more application parameters...
--num-layers=3
Then, in task.py (or file defined in --module-name) you can add:
args_parser.add_argument(
'--project',
help='Service Project ID where ML jobs are launched.',
required=True)
and then simply access it with args.project:
logging.info('Project ID: {}'.format(args.project))

Object detection training job fails on GCP

I am running a training job on GCP for object detection using my own dataset. My training job script is like this:
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1 \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages $PWD/models/research/dist/object_detection-0.1.tar.gz,$PWD/models/research/slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name $PWD/models/research/object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1 \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
It fails at the following line :
python -m $PWD/models/research/object_detection.model_main --model_dir=gs://my-hand-detector --pipeline_config_path=gs://my-hand-detector/data/fast_rcnn_resnet101_coco.config --job-dir gs://my-hand-detector/
/usr/bin/python: Import by filename is not supported.
Based on logs, this is the source of error which I have understood. Any help in this regard would be helpful. Thank you.
I assume that you are using model_main.py file from Tensorflow GitHub repository. Using it, I have been able to replicate your error message. After troubleshooting, I successfully submitted the training job and could train the model properly.
In order to address your issue I suggest you to follow this tutorial, taking special consideration to the following steps:
Make sure to have an updated version of tensorflow (1.14 doesn’t include all necessary capabilities)
Properly generate TFRecords from input data and upload them to GCS bucket
Configure object detection pipeline (set the proper paths to data and label map)
In my case, I have reproduced the workflow using PASCAL VOC input data (See this).

Cloud Machine Learning Engine fails to deploy model

I have trained both my own model and the one from the official tutorial.
I'm up to the step to deploy the model to support prediction. However, it keeps giving me an error saying:
"create version failed. internal error happened"
when I attempt to deploy the models by running:
gcloud ml-engine versions create v1 \
--model $MODEL_NAME \
--origin $MODEL_BINARIES \
--python-version 3.5 \
--runtime-version 1.13
*the model binary should be correct, as I pointed it to the folder containing model.pb and variables folder, e.g. MODEL_BINARIES=gs://$BUCKET_NAME/results/20190404_020134/saved_model/1554343466.
I have also tried to change the region setting for the model as well, but this doesn't help.
Turns out your GCS bucket and the trained model needs to be in the same region. This was not explained well in the Cloud ML tutorial, where it only says:
Note: Use the same region where you plan on running Cloud ML Engine jobs. The example uses us-central1 because that is the region used in the getting-started instructions.
Also note that a lot of regions cannot be used for both the bucket and model training (e.g. asia-east1).

GCP cloud ml-engine start job issue

I am running the below code in GCP default VM and config.yaml also created with necessary fields. Although I am getting source directory is not a valid directory error.
gcloud ml-engine jobs submit training my_job \ --module-name
trainer.task \ --staging-bucket gs://my-bucket \ --package-path
/my/code/path/trainer \ --packages additional-dep1.tar.gz,dep2.whl
Have checked all the paths and they are ok and the data is within them however the command is not executing...
Help on above topic is much appreciated
Your Python package structure is probably missing an init.py. see http://python-packaging.readthedocs.io/en/latest/minimal.html
And verify it using setup.py sdist

Google Cloud ML returns empty predictions with object detection model

I am deploying a model to Google Cloud ML for the first time. I have trained and tested the model locally and it still needs work but it works ok.
I have uploaded it to Cloud ML and tested with the same example images I test locally that I know get detections. (using this tutorial)
When I do this, I get no detections. At first I thought I had uploaded the wrong checkpoint but I tested and the same checkpoint works with these images offline, I don't know how to debug further.
When I look at the results the file
prediction.results-00000-of-00001
is just empty
and the file
prediction.errors_stats-00000-of-00001
contains the following text: ('No JSON object could be decoded', 1)
Is this a sign the detection has run and detected nothing, or is there some problem while running?
Maybe the problem is I am preparing the images wrong for uploading?
The logs show no errors at all
Thank you
EDIT:
I was doing more tests and tried to run the model locally using the command "gcloud ml-engine local predict" instead of the usual local code. I get the same result as online, no answer at all, but also no error message
EDIT 2:
I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command:
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-
format=tf_record \ --input-paths=gs://MY_BUCKET/data_dir/inputs.tfr
\ --output-path=gs://MY_BUCKET/data_dir/version4 \ --region
us-central1 \ --model="gcp_detector" \ --version="Version4"
Works with the following commands
Model export:
# From tensorflow/models
export PYTHONPATH=$PYTHONPATH:/home/[user]/repos/DeepLearning/tools/models/research:/home/[user]/repos/DeepLearning/tools/models/research/slim
cd /home/[user]/repos/DeepLearning/tools/models/research
python object_detection/export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path /home/[user]/[path]/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /[path_to_checkpoint]/model.ckpt-216593 \
--output_directory /[output_path]/output_inference_graph.pb
Cloud execution
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-format=TF_RECORD \
--input-paths=gs://my_inference/data_dir/inputs/* \
--output-path=${YOUR_OUTPUT_DIR} \
--region us-central1 \
--model="model_name" \
--version="version_name"
I don't know what change exactly fixes the issue, but there are some small changes like tf_record now being TF_RECORD. Hope this helps someone else. Props to google support for their help (they suggested the changes)