GCP cloud ml-engine start job issue

GCP cloud ml-engine start job issue - google-cloud-ml

I am running the below code in GCP default VM and config.yaml also created with necessary fields. Although I am getting source directory is not a valid directory error.
gcloud ml-engine jobs submit training my_job \ --module-name
trainer.task \ --staging-bucket gs://my-bucket \ --package-path
/my/code/path/trainer \ --packages additional-dep1.tar.gz,dep2.whl
Have checked all the paths and they are ok and the data is within them however the command is not executing...
Help on above topic is much appreciated

Your Python package structure is probably missing an init.py. see http://python-packaging.readthedocs.io/en/latest/minimal.html
And verify it using setup.py sdist

Related

Terraform script to build and run Dataflow Flex template

Need to convert these 2 gcloud commands to build and run dataflow jobs using Terraform.
gcloud dataflow flex-template build ${TEMPLATE_PATH} \
--image-gcr-path "${TARGET_GCR_IMAGE}" \
--sdk-language "JAVA" \
--flex-template-base-image ${BASE_CONTAINER_IMAGE} \
--metadata-file "/Users/b.j/g/codebase/g-dataflow/pubsub-lite/src/main/resources/g_pubsublite_to_gcs_metadata.json" \
--jar "/Users/b.j/g/codebase/g-dataflow/pubsub-lite/target/debian/pubsub-lite-0.0.1-SNAPSHOT-uber.jar" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="com.in.g.gr.dataflow.PubSubLiteToGCS"
gcloud dataflow flex-template run "pub-sub-lite-flex-`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location=$TEMPLATE_FILE_LOCATION \
--parameters=subscription=$SUBSCRIPTION,output=$OUTPUT_DIR,windowSize=$WINDOW_SIZE_IN_SECS,partitionLevel=$PARTITION_LEVEL,numOfShards=$NUM_SHARDS \
--region=$REGION \
--worker-region=$WORKER_REGION \
--staging-location=$STAGING_LOCATION \
--subnetwork=$SUBNETWORK \
--network=$NETWORK
I've tried using the resource google_dataflow_flex_template_job from which i can run the dataflow job using the stored dataflow template(2nd gcloud command), now I need to create the template and docker image as per my 1st gcloud command using terraform ?
Any inputs on this ?? And whats the best way to pass the jars used in the 1st gcloud command (placing it in GCS bucket) ?

And whats the best way to pass the jars used in the 1st gcloud command (placing it in GCS bucket)?
There is no need to manually store these jar files in GCS. The gcloud dataflow flex-template build command will build a docker container image including all the required jar files and upload the image to the container registry. This image (+ the metadata file) is the only thing needed to run the template.
now I need to create the template and docker image as per my 1st gcloud command using terraform ?
AFAIK there is no special terraform module to build a flex template. I'd try using the terraform-google-gcloud module, which can execute an arbitrary gcloud command, to run gcloud dataflow flex-template build.
If you build your project using Maven, another option is using jib-maven-plugin to build and upload the container image instead of using gcloud dataflow flex-template build. See these build instructions for an example. You'll still need to upload the json image spec ("Creating Image Spec" section in the instructions) somehow, e.g. using the gsutil command or maybe using terraform's google_storage_bucket_object, so I think this approach is more complicated.

Invoke different entrypoints/modules when training with custom container

I've built a custom Docker container with my training application. The Dockerfile, at the moment, is something like
FROM python:slim
COPY ./src /pipelines/component/src
RUN pip3 install -U ...
...
ENTRYPOINT ["python3", "/pipelines/component/src/training.py"]
so when I run
gcloud ai-platform jobs submit training JOB_NAME \
--region=$REGION \
--master-image-uri=$IMAGE_URI
it goes as expected.
What I'd like to do is to add another module, like /pipelines/component/src/tuning.py; remove the default ENTRYPOINT from Dockerfile; decide which module to call from the gcloud command. So I tried
gcloud ai-platform jobs submit training JOB_NAME \
--region=$REGION \
--master-image-uri=$IMAGE_URI \
--module-name=src.tuning \
--package-path=/pipelines/component/src
It returns Source directory [/pipelines/component] is not a valid directory., because it's searching for the package path on the local machine, instead of the container. How can I solve this problem?

You can use TrainingInput.ReplicaConfig.ContainerCommand field to override the docker image's entrypoint. Here is a sample command:
gcloud ai-platform jobs submit training JOB_NAME \
--region=$REGION
--master-image-uri=$IMAGE_URI
--config=config.yaml
And config.yaml content will be something like this:
trainingInput:
scaleTier: BASIC
masterConfig:
containerCommand: ["python3", "/pipelines/component/src/tuning.py"]
This link has more context about config flag.
Similarly, you can override docker image's command with containerArgs field.

Object detection training job fails on GCP

I am running a training job on GCP for object detection using my own dataset. My training job script is like this:
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1 \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages $PWD/models/research/dist/object_detection-0.1.tar.gz,$PWD/models/research/slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name $PWD/models/research/object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1 \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
It fails at the following line :
python -m $PWD/models/research/object_detection.model_main --model_dir=gs://my-hand-detector --pipeline_config_path=gs://my-hand-detector/data/fast_rcnn_resnet101_coco.config --job-dir gs://my-hand-detector/
/usr/bin/python: Import by filename is not supported.
Based on logs, this is the source of error which I have understood. Any help in this regard would be helpful. Thank you.

I assume that you are using model_main.py file from Tensorflow GitHub repository. Using it, I have been able to replicate your error message. After troubleshooting, I successfully submitted the training job and could train the model properly.
In order to address your issue I suggest you to follow this tutorial, taking special consideration to the following steps:
Make sure to have an updated version of tensorflow (1.14 doesn’t include all necessary capabilities)
Properly generate TFRecords from input data and upload them to GCS bucket
Configure object detection pipeline (set the proper paths to data and label map)
In my case, I have reproduced the workflow using PASCAL VOC input data (See this).

Adding custom dependancy wont work in ML-Engine submit training

I have a .sh script that lunches a submit training job as following:
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="campign_retention_model__$now"
JOB_DIR="gs://machine_learning_datasets/campaign_retention"
REGION="us-east1"
PYTHON_VERSION='3.5'
RUNTIME_VERSION='1.12'
TRAINER_PACKAGE_PATH="./trainer/"
PACKAGE_STAGING_PATH="gs://machine_learning_datasets/campaign_retention"
CLOUDSDK_PYTHON="/usr/bin/python"
MAIN_TRAINER_MODULE="trainer.task"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
Which works great (Notice that the .sh is located next to the trainer dir).
Due to external infra requirements, i was forced to save the content of my project within a bucket named:
"gs://campign_retention_code/camp_ret"
And hand out a stand alone sh, So I've just changed it to (just changed the path of TRAINER_PACKAGE_PATH):
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="campign_retention_model__$now"
JOB_DIR="gs://machine_learning_datasets/campaign_retention"
REGION="us-east1"
PYTHON_VERSION='3.5'
RUNTIME_VERSION='1.12'
TRAINER_PACKAGE_PATH="gs://campign_retention_code/camp_ret/trainer"
PACKAGE_STAGING_PATH="gs://machine_learning_datasets/campaign_retention"
CLOUDSDK_PYTHON="/usr/bin/python"
MAIN_TRAINER_MODULE="trainer.task"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
Now when i'm running it (I moved it to a different location on the desktop to /Users/yehoshaphatschellekens/Desktop, to make sure its not close to my project) i'm getting the following error:
ERROR: (gcloud.ml-engine.jobs.submit.training) Source directory [/Users/yehoshaphatschellekens/Desktop/camp_ret] is not a valid directory.
Looking at the docs packaging-trainer i noticed that there are two examples, one that works like my original script, which as i said, works perfectly, and another example that uses a packaged dependancy.
Why the submit job won't recognise my dependancies on gs, can't i just point to --package-path a directory from gs instead of my local dir?
Thanks in Advance!!!

I believe what you are trying to do requires using
--packages gs://path/to/packages
INSTEAD of --package-path

Google Cloud ML returns empty predictions with object detection model

I am deploying a model to Google Cloud ML for the first time. I have trained and tested the model locally and it still needs work but it works ok.
I have uploaded it to Cloud ML and tested with the same example images I test locally that I know get detections. (using this tutorial)
When I do this, I get no detections. At first I thought I had uploaded the wrong checkpoint but I tested and the same checkpoint works with these images offline, I don't know how to debug further.
When I look at the results the file
prediction.results-00000-of-00001
is just empty
and the file
prediction.errors_stats-00000-of-00001
contains the following text: ('No JSON object could be decoded', 1)
Is this a sign the detection has run and detected nothing, or is there some problem while running?
Maybe the problem is I am preparing the images wrong for uploading?
The logs show no errors at all
Thank you
EDIT:
I was doing more tests and tried to run the model locally using the command "gcloud ml-engine local predict" instead of the usual local code. I get the same result as online, no answer at all, but also no error message
EDIT 2:
I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command:
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-
format=tf_record \ --input-paths=gs://MY_BUCKET/data_dir/inputs.tfr
\ --output-path=gs://MY_BUCKET/data_dir/version4 \ --region
us-central1 \ --model="gcp_detector" \ --version="Version4"

Works with the following commands
Model export:
# From tensorflow/models
export PYTHONPATH=$PYTHONPATH:/home/[user]/repos/DeepLearning/tools/models/research:/home/[user]/repos/DeepLearning/tools/models/research/slim
cd /home/[user]/repos/DeepLearning/tools/models/research
python object_detection/export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path /home/[user]/[path]/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /[path_to_checkpoint]/model.ckpt-216593 \
--output_directory /[output_path]/output_inference_graph.pb
Cloud execution
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-format=TF_RECORD \
--input-paths=gs://my_inference/data_dir/inputs/* \
--output-path=${YOUR_OUTPUT_DIR} \
--region us-central1 \
--model="model_name" \
--version="version_name"
I don't know what change exactly fixes the issue, but there are some small changes like tf_record now being TF_RECORD. Hope this helps someone else. Props to google support for their help (they suggested the changes)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCP cloud ml-engine start job issue - google-cloud-ml

Your Python package structure is probably missing an init.py. see http://python-packaging.readthedocs.io/en/latest/minimal.html And verify it using setup.py sdist

Related

Terraform script to build and run Dataflow Flex template

Invoke different entrypoints/modules when training with custom container

Object detection training job fails on GCP

Adding custom dependancy wont work in ML-Engine submit training

Google Cloud ML returns empty predictions with object detection model

Categories

Resources