Adding custom dependancy wont work in ML-Engine submit training - google-cloud-ml

I have a .sh script that lunches a submit training job as following:
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="campign_retention_model__$now"
JOB_DIR="gs://machine_learning_datasets/campaign_retention"
REGION="us-east1"
PYTHON_VERSION='3.5'
RUNTIME_VERSION='1.12'
TRAINER_PACKAGE_PATH="./trainer/"
PACKAGE_STAGING_PATH="gs://machine_learning_datasets/campaign_retention"
CLOUDSDK_PYTHON="/usr/bin/python"
MAIN_TRAINER_MODULE="trainer.task"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
Which works great (Notice that the .sh is located next to the trainer dir).
Due to external infra requirements, i was forced to save the content of my project within a bucket named:
"gs://campign_retention_code/camp_ret"
And hand out a stand alone sh, So I've just changed it to (just changed the path of TRAINER_PACKAGE_PATH):
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="campign_retention_model__$now"
JOB_DIR="gs://machine_learning_datasets/campaign_retention"
REGION="us-east1"
PYTHON_VERSION='3.5'
RUNTIME_VERSION='1.12'
TRAINER_PACKAGE_PATH="gs://campign_retention_code/camp_ret/trainer"
PACKAGE_STAGING_PATH="gs://machine_learning_datasets/campaign_retention"
CLOUDSDK_PYTHON="/usr/bin/python"
MAIN_TRAINER_MODULE="trainer.task"
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
Now when i'm running it (I moved it to a different location on the desktop to /Users/yehoshaphatschellekens/Desktop, to make sure its not close to my project) i'm getting the following error:
ERROR: (gcloud.ml-engine.jobs.submit.training) Source directory [/Users/yehoshaphatschellekens/Desktop/camp_ret] is not a valid directory.
Looking at the docs packaging-trainer i noticed that there are two examples, one that works like my original script, which as i said, works perfectly, and another example that uses a packaged dependancy.
Why the submit job won't recognise my dependancies on gs, can't i just point to --package-path a directory from gs instead of my local dir?
Thanks in Advance!!!

I believe what you are trying to do requires using
--packages gs://path/to/packages
INSTEAD of --package-path

Related

Why is Dataflow Runner v2 experiments flag not appearing in pipeline options?

I've added to this command so that Dataflow Runner v2 is used:
mvn -Pdataflow-runner compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--project=PROJECT_ID \
--gcpTempLocation=gs://BUCKET_NAME/temp/ \
--output=gs://BUCKET_NAME/output \
--runner=DataflowRunner \
--region=REGION \
--experiments=use_runner_v2"
Note --experiments=use_runner_v2 is added at the end. But experiments flag is not shown in the pipeline options in the gcp ui:
The userAgent is Apache_Beam_SDK_for_Java/2.39.0(JRE_17_environment) and region is europe-west1. This might be relevant information about why my setup isn't working.

using gcloud beta builds triggers create cloud-source-repositories doesn't working with --dockerfile-image

I'm working on a auto devops workflow only based on the dockerfile using Cloud Build on GCP, when I try to use the following command it seems is not using the flag: --dockerfile-image
gcloud beta builds triggers create cloud-source-repositories \
--name="test-trigger-2" \
--repo="projects/nodrize-dev/repos/b722166a-56e0-46af-bd0d-42af8d37c570/bf11672f-34d5-4d8c-80cb-31120f39251a/quirino-backend" \
--branch-pattern="^master$" \
--dockerfile="Dockerfile" \
--dockerfile-dir="" \
--dockerfile-image="gcr.io/nodrize-dev/test-backend"
Created [https://cloudbuild.googleapis.com/v1/projects/nodrize-dev/triggers/896f8ac8-397c-464a-84f7-43e69f1bc6cb].
NAME CREATE_TIME STATUS
test-trigger-2 2021-06-02T21:06:54+00:00
I want to create trigger to run it later but the last flag isnt working I asume is using the default or fallback, because as you can see in the image name is:
gcr.io/nodrize-dev/b722166a-56e0-46af-bd0d-42af8d37c570/bf11672f-34d5-4d8c-80cb-31120f39251a/quirino-backend:$COMMIT_SHA:
dockerimage-name in gcp concole:
I hope someone can help me or at least know what is happening.
This works for me.
I suspect perhaps that the trigger is incorrect or is not being triggered and|or the image is not what was generated by the trigger.
PROJECT=...
REPO=...
gcloud source repos create ${REPO} \
--project=${PROJECT}
gcloud beta builds triggers create cloud-source-repositories \
--name="trigger" \
--project=${PROJECT} \
--repo=${REPO} \
--branch-pattern="^master$" \
--dockerfile="Dockerfile" \
--dockerfile-dir="." \
--dockerfile-image="gcr.io/${PROJECT}/freddie-01"
NAME CREATE_TIME STATUS
trigger 2021-06-03T15:24:27+00:00
git push google master
gcloud builds list \
--project=${PROJECT} \
--format="value(images)"
gcr.io/${PROJECT}/freddie-01:7dcf74e126af711d24bb2b652d86f0d28bbe3bd9
gcloud container images list \
--project=${PROJECT}
NAME
gcr.io/${PROJECT}/freddie-01

Object detection training job fails on GCP

I am running a training job on GCP for object detection using my own dataset. My training job script is like this:
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1 \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages $PWD/models/research/dist/object_detection-0.1.tar.gz,$PWD/models/research/slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name $PWD/models/research/object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1 \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
It fails at the following line :
python -m $PWD/models/research/object_detection.model_main --model_dir=gs://my-hand-detector --pipeline_config_path=gs://my-hand-detector/data/fast_rcnn_resnet101_coco.config --job-dir gs://my-hand-detector/
/usr/bin/python: Import by filename is not supported.
Based on logs, this is the source of error which I have understood. Any help in this regard would be helpful. Thank you.
I assume that you are using model_main.py file from Tensorflow GitHub repository. Using it, I have been able to replicate your error message. After troubleshooting, I successfully submitted the training job and could train the model properly.
In order to address your issue I suggest you to follow this tutorial, taking special consideration to the following steps:
Make sure to have an updated version of tensorflow (1.14 doesn’t include all necessary capabilities)
Properly generate TFRecords from input data and upload them to GCS bucket
Configure object detection pipeline (set the proper paths to data and label map)
In my case, I have reproduced the workflow using PASCAL VOC input data (See this).

GCP cloud ml-engine start job issue

I am running the below code in GCP default VM and config.yaml also created with necessary fields. Although I am getting source directory is not a valid directory error.
gcloud ml-engine jobs submit training my_job \ --module-name
trainer.task \ --staging-bucket gs://my-bucket \ --package-path
/my/code/path/trainer \ --packages additional-dep1.tar.gz,dep2.whl
Have checked all the paths and they are ok and the data is within them however the command is not executing...
Help on above topic is much appreciated
Your Python package structure is probably missing an init.py. see http://python-packaging.readthedocs.io/en/latest/minimal.html
And verify it using setup.py sdist

ERROR: gcloud crashed (ArgumentError): argument USER_ARGS: unrecognized args: --runtime_version=1.0

Below script was running fine until yesterday morning.
gcloud ml-engine jobs submit training "$JOB_ID" \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--runtime_version=1.0 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*" \
--classification_type "multilabel" \
Running into below error:
ERROR: gcloud crashed (ArgumentError): argument USER_ARGS: unrecognized args: --runtime_version=1.0
The '--' argument must be specified between gcloud specific args on the left and USER_ARGS on the right.
Below are the gcloud components version:
$ gcloud version
Google Cloud SDK 147.0.0
alpha 2016.01.12
app-engine-go
app-engine-go-linux-x86_64 1.9.50
app-engine-java 1.9.50
app-engine-php " "
app-engine-python 1.9.50
beta 2016.01.12
bq 2.0.24
bq-nix 2.0.24
cloud-datastore-emulator 1.2.1
core 2017.03.13
alpha 2016.01.12
core-nix 2016.11.07
datalab 20170309
datalab-nix 20170105
gcd-emulator v1beta3-1.0.0
gcloud
gcloud-deps 2017.03.13
gcloud-deps-linux-x86_64 2017.02.21
gsutil 4.22
gsutil-nix 4.18
kubectl
kubectl-linux-x86_64 1.5.3
pubsub-emulator 2017.02.07
Not sure whether this is anything changed in Cloud, or I need check any config on my end that may cause this error.
You might need to use --runtime-version as the name of the argument (hyphen instead of underscore).
Without that, gcloud is assuming its some custom user-defined argument, which it expects to be in the list after the '--', hence the confusing error message.