In Google's sample code found at cloudml-samples/flowers/sample.sh, between lines 66 and 69, is the argument "region":
# Tell CloudML about a new type of model coming. Think of a "model" here as
# a namespace for deployed Tensorflow graphs.
gcloud ml-engine models create "$MODEL_NAME" \
--region us-central1
Shouldn't "region" be replaced with "regions" to avoid an error?
(I'm not in position to submit a PR about this.)
This has been fixed and will be pushed in the next couple of days.
Related
I experience a very strange behaviour with the --filter --formatand --limit flags.
I have the following command:
gcloud run revisions list --sort-by=~creationTimestamp --service "api-gateway" --platform managed --format="value(metadata.name)" --filter="spec.containers.env.name=ENDPOINTS_SERVICE_NAME"
The command returns me this list with in total 177 items:
api-gateway-00295-xeb 2020-07-21T06:46:14.991421Z
api-gateway-00283-wug 2020-07-20T14:41:02.108809Z
api-gateway-00281-yix 2020-07-20T14:32:17.325634Z
api-gateway-00278-ham 2020-07-20T12:50:13.385984Z
api-gateway-00276-mol 2020-07-17T12:21:36.897245Z
api-gateway-00274-nih 2020-07-16T07:50:18.544546Z
api-gateway-00272-kol 2020-07-13T12:55:35.485589Z
api-gateway-00270-vis 2020-07-13T08:38:52.352422Z
api-gateway-00263-zaf 2020-07-10T14:08:36.502972Z
...
The first thing is, that the timestamp is returned for a strange reason. (I actually state what I want to get with --format and when I remove the --sort-by flag the timestamp is gone.)
Secondly, when I add --limit 1 no result is returned at all!
gcloud run revisions list --sort-by=~creationTimestamp --service "api-gateway" --platform managed --format="value(metadata.name)" --filter="spec.containers.env.name=ENDPOINTS_SERVICE_NAME" --limit 1
With --limit 5 only two are returned, so as a result it cloud be that the limit is applied before filtering, although the documentation says that is should be the other way around.
However the "latest" entry is api-gateway-00295-xeb and should be returned with a limit of 1.
I don't understand the behaviour of the gcloud CLI here.
Does anyone have explanations for the two things?
As #DazWilkin suggested I created an issue at the public google issue tracker here:
https://issuetracker.google.com/issues/161833506
The Cloud SDK engineering team is looking into this, however there is no ETA.
I am running a training job on GCP for object detection using my own dataset. My training job script is like this:
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1 \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages $PWD/models/research/dist/object_detection-0.1.tar.gz,$PWD/models/research/slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name $PWD/models/research/object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1 \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
It fails at the following line :
python -m $PWD/models/research/object_detection.model_main --model_dir=gs://my-hand-detector --pipeline_config_path=gs://my-hand-detector/data/fast_rcnn_resnet101_coco.config --job-dir gs://my-hand-detector/
/usr/bin/python: Import by filename is not supported.
Based on logs, this is the source of error which I have understood. Any help in this regard would be helpful. Thank you.
I assume that you are using model_main.py file from Tensorflow GitHub repository. Using it, I have been able to replicate your error message. After troubleshooting, I successfully submitted the training job and could train the model properly.
In order to address your issue I suggest you to follow this tutorial, taking special consideration to the following steps:
Make sure to have an updated version of tensorflow (1.14 doesn’t include all necessary capabilities)
Properly generate TFRecords from input data and upload them to GCS bucket
Configure object detection pipeline (set the proper paths to data and label map)
In my case, I have reproduced the workflow using PASCAL VOC input data (See this).
I have trained both my own model and the one from the official tutorial.
I'm up to the step to deploy the model to support prediction. However, it keeps giving me an error saying:
"create version failed. internal error happened"
when I attempt to deploy the models by running:
gcloud ml-engine versions create v1 \
--model $MODEL_NAME \
--origin $MODEL_BINARIES \
--python-version 3.5 \
--runtime-version 1.13
*the model binary should be correct, as I pointed it to the folder containing model.pb and variables folder, e.g. MODEL_BINARIES=gs://$BUCKET_NAME/results/20190404_020134/saved_model/1554343466.
I have also tried to change the region setting for the model as well, but this doesn't help.
Turns out your GCS bucket and the trained model needs to be in the same region. This was not explained well in the Cloud ML tutorial, where it only says:
Note: Use the same region where you plan on running Cloud ML Engine jobs. The example uses us-central1 because that is the region used in the getting-started instructions.
Also note that a lot of regions cannot be used for both the bucket and model training (e.g. asia-east1).
I am deploying a model to Google Cloud ML for the first time. I have trained and tested the model locally and it still needs work but it works ok.
I have uploaded it to Cloud ML and tested with the same example images I test locally that I know get detections. (using this tutorial)
When I do this, I get no detections. At first I thought I had uploaded the wrong checkpoint but I tested and the same checkpoint works with these images offline, I don't know how to debug further.
When I look at the results the file
prediction.results-00000-of-00001
is just empty
and the file
prediction.errors_stats-00000-of-00001
contains the following text: ('No JSON object could be decoded', 1)
Is this a sign the detection has run and detected nothing, or is there some problem while running?
Maybe the problem is I am preparing the images wrong for uploading?
The logs show no errors at all
Thank you
EDIT:
I was doing more tests and tried to run the model locally using the command "gcloud ml-engine local predict" instead of the usual local code. I get the same result as online, no answer at all, but also no error message
EDIT 2:
I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command:
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-
format=tf_record \ --input-paths=gs://MY_BUCKET/data_dir/inputs.tfr
\ --output-path=gs://MY_BUCKET/data_dir/version4 \ --region
us-central1 \ --model="gcp_detector" \ --version="Version4"
Works with the following commands
Model export:
# From tensorflow/models
export PYTHONPATH=$PYTHONPATH:/home/[user]/repos/DeepLearning/tools/models/research:/home/[user]/repos/DeepLearning/tools/models/research/slim
cd /home/[user]/repos/DeepLearning/tools/models/research
python object_detection/export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path /home/[user]/[path]/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /[path_to_checkpoint]/model.ckpt-216593 \
--output_directory /[output_path]/output_inference_graph.pb
Cloud execution
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-format=TF_RECORD \
--input-paths=gs://my_inference/data_dir/inputs/* \
--output-path=${YOUR_OUTPUT_DIR} \
--region us-central1 \
--model="model_name" \
--version="version_name"
I don't know what change exactly fixes the issue, but there are some small changes like tf_record now being TF_RECORD. Hope this helps someone else. Props to google support for their help (they suggested the changes)
I'm trying to run a (py)Spark job on EMR that will process a large amount of data. Currently my job is failing with the following error message:
Reason: Container killed by YARN for exceeding memory limits.
5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
So I google'd how to do this, and found that I should pass along the spark.yarn.executor.memoryOverhead parameter with the --conf flag. I'm doing it this way:
aws emr add-steps\
--cluster-id %s\
--profile EMR\
--region us-west-2\
--steps Name=Spark,Jar=command-runner.jar,\
Args=[\
/usr/lib/spark/bin/spark-submit,\
--deploy-mode,client,\
/home/hadoop/%s,\
--executor-memory,100g,\
--num-executors,3,\
--total-executor-cores,1,\
--conf,'spark.python.worker.memory=1200m',\
--conf,'spark.yarn.executor.memoryOverhead=15300',\
],ActionOnFailure=CONTINUE" % (cluster_id,script_name)\
But when I rerun the job it keeps giving me the same error message, with the 5.5 GB of 5.5 GB physical memory used, which implies that my memory did not increase.. any hints on what I am doing wrong?
EDIT
Here are details on how I initially create the cluster:
aws emr create-cluster\
--name "Spark"\
--release-label emr-4.7.0\
--applications Name=Spark\
--bootstrap-action Path=s3://emr-code-matgreen/bootstraps/install_python_modules.sh\
--ec2-attributes KeyName=EMR2,InstanceProfile=EMR_EC2_DefaultRole\
--log-uri s3://emr-logs-zerex\
--instance-type r3.xlarge\
--instance-count 4\
--profile EMR\
--service-role EMR_DefaultRole\
--region us-west-2'
Thanks.
After a couple of hours I found the solution to this problem. When creating the cluster, I needed to pass on the following flag as a parameter:
--configurations file://./sparkConfig.json\
With the JSON file containing:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "10G"
}
}
]
This allows me to increase the memoryOverhead in the next step by using the parameter I initially posted.
If you are logged into an EMR node and want to further alter Spark's default settings without dealing with the AWSCLI tools you can add a line to the spark-defaults.conf file. Spark is located in EMR's /etc directory. Users can access the file directly by navigating to or editing /etc/spark/conf/spark-defaults.conf
So in this case we'd append spark.yarn.executor.memoryOverhead to the end of the spark-defaults file. The end of the file looks very similar to this example:
spark.driver.memory 1024M
spark.executor.memory 4305M
spark.default.parallelism 8
spark.logConf true
spark.executorEnv.PYTHONPATH /usr/lib/spark/python
spark.driver.maxResultSize 0
spark.worker.timeout 600
spark.storage.blockManagerSlaveTimeoutMs 600000
spark.executorEnv.PYTHONHASHSEED 0
spark.akka.timeout 600
spark.sql.shuffle.partitions 300
spark.yarn.executor.memoryOverhead 1000M
Similarly, the heap size can be controlled with the --executor-memory=xg flag or the spark.executor.memory property.
Hope this helps...