Bad model detected when deploying model on google ml engine - google-cloud-ml

I trained my model using google ml engine with the following configuration.
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1/train \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1/train \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
After training, I downloaded the latest checkpoint from GCP and exported the model using the following command :
python export_inference_graph.py --input_type encoded_image_string_tensor --pipeline_config_path training/fast_rcnn_resnet101_coco.config --trained_checkpoint_prefix training/model.ckpt-11127 --output_directory exported_graphs
My model config looks like this :
The given SavedModel SignatureDef contains the following input(s):
inputs['inputs'] tensor_info:
dtype: DT_UINT8
shape: (-1, -1, -1, 3)
name: image_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300, 4)
name: detection_boxes:0
outputs['detection_classes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300)
name: detection_classes:0
outputs['detection_features'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, -1, -1, -1)
name: detection_features:0
outputs['detection_multiclass_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300, 2)
name: detection_multiclass_scores:0
outputs['detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300)
name: detection_scores:0
outputs['num_detections'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: num_detections:0
outputs['raw_detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300, 4)
name: raw_detection_boxes:0
outputs['raw_detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 300, 2)
name: raw_detection_scores:0
Method name is: tensorflow/serving/predict
After this I deploy this model on ml-engine with the following configuration :
Python version 2.7
Framework TensorFlow
Framework version 1.12.3
Runtime version 1.12
Machine type Single core CPU
I receive the following error :
Error
Create Version failed. Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'FusedBatchNormV3' in binary running on localhost. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.\n\n (Error code: 0)"

It's most likely an incompatibility of Tf versions somewhere, e.g. between the model and the runtime. Did you create your model with the version of Tf you are actually running it in ?
Many threads seem to confirm my answer :
Not found: Op type not registered 'CountExtremelyRandomStats'
Bad model deploying to GCP Cloudml

I was able to figure this out: While I exporting the model I was using a different Tensorflow version. Inorder to keep things coherent and avoid such error, Make sure Tensorflow version during training, exporting and deployment are all same .

Related

Prometheus series values for time metrics

I'm defining a data series for testing a Prometheus alert using the container_last_seen metric from the cadvisor exporter.
How do I enter timestamp series values, as returned by the container_last_seen metric? I'm testing Prometheus alerts on an Apple Mac which run in production on Linux boxes.
Here's one thing I tried:
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
It seems whatever I put in the values for the series is not accepted.
I've also tried durations: '0h+1mx60'
As this is legal: time() - container_last_seen{...} cls is definitely a timestamp, and I would expect a timestamp to be represented by a Unix epoch number. Executing the query on Prometheus gives Unix epoch times, but putting numbers in a series is rejected with the error below.
promtool is recognising the different types but giving much the same error:
➜ promtool test rules alertrules-service-oriented-test.yml
Unit Testing: alertrules-service-oriented-test.yml
FAILED:
1:1: parse error: unexpected number "0" in series values
If the values are '1h+0mx61', promtool correctly identifies the values as durations:
1:1: parse error: unexpected duration "1h" in series values
Note that when this test is commented out, there is no 1:1: parse error and the tests complete successfully. This is not a problem with out of sight parts of the test file.
Thanks for any insights.
Here's the alert:
alertrules.yaml:
- name: containers
interval: 15s
rules:
- alert: prod_container_crashing
expr: |
count by (instance, container_label_com_docker_swarm_service_name)
(
count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!="",env="prod"}[15m])
) - 1 > 2
for: 5m
labels:
service: prod
type: container
severity: critical
annotations:
summary: "pdce {{ $labels.container_label_com_docker_swarm_service_name }}"
description: "{{ $labels.container_label_com_docker_swarm_service_name }} in prod cluster on {{ $labels.instance }} is crashing"
and here's the test file:
alertrules_test.yml:
rule_files:
- alertrules.yml
evaluation_interval: 1m
tests:
- name: container_tests
interval: 15s
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
alert_rule_test:
- eval_time: 15m
alertname: prod_container_crashing
exp_alerts:
- exp_labels:
service: prod
type: container
severity: critical
exp_annotations:
summary: prod service1
description: service1 in prod cluster on 10.0.0.1 is crashing
When the series: value is all on one line, without a > or | yaml flow operator, e.g.
- series: container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
the error is not there, I don't know why. So this doesn't appear to be a data typing issue.
It's a shame for readability reasons-- either Prometheus or GoLang may have a squeaky wheel in their YAML implementation.

How to set table format as default in google cloud shell?

When I try to list anything, my result is not grouped as a table ( as in the video). Each region is listed separately with its descriptions. Something like this
NAME: us-west3
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
NAME: us-west4
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
Please try:
gcloud config set accessibility/screen_reader False
And then repeat the command.

Google AI Platform Prediction error - Object Detection API models - HttpError 400 - Tensor name has inconsistent batch size

I need to do remote, on-line predictions using the TensorFlow Object Detection API. I am trying to use the Google AI-Platform. When I do on-line predictions of Object Detection models on the AI Platform, I get an error similar to:
HttpError 400 Tensor name: num_proposals has inconsistent batch size: 1 expecting: 49152
When I execute predictions locally (e.g. result = model(image)), I get the desired results.
This error occurs for a variety of Object Detection models -- Mask-RCNN and MobileNet. The error occurs on Object Detection models that I have trained, and ones loaded directly from the Object Detection Model Zoo (v2). I get successful results using the same code, but a model deployed on AI Platform that is not Object Detection.
Signature Information
The model input signature-def seems to be correct:
!saved_model_cli show --dir {MODEL_DIR_GS}
!saved_model_cli show --dir {MODEL_DIR_GS} --tag_set serve
!saved_model_cli show --dir {MODEL_DIR_GS} --tag_set serve --signature_def serving_default
gives:
The given SavedModel contains the following tag-sets:
serve
The given SavedModel MetaGraphDef contains SignatureDefs with the following keys:
SignatureDef key: "__saved_model_init_op"
SignatureDef key: "serving_default"
The given SavedModel SignatureDef contains the following input(s):
inputs['input_tensor'] tensor_info:
dtype: DT_UINT8
shape: (1, -1, -1, 3)
name: serving_default_input_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['anchors'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 4)
name: StatefulPartitionedCall:0
outputs['box_classifier_features'] tensor_info:
dtype: DT_FLOAT
shape: (300, 9, 9, 1536)
name: StatefulPartitionedCall:1
outputs['class_predictions_with_background'] tensor_info:
dtype: DT_FLOAT
shape: (300, 2)
name: StatefulPartitionedCall:2
outputs['detection_anchor_indices'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100)
name: StatefulPartitionedCall:3
outputs['detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100, 4)
name: StatefulPartitionedCall:4
outputs['detection_classes'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100)
name: StatefulPartitionedCall:5
outputs['detection_masks'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100, 33, 33)
name: StatefulPartitionedCall:6
outputs['detection_multiclass_scores'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100, 2)
name: StatefulPartitionedCall:7
outputs['detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (1, 100)
name: StatefulPartitionedCall:8
outputs['final_anchors'] tensor_info:
dtype: DT_FLOAT
shape: (1, 300, 4)
name: StatefulPartitionedCall:9
outputs['image_shape'] tensor_info:
dtype: DT_FLOAT
shape: (4)
name: StatefulPartitionedCall:10
outputs['mask_predictions'] tensor_info:
dtype: DT_FLOAT
shape: (100, 1, 33, 33)
name: StatefulPartitionedCall:11
outputs['num_detections'] tensor_info:
dtype: DT_FLOAT
shape: (1)
name: StatefulPartitionedCall:12
outputs['num_proposals'] tensor_info:
dtype: DT_FLOAT
shape: (1)
name: StatefulPartitionedCall:13
outputs['proposal_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (1, 300, 4)
name: StatefulPartitionedCall:14
outputs['proposal_boxes_normalized'] tensor_info:
dtype: DT_FLOAT
shape: (1, 300, 4)
name: StatefulPartitionedCall:15
outputs['raw_detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (1, 300, 4)
name: StatefulPartitionedCall:16
outputs['raw_detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (1, 300, 2)
name: StatefulPartitionedCall:17
outputs['refined_box_encodings'] tensor_info:
dtype: DT_FLOAT
shape: (300, 1, 4)
name: StatefulPartitionedCall:18
outputs['rpn_box_encodings'] tensor_info:
dtype: DT_FLOAT
shape: (1, 12288, 4)
name: StatefulPartitionedCall:19
outputs['rpn_objectness_predictions_with_background'] tensor_info:
dtype: DT_FLOAT
shape: (1, 12288, 2)
name: StatefulPartitionedCall:20
Method name is: tensorflow/serving/predict
Steps to Reproduce
Download a model from TensorFlow Model Zoo.
Deploy to AI Platform
!gcloud config set project $PROJECT
!gcloud beta ai-platform models create $MODEL --regions=us-central1
%%bash -s $PROJECT $MODEL $VERSION $MODEL_DIR_GS
gcloud ai-platform versions create $3 \
--project $1 \
--model $2 \
--origin $4 \
--runtime-version=2.1 \
--framework=tensorflow \
--python-version=3.7 \
--machine-type=n1-standard-2 \
--accelerator type=nvidia-tesla-t4
Evaluate remotely
import googleapiclient
import numpy as np
import socket
img_np = np.zeros((100, 100,3), dtype=np.uint8)
img_list = img_np.to_list()
instances = [img_list]
socket.setdefaulttimeout(600) # set timeout to 10 minutes
service = googleapiclient.discovery.build('ml', 'v1', cache_discovery=False, )
model_version_string = 'projects/{}/models/{}/versions/{}'.format(PROJECT, MODEL, VERSION)
print(model_version_string)
response = service.projects().predict(
name=model_version_string,
body={'instances': instances}
).execute()
if 'error' in response:
raise RuntimeError(response['error'])
else:
print(f'Success. # keys={response.keys()}')
I get an error similar to:
HttpError: <HttpError 400 when requesting
https://ml.googleapis.com/v1/projects/gcp_project/models/error_demo/versions/mobilenet:predict?alt=json
returned "{ "error": "Tensor name: refined_box_encodings has inconsistent batch size: 300
expecting: 1"}}>
Additional Information
The code fails if I change the instances variable in the request body from instances = [img_list] to instances = [{'input_tensor':img_list}].
If I intentionally use an incorrect input shape (e.g. (1, 100, 100, 2) or (100, 100, 2), I get a response stating that the input shape is not correct.
The Google Cloud Storage JSON Error Code documentation states:
invalidArgument -- The value for one of fields in the request body was invalid.
If I repeat this prediction step, I get the same error message, except with different names for tensors.
If I run the process using gcloud
import json
x = {"instances":[
[
[
[0, 0, 0],
[0, 0, 0]
],
[
[0, 0, 0],
[0, 0, 0]
]
]
]
}
with open('test.json', 'w') as f:
json.dump(x, f)
!gcloud ai-platform predict --model $MODEL --json-request=./test.json
I get an INVALID_ARGUMENT error.
ERROR: (gcloud.ai-platform.predict) HTTP request failed. Response: {
"error": {
"code": 400,
"message": "{ \"error\": \"Tensor name: anchors has inconsistent batch size: 49152 expecting: 1\" }",
"status": "INVALID_ARGUMENT"
}
}
I get the same error if I submit the same JSON data above using Google Cloud Console -- the Test & Use tab of the AI Platform Version Details screen, or the AI Platform Prediction JSON documentation on Method: Projects.predict
I enabled logging (both regular and console), but it gives no additional information.
I've placed the details required to reproduce in a Colab.
Thanks in advance. I've spent over a day working on this and am really stuck!
Per https://github.com/tensorflow/serving/issues/1047, when the request uses the instances key, TensorFlow Serving ensures that all components of the output have the same batch size. The workaround is to use the inputs keyword.
E.g.
inputs = [img_list]
...
response = service.projects().predict(
name=model_version_string,
body={'inputs': inputs}

(gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field

I'm getting the following error when running a AI Platform training job:
ERROR: (gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field: master_config.accelerator_config Error: Attaching 1 NVIDIA_TESLA_T4(s) on VM type n1-highcpu-32 is not supported.
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: Attaching 1 NVIDIA_TESLA_T4(s) on VM type n1-highcpu-32 is not supported.
field: master_config.accelerator_config
config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-32
masterConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4
For n1-highcpu-32 you can choose 2 or 4 NVIDIA Tesla T4 GPUs
Please verify the combination of Compute Engine and Accelerator here

What does 'Attempting to upgrade input file specified using deprecated transformation parameters' mean?

I am currently trying to train my first net with Caffe. I get the following output:
caffe train --solver=first_net_solver.prototxt
I0515 09:01:06.577710 15331 caffe.cpp:117] Use CPU.
I0515 09:01:06.578014 15331 caffe.cpp:121] Starting Optimization
I0515 09:01:06.578097 15331 solver.cpp:32] Initializing solver from parameters:
test_iter: 1
test_interval: 1
base_lr: 0.01
display: 1
max_iter: 2
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0
snapshot: 1
snapshot_prefix: "first_net"
solver_mode: CPU
net: "first_net.prototxt"
I0515 09:01:06.578203 15331 solver.cpp:70] Creating training net from net file: first_net.prototxt
E0515 09:01:06.578348 15331 upgrade_proto.cpp:609] Attempting to upgrade input file specified using deprecated transformation parameters: first_net.prototxt
I0515 09:01:06.578533 15331 upgrade_proto.cpp:612] Successfully upgraded file specified using deprecated data transformation parameters.
E0515 09:01:06.578549 15331 upgrade_proto.cpp:614] Note that future Caffe releases will only support transform_param messages for transformation fields.
E0515 09:01:06.578574 15331 upgrade_proto.cpp:618] Attempting to upgrade input file specified using deprecated V1LayerParameter: first_net.prototxt
I0515 09:01:06.578635 15331 upgrade_proto.cpp:626] Successfully upgraded file specified using deprecated V1LayerParameter
I0515 09:01:06.578729 15331 net.cpp:42] Initializing net from parameters:
name: "first_net"
input: "data"
input_dim: 1
input_dim: 5
input_dim: 41
input_dim: 41
state {
phase: TRAIN
}
layer {
name: "data"
type: "ImageData"
top: "data2"
top: "data-idx"
transform_param {
mirror: false
crop_size: 41
}
image_data_param {
source: "/home/moose/GitHub/first-net/data-images.txt"
}
}
layer {
name: "label-mask"
type: "ImageData"
top: "label-mask"
top: "label-idx"
transform_param {
mirror: false
crop_size: 41
}
image_data_param {
source: "/home/moose/GitHub/first-net/labels-images.txt"
}
}
layer {
name: "assert-idx"
type: "EuclideanLoss"
bottom: "data-idx"
top: "loss"
}
What does
Attempting to upgrade input file specified using deprecated transform parameters / V1LayerParameter
mean? Where exactly did I use something deprecated? What should I use instead?
Recently, input transformation (scaling/cropping etc.) was separated from the IMAGE_DATA layer into a separate object: data transformer. This change affected the protobuffer syntax and the syntax of the IMAGE_DATA layer.
It appears as if your first_net.prototxt is in the old format and Caffe converts it for you to the new format.
You can do this conversion manually yourself using ./build/tools/upgrade_net_proto_text (for prototxt files) and ./build/tools/upgrade_net_proto_binary (for binaryproto files).