(gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field - google-cloud-platform

I'm getting the following error when running a AI Platform training job:
ERROR: (gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field: master_config.accelerator_config Error: Attaching 1 NVIDIA_TESLA_T4(s) on VM type n1-highcpu-32 is not supported.
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: Attaching 1 NVIDIA_TESLA_T4(s) on VM type n1-highcpu-32 is not supported.
field: master_config.accelerator_config
config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: n1-highcpu-32
masterConfig:
acceleratorConfig:
count: 1
type: NVIDIA_TESLA_T4

For n1-highcpu-32 you can choose 2 or 4 NVIDIA Tesla T4 GPUs
Please verify the combination of Compute Engine and Accelerator here

Related

Prometheus series values for time metrics

I'm defining a data series for testing a Prometheus alert using the container_last_seen metric from the cadvisor exporter.
How do I enter timestamp series values, as returned by the container_last_seen metric? I'm testing Prometheus alerts on an Apple Mac which run in production on Linux boxes.
Here's one thing I tried:
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
It seems whatever I put in the values for the series is not accepted.
I've also tried durations: '0h+1mx60'
As this is legal: time() - container_last_seen{...} cls is definitely a timestamp, and I would expect a timestamp to be represented by a Unix epoch number. Executing the query on Prometheus gives Unix epoch times, but putting numbers in a series is rejected with the error below.
promtool is recognising the different types but giving much the same error:
➜ promtool test rules alertrules-service-oriented-test.yml
Unit Testing: alertrules-service-oriented-test.yml
FAILED:
1:1: parse error: unexpected number "0" in series values
If the values are '1h+0mx61', promtool correctly identifies the values as durations:
1:1: parse error: unexpected duration "1h" in series values
Note that when this test is commented out, there is no 1:1: parse error and the tests complete successfully. This is not a problem with out of sight parts of the test file.
Thanks for any insights.
Here's the alert:
alertrules.yaml:
- name: containers
interval: 15s
rules:
- alert: prod_container_crashing
expr: |
count by (instance, container_label_com_docker_swarm_service_name)
(
count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!="",env="prod"}[15m])
) - 1 > 2
for: 5m
labels:
service: prod
type: container
severity: critical
annotations:
summary: "pdce {{ $labels.container_label_com_docker_swarm_service_name }}"
description: "{{ $labels.container_label_com_docker_swarm_service_name }} in prod cluster on {{ $labels.instance }} is crashing"
and here's the test file:
alertrules_test.yml:
rule_files:
- alertrules.yml
evaluation_interval: 1m
tests:
- name: container_tests
interval: 15s
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
alert_rule_test:
- eval_time: 15m
alertname: prod_container_crashing
exp_alerts:
- exp_labels:
service: prod
type: container
severity: critical
exp_annotations:
summary: prod service1
description: service1 in prod cluster on 10.0.0.1 is crashing
When the series: value is all on one line, without a > or | yaml flow operator, e.g.
- series: container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
the error is not there, I don't know why. So this doesn't appear to be a data typing issue.
It's a shame for readability reasons-- either Prometheus or GoLang may have a squeaky wheel in their YAML implementation.

How to set table format as default in google cloud shell?

When I try to list anything, my result is not grouped as a table ( as in the video). Each region is listed separately with its descriptions. Something like this
NAME: us-west3
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
NAME: us-west4
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
Please try:
gcloud config set accessibility/screen_reader False
And then repeat the command.

Error instantiating snitch class 'org.apache.cassandra.locator.Ec2Snitch'

I'm having hard time setting up 2 node Cassandra cluster on Ec2 instances. This is 2.2.19 version. I cannot upgrade due to some other dependencies involved.
The Ec2 instances are in private subnet. Assigned static private ips
Here is my cassandra.yaml
cluster_name: 'Test-cluster'
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
saved_caches_directory: /var/lib/cassandra/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "${private_ip}"
listen_address: ${private_ip}
start_native_transport: true
native_transport_port: 9042
storage_port: 7000
num_tokens: 32
ssl_storage_port: 9042
start_rpc: true
rpc_address: ${private_ip}
rpc_port: 9160
broadcast_rpc_address: ${private_ip}
endpoint_snitch: Ec2Snitch
partitioner: org.apache.cassandra.dht.RandomPartitioner
Here is my system.log
INFO [main] 2021-06-07 18:42:41,900 DatabaseDescriptor.java:327 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO [main] 2021-06-07 18:42:42,022 DatabaseDescriptor.java:437 - Global memtable on-heap threshold is enabled at 251MB
INFO [main] 2021-06-07 18:42:42,023 DatabaseDescriptor.java:441 - Global memtable off-heap threshold is enabled at 251MB
ERROR [main] 2021-06-07 18:42:42,049 CassandraDaemon.java:787 - Exception encountered during startup
org.apache.cassandra.exceptions.ConfigurationException: Error instantiating snitch class 'org.apache.cassandra.locator.Ec2Snitch'.
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:551) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:529) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.createEndpointSnitch(DatabaseDescriptor.java:741) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:465) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:133) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:599) [apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:774) [apache-cassandra-2.2.19.jar:2.2.19]
Caused by: org.apache.cassandra.exceptions.ConfigurationException: Ec2Snitch was unable to execute the API call. Not an ec2 node?
at org.apache.cassandra.locator.Ec2Snitch.awsApiCall(Ec2Snitch.java:79) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.locator.Ec2Snitch.<init>(Ec2Snitch.java:55) ~[apache-cassandra-2.2.19.jar:2.2.19]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_282]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_282]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_282]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_282]
at java.lang.Class.newInstance(Class.java:442) ~[na:1.8.0_282]
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:536) ~[apache-cassandra-2.2.19.jar:2.2.19]
Note: When I change snitch to SimpleSnitch it actually works.
Please help!!
Answering my own question
Ec2snitch uses IMDVs1 to get metadata http://169.254.169.254/latest/meta-data/placement/availability-zone to determine certain properties.
I created Ec2 instances through terraform where my code has
metadata_options {
http_endpoint = "enabled"
http_tokens = "enabled"
}
The above code forces to use imdsv2 only which is causing the issue. Ec2snitch couldn't get metadata by simple curl command.
Solution:
metadata_options {
http_endpoint = "enabled"
http_tokens = "optional"
}
If you are doing through console, when launching instance, make sure meta data version is set to V1 and V2

How to create the config.yaml file for distributed training on Unified Cloud AI Platform

I am looking to train a model using Google Cloud's new service - the Unified AI Platform. To do so I am using a config.yaml that looks like this:
workerPoolSpecs:
workerPoolSpec:
machineSpec:
machineType: n1-highmem-16
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 1
pythonPackageSpec:
executorImageUri: us-docker.pkg.dev/cloud-aiplatform/training/tf-gpu.2-4:latest
packageUris: gs://path/to/bucket/unified_ai_platform/src_dist/trainer-0.1.tar.gz
pythonModule: trainer.task
workerPoolSpec:
machineSpec:
machineType: n1-highmem-16
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 2
pythonPackageSpec:
executorImageUri: us-docker.pkg.dev/cloud-aiplatform/training/tf-gpu.2-4:latest
packageUris: gs://path/to/bucket/unified_ai_platform/src_dist/trainer-0.1.tar.gz
pythonModule: trainer.task
However for distributed training I am unable to understand how to pass multiple workerPoolSpecs in this file. The example yaml file provided does not look at the case wherein I can provide multiple workerPoolSpecs.
The example's documentation also saying that "You can specify multiple worker pool specs in order to create a custom job with multiple worker pools".
Any help in this regard will be appreciated.
Answering my own question. The config.yaml file should look like this:
workerPoolSpecs:
- machineSpec:
machineType: n1-standard-16
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 1
containerSpec:
imageUri: gcr.io/path/to/container:v2
args:
- --model-dir=gs://path/to/model
- --tfrecord-dir=gs://path/to/training/data/
- --epochs=2
- machineSpec:
machineType: n1-standard-16
acceleratorType: NVIDIA_TESLA_P100
acceleratorCount: 2
replicaCount: 2
containerSpec:
imageUri: gcr.io/path/to/container:v2
args:
- --model-dir=gs://path/to/models
- --tfrecord-dir=gs://path/to/training/data/
- --epochs=2

GCP Dataproc - Error: Unknown name "optionalComponents" at 'cluster.config': Cannot find field

I am trying to create dataproc cluster using configurations mentioned in YAML file (using import):
The command I have been using successfully:
$ gcloud beta dataproc clusters import $CLUSTER_NAME --region=$REGION
--source=cluster_conf_file.yaml
Later on I tried adding HABSE component which is a part of available optional components using attribute --optional-components:
$ gcloud beta dataproc clusters import $CLUSTER_NAME --optional-components=HBASE --region=$REGION
--source=cluster_conf_file.yaml
(Documentation referred:
https://cloud.google.com/dataproc/docs/concepts/components/hbase#installing_the_component)
Which caused below error:
ERROR: (gcloud.beta.dataproc.clusters.import) unrecognized arguments: --optional-components=HBASE
Then I tried adding the attribute --optional-components as optionalComponents in the YAML file (instead of passing through command line) by referring this documentation.
Sample YAML:
config:
endpointConfig:
enableHttpPortAccess: BOOLEAN_VALUE
configBucket: BUCKET_NAME
gceClusterConfig:
serviceAccount: SERVICE_ACCOUNT
subnetworkUri: SUBNETWORK_URI
tags:
- Tag1
- TAG2
optionalComponents: <---- Attribute causing error
- HBASE
softwareConfig:
imageVersion: IMAGE_VERSION
properties:
PROPERTY: VALUE
.
.
.
masterConfig:
diskConfig:
bootDiskSizeGb: SIZE
bootDiskType: TYPE
machineTypeUri: TYPE_URI
numInstances: COUNT
Which caused below error:
ERROR: (gcloud.dataproc.clusters.import) INVALID_ARGUMENT: Invalid JSON payload received. Unknown name "optionalComponents" at 'cluster.config': Cannot find field.
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: "Invalid JSON payload received. Unknown name \"optionalComponents\"\
\ at 'cluster.config': Cannot find field."
field: cluster.config
Is there a way to fix this?
optionalComponents should be under config.softwareConfig:
config:
...
softwareConfig:
imageVersion: IMAGE_VERSION
optionalComponents:
- ZOOKEEPER
- HBASE
You can prove it by first creating a cluster with optional components, then export it to a YAML file.