Use `capture_tpu_profile` in AI Platform - google-cloud-platform

we are trying to capture TPU profiling data while running our training task on AI Platform. Following this tutorial. All needed information like TPU name getting from our model output.
config.yaml:
trainingInput:
scaleTier: BASIC_TPU
runtimeVersion: '1.15' # also tried '2.1'
task submitting command:
export DATE=$(date '+%Y%m%d_%H%M%S') && \
gcloud ai-platform jobs submit training "imaterialist_image_classification_model_${DATE}" \
--region=us-central1 \
--staging-bucket='gs://${BUCKET}' \
--module-name='efficientnet.main' \
--config=config.yaml \
--package-path="${PWD}/efficientnet" \
-- \
--data_dir='gs://${BUCKET}/tfrecords/' \
--train_batch_size=8 \
--train_steps=5 \
--model_dir="gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/${DATE}" \
--model_name='efficientnet-b4' \
--skip_host_call=true \
--gcp_project=${GCP_PROJECT_ID} \
--mode=train
When we tried to run capture_tpu_profile with name that our model got from master:
capture_tpu_profile --gcp_project="${GCP_PROJECT_ID}" --logdir='gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/20200318_005446' --tpu_zone='us-central1-b' --tpu='<tpu_IP_address>'
we got this error:
File "/home/kovtuh/.local/lib/python3.7/site-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 480, in _fetch_cloud_tpu_metadata
"constructor. Exception: %s" % (self._tpu, e))
ValueError: Could not lookup TPU metadata from name 'b'<tpu_IP_address>''. Please doublecheck the tpu argument in the TPUClusterResolver constructor. Exception: <HttpError 404 when requesting https://tpu.googleapis.com/v1/projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>?alt=json returned "Resource 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>' was not found". Details: "[{'#type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>'}]">
Seems like TPU device isn't connected to our project when provided in AI Platform, but what project is connected to and can we get an access to such TPUs to capture it's profile?

Related

Google Cloud Storage File System, Python Package Error: AttributeError: 'GCSFile' object has no attribute 'gcsfs'

I am trying to run a python code which will download and stream chunks of data from source URL to destination cloud storage blob.
It is working fine in standalone pc, local function and so on.
But when i try same with GCP Cloud RUN it is throwing weird error.
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
Complete error:
Traceback (most recent call last):
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1683, in __del__
self.close()
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1661, in close
self.flush(force=True)
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1527, in flush
self._initiate_upload()
File "/home/<user>/.local/lib/python3.9/site-packages/gcsfs/core.py", line 1443, in _initiate_upload
self.gcsfs.loop,
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
It consumed my week, any help or direction is highly appriciated, thanks in advance.
The actual code which has been used:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
#app.route('/urltogcs')
def urltogcs():
try:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "secret.json"
gcp_file_system = gcsfs.GCSFileSystem(project='<project_id>')
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url} :('
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
Your code (with the proposed changes) works for me:
main.py:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
project = os.getenv("PROJECT")
port = os.getenv("PORT", 8080)
#app.route('/urltogcs')
def urltogcs():
try:
gcp_file_system = gcsfs.GCSFileSystem(project=project)
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url}
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(port))
Note: The code requires project from the environment which isn't ideal. It would be better if gcsfs.GCSFileSystem didn't require project. Alternatively project could be obtained from Google's metadata service. For convenience (!), I'm setting it using the environment.
requirements.txt:
Flask==2.2.2
gcsfs==2022.7.1
gunicorn==20.1.0
Dockerfile:
FROM python:3.10-slim
ENV PYTHONUNBUFFERED True
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
RUN pip install --no-cache-dir -r requirements.txt
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Bash script:
BILLING="[YOUR-BILLING]"
PROJECT="[YOUR-PROJECT]"
REGION="[YOUR-REGION]"
BUCKET="[YOUR-BUCKET]"
# Create Project
gcloud projects create ${PROJECT}
# Associate with Billing Account
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
# Enabled services
SERVICES=(
"artifactregistry"
"cloudbuild"
"run"
)
for SERVICE in ${SERVICES[#]}
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
# Create Bucket
gsutil mb -p ${PROJECT} gs://${BUCKET}
# Service Account
ACCOUNT=tester
EMAIL=${ACCOUNT}#${PROJECT}.iam.gserviceaccount.com
# Create Service Account
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}
# Create Service Account key
gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}
# Ensure Service Account can write to storage
gcloud projects add-iam-policy-binding ${PROJECT} \
--role=roles/storage.admin \
--member=serviceAccount:${EMAIL}
# Only needed for local testing
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json
# Deploy Cloud Run service
# Run service as Service Account
NAME="urltogcs"
gcloud run deploy ${NAME} \
--source=${PWD} \
--set-env-vars=PROJECT=${PROJECT} \
--no-allow-unauthenticated \
--service-account=${EMAIL} \
--region=${REGION} \
--project=${PROJECT}
# Grab the Cloud Run service's endpoint
ENDPOINT=$(gcloud run services describe ${NAME} \
--region=${REGION} \
--project=${PROJECT} \
--format="value(status.url)")
# Cloud Run service requires auth
TOKEN=$(gcloud auth print-identity-token)
# This page
SRC="https://stackoverflow.com/questions/73393808/"
# Generate a GCS Object name by epoch
DST="gs://${BUCKET}/$(date +%s)"
curl \
--silent \
--get \
--header "Authorization: Bearer ${TOKEN}" \
--data-urlencode "source=${SRC}" \
--data-urlencode "destination=${DST}" \
--write-out '%{response_code}' \
--output /dev/null \
${ENDPOINT}/urltogcs
Yields OK:
200
And:
gsutil ls gs://${BUCKET}
gs://${BUCKET}/1660780270

How to get latest version of an image from artifact registry

is there a command (gcloud) that return the latest fully qualified name of an image from Artifact registry
Try:
PROJECT=
REGION=
REPO=
IMAGE=
gcloud artifacts docker images list \
${REGION}-docker.pkg.dev/${PROJECT}/${REPO} \
--filter="package=${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}" \
--sort-by="~UPDATE_TIME" \
--limit=1 \
--format="value(format("{0}#{1}",package,version))"
Because:
Filters the list for a specific image
Sorts the results descending (~) by UPDATE_TIME1
Only takes 1 value i.e. the most recent
Outputs the results as {package}#{version}
1 -- Curiously, --sort-by uses the output (!) field name not the underlying type (surfaced by e.g. --format=json or --format=yaml) name.
Many thanks to the previous answer, I use it to remove the tag "latest" of my last pushed artifact. I then add it when I push another. Leaving here if anyone interested.
Doc : https://cloud.google.com/artifact-registry/docs/docker/manage-images#tag
Remove tag :
gcloud artifacts docker tags delete \
$(gcloud artifacts docker images list ${REGION}-docker.pkg.dev/\
${PROJECT}/${REPO}/${IMAGE}/\
--filter="package=${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}"\
--sort-by="~UPDATE_TIME" --limit=1 --format="value(format("{0}",package))"):latest
Add tag:
gcloud artifacts docker tags add \
$(gcloud artifacts docker images list \
${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}/ \
--filter="package=${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}" \
--sort-by="~UPDATE_TIME" --limit=1 \
--format="value(format("{0}#{1}",package,version))") \
$(gcloud artifacts docker images list \
${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}/ \
--filter="package=${REGION}-docker.pkg.dev/${PROJECT}/${REPO}/${IMAGE}" \
--sort-by="~UPDATE_TIME" --limit=1 \
--format="value(format("{0}",package))"):latest

unable to create a gcloud alert policy in command line with multiple conditions

I am trying to create a single alert policy for Cloud-Sql instance_state through gcloud with multiple conditions.
If the instance is in "RUNNABLE" OR "FAILED" state for more than 5 minutes, then a alert should be triggerred. I was able to create that in console and below is the screenshot:
Now I try the same using the command line and give this gcloud command:
gcloud alpha monitoring policies create \
--display-name='Test Database State Alert ('$PROJECTID')' \
--condition-display-name='Instance is not running for 5 minutes'\
--notification-channels="x23234dfdfffffff" \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "RUNNABLE")'
OR 'metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")' \
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--combiner='OR' \
--documentation='The rule "${condition.display_name}" has generated this alert for the "${metric.display_name}".' \
--project="$PROJECTID" \
--enabled
I am getting the error below in the OR part of the condition:
ERROR: (gcloud.alpha.monitoring.policies.create) unrecognized arguments:
OR
metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")
Even if i put ( ) over the condition still it fails, also the || operator also fails.
Can anyone please tell me the correct gcloud command for this? Also i want the structure of the alert policy to be similar to the one created in cloud-console as shown above
Thanks
I was able to use gcloud alpha monitoring policies conditions create to append additional conditions.
gcloud alpha monitoring policies create \
--notification-channels=projects/qwiklabs-gcp-04-d822dd6cd419/notificationChannels/2510735656842641871 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_MEAN"}' \
--condition-display-name='CPU Utilization >0.95 for 1m'\
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" resource.type="gce_instance"' \
--duration='1m' \
--if='> 0.95' \
--display-name=' alert on spikes or consistantly high cpu' \
--combiner='OR'
gcloud alpha monitoring policies list --format='value(name,displayName)'
gcloud alpha monitoring policies conditions create \
projects/qwiklabs-gcp-04-d822dd6cd419/alertPolicies/1712202834227136574 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_MEAN"}' \
--condition-display-name='CPU Utilization >0.80 for 10m'\
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" resource.type="gce_instance"' \
--duration='10m' \
--if='> 0.80'
Duplicate --condition-filter clauses did not work for me. YMMV.
From the docs gcloud alpha monitoring policies create, it appears that you can specify repeated (!) occurrences of:
[--aggregation=AGGREGATION --condition-display-name=CONDITION_DISPLAY_NAME --condition-filter=CONDITION_FILTER --duration=DURATION --if=IF_VALUE --trigger-count=TRIGGER_COUNT | --trigger-percent=TRIGGER_PERCENT]
So I think you need to duplicate your --condition-filter with the --combiner="OR", i.e.
gcloud alpha monitoring policies create \
--display-name='Test Database State Alert ('$PROJECTID')' \
--notification-channels="x23234dfdfffffff" \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-display-name='RUNNABLE'\
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "RUNNABLE")'
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-display-name='FAILED'\
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")' \
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--combiner='OR' \
--documentation='The rule "${condition.display_name}" has generated this alert for the "${metric.display_name}".' \
--project="$PROJECTID" \
--enabled

How to create multiple versions of a model in GCP ML Engine?

GCP has a multiple version capability for models. You can even specify a version as a default. However, how do you actually upload multiple models?
For instance, this command will create a model if the model name does not exist.
%%bash
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/${MODEL_NAME}/${TRAINING_DIR}/export/exporter | tail -1)
DESCRIPTION="Has multiple bang count entries. 1200 training samples"
gcloud ml-engine versions create ${MODEL_VERSION}_${MODEL_SUBVERSION} \
--model ${MODEL_NAME} \
--origin ${MODEL_LOCATION} \
--runtime-version $TFVERSION \
--description="${DESCRIPTION}" \
--labels='some_key'="${SOME_VALUE}",another_key="another_value"
However, each time I bump the model version, I get this error:
ERROR: (gcloud.ml-engine.versions.create) ALREADY_EXISTS: Field: version.name Error: A version with the same name already exists.
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: A version with the same name already exists.
field: version.name

runtime_version versus runtime-version in cloudml-samples/flowers/sample.sh

In Google's sample code found at cloudml-samples/flowers/sample.sh, between lines 52 and 64, is the argument "runtime_version":
# Training on CloudML is quick after preprocessing. If you ran the above
# commands asynchronously, make sure they have completed before calling this one.
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--runtime_version=1.0 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Shouldn't "runtime_version" be replaced with "runtime-version" to avoid an error?
Yes. I've submitted a PR (in the future, never hesitate to do so yourself)