GPU utilization is zero when running batch transform in Amazon SageMaker - amazon-web-services

I want to run a batch transform job on AWS SageMaker. I have an image classification model which I have trained on a local GPU. Now I want to deploy it on AWS SageMaker and make predictions using Batch Transform. While the batch transform job runs successfully, the GPU utilization during the job is always zero (GPU Memory utilization, however, is at 97%). That's what CloudWatch is telling me. Also, the job takes approx. 7 minutes to process 500 images, I would expect it to run much faster than this, at least when comparing it to the time it takes to process the images on a local GPU.
My question: Why doesn't the GPU get used during batch transform, even though I am using a GPU instance (I am using an ml.p3.2xlarge instance)? I was able to deploy the very same model to an endpoint and send requests. When deploying to an endpoint instead of using batch transform, the GPU actually gets used.
Model preparation
I am using a Keras Model with TensorFlow backend. I converted this model to a sagemaker model using this guide https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ :
import tensorflow as tf
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
import tarfile
import sagemaker
# deactivate eager mode
if tf.executing_eagerly():
tf.compat.v1.disable_eager_execution()
builder = builder.SavedModelBuilder(export_dir)
# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
inputs={"image_bytes": model.input}, outputs={"score_bytes": model.output})
with tf.compat.v1.keras.backend.get_session() as sess:
# Save the meta graph and variables
builder.add_meta_graph_and_variables(
sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
builder.save()
with tarfile.open(tar_model_file, mode='w:gz') as archive:
archive.add('export', recursive=True)
sagemaker_session = sagemaker.Session()
s3_uri = sagemaker_session.upload_data(path=tar_model_file, bucket=bucket, key_prefix=sagemaker_model_dir)
Batch Transform
Container image used for batch transform: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-inference:2.0.0-gpu
framework = 'tensorflow'
instance_type='ml.p3.2xlarge'
image_scope = 'inference'
tf_version = '2.0.0'
py_version = '3.6'
sagemaker_model = TensorFlowModel(model_data=MODEL_TAR_ON_S3, role=role, image_uri=tensorflow_image)
transformer = sagemaker_model.transformer(
instance_count = 1,
instance_type = instance_type,
strategy='MultiRecord',
max_concurrent_transforms=8,
max_payload=10, # in MB
output_path = output_data_path,
)
transformer.transform(data = input_data_path,
job_name = job_name,
content_type = 'application/json',
logs=False,
wait=True
)
Log file excerpts
Loading the model takes quite long (several minutes). During this time, the following error message is getting logged:
2020-11-08T15:14:12.433+01:00 2020/11/08 14:14:12 [error]
14#14: *3066 no live upstreams while connecting to upstream, client:
169.254.255.130, server: , request: "GET /ping HTTP/1.1", subrequest: "/v1/models/my_model:predict", upstream:
"http://tfs_upstream/v1/models/my_model:predict", host:
"169.254.255.131:8080" 2020-11-08T15:14:12.433+01:00
169.254.255.130 - - [08/Nov/2020:14:14:12 +0000] "GET /ping HTTP/1.1" 502 157 "-" "Go-http-client/1.1" 2020-11-08T15:14:12.433+01:00
2020/11/08 14:14:12 [error] 14#14: *3066 js: failed ping#015
2020-11-08T15:14:12.433+01:00 502 Bad
Gateway#015 2020-11-08T15:14:12.433+01:00
#015 2020-11-08T15:14:12.433+01:00 502
Bad Gateway#015 2020-11-08T15:14:12.433+01:00
nginx/1.16.1#015 2020-11-08T15:14:12.433+01:00
#015 2020-11-08T15:14:12.433+01:00 #015
There was a log entry about NUMA node read:
successful NUMA node read from SysFS had negative value (-1), but
there must be at least one NUMA node, so returning NUMA node zero
And about a serving warmup request:
No warmup data file found at
/opt/ml/model/export/my_model/1/assets.extra/tf_serving_warmup_requests
And this warning:
[warn] getaddrinfo: address family for nodename not supported

Related

Google Kubernetes Engine The node was low on resource: ephemeral-storage. which exceeds its request of 0

I have a GKE cluster where I create jobs through django, it runs my c++ code images and the builds are triggered through github. It was working just fine up until now. However I have recently pushed a new commit to github (It was a really small change, like three-four lines of basic operations) and it built an image as usual. But this time, it said Pod errors: BackoffLimitExceeded, Error with exit code 137 when trying to create the job through my simple job, and the job is not completed.
I did some digging into the problem and through runnig kubectl describe POD_NAME I got this output from a failed pod:
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-nqgnl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m32s default-scheduler Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
Normal Pulling 7m7s kubelet Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
Normal Pulled 4m1s kubelet Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
Normal Created 4m1s kubelet Created container jobcontainer
Normal Started 4m kubelet Started container jobcontainer
Warning Evicted 3m29s kubelet The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
Normal Killing 3m29s kubelet Stopping container jobcontainer
Warning ExceededGracePeriod 3m19s kubelet Container runtime did not kill the pod within specified grace period.
The error occurs because of this line:
The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
I do not have a yaml file where I set my pod informations, instead I make a django call handle configuration which looks like this:
def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
# Body is the object Body
body = client.V1Job(api_version="batch/v1", kind="Job")
# Body needs Metadata
# Attention: Each JOB must have a different name!
body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
# And a Status
body.status = client.V1JobStatus()
# Now we start with the Template...
template = client.V1PodTemplate()
template.template = client.V1PodTemplateSpec()
# Passing Arguments in Env:
env_list = []
for env_name, env_value in env_vars.items():
env_list.append( client.V1EnvVar(name=env_name, value=env_value) )
print(env_list)
security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
return body
def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform', ])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(name=f"path/to/cluster")
with NamedTemporaryFile(delete=False) as ca_cert:
ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = True
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
config.ssl_ca_cert = ca_cert.name
client.Configuration.set_default(config)
# Setup K8 configs
api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))
container_image = get_first_success_build_from_list_builds(client)
name = id_generator()
body = kube_create_job_object(name, container_image,
env_vars={
"PROJECT" : json.dumps(manifest),
"BUCKET" : settings.GS_BUCKET_NAME,
})
try:
api_response = api_instance.create_namespaced_job("default", body, pretty=True)
print(api_response)
except ApiException as e:
print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)
return body
What causes this and how can I fix it? Am I supposed to set resource/limit varibles to a value and if so how can I do that inside my django job call?
It looks like you are running out of storage on the actual node itself. Since your job spec does not have a request for ephemeral storage, it is being scheduled on any node and in this case it appears like that particular node does not have enough storage available.
I'm not a Python expert, but looks like you should be able to do something like:
storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)

_InactiveRpcError while querying Vertex AI Matching Engine Index

I am following the example notebook as per GCP docs to test Vertex Matching Engine. I have deployed an index but while trying to query the index I am getting _InactiveRpcError. The VPC network is in us-west2 with private service access enabled and the Index is deployed in us-central1. My VPC network contains the pre-populated firewall rules.
Index
createTime: '2021-11-23T15:25:53.928606Z'
deployedIndexes:
- deployedIndexId: brute_force_glove_deployed_v3
indexEndpoint: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
description: testing python script for creating index
displayName: glove_100_brute_force_20211123152551
etag: AMEw9yOVPWBOTpbAvJLllqxWMi2YurEV_sad2n13QvbIlqjOdMyiq_j20gG1ldhdZNTL
metadata:
config:
algorithmConfig:
bruteForceConfig: {}
dimensions: 100
distanceMeasureType: DOT_PRODUCT_DISTANCE
metadataSchemaUri: gs://google-cloud-aiplatform/schema/matchingengine/metadata/nearest_neighbor_search_1.0.0.yaml
name: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
updateTime: '2021-11-23T16:04:17.993730Z'
Index-Endpoint
createTime: '2021-11-24T10:59:51.975949Z'
deployedIndexes:
- automaticResources:
maxReplicaCount: 1
minReplicaCount: 1
createTime: '2021-11-30T15:16:12.323028Z'
deploymentGroup: default
displayName: brute_force_glove_deployed_v3
enableAccessLogging: true
id: brute_force_glove_deployed_v3
index: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
indexSyncTime: '2021-11-30T16:37:35.597200Z'
privateEndpoints:
matchGrpcAddress: 10.242.4.5
displayName: index_endpoint_for_demo
etag: AMEw9yO6cuDfgpBhGVw7-NKnlS1vdFI5nnOtqVgW1ddMP-CMXM7NfGWVpqRpMRPsNCwc
name: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
network: projects/XXXXXXXXXXXX/global/networks/XXXXXXXXXXXX
updateTime: '2021-11-24T10:59:53.271100Z'
Code
import grpc
# import the generated classes
import match_service_pb2
import match_service_pb2_grpc
DEPLOYED_INDEX_SERVER_IP = '10.242.0.5'
DEPLOYED_INDEX_ID = 'brute_force_glove_deployed_v3'
query = [-0.11333, 0.48402, 0.090771, -0.22439, 0.034206, -0.55831, 0.041849, -0.53573, 0.18809, -0.58722, 0.015313, -0.014555, 0.80842, -0.038519, 0.75348, 0.70502, -0.17863, 0.3222, 0.67575, 0.67198, 0.26044, 0.4187, -0.34122, 0.2286, -0.53529, 1.2582, -0.091543, 0.19716, -0.037454, -0.3336, 0.31399, 0.36488, 0.71263, 0.1307, -0.24654, -0.52445, -0.036091, 0.55068, 0.10017, 0.48095, 0.71104, -0.053462, 0.22325, 0.30917, -0.39926, 0.036634, -0.35431, -0.42795, 0.46444, 0.25586, 0.68257, -0.20821, 0.38433, 0.055773, -0.2539, -0.20804, 0.52522, -0.11399, -0.3253, -0.44104, 0.17528, 0.62255, 0.50237, -0.7607, -0.071786, 0.0080131, -0.13286, 0.50097, 0.18824, -0.54722, -0.42664, 0.4292, 0.14877, -0.0072514, -0.16484, -0.059798, 0.9895, -0.61738, 0.054169, 0.48424, -0.35084, -0.27053, 0.37829, 0.11503, -0.39613, 0.24266, 0.39147, -0.075256, 0.65093, -0.20822, -0.17456, 0.53571, -0.16537, 0.13582, -0.56016, 0.016964, 0.1277, 0.94071, -0.22608, -0.021106]
channel = grpc.insecure_channel("{}:10000".format(DEPLOYED_INDEX_SERVER_IP))
stub = match_service_pb2_grpc.MatchServiceStub(channel)
request = match_service_pb2.MatchRequest()
request.deployed_index_id = DEPLOYED_INDEX_ID
for val in query:
request.float_val.append(val)
response = stub.Match(request)
response
Error
_InactiveRpcError Traceback (most recent call last)
/tmp/ipykernel_3451/467153318.py in <module>
108 request.float_val.append(val)
109
--> 110 response = stub.Match(request)
111 response
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
944 state, call, = self._blocking(request, timeout, metadata, credentials,
945 wait_for_ready, compression)
--> 946 return _end_unary_response_blocking(state, call, False, None)
947
948 def with_call(self,
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
847 return state.response
848 else:
--> 849 raise _InactiveRpcError(state)
850
851
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1638277076.941429628","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3093,"referenced_errors":[{"created":"#1638277076.941428202","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
Currently, Matching Engine only supports Query from the same region. Can you try running the code from VM in us-central1.

ExportDicomData request of Google Cloud Healthcare API on GitHub tutorials never finish

I'm trying AutoML Vision of ML Codelabs on Cloud Healthcare API GitHub tutorials.
https://github.com/GoogleCloudPlatform/healthcare/blob/master/imaging/ml_codelab/breast_density_auto_ml.ipynb
I run the Export DICOM data cell code of Convert DICOM to JPEG section and the request as well as all the premise cell code succeeded.
But waiting for operation completion is timed out and never finish.
(ExportDicomData request status on Dataset page stays "Running" over the day. I did many times but all the requests were stacked staying "Running". A few times I tried to do from scratch and the results were same.)
I did so far:
1) Remove "output_config" since INVALID ARGUMENT error occurs.
https://github.com/GoogleCloudPlatform/healthcare/issues/133
2) Enable Cloud Resource Manager API since it is needed.
This is the cell code.
# Path to export DICOM data.
dicom_store_url = os.path.join(HEALTHCARE_API_URL, 'projects', project_id, 'locations', location, 'datasets', dataset_id, 'dicomStores', dicom_store_id)
path = dicom_store_url + ":export"
# Headers (send request in JSON format).
headers = {'Content-Type': 'application/json'}
# Body (encoded in JSON format).
# output_config = {'output_config': {'gcs_destination': {'uri_prefix': jpeg_folder, 'mime_type': 'image/jpeg; transfer-syntax=1.2.840.10008.1.2.4.50'}}}
output_config = {'gcs_destination': {'uri_prefix': jpeg_folder, 'mime_type': 'image/jpeg; transfer-syntax=1.2.840.10008.1.2.4.50'}}
body = json.dumps(output_config)
resp, content = http.request(path, method='POST', headers=headers, body=body)
assert resp.status == 200, 'error exporting to JPEG, code: {0}, response: {1}'.format(resp.status, content)
print('Full response:\n{0}'.format(content))
# Record operation_name so we can poll for it later.
response = json.loads(content)
operation_name = response['name']
This is the result of waiting.
Waiting for operation completion...
Full response:
{
"name": "projects/my-datalab-tutorials/locations/us-central1/datasets/sample-dataset/operations/18300485449992372225",
"metadata": {
"#type": "type.googleapis.com/google.cloud.healthcare.v1beta1.OperationMetadata",
"apiMethodName": "google.cloud.healthcare.v1beta1.dicom.DicomService.ExportDicomData",
"createTime": "2019-08-18T10:37:49.809136Z"
}
}
AssertionErrorTraceback (most recent call last)
<ipython-input-18-1a57fd38ea96> in <module>()
21 timeout = time.time() + 10*60 # Wait up to 10 minutes.
22 path = os.path.join(HEALTHCARE_API_URL, operation_name)
---> 23 _ = wait_for_operation_completion(path, timeout)
<ipython-input-18-1a57fd38ea96> in wait_for_operation_completion(path, timeout)
15
16 print('Full response:\n{0}'.format(content))
---> 17 assert success, "operation did not complete successfully in time limit"
18 print('Success!')
19 return response
AssertionError: operation did not complete successfully in time limit
API Version is v1beta1.
I was wondering if somebody has any suggestion.
Thank you.
After several times kept trying and stayed running one night, it finally succeeded. I don't know why.
There was a recent update to the codelabs. The error message is due to the timeout in the codelab and not the actual operation. This has been addressed in the update. Please let me know if you are still running into any issues!

Is possible to avoid the 60 seconds limit in urllib2.urlopen with GAE?

I'm requesting a file with a size around 14MB from a slow server with urllib2.urlopen, and it spend more than 60 seconds to get the data, and I'm getting the error:
Deadline exceeded while waiting for HTTP response from URL:
http://bigfile.zip?type=CSV
Here my code:
class CronChargeBT(webapp2.RequestHandler):
def get(self):
taskqueue.add(queue_name = 'optimized-queue', url='/cronChargeBTB')
class CronChargeBTB(webapp2.RequestHandler):
def post(self):
url = "http://bigfile.zip?type=CSV"
url_request = urllib2.Request(url)
url_request.add_header('Accept-encoding', 'gzip')
urlfetch.set_default_fetch_deadline(300)
response = urllib2.urlopen(url_request, timeout=300)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
...work with the data insiste the file...
I create a cron task who calls CronChargeBT. Here the cron.yaml:
- description: cargar BlueTomato
url: /cronChargeBT
target: charge
schedule: every wed,sun 01:00
and it create a new task and insert into a queue, here the queue configuration:
- name: optimized-queue
rate: 40/s
bucket_size: 60
max_concurrent_requests: 10
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 10
max_backoff_seconds: 200
Of coursethat the timeout=300 isn't working because the 60seconds limit in GAE but I think yhat I can avoid it using a task... anyone knows how I can get the data in the file avoiding this timeout.
Thanks a lot!!!
Cron jobs are limited to 10 minutes deadline, not 60 seconds. If your download fails, perhaps just retry? Does the download work if you download it from your computer? There's nothing you can do on GAE if the server you are downloading from is too slow or unstable.
Edit: According to https://cloud.google.com/appengine/docs/java/outbound-requests#request_timeouts, there is a maximum deadline of 60 seconds for cron job requests. Therefore, you can't get around it.

AWS Worker tier cron - server error #500 - "post http 1.1 500 AWS aws-sqsd/2.0"

I'm trying to set up a cronjob at Elastic Beanstalk. The task is being scheduled. For testing purposes it should run every minute... However it is not working. It is a Django app. The app is running in two Environments, one is the worker and the other one is "hosting" the application.
This part is working. The command is running but it's not being executed (the files are not being deleted).
Here is views.py:
#login_required
def delete_expired_files(request):
users = DemoUser.objects.all()
for user in users:
documents = Document.objects.filter(owner=user.id)
if documents:
for doc in documents:
now = timezone.now()
if now >= doc.date_published + timedelta(days = doc.owner.group.valid_time):
doc.delete()
return redirect("user_home")
cron.yml:
version: 1
cron:
- name: "delete_expired_files"
url: "http://networksapp.elasticbeanstalk.com/networks_app/delete_expired_files"
schedule: "* * * * *"
However, it prints this on the log file at the access_log part :
"POST /myapp/management/commands/delete_expired_files HTTP/1.1" 500 124709 "-" "aws-sqsd/2.0"
This is the log file I am accessing so far:
Log file content
Why is it? How can I fix it?
Thank you so much.