Amazon sagemaker training job using prebuild docker image - amazon-web-services

Hi I am newbie to AWS Sagemaker, I am trying to deploying the custom time series model on sagemaker, so for that build a docker image using sagemaker terminal,But when i am trying to creating training job it showing some error.I am struggling with past four days, please any one could help me.
Here my code:
lstm = sage.estimator.Estimator(image,
role, 1, 'ml.m4.xlarge',
output_path='s3://' + s3Bucket,
sagemaker_session=sess)
lstm.fit(upload_data)
Here my Error, I attached policy of ecr full access permissions to sagemaker Iam role and also account is in same region.
ClientErrorTraceback (most recent call last)
<ipython-input-48-1d7f3ff70f18> in <module>()
4 sagemaker_session=sess)
5
----> 6 lstm.fit(upload_data)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name, experiment_config)
472 self._prepare_for_training(job_name=job_name)
473
--> 474 self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
475 self.jobs.append(self.latest_training_job)
476 if wait:
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs, experiment_config)
1036 train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
1037
-> 1038 estimator.sagemaker_session.train(**train_args)
1039
1040 return cls(estimator.sagemaker_session, estimator._current_job_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
588 LOGGER.info("Creating training-job with name: %s", job_name)
589 LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 590 self.sagemaker_client.create_training_job(**train_request)
591
592 def process(
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
314 "%s() only accepts keyword arguments." % py_operation_name)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317
318 _api_call.__name__ = str(py_operation_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
624 error_code = parsed_response.get("Error", {}).get("Code")
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:
628 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Cannot find repository: sagemaker-model in registry ID: 534860077983 Please check if your ECR repository exists and role arn:aws:iam::534860077983:role/service-role/AmazonSageMaker-ExecutionRole-20190508T215284 has proper pull permissions for SageMaker: ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

TL;DR: Seems like you're not providing the correct repository for the ECR image to the SageMaker estimator. Maybe the repository doesn't exist?
Also make sure that the repository's permissions are configured to allow the principal sagemaker.amazonaws.com to do ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

Related

_InactiveRpcError while querying Vertex AI Matching Engine Index

I am following the example notebook as per GCP docs to test Vertex Matching Engine. I have deployed an index but while trying to query the index I am getting _InactiveRpcError. The VPC network is in us-west2 with private service access enabled and the Index is deployed in us-central1. My VPC network contains the pre-populated firewall rules.
Index
createTime: '2021-11-23T15:25:53.928606Z'
deployedIndexes:
- deployedIndexId: brute_force_glove_deployed_v3
indexEndpoint: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
description: testing python script for creating index
displayName: glove_100_brute_force_20211123152551
etag: AMEw9yOVPWBOTpbAvJLllqxWMi2YurEV_sad2n13QvbIlqjOdMyiq_j20gG1ldhdZNTL
metadata:
config:
algorithmConfig:
bruteForceConfig: {}
dimensions: 100
distanceMeasureType: DOT_PRODUCT_DISTANCE
metadataSchemaUri: gs://google-cloud-aiplatform/schema/matchingengine/metadata/nearest_neighbor_search_1.0.0.yaml
name: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
updateTime: '2021-11-23T16:04:17.993730Z'
Index-Endpoint
createTime: '2021-11-24T10:59:51.975949Z'
deployedIndexes:
- automaticResources:
maxReplicaCount: 1
minReplicaCount: 1
createTime: '2021-11-30T15:16:12.323028Z'
deploymentGroup: default
displayName: brute_force_glove_deployed_v3
enableAccessLogging: true
id: brute_force_glove_deployed_v3
index: projects/XXXXXXXXXXXX/locations/us-central1/indexes/XXXXXXXXXXXX
indexSyncTime: '2021-11-30T16:37:35.597200Z'
privateEndpoints:
matchGrpcAddress: 10.242.4.5
displayName: index_endpoint_for_demo
etag: AMEw9yO6cuDfgpBhGVw7-NKnlS1vdFI5nnOtqVgW1ddMP-CMXM7NfGWVpqRpMRPsNCwc
name: projects/XXXXXXXXXXXX/locations/us-central1/indexEndpoints/XXXXXXXXXXXX
network: projects/XXXXXXXXXXXX/global/networks/XXXXXXXXXXXX
updateTime: '2021-11-24T10:59:53.271100Z'
Code
import grpc
# import the generated classes
import match_service_pb2
import match_service_pb2_grpc
DEPLOYED_INDEX_SERVER_IP = '10.242.0.5'
DEPLOYED_INDEX_ID = 'brute_force_glove_deployed_v3'
query = [-0.11333, 0.48402, 0.090771, -0.22439, 0.034206, -0.55831, 0.041849, -0.53573, 0.18809, -0.58722, 0.015313, -0.014555, 0.80842, -0.038519, 0.75348, 0.70502, -0.17863, 0.3222, 0.67575, 0.67198, 0.26044, 0.4187, -0.34122, 0.2286, -0.53529, 1.2582, -0.091543, 0.19716, -0.037454, -0.3336, 0.31399, 0.36488, 0.71263, 0.1307, -0.24654, -0.52445, -0.036091, 0.55068, 0.10017, 0.48095, 0.71104, -0.053462, 0.22325, 0.30917, -0.39926, 0.036634, -0.35431, -0.42795, 0.46444, 0.25586, 0.68257, -0.20821, 0.38433, 0.055773, -0.2539, -0.20804, 0.52522, -0.11399, -0.3253, -0.44104, 0.17528, 0.62255, 0.50237, -0.7607, -0.071786, 0.0080131, -0.13286, 0.50097, 0.18824, -0.54722, -0.42664, 0.4292, 0.14877, -0.0072514, -0.16484, -0.059798, 0.9895, -0.61738, 0.054169, 0.48424, -0.35084, -0.27053, 0.37829, 0.11503, -0.39613, 0.24266, 0.39147, -0.075256, 0.65093, -0.20822, -0.17456, 0.53571, -0.16537, 0.13582, -0.56016, 0.016964, 0.1277, 0.94071, -0.22608, -0.021106]
channel = grpc.insecure_channel("{}:10000".format(DEPLOYED_INDEX_SERVER_IP))
stub = match_service_pb2_grpc.MatchServiceStub(channel)
request = match_service_pb2.MatchRequest()
request.deployed_index_id = DEPLOYED_INDEX_ID
for val in query:
request.float_val.append(val)
response = stub.Match(request)
response
Error
_InactiveRpcError Traceback (most recent call last)
/tmp/ipykernel_3451/467153318.py in <module>
108 request.float_val.append(val)
109
--> 110 response = stub.Match(request)
111 response
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
944 state, call, = self._blocking(request, timeout, metadata, credentials,
945 wait_for_ready, compression)
--> 946 return _end_unary_response_blocking(state, call, False, None)
947
948 def with_call(self,
/opt/conda/lib/python3.7/site-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
847 return state.response
848 else:
--> 849 raise _InactiveRpcError(state)
850
851
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1638277076.941429628","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3093,"referenced_errors":[{"created":"#1638277076.941428202","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
Currently, Matching Engine only supports Query from the same region. Can you try running the code from VM in us-central1.

Sagemaker endpoint invalid when create_monitoring_schedule is called on the endpoint

I am following this github repo, adopting it to a text classification problem that is built on distil bert. So given a sting of text, the model should return a label and a (probability) score.
Output from the model:
sentiment_input = {"inputs": "I love using the new Inference DLC."}
# sentiment_input= "I love using the new Inference DLC."
response = predictor.predict(data=sentiment_input)
print(response)
Output:
[{'label': 'LABEL_80', 'score': 0.008507220074534416}]
When I run the following
# Create an enpointInput
endpointInput = EndpointInput(
endpoint_name=predictor.endpoint_name,
probability_attribute="score",
inference_attribute="label",
# probability_threshold_attribute=0.5,
destination="/opt/ml/processing/input_data",
)
# Create the monitoring schedule to execute every hour.
from sagemaker.model_monitor import CronExpressionGenerator
response = clinc_intent0911.create_monitoring_schedule(
monitor_schedule_name=clincintent_monitor_schedule_name,
endpoint_input=endpointInput,
output_s3_uri=baseline_results_uri,
problem_type="MulticlassClassification",
ground_truth_input=ground_truth_upload_path,
constraints=baseline_job.suggested_constraints(),
schedule_cron_expression=CronExpressionGenerator.hourly(),
enable_cloudwatch_metrics=True,
)
I get the following error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-269-72e7049246fb> in <module>
10 constraints=baseline_job.suggested_constraints(),
11 schedule_cron_expression=CronExpressionGenerator.hourly(),
---> 12 enable_cloudwatch_metrics=True,
13 )
/opt/conda/lib/python3.6/site-packages/sagemaker/model_monitor/model_monitoring.py in create_monitoring_schedule(self, endpoint_input, ground_truth_input, problem_type, record_preprocessor_script, post_analytics_processor_script, output_s3_uri, constraints, monitor_schedule_name, schedule_cron_expression, enable_cloudwatch_metrics)
2615 network_config=self.network_config,
2616 )
-> 2617 self.sagemaker_session.sagemaker_client.create_model_quality_job_definition(**request_dict)
2618
2619 # create schedule
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.__name__ = str(py_operation_name)
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateModelQualityJobDefinition operation: Endpoint 'clinc-intent-analysis-0911' does not exist or is not valid
At this point my sagemaker endpoint is live and unable to debug it is not valid.
SageMaker ModelMonitor only works for tabular datasets at the moment out of the box (see documentation), and hence the "not valid" error message. To use it on NLP problems, you'd have to bring your own model monitor container (BYOC). Here is an example to get started - https://aws.amazon.com/blogs/machine-learning/detect-nlp-data-drift-using-custom-amazon-sagemaker-model-monitor/,
and the associated Github repo is here - https://github.com/aws-samples/detecting-data-drift-in-nlp-using-amazon-sagemaker-custom-model-monitor

aws sagemker does not save model artifacts after training RLEstimator

I am training a RLEstimator with COACH toolkit and TENSORFLOW framework.
My code is based on sagemaker docs:
import sagemaker
bucket = sagemaker.Session().default_bucket()
role = sagemaker.get_execution_role()
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework
instance_type = "ml.c5.2xlarge"
estimator = RLEstimator(source_dir='src',
entry_point="train-coach.py",
dependencies=["common/sagemaker_rl"],
toolkit=RLToolkit.COACH,
toolkit_version='0.11.1',
framework=RLFramework.TENSORFLOW,
role=role,
instance_count=1,
instance_type=instance_type,
output_path='s3://{}/'.format(bucket),
base_job_name='my-job-name')
estimator.fit()
Training is performed normally, the last lines in the output :
2021-06-15 06:10:02,088 sagemaker-containers INFO Reporting training SUCCESS
2021-06-15 06:10:20 Uploading - Uploading generated training model
2021-06-15 06:10:20 Completed - Training job completed
Training seconds: 136
Billable seconds: 136
But an attempt of deploying the model causes error:
predictor = estimator.deploy(instance_type='ml.m4.xlarge',
initial_instance_count=1)
update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-8-3b92cb4c3461> in <module>
1 # Deploy my estimator to a SageMaker Endpoint and get a MXNetPredictor
2 predictor = estimator.deploy(instance_type='ml.m4.xlarge',
----> 3 initial_instance_count=1)
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
949 wait=wait,
950 kms_key=kms_key,
--> 951 data_capture_config=data_capture_config,
952 )
953
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/tensorflow/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, update_endpoint)
285 wait=wait,
286 data_capture_config=data_capture_config,
--> 287 update_endpoint=update_endpoint,
288 )
289
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
761 self._base_name = "-".join((self._base_name, compiled_model_suffix))
762
--> 763 self._create_sagemaker_model(instance_type, accelerator_type, tags)
764 production_variant = sagemaker.production_variant(
765 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
329 vpc_config=self.vpc_config,
330 enable_network_isolation=enable_network_isolation,
--> 331 tags=tags,
332 )
333
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/session.py in create_model(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container, tags)
2554
2555 try:
-> 2556 self.sagemaker_client.create_model(**create_model_request)
2557 except ClientError as e:
2558 error_code = e.response["Error"]["Code"]
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
384 "%s() only accepts keyword arguments." % py_operation_name)
385 # The "self" in this scope is referring to the BaseClient.
--> 386 return self._make_api_call(operation_name, kwargs)
387
388 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
703 error_code = parsed_response.get("Error", {}).get("Code")
704 error_class = self.exceptions.from_code(error_code)
--> 705 raise error_class(parsed_response, operation_name)
706 else:
707 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://<my-bucket>/sagemaker-rl-tensorflow-2021-06-15-06-05-55-347/output/model.tar.gz.
/sagemaker-rl-tensorflow-2021-06-15-06-05-55-347/output/ folder contains intermediate/ folder and output.tar.gz, but no model.tar.gz exists.
Why model artifacts are not saved during training?
How can I deploy my model to sagemeker endpoint?
you need your code to save a model in /opt/ml/model.
At the end of training, SageMaker checks this location and uploads its content to S3.
You should first check how many episodes that you're actually running. If your training printout doesn't contain any checkpoint logs that send to /opt/ml/model/data...etc... then your _save_tf_model function may not have any files to save. I had a similar issue where I only ran a single episode, and thus there were no checkpoints sending *.onnx files to my checkpoint directory.
Also, check line number 202 in your sagemaker_rl.coach_launcher.py file. In an older version I was using, it was missing the single line:
import tensorflow as tf

AWS personalize service not available using boto3

I am trying to setup a boto3 client to use the AWS personalize service along the lines of what is done here:
https://docs.aws.amazon.com/personalize/latest/dg/data-prep-importing.html
I have faithfully followed the tutorial up to this point. I have my s3 bucket set up, and I have an appropriately formatted csv.
I configured my access and secret tokens and can successfully perform basic operations on my s3 bucket, so I think that part is working:
import s3fs
fs = s3fs.S3FileSystem()
fs.ls('personalize-service-test')
'personalize-test/data'
When I try to create my service, things start to fail:
import boto3
personalize = boto3.client('personalize')
---------------------------------------------------------------------------
UnknownServiceError Traceback (most recent call last)
<ipython-input-13-c23d30ee6bd1> in <module>
----> 1 personalize = boto3.client('personalize')
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/boto3/__init__.py in client(*args, **kwargs)
89 See :py:meth:`boto3.session.Session.client`.
90 """
---> 91 return _get_default_session().client(*args, **kwargs)
92
93
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/boto3/session.py in client(self, service_name, region_name, api_version, use_ssl, verify, endpoint_url, aws_access_key_id, aws_secret_access_key, aws_session_token, config)
261 aws_access_key_id=aws_access_key_id,
262 aws_secret_access_key=aws_secret_access_key,
--> 263 aws_session_token=aws_session_token, config=config)
264
265 def resource(self, service_name, region_name=None, api_version=None,
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/botocore/session.py in create_client(self, service_name, region_name, api_version, use_ssl, verify, endpoint_url, aws_access_key_id, aws_secret_access_key, aws_session_token, config)
836 is_secure=use_ssl, endpoint_url=endpoint_url, verify=verify,
837 credentials=credentials, scoped_config=self.get_scoped_config(),
--> 838 client_config=config, api_version=api_version)
839 monitor = self._get_internal_component('monitor')
840 if monitor is not None:
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/botocore/client.py in create_client(self, service_name, region_name, is_secure, endpoint_url, verify, credentials, scoped_config, api_version, client_config)
77 'choose-service-name', service_name=service_name)
78 service_name = first_non_none_response(responses, default=service_name)
---> 79 service_model = self._load_service_model(service_name, api_version)
80 cls = self._create_client_class(service_name, service_model)
81 endpoint_bridge = ClientEndpointBridge(
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/botocore/client.py in _load_service_model(self, service_name, api_version)
115 def _load_service_model(self, service_name, api_version=None):
116 json_model = self._loader.load_service_model(service_name, 'service-2',
--> 117 api_version=api_version)
118 service_model = ServiceModel(json_model, service_name=service_name)
119 return service_model
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/botocore/loaders.py in _wrapper(self, *args, **kwargs)
130 if key in self._cache:
131 return self._cache[key]
--> 132 data = func(self, *args, **kwargs)
133 self._cache[key] = data
134 return data
/anaconda3/envs/ds_std_3.6/lib/python3.6/site-packages/botocore/loaders.py in load_service_model(self, service_name, type_name, api_version)
376 raise UnknownServiceError(
377 service_name=service_name,
--> 378 known_service_names=', '.join(sorted(known_services)))
379 if api_version is None:
380 api_version = self.determine_latest_version(
UnknownServiceError: Unknown service: 'personalize'. Valid service names are: acm, acm-pca, alexaforbusiness, amplify, apigateway, apigatewaymanagementapi, apigatewayv2, application-autoscaling, appmesh, appstream, appsync, athena, autoscaling, autoscaling-plans, backup, batch, budgets, ce, chime, cloud9, clouddirectory, cloudformation, cloudfront, cloudhsm, cloudhsmv2, cloudsearch, cloudsearchdomain, cloudtrail, cloudwatch, codebuild, codecommit, codedeploy, codepipeline, codestar, cognito-identity, cognito-idp, cognito-sync, comprehend, comprehendmedical, config, connect, cur, datapipeline, datasync, dax, devicefarm, directconnect, discovery, dlm, dms, docdb, ds, dynamodb, dynamodbstreams, ec2, ecr, ecs, efs, eks, elasticache, elasticbeanstalk, elastictranscoder, elb, elbv2, emr, es, events, firehose, fms, fsx, gamelift, glacier, globalaccelerator, glue, greengrass, guardduty, health, iam, importexport, inspector, iot, iot-data, iot-jobs-data, iot1click-devices, iot1click-projects, iotanalytics, kafka, kinesis, kinesis-video-archived-media, kinesis-video-media, kinesisanalytics, kinesisanalyticsv2, kinesisvideo, kms, lambda, lex-models, lex-runtime, license-manager, lightsail, logs, machinelearning, macie, marketplace-entitlement, marketplacecommerceanalytics, mediaconnect, mediaconvert, medialive, mediapackage, mediastore, mediastore-data, mediatailor, meteringmarketplace, mgh, mobile, mq, mturk, neptune, opsworks, opsworkscm, organizations, pi, pinpoint, pinpoint-email, pinpoint-sms-voice, polly, pricing, quicksight, ram, rds, rds-data, redshift, rekognition, resource-groups, resourcegroupstaggingapi, robomaker, route53, route53domains, route53resolver, s3, s3control, sagemaker, sagemaker-runtime, sdb, secretsmanager, securityhub, serverlessrepo, servicecatalog, servicediscovery, ses, shield, signer, sms, sms-voice, snowball, sns, sqs, ssm, stepfunctions, storagegateway, sts, support, swf, textract, transcribe, transfer, translate, waf, waf-regional, workdocs, worklink, workmail, workspaces, xray
Indeed the service name 'personalize' is missing from the list.
I already tried upgrading boto3 and botocore to their latest version and restarting my kernel.
boto3 1.9.143
botocore 1.12.143
Any idea as to what to try next would be great.
The boto3 documentation does not (currently on 1.9.143) list personalize as a supported service.
Have you signed up for the preview through the landing page? Personalize is still in limited preview, so your account will not be able to access it through boto3 by default, unless otherwise whitelisted.
Edit: Personalize is currently publicly available, so this answer is no longer relevant.
I had neglected to perform these setup steps:
https://docs.aws.amazon.com/personalize/latest/dg/aws-personalize-set-up-aws-cli.html
When I did that and restarted my kernel, boto3 picked up the service definition and now things seem to work.

H2O machine learning platform for Python incurs EnvironmentError while building models

I am new to h2o machine learning platform and having the below issue while trying to build models.
When i was trying to build 5 GBM models with a not so large dataset, it has the following error:
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [################# ] 34%
EnvironmentErrorTraceback (most recent call last)
<ipython-input-22-e74b34df2f1a> in <module>()
13 params_model={'x': features_pca_all, 'y': response, 'training_frame': train_holdout_pca_hex, 'validation_frame': validation_holdout_pca_hex, 'ntrees': ntree, 'max_depth':depth, 'min_rows': min_rows, 'learn_rate': 0.005}
14
---> 15 gbm_model=h2o.gbm(**params_model)
16
17 #store model
C:\Anaconda2\lib\site-packages\h2o\h2o.pyc in gbm(x, y, validation_x, validation_y, training_frame, model_id, distribution, tweedie_power, ntrees, max_depth, min_rows, learn_rate, nbins, nbins_cats, validation_frame, balance_classes, max_after_balance_size, seed, build_tree_one_node, nfolds, fold_column, fold_assignment, keep_cross_validation_predictions, score_each_iteration, offset_column, weights_column, do_future, checkpoint)
1058 parms = {k:v for k,v in locals().items() if k in ["training_frame", "validation_frame", "validation_x", "validation_y", "offset_column", "weights_column", "fold_column"] or v is not None}
1059 parms["algo"]="gbm"
-> 1060 return h2o_model_builder.supervised(parms)
1061
1062
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in supervised(kwargs)
28 algo = kwargs["algo"]
29 parms={k:v for k,v in kwargs.items() if (k not in ["x","y","validation_x","validation_y","algo"] and v is not None) or k=="validation_frame"}
---> 30 return supervised_model_build(x,y,vx,vy,algo,offsets,weights,fold_column,parms)
31
32 def unsupervised_model_build(x,validation_x,algo_url,kwargs): return _model_build(x,None,validation_x,None,algo_url,None,None,None,kwargs)
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in supervised_model_build(x, y, vx, vy, algo, offsets, weights, fold_column, kwargs)
16 if not is_auto_encoder and y is None: raise ValueError("Missing response")
17 if vx is not None and vy is None: raise ValueError("Missing response validating a supervised model")
---> 18 return _model_build(x,y,vx,vy,algo,offsets,weights,fold_column,kwargs)
19
20 def supervised(kwargs):
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in _model_build(x, y, vx, vy, algo, offsets, weights, fold_column, kwargs)
86 do_future = kwargs.pop("do_future") if "do_future" in kwargs else False
87 future_model = H2OModelFuture(H2OJob(H2OConnection.post_json("ModelBuilders/"+algo, **kwargs), job_type=(algo+" Model Build")), x)
---> 88 return future_model if do_future else _resolve_model(future_model, **kwargs)
89
90 def _resolve_model(future_model, **kwargs):
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in _resolve_model(future_model, **kwargs)
89
90 def _resolve_model(future_model, **kwargs):
---> 91 future_model.poll()
92 if '_rest_version' in kwargs.keys(): model_json = H2OConnection.get_json("Models/"+future_model.job.dest_key, _rest_version=kwargs['_rest_version'])["models"][0]
93 else: model_json = H2OConnection.get_json("Models/"+future_model.job.dest_key)["models"][0]
C:\Anaconda2\lib\site-packages\h2o\model\model_future.pyc in poll(self)
8
9 def poll(self):
---> 10 self.job.poll()
11 self.x = None
C:\Anaconda2\lib\site-packages\h2o\job.pyc in poll(self)
39 time.sleep(sleep)
40 if sleep < 1.0: sleep += 0.1
---> 41 self._refresh_job_view()
42 running = self._is_running()
43 self._update_progress()
C:\Anaconda2\lib\site-packages\h2o\job.pyc in _refresh_job_view(self)
52
53 def _refresh_job_view(self):
---> 54 jobs = H2OConnection.get_json(url_suffix="Jobs/" + self.job_key)
55 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0]
56 self.status = self.job["status"]
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in get_json(url_suffix, **kwargs)
410 if __H2OCONN__ is None:
411 raise ValueError("No h2o connection. Did you run `h2o.init()` ?")
--> 412 return __H2OCONN__._rest_json(url_suffix, "GET", None, **kwargs)
413
414 #staticmethod
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _rest_json(self, url_suffix, method, file_upload_info, **kwargs)
419
420 def _rest_json(self, url_suffix, method, file_upload_info, **kwargs):
--> 421 raw_txt = self._do_raw_rest(url_suffix, method, file_upload_info, **kwargs)
422 return self._process_tables(raw_txt.json())
423
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _do_raw_rest(self, url_suffix, method, file_upload_info, **kwargs)
476
477 begin_time_seconds = time.time()
--> 478 http_result = self._attempt_rest(url, method, post_body, file_upload_info)
479 end_time_seconds = time.time()
480 elapsed_time_seconds = end_time_seconds - begin_time_seconds
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _attempt_rest(self, url, method, post_body, file_upload_info)
526
527 except requests.ConnectionError as e:
--> 528 raise EnvironmentError("h2o-py encountered an unexpected HTTP error:\n {}".format(e))
529
530 return http_result
EnvironmentError: h2o-py encountered an unexpected HTTP error:
('Connection aborted.', BadStatusLine("''",))
My hunch is that the cluster memory has only around 247.5 MB which is not enough to handle the model building hence aborted the connection to h2o. Here are the codes I used to initiate h2o:
#initialization of h2o module
import subprocess as sp
import sys
import os.path as p
# path of h2o jar file
h2o_path = p.join(sys.prefix, "h2o_jar", "h2o.jar")
# subprocess to launch h2o
# the command can be further modified to include virtual machine parameters
sp.Popen("java -jar " + h2o_path)
# h2o.init() call to verify that h2o launch is successfull
h2o.init(ip="localhost", port=54321, size=1, start_h2o=False, enable_assertions=False, \
license=None, max_mem_size_GB=4, min_mem_size_GB=4, ice_root=None)
and here is the returned status table:
Any ideas on the above would be greatly appreciated!!
Just to close out this question, I'll restate the solution mentioned in the comments above. The user was able to resolve the issue by starting H2O from the command line with 1GB of memory using java -jar -Xmx1g h2o.jar, and then connected to the existing H2O server in Python using h2o.init().
It's not clear to me why h2o.init() was not creating the correct size cluster using the max_mem_size_GB argument. Regardless, this argument has been deprecated recently and replaced by another argument, max_mem_size, so it may no longer be an issue.