aws sagemker does not save model artifacts after training RLEstimator - amazon-web-services

I am training a RLEstimator with COACH toolkit and TENSORFLOW framework.
My code is based on sagemaker docs:
import sagemaker
bucket = sagemaker.Session().default_bucket()
role = sagemaker.get_execution_role()
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework
instance_type = "ml.c5.2xlarge"
estimator = RLEstimator(source_dir='src',
entry_point="train-coach.py",
dependencies=["common/sagemaker_rl"],
toolkit=RLToolkit.COACH,
toolkit_version='0.11.1',
framework=RLFramework.TENSORFLOW,
role=role,
instance_count=1,
instance_type=instance_type,
output_path='s3://{}/'.format(bucket),
base_job_name='my-job-name')
estimator.fit()
Training is performed normally, the last lines in the output :
2021-06-15 06:10:02,088 sagemaker-containers INFO Reporting training SUCCESS
2021-06-15 06:10:20 Uploading - Uploading generated training model
2021-06-15 06:10:20 Completed - Training job completed
Training seconds: 136
Billable seconds: 136
But an attempt of deploying the model causes error:
predictor = estimator.deploy(instance_type='ml.m4.xlarge',
initial_instance_count=1)
update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-8-3b92cb4c3461> in <module>
1 # Deploy my estimator to a SageMaker Endpoint and get a MXNetPredictor
2 predictor = estimator.deploy(instance_type='ml.m4.xlarge',
----> 3 initial_instance_count=1)
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
949 wait=wait,
950 kms_key=kms_key,
--> 951 data_capture_config=data_capture_config,
952 )
953
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/tensorflow/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, update_endpoint)
285 wait=wait,
286 data_capture_config=data_capture_config,
--> 287 update_endpoint=update_endpoint,
288 )
289
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
761 self._base_name = "-".join((self._base_name, compiled_model_suffix))
762
--> 763 self._create_sagemaker_model(instance_type, accelerator_type, tags)
764 production_variant = sagemaker.production_variant(
765 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
329 vpc_config=self.vpc_config,
330 enable_network_isolation=enable_network_isolation,
--> 331 tags=tags,
332 )
333
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/session.py in create_model(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container, tags)
2554
2555 try:
-> 2556 self.sagemaker_client.create_model(**create_model_request)
2557 except ClientError as e:
2558 error_code = e.response["Error"]["Code"]
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
384 "%s() only accepts keyword arguments." % py_operation_name)
385 # The "self" in this scope is referring to the BaseClient.
--> 386 return self._make_api_call(operation_name, kwargs)
387
388 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
703 error_code = parsed_response.get("Error", {}).get("Code")
704 error_class = self.exceptions.from_code(error_code)
--> 705 raise error_class(parsed_response, operation_name)
706 else:
707 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://<my-bucket>/sagemaker-rl-tensorflow-2021-06-15-06-05-55-347/output/model.tar.gz.
/sagemaker-rl-tensorflow-2021-06-15-06-05-55-347/output/ folder contains intermediate/ folder and output.tar.gz, but no model.tar.gz exists.
Why model artifacts are not saved during training?
How can I deploy my model to sagemeker endpoint?

you need your code to save a model in /opt/ml/model.
At the end of training, SageMaker checks this location and uploads its content to S3.

You should first check how many episodes that you're actually running. If your training printout doesn't contain any checkpoint logs that send to /opt/ml/model/data...etc... then your _save_tf_model function may not have any files to save. I had a similar issue where I only ran a single episode, and thus there were no checkpoints sending *.onnx files to my checkpoint directory.
Also, check line number 202 in your sagemaker_rl.coach_launcher.py file. In an older version I was using, it was missing the single line:
import tensorflow as tf

Related

Sagemaker endpoint invalid when create_monitoring_schedule is called on the endpoint

I am following this github repo, adopting it to a text classification problem that is built on distil bert. So given a sting of text, the model should return a label and a (probability) score.
Output from the model:
sentiment_input = {"inputs": "I love using the new Inference DLC."}
# sentiment_input= "I love using the new Inference DLC."
response = predictor.predict(data=sentiment_input)
print(response)
Output:
[{'label': 'LABEL_80', 'score': 0.008507220074534416}]
When I run the following
# Create an enpointInput
endpointInput = EndpointInput(
endpoint_name=predictor.endpoint_name,
probability_attribute="score",
inference_attribute="label",
# probability_threshold_attribute=0.5,
destination="/opt/ml/processing/input_data",
)
# Create the monitoring schedule to execute every hour.
from sagemaker.model_monitor import CronExpressionGenerator
response = clinc_intent0911.create_monitoring_schedule(
monitor_schedule_name=clincintent_monitor_schedule_name,
endpoint_input=endpointInput,
output_s3_uri=baseline_results_uri,
problem_type="MulticlassClassification",
ground_truth_input=ground_truth_upload_path,
constraints=baseline_job.suggested_constraints(),
schedule_cron_expression=CronExpressionGenerator.hourly(),
enable_cloudwatch_metrics=True,
)
I get the following error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<ipython-input-269-72e7049246fb> in <module>
10 constraints=baseline_job.suggested_constraints(),
11 schedule_cron_expression=CronExpressionGenerator.hourly(),
---> 12 enable_cloudwatch_metrics=True,
13 )
/opt/conda/lib/python3.6/site-packages/sagemaker/model_monitor/model_monitoring.py in create_monitoring_schedule(self, endpoint_input, ground_truth_input, problem_type, record_preprocessor_script, post_analytics_processor_script, output_s3_uri, constraints, monitor_schedule_name, schedule_cron_expression, enable_cloudwatch_metrics)
2615 network_config=self.network_config,
2616 )
-> 2617 self.sagemaker_session.sagemaker_client.create_model_quality_job_definition(**request_dict)
2618
2619 # create schedule
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.__name__ = str(py_operation_name)
/opt/conda/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
674 error_code = parsed_response.get("Error", {}).get("Code")
675 error_class = self.exceptions.from_code(error_code)
--> 676 raise error_class(parsed_response, operation_name)
677 else:
678 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateModelQualityJobDefinition operation: Endpoint 'clinc-intent-analysis-0911' does not exist or is not valid
At this point my sagemaker endpoint is live and unable to debug it is not valid.
SageMaker ModelMonitor only works for tabular datasets at the moment out of the box (see documentation), and hence the "not valid" error message. To use it on NLP problems, you'd have to bring your own model monitor container (BYOC). Here is an example to get started - https://aws.amazon.com/blogs/machine-learning/detect-nlp-data-drift-using-custom-amazon-sagemaker-model-monitor/,
and the associated Github repo is here - https://github.com/aws-samples/detecting-data-drift-in-nlp-using-amazon-sagemaker-custom-model-monitor

Amazon sagemaker training job using prebuild docker image

Hi I am newbie to AWS Sagemaker, I am trying to deploying the custom time series model on sagemaker, so for that build a docker image using sagemaker terminal,But when i am trying to creating training job it showing some error.I am struggling with past four days, please any one could help me.
Here my code:
lstm = sage.estimator.Estimator(image,
role, 1, 'ml.m4.xlarge',
output_path='s3://' + s3Bucket,
sagemaker_session=sess)
lstm.fit(upload_data)
Here my Error, I attached policy of ecr full access permissions to sagemaker Iam role and also account is in same region.
ClientErrorTraceback (most recent call last)
<ipython-input-48-1d7f3ff70f18> in <module>()
4 sagemaker_session=sess)
5
----> 6 lstm.fit(upload_data)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name, experiment_config)
472 self._prepare_for_training(job_name=job_name)
473
--> 474 self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
475 self.jobs.append(self.latest_training_job)
476 if wait:
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in start_new(cls, estimator, inputs, experiment_config)
1036 train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
1037
-> 1038 estimator.sagemaker_session.train(**train_args)
1039
1040 return cls(estimator.sagemaker_session, estimator._current_job_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
588 LOGGER.info("Creating training-job with name: %s", job_name)
589 LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 590 self.sagemaker_client.create_training_job(**train_request)
591
592 def process(
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
314 "%s() only accepts keyword arguments." % py_operation_name)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317
318 _api_call.__name__ = str(py_operation_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
624 error_code = parsed_response.get("Error", {}).get("Code")
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:
628 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Cannot find repository: sagemaker-model in registry ID: 534860077983 Please check if your ECR repository exists and role arn:aws:iam::534860077983:role/service-role/AmazonSageMaker-ExecutionRole-20190508T215284 has proper pull permissions for SageMaker: ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer
TL;DR: Seems like you're not providing the correct repository for the ECR image to the SageMaker estimator. Maybe the repository doesn't exist?
Also make sure that the repository's permissions are configured to allow the principal sagemaker.amazonaws.com to do ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

Pyredis cannot connect to Digital Ocean hosted Redis (connection lost ConnectionError exception)

In [16]: r
Out[16]: Redis<ConnectionPool<Connection<host=****,port=*****,db=3>>>
In [17]: r.set('zaza', 'king')
---------------------------------------------------------------------------
ConnectionError Traceback (most recent call last)
<ipython-input-17-8126d1846970> in <module>
----> 1 r.set('zaza', 'king')
/usr/local/lib/python3.7/site-packages/redis/client.py in set(self, name, value, ex, px, nx, xx)
1517 if xx:
1518 pieces.append('XX')
-> 1519 return self.execute_command('SET', *pieces)
1520
1521 def __setitem__(self, name, value):
/usr/local/lib/python3.7/site-packages/redis/client.py in execute_command(self, *args, **options)
834 pool = self.connection_pool
835 command_name = args[0]
--> 836 conn = self.connection or pool.get_connection(command_name, **options)
837 try:
838 conn.send_command(*args)
/usr/local/lib/python3.7/site-packages/redis/connection.py in get_connection(self, command_name, *keys, **options)
1069 try:
1070 # ensure this connection is connected to Redis
-> 1071 connection.connect()
1072 # connections that the pool provides should be ready to send
1073 # a command. if not, the connection was either returned to the
/usr/local/lib/python3.7/site-packages/redis/connection.py in connect(self)
545 self._sock = sock
546 try:
--> 547 self.on_connect()
548 except RedisError:
549 # clean up after any error in on_connect
/usr/local/lib/python3.7/site-packages/redis/connection.py in on_connect(self)
615 # to check the health prior to the AUTH
616 self.send_command('AUTH', self.password, check_health=False)
--> 617 if nativestr(self.read_response()) != 'OK':
618 raise AuthenticationError('Invalid Password')
619
/usr/local/lib/python3.7/site-packages/redis/connection.py in read_response(self)
697 "Read the response from a previously sent command"
698 try:
--> 699 response = self._parser.read_response()
700 except socket.timeout:
701 self.disconnect()
/usr/local/lib/python3.7/site-packages/redis/connection.py in read_response(self)
307
308 def read_response(self):
--> 309 response = self._buffer.readline()
310 if not response:
311 raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
/usr/local/lib/python3.7/site-packages/redis/connection.py in readline(self)
239 while not data.endswith(SYM_CRLF):
240 # there's more data in the socket that we need
--> 241 self._read_from_socket()
242 buf.seek(self.bytes_read)
243 data = buf.readline()
/usr/local/lib/python3.7/site-packages/redis/connection.py in _read_from_socket(self, length, timeout, raise_on_timeout)
184 # an empty string indicates the server shutdown the socket
185 if isinstance(data, bytes) and len(data) == 0:
--> 186 raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
187 buf.write(data)
188 data_length = len(data)
This started happening after a move to hosted redis from a local instance
So, the problem is that the Redis connection string on those hosted solutions has to start with rediss:// as in redis + SSL as per the official documentation:
https://redislabs.com/lp/python-redis/
If you use hosted Redis from AWS or Digital Ocean this might as well happen to you :)
if you are using Celery you would also need to modify your app config in app.py as per
https://github.com/celery/celery/issues/5371

RuntimeError: latex was not able to process the following string: '_auto'

To whom will concern my question,
I met a problem to perform the python code built by LIGO team on the jupyter notebook. When I tried to perform the following simple code without the environment of the jupyter notebook, I can obtain the correct output and the attached figure.
+++++++++++++++++++++++++++++++++
from gwpy.timeseries import TimeSeries
from numpy import random
series = TimeSeries(random.random(1000), sample_rate=100, unit='m')
plot = series.plot()
plot.show()
+++++++++++++++++++++++++++
Generation of random plots
But when I applied the following similar python code in the jupyter notebook:
++++++++++++++++
`%matplotlib inline`
%config InlineBackend.figure_format = 'retina'
from gwpy.timeseries import TimeSeries
from numpy import random
series = TimeSeries(random.random(1000), sample_rate=100, unit='m')
plot = series.plot()
++++++++++++++++
I obtained the runtime error related to the latex.
Since I can get the correct output if I do not use the jupyter, I speculate that the problem is caused by the transformation of the plot to present on the notebook environment.
I tried to check the related runtime error related to latex and install the packages (e.g., dvipng, texlive-latex-extra & texlive-fonts-recommended) through Macports or re-install MacTeX in my machine, but the problem still exists.
Here comes the warning & error messages I obtained on the jupyter notebook.
RuntimeError Traceback (most recent call last)
/Users/lupin/Library/Python/2.7/lib/python/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)
/Users/lupin/Library/Python/2.7/lib/python/site-packages/IPython/core/pylabtools.pyc in <lambda>(fig)
247 png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
248 if 'retina' in formats or 'png2x' in formats:
--> 249 png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))
250 if 'jpg' in formats or 'jpeg' in formats:
251 jpg_formatter.for_type(Figure, lambda fig: print_figure(fig, 'jpg', **kwargs))
/Users/lupin/Library/Python/2.7/lib/python/site-packages/IPython/core/pylabtools.pyc in retina_figure(fig, **kwargs)
137 def retina_figure(fig, **kwargs):
138 """format a figure as a pixel-doubled (retina) PNG"""
--> 139 pngdata = print_figure(fig, fmt='retina', **kwargs)
140 # Make sure that retina_figure acts just like print_figure and returns
141 # None when the figure is empty.
/Users/lupin/Library/Python/2.7/lib/python/site-packages/IPython/core/pylabtools.pyc in print_figure(fig, fmt, bbox_inches, **kwargs)
129
130 bytes_io = BytesIO()
--> 131 fig.canvas.print_figure(bytes_io, **kw)
132 data = bytes_io.getvalue()
133 if fmt == 'svg':
/Library/Python/2.7/site-packages/matplotlib/backend_bases.pyc in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs)
2212 **kwargs)
2213 renderer = self.figure._cachedRenderer
-> 2214 bbox_inches = self.figure.get_tightbbox(renderer)
2215
2216 bbox_artists = kwargs.pop("bbox_extra_artists", None)
/Library/Python/2.7/site-packages/matplotlib/figure.pyc in get_tightbbox(self, renderer)
2188 for ax in self.axes:
2189 if ax.get_visible():
-> 2190 bb.append(ax.get_tightbbox(renderer))
2191
2192 if len(bb) == 0:
/Library/Python/2.7/site-packages/matplotlib/axes/_base.pyc in get_tightbbox(self, renderer, call_axes_locator)
4168 bb.append(self._right_title.get_window_extent(renderer))
4169
-> 4170 bb_xaxis = self.xaxis.get_tightbbox(renderer)
4171 if bb_xaxis:
4172 bb.append(bb_xaxis)
/Library/Python/2.7/site-packages/matplotlib/axis.pyc in get_tightbbox(self, renderer)
1158 for a in [self.label, self.offsetText]:
1159 if a.get_visible():
-> 1160 bb.append(a.get_window_extent(renderer))
1161
1162 bb.extend(ticklabelBoxes)
/Library/Python/2.7/site-packages/matplotlib/text.pyc in get_window_extent(self, renderer, dpi)
920 raise RuntimeError('Cannot get window extent w/o renderer')
921
--> 922 bbox, info, descent = self._get_layout(self._renderer)
923 x, y = self.get_unitless_position()
924 x, y = self.get_transform().transform_point((x, y))
/Library/Python/2.7/site-packages/matplotlib/text.pyc in _get_layout(self, renderer)
307 w, h, d = renderer.get_text_width_height_descent(clean_line,
308
self._fontproperties,
--> 309 ismath=ismath)
310 else:
311 w, h, d = 0, 0, 0
/Library/Python/2.7/site-packages/matplotlib/backends/backend_agg.pyc in get_text_width_height_descent(self, s, prop, ismath)
230 fontsize = prop.get_size_in_points()
231 w, h, d = texmanager.get_text_width_height_descent(
--> 232 s, fontsize, renderer=self)
233 return w, h, d
234
/Library/Python/2.7/site-packages/matplotlib/texmanager.pyc in get_text_width_height_descent(self, tex, fontsize, renderer)
499 else:
500 # use dviread. It sometimes returns a wrong descent.
--> 501 dvifile = self.make_dvi(tex, fontsize)
502 with dviread.Dvi(dvifile, 72 * dpi_fraction) as dvi:
503 page = next(iter(dvi))
/Library/Python/2.7/site-packages/matplotlib/texmanager.pyc in make_dvi(self, tex, fontsize)
363 self._run_checked_subprocess(
364 ["latex", "-interaction=nonstopmode", "--halt-on-error",
--> 365 texfile], tex)
366 for fname in glob.glob(basefile + '*'):
367 if not fname.endswith(('dvi', 'tex')):
/Library/Python/2.7/site-packages/matplotlib/texmanager.pyc in _run_checked_subprocess(self, command, tex)
342 prog=command[0],
343 tex=tex.encode('unicode_escape'),
--> 344 exc=exc.output.decode('utf-8')))
345 _log.debug(report)
346 return report
RuntimeError: latex was not able to process the following string:
'_auto'
Here is the full report generated by latex:
This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/MacPorts 2017_4) (preloaded format=latex)
restricted \write18 enabled.
entering extended mode
(/Users/lupin/.matplotlib/tex.cache/1de80ee53f095837776b678f34112ba4.tex
LaTeX2e <2017-04-15>
Babel <3.10> and hyphenation patterns for 3 language(s) loaded.
(/opt/local/share/texmf-texlive/tex/latex/base/article.cls
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
(/opt/local/share/texmf-texlive/tex/latex/base/size10.clo))
(/opt/local/share/texmf-texlive/tex/latex/type1cm/type1cm.sty)
(/opt/local/share/texmf-texlive/tex/latex/base/textcomp.sty
(/opt/local/share/texmf-texlive/tex/latex/base/ts1enc.def))
(/opt/local/share/texmf-texlive/tex/latex/geometry/geometry.sty
(/opt/local/share/texmf-texlive/tex/latex/graphics/keyval.sty)
(/opt/local/share/texmf-texlive/tex/generic/oberdiek/ifpdf.sty)
(/opt/local/share/texmf-texlive/tex/generic/oberdiek/ifvtex.sty)
(/opt/local/share/texmf-texlive/tex/generic/ifxetex/ifxetex.sty)
Package geometry Warning: Over-specification in `h'-direction.
`width' (5058.9pt) is ignored.
Package geometry Warning: Over-specification in `v'-direction.
`height' (5058.9pt) is ignored.
) (./1de80ee53f095837776b678f34112ba4.aux)
(/opt/local/share/texmf-texlive/tex/latex/base/ts1cmr.fd)
*geometry* driver: auto-detecting
*geometry* detected driver: dvips
! Missing $ inserted.
<inserted text>
$
l.13 \fontsize{20.000000}{25.000000}{\rmfamily _
auto}
No pages of output.
Transcript written on 1de80ee53f095837776b678f34112ba4.log.
Can anyone provide some suggestions to solve this issue?
The '_' legend in latex indicates subscript, translator can't identify it out of '$$', replacing '_auto' with '-auto' may help:)

H2O machine learning platform for Python incurs EnvironmentError while building models

I am new to h2o machine learning platform and having the below issue while trying to build models.
When i was trying to build 5 GBM models with a not so large dataset, it has the following error:
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [##################################################] 100%
gbm Model Build Progress: [################# ] 34%
EnvironmentErrorTraceback (most recent call last)
<ipython-input-22-e74b34df2f1a> in <module>()
13 params_model={'x': features_pca_all, 'y': response, 'training_frame': train_holdout_pca_hex, 'validation_frame': validation_holdout_pca_hex, 'ntrees': ntree, 'max_depth':depth, 'min_rows': min_rows, 'learn_rate': 0.005}
14
---> 15 gbm_model=h2o.gbm(**params_model)
16
17 #store model
C:\Anaconda2\lib\site-packages\h2o\h2o.pyc in gbm(x, y, validation_x, validation_y, training_frame, model_id, distribution, tweedie_power, ntrees, max_depth, min_rows, learn_rate, nbins, nbins_cats, validation_frame, balance_classes, max_after_balance_size, seed, build_tree_one_node, nfolds, fold_column, fold_assignment, keep_cross_validation_predictions, score_each_iteration, offset_column, weights_column, do_future, checkpoint)
1058 parms = {k:v for k,v in locals().items() if k in ["training_frame", "validation_frame", "validation_x", "validation_y", "offset_column", "weights_column", "fold_column"] or v is not None}
1059 parms["algo"]="gbm"
-> 1060 return h2o_model_builder.supervised(parms)
1061
1062
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in supervised(kwargs)
28 algo = kwargs["algo"]
29 parms={k:v for k,v in kwargs.items() if (k not in ["x","y","validation_x","validation_y","algo"] and v is not None) or k=="validation_frame"}
---> 30 return supervised_model_build(x,y,vx,vy,algo,offsets,weights,fold_column,parms)
31
32 def unsupervised_model_build(x,validation_x,algo_url,kwargs): return _model_build(x,None,validation_x,None,algo_url,None,None,None,kwargs)
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in supervised_model_build(x, y, vx, vy, algo, offsets, weights, fold_column, kwargs)
16 if not is_auto_encoder and y is None: raise ValueError("Missing response")
17 if vx is not None and vy is None: raise ValueError("Missing response validating a supervised model")
---> 18 return _model_build(x,y,vx,vy,algo,offsets,weights,fold_column,kwargs)
19
20 def supervised(kwargs):
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in _model_build(x, y, vx, vy, algo, offsets, weights, fold_column, kwargs)
86 do_future = kwargs.pop("do_future") if "do_future" in kwargs else False
87 future_model = H2OModelFuture(H2OJob(H2OConnection.post_json("ModelBuilders/"+algo, **kwargs), job_type=(algo+" Model Build")), x)
---> 88 return future_model if do_future else _resolve_model(future_model, **kwargs)
89
90 def _resolve_model(future_model, **kwargs):
C:\Anaconda2\lib\site-packages\h2o\h2o_model_builder.pyc in _resolve_model(future_model, **kwargs)
89
90 def _resolve_model(future_model, **kwargs):
---> 91 future_model.poll()
92 if '_rest_version' in kwargs.keys(): model_json = H2OConnection.get_json("Models/"+future_model.job.dest_key, _rest_version=kwargs['_rest_version'])["models"][0]
93 else: model_json = H2OConnection.get_json("Models/"+future_model.job.dest_key)["models"][0]
C:\Anaconda2\lib\site-packages\h2o\model\model_future.pyc in poll(self)
8
9 def poll(self):
---> 10 self.job.poll()
11 self.x = None
C:\Anaconda2\lib\site-packages\h2o\job.pyc in poll(self)
39 time.sleep(sleep)
40 if sleep < 1.0: sleep += 0.1
---> 41 self._refresh_job_view()
42 running = self._is_running()
43 self._update_progress()
C:\Anaconda2\lib\site-packages\h2o\job.pyc in _refresh_job_view(self)
52
53 def _refresh_job_view(self):
---> 54 jobs = H2OConnection.get_json(url_suffix="Jobs/" + self.job_key)
55 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0]
56 self.status = self.job["status"]
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in get_json(url_suffix, **kwargs)
410 if __H2OCONN__ is None:
411 raise ValueError("No h2o connection. Did you run `h2o.init()` ?")
--> 412 return __H2OCONN__._rest_json(url_suffix, "GET", None, **kwargs)
413
414 #staticmethod
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _rest_json(self, url_suffix, method, file_upload_info, **kwargs)
419
420 def _rest_json(self, url_suffix, method, file_upload_info, **kwargs):
--> 421 raw_txt = self._do_raw_rest(url_suffix, method, file_upload_info, **kwargs)
422 return self._process_tables(raw_txt.json())
423
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _do_raw_rest(self, url_suffix, method, file_upload_info, **kwargs)
476
477 begin_time_seconds = time.time()
--> 478 http_result = self._attempt_rest(url, method, post_body, file_upload_info)
479 end_time_seconds = time.time()
480 elapsed_time_seconds = end_time_seconds - begin_time_seconds
C:\Anaconda2\lib\site-packages\h2o\connection.pyc in _attempt_rest(self, url, method, post_body, file_upload_info)
526
527 except requests.ConnectionError as e:
--> 528 raise EnvironmentError("h2o-py encountered an unexpected HTTP error:\n {}".format(e))
529
530 return http_result
EnvironmentError: h2o-py encountered an unexpected HTTP error:
('Connection aborted.', BadStatusLine("''",))
My hunch is that the cluster memory has only around 247.5 MB which is not enough to handle the model building hence aborted the connection to h2o. Here are the codes I used to initiate h2o:
#initialization of h2o module
import subprocess as sp
import sys
import os.path as p
# path of h2o jar file
h2o_path = p.join(sys.prefix, "h2o_jar", "h2o.jar")
# subprocess to launch h2o
# the command can be further modified to include virtual machine parameters
sp.Popen("java -jar " + h2o_path)
# h2o.init() call to verify that h2o launch is successfull
h2o.init(ip="localhost", port=54321, size=1, start_h2o=False, enable_assertions=False, \
license=None, max_mem_size_GB=4, min_mem_size_GB=4, ice_root=None)
and here is the returned status table:
Any ideas on the above would be greatly appreciated!!
Just to close out this question, I'll restate the solution mentioned in the comments above. The user was able to resolve the issue by starting H2O from the command line with 1GB of memory using java -jar -Xmx1g h2o.jar, and then connected to the existing H2O server in Python using h2o.init().
It's not clear to me why h2o.init() was not creating the correct size cluster using the max_mem_size_GB argument. Regardless, this argument has been deprecated recently and replaced by another argument, max_mem_size, so it may no longer be an issue.