I am receiving the following error when I try and train an XGBoost model and have no idea how to fix it. Any help please?
UnexpectedStatusException: Error for Training job sagemaker-xgboost-2022-08-22-21-37-39-774: Failed. Reason: AlgorithmError: framework error:
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
train(framework.training_env())
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
checkpoint_config=checkpoint_config
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 110, in sagemaker_train
validated_train_config = hyperparameters.validate(train_config)
File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 270, in validate
raise exc.UserError("Missing required hyperparameter: {}".format(hp)
My full notebook is too large to post here, but below I have also added an image of the code right before the training
SageMaker's implementation of XGBoost requires one hyperparameter, all other parameters are optional, looks like you are not passing the required parameter, num_round.
Try adding this to your dictionary:
hyperparameters = {
"num_round": 10
}
Per AWS docs, num_round: The number of rounds to run the training. Valid values: integer
For more information on SageMaker's hyperparameters requirements refer to: SageMaker Dev Guide: XGBoost Hyperparameters
Related
This question already has answers here:
Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable"
(2 answers)
Closed 1 year ago.
I wanna first point out that I tried every answers mentioned in this thread. None of these seem to fix the issue and the question already dates a while back.
Issue
I want to run the dev_appserver.py while adding certain files to the ignore list for the watcher; this means that the skip_files is out of the question as this option removes them from being read by the server.
When I run dev_appserver.py without the --watcher_ignore_re flag, everything works fine except for the file watch. When I do run it with the flag, I get the following error:
INFO 2021-11-02 13:54:50,100 devappserver2.py:309] Skipping SDK update check.
Traceback (most recent call last):
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 109, in <module>
_run_file(__file__, globals())
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 103, in _run_file
_execfile(_PATHS.script_file(script_name), globals_)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 83, in _execfile
execfile(fn, scope)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 635, in <module>
main()
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 623, in main
dev_server.start(options)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 356, in start
java_major_version=self.java_major_version
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/metrics.py", line 185, in Start
self._cmd_args = json.dumps(vars(cmd_args)) if cmd_args else None
File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <_sre.SRE_Pattern object at 0x7f720c625240> is not JSON serializable
I have tried with different versions without success:
GCloud 361.0.0/362.0.0/357.0.0/240.0.0/220.0.0/200.0.0
Python 2.7.18/3.9.7
I have also tried different string values on the watcher flag:
""
''
".css"
"*.css"
".*\css"
'.css'
'*.css'
'.*\css'
etc.
I know the issue is therefore not with how the string is formulated (at least it doesn't seem like it). And the different version don't help either.
My work colleagues don't have this issue and are using different versions I listed here on MacOS. I am currently on arch Linux, but I have ran into the exact same issue on my mac as well.
I have also added export CLOUDSDK_PYTHON=python2.7 in my ~/.zshrc file.
So turns out this was a duplicate after all. I missed one comment that had the solution. This one: https://stackoverflow.com/a/52238832/9706597
It looks like it is an issue with the google analytics code built into dev_appserver2 (google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py on or around line 316). It wants to send all of your command line options to google analytics. If you remove the analytics client id by adding the command line option --google_analytics_client_id= (note: '=' without any following value) the appserver won't call the google analytics code where it is trying to JSON serialize an SRE object and failing.
In a few short words for anyone else coming through here, simply add this option, simply add this option:
--google_analytics_client_id= with no value.
I've been trying to get started with Google Cloud's AI Platform. I have been following this tutorial: https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1
My own models are written in PyTorch, hence the choice to get started with PyTorch. Why I want to use the GPU I guess goes without saying.
I've tried to follow the instructions to the letter, and I've used the provided sample code. Yet I still run into errors. I can create a job without problems, but the job ends up failing with the following error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
I'm relatively new to PyTorch and completely new to GCP, so I have no idea how I would go about fixing this and any help would be much appreciated.
Full trace:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
train(sequential_model, train_loader, criterion, optimizer, epoch)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
for batch_index, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device
I have managed to reproduce the same issue with pytorch-gpu.1-6.
As a workaround, it works with the pytorch-gpu.1-4
I suppose that something in the code changed from 1.6 but not sure what as I am not familiar with Pytorch.
Besides, up to version 1.7, it seems that the code to select the device hasn’t changed compared to our sample code.
Additionally, our GPU basic tier Nvidia Tesla K80 seems to be supported by CUDA 9.0, 9.2, 10.0 OR 11
In any case, I have created a public issue on issuetracker so the AI Platform engineering team can investigate it.
I have a problem, i don't understand this error, when trying to list kaggles datasets in google colab.
Notebook config: Python 3.x, no hdw acc.
#to upload my kaggle token
from google.colab import files
files.upload()
#setting up the token
!pip install --upgrade kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
#and taking a look at datasets
!kaggle datasets list
Traceback (most recent call last):
File "/usr/local/bin/kaggle", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/kaggle/cli.py", line 51, in main
out = args.func(**command_args)
File "/usr/local/lib/python3.6/dist-packages/kaggle/api/kaggle_api_extended.py", line 940, in dataset_list_cli
max_size, min_size)
File "/usr/local/lib/python3.6/dist-packages/kaggle/api/kaggle_api_extended.py", line 905, in dataset_list
return [Dataset(d) for d in datasets_list_result]
File "/usr/local/lib/python3.6/dist-packages/kaggle/api/kaggle_api_extended.py", line 905, in <listcomp>
return [Dataset(d) for d in datasets_list_result]
File "/usr/local/lib/python3.6/dist-packages/kaggle/models/kaggle_models_extended.py", line 67, in __init__
self.size = File.get_size(self.totalBytes)
File "/usr/local/lib/python3.6/dist-packages/kaggle/models/kaggle_models_extended.py", line 107, in get_size
while size >= 1024 and suffix_index < 4:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
well, I would like to understand what happened, and how to fix it. Thank's in the advance.
jet.
I am encountering this problem as well. I noticed that if I set the use this call
kaggle datasets list --min-size 1
It will work. Note you will need version 1.5.6. I had 1.5.4 on a Colab instance and that version didn’t support that argument.
The problem seems to be bigquery/crypto-litecoin has no data. As a consequence of this, it looks like totalBytes is None in Dataset.
I've opened an issue on github and will created a PR. If you want a temporary work around, you can grab the file from my fork. You can use your traceback to determine where to put the file. Or alternatively, just use --min-size 1 so it will ignore the case when there are no data files.
I ran into the same problem.
Generate the Kaggle JSON API file. On the Widget/Icon in the top Right corner -> click "Account" -> Scroll down to "API" subsection, Click "Expire API Token" -> Click "Create New API Token"
In Google Colab. Upload your json file
Run the following code:
#first upload kaggle api file "kaggle.json" import os #this path contains the json file os.environ['KAGGLE_CONFIG_DIR'] = "/content"
#Find the competition or Dataset under Data. Like this: !kaggle competitions download -c jane-street-market-prediction
This worked for me after a lot of banging my head against the wall.
If you get errors still, you may need to link your Colab and Kaggle accounts. You can do this in the account settings portion of kaggle.
We recently added X-Ray to our code by having:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all()
While this runs fine on AWS Lambda, but when trying to run locally during calling ElasticSearch we got the following exception:
ERROR:aws_xray_sdk.core.context:cannot find the current segment/subsegment, please make sure you have a segment open
queryCustomers - DEBUG - Caught exception for <function search_customer at 0x10bfcf0d0>
Traceback (most recent call last):
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/chalice/app.py", line 659, in _get_view_function_response
response = view_function(**function_args)
File "/Users/jameslin/projects/test-project/src/app.py", line 57, in search_customer
return query[0:size].execute().to_dict()['hits']['hits']
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 639, in execute
**self._params
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 632, in search
doc_type, '_search'), params=params, body=body)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 71, in perform_request
prepared_request = self.session.prepare_request(request)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/ext/requests/patch.py", line 38, in _inject_header
inject_trace_header(headers, xray_recorder.current_subsegment())
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 251, in current_subsegment
entity = self.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 316, in get_trace_entity
return self.context.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 93, in get_trace_entity
return self.handle_context_missing()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 118, in handle_context_missing
raise SegmentNotFoundException(MISSING_SEGMENT_MSG)
aws_xray_sdk.core.exceptions.exceptions.SegmentNotFoundException: cannot find the current segment/subsegment, please make sure you have a segment open
I have no idea what his means and how to get rid of it, my google attempts gives not many relevant results and I also tried running the x-ray daemon locally but still having the same problem:
./xray_mac -o -n ap-southeast-2
When your code is running on AWS Lambda with tracing enabled, Lambda container will generate a segment representing the whole function invocation. It also sets the context as the environment variable so the SDK can link any subsegment created inside the function back to the parent segment.
If you run the same code locally the SDK still tries to create subsegments for the actual function code but it can't find any context, thus throwing the error you posted.
To solve this you will need to setup some environment variables to make sure the SDK has the same information as it were running on an actual Lambda container.
Make sure the SDK thinks it is running on a Lambda container by setting LAMBDA_TASK_ROOT with whatever value you'd like (only the presence of the key matters). You can see the source code here: https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/core/lambda_launcher.py
Setting LAMBDA_TRACE_HEADER_KEY so the function has a tracing context. The value must be a trace header and you can see more details here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-x-ray.html
This workaround is not ideal as it requires extra code changes from user side. We would like to provide better customer experience for testing X-Ray instrumented Lambda function locally. Do you mind sharing more details about how you are doing local testing and how you expect X-Ray tracing works in such testing environment, so we can have better improvement for your use case?
I need help in resolving this issue. I am unable to add new datastreams to a few specific objects in the Fedora repository but have no clue what's really wrong about this objects. Here is the error trace I get:
HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Traceback (most recent call last):
File "/opt/2.0/flx/pylons/flx/compress_upload_images.py", line 159, in run
obj.addDataStream(cDSName, fc.getDSXml(r.type.name), label=label, mimeType=h.safe_decode('%s' % mimeType), controlGroup=controlGroup, logMessage=h.safe_decode('Storing compressed %s' % r.type.name))
File "/usr/local/lib/python2.6/dist-packages/fcrepo/object.py", line 64, in addDataStream
self.client.addDatastream(self.pid, dsid, body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/client.py", line 119, in addDatastream
response = request.submit(body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/wadl.py", line 81, in submit
method=self.method.name)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 80, in open
return check_response_status(self.conn.getresponse())
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 107, in check_response_status
raise ex
FedoraConnectionException: HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Finally I was able to fix this issue. The problem was basically the database used by the Fedora Commons was inconsistent and for the image objects that I was facing the issue, entries were not present in the db. The inconsistency had happened due to some migration that happened and some rows were missed. We had to copy the missing data from the old database to the new one and it worked this time!