RuntimeError 'CUDA error' when trying the Google Cloud AI Platform Tutorial for PyTorch - google-cloud-platform

I've been trying to get started with Google Cloud's AI Platform. I have been following this tutorial: https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1
My own models are written in PyTorch, hence the choice to get started with PyTorch. Why I want to use the GPU I guess goes without saying.
I've tried to follow the instructions to the letter, and I've used the provided sample code. Yet I still run into errors. I can create a job without problems, but the job ends up failing with the following error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
I'm relatively new to PyTorch and completely new to GCP, so I have no idea how I would go about fixing this and any help would be much appreciated.
Full trace:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
train(sequential_model, train_loader, criterion, optimizer, epoch)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
for batch_index, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device

I have managed to reproduce the same issue with pytorch-gpu.1-6.
As a workaround, it works with the pytorch-gpu.1-4
I suppose that something in the code changed from 1.6 but not sure what as I am not familiar with Pytorch.
Besides, up to version 1.7, it seems that the code to select the device hasn’t changed compared to our sample code.
Additionally, our GPU basic tier Nvidia Tesla K80 seems to be supported by CUDA 9.0, 9.2, 10.0 OR 11
In any case, I have created a public issue on issuetracker so the AI Platform engineering team can investigate it.

Related

AWS Sagemaker Error for Training job - Algorithm Error

I am receiving the following error when I try and train an XGBoost model and have no idea how to fix it. Any help please?
UnexpectedStatusException: Error for Training job sagemaker-xgboost-2022-08-22-21-37-39-774: Failed. Reason: AlgorithmError: framework error:
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
train(framework.training_env())
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
checkpoint_config=checkpoint_config
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 110, in sagemaker_train
validated_train_config = hyperparameters.validate(train_config)
File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 270, in validate
raise exc.UserError("Missing required hyperparameter: {}".format(hp)
My full notebook is too large to post here, but below I have also added an image of the code right before the training
SageMaker's implementation of XGBoost requires one hyperparameter, all other parameters are optional, looks like you are not passing the required parameter, num_round.
Try adding this to your dictionary:
hyperparameters = {
"num_round": 10
}
Per AWS docs, num_round: The number of rounds to run the training. Valid values: integer
For more information on SageMaker's hyperparameters requirements refer to: SageMaker Dev Guide: XGBoost Hyperparameters

How can I fix Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable" error? [duplicate]

This question already has answers here:
Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable"
(2 answers)
Closed 1 year ago.
I wanna first point out that I tried every answers mentioned in this thread. None of these seem to fix the issue and the question already dates a while back.
Issue
I want to run the dev_appserver.py while adding certain files to the ignore list for the watcher; this means that the skip_files is out of the question as this option removes them from being read by the server.
When I run dev_appserver.py without the --watcher_ignore_re flag, everything works fine except for the file watch. When I do run it with the flag, I get the following error:
INFO 2021-11-02 13:54:50,100 devappserver2.py:309] Skipping SDK update check.
Traceback (most recent call last):
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 109, in <module>
_run_file(__file__, globals())
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 103, in _run_file
_execfile(_PATHS.script_file(script_name), globals_)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 83, in _execfile
execfile(fn, scope)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 635, in <module>
main()
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 623, in main
dev_server.start(options)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 356, in start
java_major_version=self.java_major_version
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/metrics.py", line 185, in Start
self._cmd_args = json.dumps(vars(cmd_args)) if cmd_args else None
File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <_sre.SRE_Pattern object at 0x7f720c625240> is not JSON serializable
I have tried with different versions without success:
GCloud 361.0.0/362.0.0/357.0.0/240.0.0/220.0.0/200.0.0
Python 2.7.18/3.9.7
I have also tried different string values on the watcher flag:
""
''
".css"
"*.css"
".*\css"
'.css'
'*.css'
'.*\css'
etc.
I know the issue is therefore not with how the string is formulated (at least it doesn't seem like it). And the different version don't help either.
My work colleagues don't have this issue and are using different versions I listed here on MacOS. I am currently on arch Linux, but I have ran into the exact same issue on my mac as well.
I have also added export CLOUDSDK_PYTHON=python2.7 in my ~/.zshrc file.
So turns out this was a duplicate after all. I missed one comment that had the solution. This one: https://stackoverflow.com/a/52238832/9706597
It looks like it is an issue with the google analytics code built into dev_appserver2 (google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py on or around line 316). It wants to send all of your command line options to google analytics. If you remove the analytics client id by adding the command line option --google_analytics_client_id= (note: '=' without any following value) the appserver won't call the google analytics code where it is trying to JSON serialize an SRE object and failing.
In a few short words for anyone else coming through here, simply add this option, simply add this option:
--google_analytics_client_id= with no value.

AWS X-Ray ERROR:aws_xray_sdk.core.context:cannot find the current segment/subsegment

We recently added X-Ray to our code by having:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all()
While this runs fine on AWS Lambda, but when trying to run locally during calling ElasticSearch we got the following exception:
ERROR:aws_xray_sdk.core.context:cannot find the current segment/subsegment, please make sure you have a segment open
queryCustomers - DEBUG - Caught exception for <function search_customer at 0x10bfcf0d0>
Traceback (most recent call last):
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/chalice/app.py", line 659, in _get_view_function_response
response = view_function(**function_args)
File "/Users/jameslin/projects/test-project/src/app.py", line 57, in search_customer
return query[0:size].execute().to_dict()['hits']['hits']
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 639, in execute
**self._params
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 632, in search
doc_type, '_search'), params=params, body=body)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 71, in perform_request
prepared_request = self.session.prepare_request(request)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/ext/requests/patch.py", line 38, in _inject_header
inject_trace_header(headers, xray_recorder.current_subsegment())
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 251, in current_subsegment
entity = self.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 316, in get_trace_entity
return self.context.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 93, in get_trace_entity
return self.handle_context_missing()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 118, in handle_context_missing
raise SegmentNotFoundException(MISSING_SEGMENT_MSG)
aws_xray_sdk.core.exceptions.exceptions.SegmentNotFoundException: cannot find the current segment/subsegment, please make sure you have a segment open
I have no idea what his means and how to get rid of it, my google attempts gives not many relevant results and I also tried running the x-ray daemon locally but still having the same problem:
./xray_mac -o -n ap-southeast-2
When your code is running on AWS Lambda with tracing enabled, Lambda container will generate a segment representing the whole function invocation. It also sets the context as the environment variable so the SDK can link any subsegment created inside the function back to the parent segment.
If you run the same code locally the SDK still tries to create subsegments for the actual function code but it can't find any context, thus throwing the error you posted.
To solve this you will need to setup some environment variables to make sure the SDK has the same information as it were running on an actual Lambda container.
Make sure the SDK thinks it is running on a Lambda container by setting LAMBDA_TASK_ROOT with whatever value you'd like (only the presence of the key matters). You can see the source code here: https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/core/lambda_launcher.py
Setting LAMBDA_TRACE_HEADER_KEY so the function has a tracing context. The value must be a trace header and you can see more details here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-x-ray.html
This workaround is not ideal as it requires extra code changes from user side. We would like to provide better customer experience for testing X-Ray instrumented Lambda function locally. Do you mind sharing more details about how you are doing local testing and how you expect X-Ray tracing works in such testing environment, so we can have better improvement for your use case?

Unable to add new datastreams to some specific objects in Fedora repository (Fedora commons)

I need help in resolving this issue. I am unable to add new datastreams to a few specific objects in the Fedora repository but have no clue what's really wrong about this objects. Here is the error trace I get:
HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Traceback (most recent call last):
File "/opt/2.0/flx/pylons/flx/compress_upload_images.py", line 159, in run
obj.addDataStream(cDSName, fc.getDSXml(r.type.name), label=label, mimeType=h.safe_decode('%s' % mimeType), controlGroup=controlGroup, logMessage=h.safe_decode('Storing compressed %s' % r.type.name))
File "/usr/local/lib/python2.6/dist-packages/fcrepo/object.py", line 64, in addDataStream
self.client.addDatastream(self.pid, dsid, body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/client.py", line 119, in addDatastream
response = request.submit(body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/wadl.py", line 81, in submit
method=self.method.name)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 80, in open
return check_response_status(self.conn.getresponse())
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 107, in check_response_status
raise ex
FedoraConnectionException: HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Finally I was able to fix this issue. The problem was basically the database used by the Fedora Commons was inconsistent and for the image objects that I was facing the issue, entries were not present in the db. The inconsistency had happened due to some migration that happened and some rows were missed. We had to copy the missing data from the old database to the new one and it worked this time!

MemoryError of Python Module dbf

I would like to conver a csv to dbf in python by the following way:
import dbf
table = dbf.from_csv('/home/beata/Documents/NC/CNRM_control/CNRM_pr_power1961','CNRM_pr_power1961.dbf')
but I got the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/dbf/__init__.py", line 172, in from_csv
mtable.append(tuple(row))
File "/usr/lib/pymodules/python2.7/dbf/tables.py", line 1154, in append
newrecord[index] = item
File "/usr/lib/pymodules/python2.7/dbf/tables.py", line 278, in __setitem__
yo.__setattr__(yo._layout.fields[name], value)
File "/usr/lib/pymodules/python2.7/dbf/tables.py", line 269, in __setattr__
yo._updateFieldValue(fielddef, value)
File "/usr/lib/pymodules/python2.7/dbf/tables.py", line 168, in _updateFieldValue
bytes = array('c', update(value, fielddef, yo._layout.memo))
File "/usr/lib/pymodules/python2.7/dbf/_io.py", line 132, in updateMemo
block = memo.put_memo(string)
File "/usr/lib/pymodules/python2.7/dbf/tables.py", line 424, in put_memo
yo.memory[thismemo] = data
MemoryError
>>>
The size of csv is 2.4 GiB. My ubuntu 14.04 LTS OS type is 64-bit with 31.3 GiB memory and Intel Xeon(R) CPU ES-1660vz# 3.70GHz x12
Could someone write me what I should to do to fix this error?
Thank you for your help in advance!
The problem you have there is that dbf.from_csv attempts to create an in-memory table, and your O/S isn't letting you have enough RAM to do so.
To get around that problem I re-wrote from_csv to write directly to disk if you pass on_disk=True. Check out PyPI for the latest code.
The remaining problem is in the dbf format itself -- you may run into problems with files that large as the internal structure wasn't designed for such large capacities. If the update doesn't completely solve your problem you'll need to split your csv file and create multiple dbfs out of it.
Feel free to email me directly if you have more questions.