AWS X-Ray ERROR:aws_xray_sdk.core.context:cannot find the current segment/subsegment - amazon-web-services

We recently added X-Ray to our code by having:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all()
While this runs fine on AWS Lambda, but when trying to run locally during calling ElasticSearch we got the following exception:
ERROR:aws_xray_sdk.core.context:cannot find the current segment/subsegment, please make sure you have a segment open
queryCustomers - DEBUG - Caught exception for <function search_customer at 0x10bfcf0d0>
Traceback (most recent call last):
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/chalice/app.py", line 659, in _get_view_function_response
response = view_function(**function_args)
File "/Users/jameslin/projects/test-project/src/app.py", line 57, in search_customer
return query[0:size].execute().to_dict()['hits']['hits']
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 639, in execute
**self._params
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 632, in search
doc_type, '_search'), params=params, body=body)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/elasticsearch/connection/http_requests.py", line 71, in perform_request
prepared_request = self.session.prepare_request(request)
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/ext/requests/patch.py", line 38, in _inject_header
inject_trace_header(headers, xray_recorder.current_subsegment())
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 251, in current_subsegment
entity = self.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/recorder.py", line 316, in get_trace_entity
return self.context.get_trace_entity()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 93, in get_trace_entity
return self.handle_context_missing()
File "/Users/jameslin/virtualenvs/test-project/lib/python3.6/site-packages/aws_xray_sdk/core/context.py", line 118, in handle_context_missing
raise SegmentNotFoundException(MISSING_SEGMENT_MSG)
aws_xray_sdk.core.exceptions.exceptions.SegmentNotFoundException: cannot find the current segment/subsegment, please make sure you have a segment open
I have no idea what his means and how to get rid of it, my google attempts gives not many relevant results and I also tried running the x-ray daemon locally but still having the same problem:
./xray_mac -o -n ap-southeast-2

When your code is running on AWS Lambda with tracing enabled, Lambda container will generate a segment representing the whole function invocation. It also sets the context as the environment variable so the SDK can link any subsegment created inside the function back to the parent segment.
If you run the same code locally the SDK still tries to create subsegments for the actual function code but it can't find any context, thus throwing the error you posted.
To solve this you will need to setup some environment variables to make sure the SDK has the same information as it were running on an actual Lambda container.
Make sure the SDK thinks it is running on a Lambda container by setting LAMBDA_TASK_ROOT with whatever value you'd like (only the presence of the key matters). You can see the source code here: https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/core/lambda_launcher.py
Setting LAMBDA_TRACE_HEADER_KEY so the function has a tracing context. The value must be a trace header and you can see more details here: https://docs.aws.amazon.com/lambda/latest/dg/lambda-x-ray.html
This workaround is not ideal as it requires extra code changes from user side. We would like to provide better customer experience for testing X-Ray instrumented Lambda function locally. Do you mind sharing more details about how you are doing local testing and how you expect X-Ray tracing works in such testing environment, so we can have better improvement for your use case?

Related

AWS Sagemaker Error for Training job - Algorithm Error

I am receiving the following error when I try and train an XGBoost model and have no idea how to fix it. Any help please?
UnexpectedStatusException: Error for Training job sagemaker-xgboost-2022-08-22-21-37-39-774: Failed. Reason: AlgorithmError: framework error:
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
train(framework.training_env())
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
checkpoint_config=checkpoint_config
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 110, in sagemaker_train
validated_train_config = hyperparameters.validate(train_config)
File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 270, in validate
raise exc.UserError("Missing required hyperparameter: {}".format(hp)
My full notebook is too large to post here, but below I have also added an image of the code right before the training
SageMaker's implementation of XGBoost requires one hyperparameter, all other parameters are optional, looks like you are not passing the required parameter, num_round.
Try adding this to your dictionary:
hyperparameters = {
"num_round": 10
}
Per AWS docs, num_round: The number of rounds to run the training. Valid values: integer
For more information on SageMaker's hyperparameters requirements refer to: SageMaker Dev Guide: XGBoost Hyperparameters

How can I fix Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable" error? [duplicate]

This question already has answers here:
Google App Engine dev_appserver.py: watcher_ignore_re flag "is not JSON serializable"
(2 answers)
Closed 1 year ago.
I wanna first point out that I tried every answers mentioned in this thread. None of these seem to fix the issue and the question already dates a while back.
Issue
I want to run the dev_appserver.py while adding certain files to the ignore list for the watcher; this means that the skip_files is out of the question as this option removes them from being read by the server.
When I run dev_appserver.py without the --watcher_ignore_re flag, everything works fine except for the file watch. When I do run it with the flag, I get the following error:
INFO 2021-11-02 13:54:50,100 devappserver2.py:309] Skipping SDK update check.
Traceback (most recent call last):
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 109, in <module>
_run_file(__file__, globals())
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 103, in _run_file
_execfile(_PATHS.script_file(script_name), globals_)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 83, in _execfile
execfile(fn, scope)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 635, in <module>
main()
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 623, in main
dev_server.start(options)
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 356, in start
java_major_version=self.java_major_version
File "/home/USERNAME/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/metrics.py", line 185, in Start
self._cmd_args = json.dumps(vars(cmd_args)) if cmd_args else None
File "/usr/lib/python2.7/json/__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <_sre.SRE_Pattern object at 0x7f720c625240> is not JSON serializable
I have tried with different versions without success:
GCloud 361.0.0/362.0.0/357.0.0/240.0.0/220.0.0/200.0.0
Python 2.7.18/3.9.7
I have also tried different string values on the watcher flag:
""
''
".css"
"*.css"
".*\css"
'.css'
'*.css'
'.*\css'
etc.
I know the issue is therefore not with how the string is formulated (at least it doesn't seem like it). And the different version don't help either.
My work colleagues don't have this issue and are using different versions I listed here on MacOS. I am currently on arch Linux, but I have ran into the exact same issue on my mac as well.
I have also added export CLOUDSDK_PYTHON=python2.7 in my ~/.zshrc file.
So turns out this was a duplicate after all. I missed one comment that had the solution. This one: https://stackoverflow.com/a/52238832/9706597
It looks like it is an issue with the google analytics code built into dev_appserver2 (google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\devappserver2.py on or around line 316). It wants to send all of your command line options to google analytics. If you remove the analytics client id by adding the command line option --google_analytics_client_id= (note: '=' without any following value) the appserver won't call the google analytics code where it is trying to JSON serialize an SRE object and failing.
In a few short words for anyone else coming through here, simply add this option, simply add this option:
--google_analytics_client_id= with no value.

RuntimeError 'CUDA error' when trying the Google Cloud AI Platform Tutorial for PyTorch

I've been trying to get started with Google Cloud's AI Platform. I have been following this tutorial: https://cloud.google.com/ai-platform/training/docs/getting-started-pytorch#gpu_1
My own models are written in PyTorch, hence the choice to get started with PyTorch. Why I want to use the GPU I guess goes without saying.
I've tried to follow the instructions to the letter, and I've used the provided sample code. Yet I still run into errors. I can create a job without problems, but the job ends up failing with the following error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
I'm relatively new to PyTorch and completely new to GCP, so I have no idea how I would go about fixing this and any help would be much appreciated.
Full trace:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 123, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 119, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 132, in run
train(sequential_model, train_loader, criterion, optimizer, epoch)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 37, in train
for batch_index, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 347, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 387, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in default_collate
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 74, in <dictcomp>
return {key: default_collate([d[key] for d in batch]) for key in elem}
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: CUDA error: no kernel image is available for execution on the device
I have managed to reproduce the same issue with pytorch-gpu.1-6.
As a workaround, it works with the pytorch-gpu.1-4
I suppose that something in the code changed from 1.6 but not sure what as I am not familiar with Pytorch.
Besides, up to version 1.7, it seems that the code to select the device hasn’t changed compared to our sample code.
Additionally, our GPU basic tier Nvidia Tesla K80 seems to be supported by CUDA 9.0, 9.2, 10.0 OR 11
In any case, I have created a public issue on issuetracker so the AI Platform engineering team can investigate it.

Unable to add new datastreams to some specific objects in Fedora repository (Fedora commons)

I need help in resolving this issue. I am unable to add new datastreams to a few specific objects in the Fedora repository but have no clue what's really wrong about this objects. Here is the error trace I get:
HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Traceback (most recent call last):
File "/opt/2.0/flx/pylons/flx/compress_upload_images.py", line 159, in run
obj.addDataStream(cDSName, fc.getDSXml(r.type.name), label=label, mimeType=h.safe_decode('%s' % mimeType), controlGroup=controlGroup, logMessage=h.safe_decode('Storing compressed %s' % r.type.name))
File "/usr/local/lib/python2.6/dist-packages/fcrepo/object.py", line 64, in addDataStream
self.client.addDatastream(self.pid, dsid, body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/client.py", line 119, in addDatastream
response = request.submit(body, **params)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/wadl.py", line 81, in submit
method=self.method.name)
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 80, in open
return check_response_status(self.conn.getresponse())
File "/usr/local/lib/python2.6/dist-packages/fcrepo/connection.py", line 107, in check_response_status
raise ex
FedoraConnectionException: HTTP code=500, Reason=Internal Server Error, body=javax.ws.rs.WebApplicationException: org.fcrepo.server.errors.ObjectNotFoundException: Error creating replication job: The requested object doesn't exist in the registry.
Finally I was able to fix this issue. The problem was basically the database used by the Fedora Commons was inconsistent and for the image objects that I was facing the issue, entries were not present in the db. The inconsistency had happened due to some migration that happened and some rows were missed. We had to copy the missing data from the old database to the new one and it worked this time!

How to mount a snapshot from boto?

I'm using boto 2.5.1, Python 2.7, Ubuntu Precise. I want to mount a snapshot on an EC2 instance. I've gotten as far as creating a volume from the snapshot, but then I can't figure out how to attach it. If I do:
[setup stuff elided]
c = EC2Connection()
print volume
print instance
c.attach_volume(volume, instance, "/dev/snap")
I get the amazingly unhelpful exception:
vol-2df00677
i-1509d364
Traceback (most recent call last):
File "./mongo_pulldown.py", line 48, in <module>
main()
File "./mongo_pulldown.py", line 28, in main
c.attach_volume(volume, instance, "/dev/snap")
File "/home/roy/deploy/current/python/local/lib/python2.7/site-packages/boto/ec2/connection.py", line 1530, in attach_volume
return self.get_status('AttachVolume', params, verb='POST')
File "/home/roy/deploy/current/python/local/lib/python2.7/site-packages/boto/connection.py", line 985, in get_status
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
with no clue of what I've done wrong. I'm assuming the device name is arbitrary, and the attach call will create the device as part of the process? Or does the device have the exist already?
How can I get some more useful diagnostic than just "Bad Request"?
The attach_volume method takes an instance_id and a volume_id but you are passing objects. Try this:
c.attach_volume(volume.id, instance.id, "/dev/sdh")
The device_name should be a reasonable device name for the OS you are using. You can find more about what that value should be here.
boto uses standard Python logging so you can configure it to log as much or as little as you want. This gist shows a shortcut approach to get full debug logging. However, boto can only log what it has access to and it's possible the response from EC2 just doesn't provide much information.
Well, it turns out that I was passing something bad for device. I was passing '/dev/snap'. When I changed that to 'xvdg', things worked. It appears that '/dev/xvdg' also works (and has the same effect; the '/dev/' part appears to be ignored).
I wrote a little function to find the next available unused device name:
def get_device_name():
for c in 'fghijklmnop':
name = "xvd%s" % c
path = "/dev/%s" % name
try:
os.stat(path)
except OSError:
return path
I was hoping that by using a fixed name outside of the set that's normally used, I could avoid having to worry about this silliness.