ValidationException error when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms - amazon-web-services

I'm trying to run a Lambda function to create a SageMaker training job using the same parameters as another previous training job. Here's my lambda function:
def lambda_handler(event, context):
training_job_name = os.environ['training_job_name']
sm = boto3.client('sagemaker')
job = sm.describe_training_job(TrainingJobName=training_job_name)
training_job_prefix = 'new-randomcutforest-'
training_job_name = training_job_prefix+str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]
print("Starting training job %s" % training_job_name)
resp = sm.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification=job['AlgorithmSpecification'],
RoleArn=job['RoleArn'],
InputDataConfig=job['InputDataConfig'],
OutputDataConfig=job['OutputDataConfig'],
ResourceConfig=job['ResourceConfig'],
StoppingCondition=job['StoppingCondition'],
VpcConfig=job['VpcConfig'],
HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {},
Tags=job['Tags'] if 'Tags' in job else [])
[...]
And I keep getting the following error message:
An error occurred (ValidationException) when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms. Please retry the request without specifying metric definitions.: ClientError
Traceback (most recent call last):
File “/var/task/lambda_function.py”, line 96, in lambda_handler
StoppingCondition=job[‘StoppingCondition’]
, and I get the same error for Hyperparameters and Tags.
I tried to remove these parameters, but they are required, so that's not a solution:
Parameter validation failed:
Missing required parameter in input: "StoppingCondition": ParamValidationError
I tried to hard-code these variables, but it led to the same error.
The exact same function used to work, but only for a few training jobs (around 5), and then it gave this error message. Now it stopped working completely, and the same error message comes up. Any idea why?

Before calling "sm.create_training_job", remove the MetricDefinitions key. To do this, pop that key from the 'AlgorithmSpecification' dictionary.
job['AlgorithmSpecification'].pop('MetricDefinitions',None)

It's hard to tell exactly what's going wrong here and why your previous job's hyperparemeters didn't work. Perhaps instead of just passing them along to the new job you could print them out to be able to inspect them?
Going the by this line...
training_job_prefix = 'new-randomcutforest-'
... I am going to hazard a guess and assume you are trying to run RCF. The hyperparameters that that algo requires are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html

Related

Error in AWS API requesting for MTurk workers with a certain qualification type

I am trying to retrieve workers with a certain qualification type by using the following code.
response = mturk_client.list_workers_with_qualification_type(
QualificationTypeId=args.qualification_type,
Status='Granted',
MaxResults=101
)
However, it results in the following error:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the ListWorkersWithQualificationType operation: 1 validation error detected: Value '101' at 'maxResults' failed to satisfy constraint: Member must have value less than or equal to 100
When I change the MaxResults parameter to be less than 100, things work fine. This makes me feel like the API has a wrong validation where it forces the MaxResults parameter to be less than 100.
Does anyone have any suggestions for how I can fix this?
I am using boto3 in python with version 1.24.22.

check if transcribe job exists if it exists delete it

Hello I'm using the aws cli with pyhon. I need to delete previous jobs from the transcription service to prevent a high invoice. My problem remains when the script starts because yet there isn't any job, I need delete if it exist and if it doesn't exists do nothing.
transcribe_client = boto3.client('transcribe', aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY, region_name=AWS_DEFAULT_REGION)
JOB_NAME = "Example-job"
try:
transcribe_client.delete_transcription_job(TranscriptionJobName=JOB_NAME)
except ClientError as e:
raise Exception( "boto3 client error el job doesnt exists: " + e.__str__())
except Exception as e:
raise Exception( "Unexpected error deleting job: " + e.__str__())
When the program starts it's throwing a exception because there isn't any job.
I need to check if this exists and the delete it, and if this job doesn't exists do nothing and should not exists crashes.
Also I don't know if this jobs has a price, if this jobs doesn't have cost/price I will generate many jobs with unique id and of this way I should prevent crashes.
any idea guys to solve this problem I will appreciate.
thanks so much.
You can call list_transcription_jobs to retrieve a list of transcription jobs that match the specified criteria, or a list of all transcription jobs if you don't specify criteria. You can then iterate over the results and decide which jobs need to be deleted.
Alternatively, You can call get_transcription_job which, according to the docs, will throw TranscribeService.Client.exceptions.NotFoundException.

how to return what error is encountered when getting parameter from ssm

So I'm using lambda to run a Flask-Ask app and right now I'm having trouble retrieving data from the parameter store. When I run my function from VS it runs fine, and has no problem retrieving a parameter.
this is piece of code causing my troubles
#app.route("/test")
def test():
try:
URL_var = client.get_parameter(Name="URL")
return str(URL_var['Parameter']['Value'])
except Exception as e:
track = traceback.format_exc()
return track
Once its actually running on AWS the problems start.
Initially the page would just timeout when I tried to retrieve or put a parameter, so I configured it so connect_timeout and read_timeout were both equal to 5, and then I put the code in a try except block. Now what I get is a stack trace that really only tells me that the function timed out.
What I need is some way to know whats going wrong when I call get_parameter so I can figure out where to go from here.

Amazon MTurk: can't delete HIT in state 'Reviewable'

I am using the script offered here to delete deployed HITs from the Amazon Mechanical Turk platform. However, I am getting the following exception by the mturk client:
An error occurred (RequestError) when calling the DeleteHIT operation: This HIT is currently in the state 'Reviewable'. This operation can be called with a status of: Reviewing, Reviewable (1574723552282 s)
To me, the error msg itself seems to be wrong. Does anybody have an explanation for this behaviour?
Try somethng like this maybe, I found this somewhere and it can solve your problem,
# if hit is reviewable and has assignments, approve assignments
if status == 'Reviewable':
assignments = mturk.list_assignments_for_hit(HITId=hit['HITId'], AssignmentStatuses=['Submitted'])
if assignments['NumResults'] > 0:
for assign in assignments['Assignments']:
mturk.approve_assignment(AssignmentId=assign['AssignmentId'])
try:
mturk.delete_hit(HITId=hit['HITId'])
except:
print('Not deleted')

How to increase deploy timeout limit at AWS Opsworks?

I would like to increase the deploy time, in a stack layer that hosts many apps (AWS Opsworks).
Currenlty I get the following error:
Eror
[2014-05-05T22:27:51+00:00] ERROR: Running exception handlers
[2014-05-05T22:27:51+00:00] ERROR: Exception handlers complete
[2014-05-05T22:27:51+00:00] FATAL: Stacktrace dumped to /var/lib/aws/opsworks/cache/chef-stacktrace.out
[2014-05-05T22:27:51+00:00] ERROR: deploy[/srv/www/lakers_test] (opsworks_delayed_job::deploy line 65) had an error: Mixlib::ShellOut::CommandTimeout: Command timed out after 600s:
Thanks in advance.
First of all, as mentioned in this ticket reporting a similar issue, the Opsworks guys recommend trying to speed up the call first (there's always room for optimization).
If that doesn't work, we can go down the rabbit hole: this gets called, which in turn calls Mixlib::ShellOut.new, which happens to have a timeout option that you can pass in the initializer!
Now you can use an Opsworks custom cookbook to overwrite the initial method, and pass the corresponding timeout option. Opsworks merges the contents of its base cookbooks with the contents of your custom cookbook - therefore you only need to add & edit one single file to your custom cookbook: opsworks_commons/libraries/shellout.rb:
module OpsWorks
module ShellOut
extend self
# This would be your new default timeout.
DEFAULT_OPTIONS = { timeout: 900 }
def shellout(command, options = {})
cmd = Mixlib::ShellOut.new(command, DEFAULT_OPTIONS.merge(options))
cmd.run_command
cmd.error!
[cmd.stderr, cmd.stdout].join("\n")
end
end
end
Notice how the only additions are just DEFAULT_OPTIONS and merging these options in the Mixlib::ShellOut.new call.
An improvement to this method would be changing this timeout option via a chef attribute, that you could in turn update via your custom JSON in the Opsworks interface. This means passing the timeout attribute in the initial Opsworks::ShellOut.shellout call - not in the method definition. But this depends on how the shellout method actually gets called...