Hello I'm using the aws cli with pyhon. I need to delete previous jobs from the transcription service to prevent a high invoice. My problem remains when the script starts because yet there isn't any job, I need delete if it exist and if it doesn't exists do nothing.
transcribe_client = boto3.client('transcribe', aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY, region_name=AWS_DEFAULT_REGION)
JOB_NAME = "Example-job"
try:
transcribe_client.delete_transcription_job(TranscriptionJobName=JOB_NAME)
except ClientError as e:
raise Exception( "boto3 client error el job doesnt exists: " + e.__str__())
except Exception as e:
raise Exception( "Unexpected error deleting job: " + e.__str__())
When the program starts it's throwing a exception because there isn't any job.
I need to check if this exists and the delete it, and if this job doesn't exists do nothing and should not exists crashes.
Also I don't know if this jobs has a price, if this jobs doesn't have cost/price I will generate many jobs with unique id and of this way I should prevent crashes.
any idea guys to solve this problem I will appreciate.
thanks so much.
You can call list_transcription_jobs to retrieve a list of transcription jobs that match the specified criteria, or a list of all transcription jobs if you don't specify criteria. You can then iterate over the results and decide which jobs need to be deleted.
Alternatively, You can call get_transcription_job which, according to the docs, will throw TranscribeService.Client.exceptions.NotFoundException.
Related
I am using the script offered here to delete deployed HITs from the Amazon Mechanical Turk platform. However, I am getting the following exception by the mturk client:
An error occurred (RequestError) when calling the DeleteHIT operation: This HIT is currently in the state 'Reviewable'. This operation can be called with a status of: Reviewing, Reviewable (1574723552282 s)
To me, the error msg itself seems to be wrong. Does anybody have an explanation for this behaviour?
Try somethng like this maybe, I found this somewhere and it can solve your problem,
# if hit is reviewable and has assignments, approve assignments
if status == 'Reviewable':
assignments = mturk.list_assignments_for_hit(HITId=hit['HITId'], AssignmentStatuses=['Submitted'])
if assignments['NumResults'] > 0:
for assign in assignments['Assignments']:
mturk.approve_assignment(AssignmentId=assign['AssignmentId'])
try:
mturk.delete_hit(HITId=hit['HITId'])
except:
print('Not deleted')
I'm trying to run a Lambda function to create a SageMaker training job using the same parameters as another previous training job. Here's my lambda function:
def lambda_handler(event, context):
training_job_name = os.environ['training_job_name']
sm = boto3.client('sagemaker')
job = sm.describe_training_job(TrainingJobName=training_job_name)
training_job_prefix = 'new-randomcutforest-'
training_job_name = training_job_prefix+str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]
print("Starting training job %s" % training_job_name)
resp = sm.create_training_job(
TrainingJobName=training_job_name,
AlgorithmSpecification=job['AlgorithmSpecification'],
RoleArn=job['RoleArn'],
InputDataConfig=job['InputDataConfig'],
OutputDataConfig=job['OutputDataConfig'],
ResourceConfig=job['ResourceConfig'],
StoppingCondition=job['StoppingCondition'],
VpcConfig=job['VpcConfig'],
HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {},
Tags=job['Tags'] if 'Tags' in job else [])
[...]
And I keep getting the following error message:
An error occurred (ValidationException) when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms. Please retry the request without specifying metric definitions.: ClientError
Traceback (most recent call last):
File “/var/task/lambda_function.py”, line 96, in lambda_handler
StoppingCondition=job[‘StoppingCondition’]
, and I get the same error for Hyperparameters and Tags.
I tried to remove these parameters, but they are required, so that's not a solution:
Parameter validation failed:
Missing required parameter in input: "StoppingCondition": ParamValidationError
I tried to hard-code these variables, but it led to the same error.
The exact same function used to work, but only for a few training jobs (around 5), and then it gave this error message. Now it stopped working completely, and the same error message comes up. Any idea why?
Before calling "sm.create_training_job", remove the MetricDefinitions key. To do this, pop that key from the 'AlgorithmSpecification' dictionary.
job['AlgorithmSpecification'].pop('MetricDefinitions',None)
It's hard to tell exactly what's going wrong here and why your previous job's hyperparemeters didn't work. Perhaps instead of just passing them along to the new job you could print them out to be able to inspect them?
Going the by this line...
training_job_prefix = 'new-randomcutforest-'
... I am going to hazard a guess and assume you are trying to run RCF. The hyperparameters that that algo requires are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html
I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures
When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you
I'm writing a database-driven application with APScheduler (v3.0.0). Especially during development, I find myself frequently wanting to command a scheduled job to start running now without affecting its subsequent schedule.
It's possible to do this at job creation time, of course:
def dummy_job(arg):
pass
sched.add_job(dummy_job, trigger='interval', hours=3, args=(None,))
sched.add_job(dummy_job, trigger=None, args=(None,))
However, if I already have a job scheduled with an interval or date trigger...
>>> sched.print_jobs()
Jobstore default:
job1 (trigger: interval[3:00:00], next run at: 2014-08-19 18:56:48 PDT)
... there doesn't seem to be a good way to tell the scheduler "make a copy of this job which will start right now." I've tried sched.reschedule_job(trigger=None), which schedules the job to start right now, but removes its existing trigger.
There's also no obvious, simple way to duplicate a job object while preserving its args and any other stateful properties. The interface I'm imagining is something like this:
sched.dup_job(id='job1', new_id='job2')
sched.reschedule_job('job2', trigger=None)
Clearly, APScheduler already contains an internal mechanism to copy job objects since repeated calls to get_job don't return the same object (that is, (sched.get_job(id) is sched.get_job(id))==False).
Has anyone else come up with a solution here? I'm thinking of posting a suggestion on the developers' site if not.
As you've probably figured out by now, that phenomenon is caused by the job stores instantiating jobs on the fly based on data loaded from the back end. To run a copy of a job immediately, this should do the trick:
job = sched.get_job(id)
sched.add_job(job.func, args=job.args, kwargs=job.kwargs)