Every time I see my BQ job status through the API, it always returns 0 for waitMsAvg when the job is running.
After the job is finished, then I can see valid numbers for waitMsAvg.
Is it intended behavior, or is it a bug?
How can I see the values when the job is running?
BigQuery query plan provides timing classification for each query stage wherein
waitMsAvgis the avg time spent by workers to get scheduled.
You are getting waitMsAvg as 0 because your query is still running. Once the query job completes waitMsAvg will denote the average time worker spent waiting to be scheduled.
Related
I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.
In aws batch Job queues dashboard, it shows all job status failed and succeeded job count for 24 hours. Is it possible to reset counter to zero?
No, it's not possible to clear jobs. Batch keeps finished jobs around for at least a day (and in my experience occasionally up to a few weeks), and there's no API or console mechanism to accelerate the process.
Is there a way to set a maximum running time for AWS Batch jobs (or queues)? This is a standard setting in most batch managers, which avoids wasting resources when a job hangs for whatever reason.
As of April, 2018, AWS Batch now supports setting a Job Timeout when submitting a Job, or in the job definition.
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-batch-adds-support-for-automatic-termination-with-job-execution-timeout/
You specify an attemptDurationSeconds parameter, which must be at least 60 seconds, either in your job definition, or when you submit the job. When this number of seconds has passed following the job attempt's startedAt timestamp, AWS Batch terminates the job. On the compute resource, your job's container receives a SIGTERM signal to give your application a chance to shut down gracefully; if the container is still running after 30 seconds, a SIGKILL signal is sent to forcefully shut down the container.
Source: https://docs.aws.amazon.com/batch/latest/userguide/job_timeouts.html
POST /v1/submitjob HTTP/1.1
Content-type: application/json
{
...
"timeout": {
"attemptDurationSeconds": number
}
}
AFAIK there is no feature to do this. However, a workaround was suggested in the forum for a similar question.
One idea is to call Batch as an Activity from Step Functions, pingback
back on a schedule (e.g. every minute) from that job. If it stops
responding then you can detect that situation as a Timeout in the
activity and act accordingly (terminate the job etc.). Not an ideal
solution (especially if the job continues to ping back as a "zombie"),
but it's a start. You'd also likely have to store activity tokens in a
database to trace them to Batch job id.
Alternatively, you split that setup into 2 steps, and schedule a Batch
job from a Lambda in the first state, then pass the Batch job id to
the second step which then polls Batch (from another Lambda) for its
state with Retry and IntervalSeconds (e.g. once every minute, or even
with exponential backoff), and MaxAttempts calculated based on your
timeout. This way, you don't need any external state storage
mechanism, long polling or even a "ping back" from the job (it CAN be
a zombie), but the downside is more steps.
There is no option to set timeout on batch job but you can setup a lambda function that triggers every 1 hour or so and deletes jobs created before say 24 hours.
working with aws for some time now and could not find a way to set a maximum running time for batch jobs.
However there are some alternative way which you could utilize.
AWS Forum
Sadly there is no way to set the limit execution time on AWS Batch.
One solution may be to edit the docker's entry point to schedule the execution time limit.
After add some asynchrone job on our workflow, the excecution of some instance become slow.I use embedded Process engine Camunda (https://docs.camunda.org/get-started/spring/embedded-process-engine/)
Any idea?
It looks like your job executions result in adding timers, there was a bug where the process engine does not realize that new jobs have been added or that there might be other jobs to execute in that case.
The issue is described in Issue CAM-6453
The scenario for us was that we had several thousand processes accumulated due to a network problem. The process would execute one service task and then wait for a intermediate timer catch event. Because adding a timer did not hint the job executor, it would execute a few processes and then sleep for 60 seconds before acquiring the next batch of jobs, even though there were still a few thousand jobs available for execution.
It should be fixed since 7.4.10, 7.5.4 and 7.6.
We set up a schedule to execute a command.
It is scheduled to run every 5 minutes as follows: 20090201T235900|20190201T235900|127|00:05:00
However, from the logs we see it runs only every hour.
Is there a reason for this?
check scheduling frequency in your sitecore.config file
<sitecore>
<scheduling>
<!-- Time between checking for scheduled tasks waiting to execute -->
<frequency>00:05:00</frequency>
</scheduling>
</sitecore>
The scheduling interval is based on the the scheduler interval and the job interval. Every scheduler interval period, all the configured jobs are evaluated. This is logged. During that evaluation, each job checked against the last time it ran, if that interval is greater that the configured job interval, the job is started.
It's fairly simple, but it's important to understand the mechanism. You can also see how it allows no way of inherently running jobs at a specific time, only at approximate intervals.
You can also see that jobs can never run more frequently than the scheduler interval regardless of the job interval. It is not unreasonable to set the scheduler to one-minute intervals to reduce the inaccuracy of job timings to no more than a minute.
In a worse case, with a 5 minute sheduler interval and a 5 minute job interval. The delay to job starting could be up to 9 minutes 59 seconds.