AWS Glue job run not respecting Timeout and not stopping - amazon-web-services

I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Why is AWS Glue taking time to start the execution?

Once execution started it's completing in just few seconds but pending execution itself is taking 10 to 15 minutes, I agree that it's setting up the env for running the job, but In my case, I need to run this job(transforming JSON) for every 15 min, will this workout or any option is there? Am I missing any configuration?

How to clear aws batch job history in dashbord

In aws batch Job queues dashboard, it shows all job status failed and succeeded job count for 24 hours. Is it possible to reset counter to zero?
No, it's not possible to clear jobs. Batch keeps finished jobs around for at least a day (and in my experience occasionally up to a few weeks), and there's no API or console mechanism to accelerate the process.

AWS stop alarm not working

I cannot get alarms to work reliably for an AWS ec2 instance. I have a g2.xlarge image running and want it to stop when it is not in use, i.e. when average usage falls below 2%.
When I try 1 or 2 periods of 1 hour below 2% it usually works but then when I start it up again it immediately stops itself as it is in an alarm condition. I have tried 12 periods of 5 minutes which allows it to start ok but now it doesn't stop at all despite being in an alarm condition for several hours.
I have tried various options and can't nail down what makes it work and what doesn't. It feels as if sometimes things work and sometimes they don't. Is it buggy or am I missing something?
Here is a screenshot of my setup which has failed to trigger a shutdown...

Scheduled task "Daily every" not firing

I have the developers edition of CF running on my machine, and I have a job that is scheduled to run:
Daily every 9 min(s) from 12:01 AM to 12:59 PM
but it's not running.
I can press the "Run Scheduled Task" button and it runs, but it's not running on it's own.
I have other jobs that run daily, but this one is not running every 9 minutes.
check the scheduler.log file for its execution and the next rescheduling time. If it hows a time which is not what you have set. Delete the job and recreate it again.
I have faced the same problem! and this was the way I made it running.
The best way to find out what's going on with the job is to take a look at the scheduler log in the CF Admin. After running the job, you should be able to check and see the next time it's scheduled to run.
Also, make sure the job isn't paused on the Scheduled Tasks page.