Why does AWS Glue say "Max concurrent runs exceeded", when there are no jobs running? - amazon-web-services

I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?

I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name

Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.

I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!

Related

Aws Glue Workflow triggering multiple times one job (incorrect behavior)

I have a big glue workflow (about 100 jobs / crawlers), and it was executing properly until last week. Since then, my first conditional trigger (ALL), is executing 20 time the same job.
I've configured the job it self, to just allow 1 parallel execution, but every time the workflow executes, it tries to launch 20 times (the same job).
Also configured the workflow, to allow a max concurrency of 1, but that doesn't fix the problem.
Since i started working with glue workflow, i've noticed that the tool it self is buggy, old and maybe deprecated ?
Any tips on how to fix this problem ?
I too have faced similar problems. Even one is fixed, some other issue would occur lately. So my suggestion is to try using step functions.

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Why do Dataflow steps not start?

I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening?
This was happening because of an error in the worker start up. Certain Dataflow steps do not seem to require workers (e.g. writing to GCS), which is why that step was able to start - i.e. that step starting does not imply that workers are being created correctly. Worker start up is not displayed in the job logs by default - you need to click the link to Stackdriver in the job logs and then add worker-startup in the logs drop down in order to see any of those errors.

Job Scheduling in SAS Data Integration Studio

i want to schedule a job in SAS-DIS. i tried the process using sas management console,bt an error is popping up saying scheluing server not found.
can anyone help me how to setup a scheduling server? or is it a software to be installed?
Thanks
I think a scheduling server is an extra package that has to be purchased. Our BI setup is lacking that option and no matter what we can't seem to get it approved. Check with your SAS server admin to see if the job scheduling has been enabled. If so he/she should be able to tell you the process for getting it scheduled.
Alternatively, without a scheduling server you still deploy your jobs and can either use
1. Cron and Crontab (in Unix or Linux)
2. Windows OS scheduler
to schedule jobs manually as this is the best option available if there is none. I know this can be very tedious and cumbersome , but can give it a try if you have less number of jobs to schedule.