Why do Dataflow steps not start? - google-cloud-platform

I have a linear three step Dataflow pipeline - for some reason the last step started, but the preceding two steps hung in Not started for a long time before I gave up and killed the job. I'm not sure what caused this, as this same pipeline had successfully run in the past, and I'm surprised it didn't show any errors in the logs as to what was preventing the first two steps from starting. What can cause such a situation and how can I prevent it from happening?

This was happening because of an error in the worker start up. Certain Dataflow steps do not seem to require workers (e.g. writing to GCS), which is why that step was able to start - i.e. that step starting does not imply that workers are being created correctly. Worker start up is not displayed in the job logs by default - you need to click the link to Stackdriver in the job logs and then add worker-startup in the logs drop down in order to see any of those errors.

Related

Aws Glue Workflow triggering multiple times one job (incorrect behavior)

I have a big glue workflow (about 100 jobs / crawlers), and it was executing properly until last week. Since then, my first conditional trigger (ALL), is executing 20 time the same job.
I've configured the job it self, to just allow 1 parallel execution, but every time the workflow executes, it tries to launch 20 times (the same job).
Also configured the workflow, to allow a max concurrency of 1, but that doesn't fix the problem.
Since i started working with glue workflow, i've noticed that the tool it self is buggy, old and maybe deprecated ?
Any tips on how to fix this problem ?
I too have faced similar problems. Even one is fixed, some other issue would occur lately. So my suggestion is to try using step functions.

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

Why does AWS Glue say "Max concurrent runs exceeded", when there are no jobs running?

I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?
I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name
Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.
I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!

Amazon EMR: Only start new scheduled job if previous job has finished

I have a AWS EMR cluster job which runs every 2 hours. I have set up a schedule using cloudWatch job to run every two hours.
But sometimes the next job (which runs after 2 hour from previous one) starts when previous one is not finished as it sometimes take more than 2 hours to completed depending on data to be processed.
I need some configuration by which I could prevent next job to be started if previous job is running.
I tried but couldn't found any set up. Can anyone knows how to do that please?
Add them as EMR steps. EMR steps run sequentially by default(Unless you change the concurrency setting)

Google Cloud DataPrep schedule is spawning multiple DataFlow jobs

I have a schedule which runs my flow twice a day - at 0910 and 1520 BST.
This is spawning a massive number of DataFlow jobs - so far today just the second schedule (1520) has spawned 80 jobs:
$ gcloud dataflow jobs list
JOB_ID NAME TYPE CREATION_TIME STATE REGION
2018-07-29_12_17_06-14876588186269022154 project-name-513008-by-username Batch 2018-07-29 19:17:07 Running us-central1
2018-07-29_12_14_54-6436458673562317581 project-name-512986-by-username Batch 2018-07-29 19:14:55 Cancelled us-central1
2018-07-29_12_13_55-6167618802124600084 project-name-512985-by-username Batch 2018-07-29 19:13:57 Cancelled us-central1
...
(see PasteBin for the full list)
In the days after the DataPrep update last week, I had trouble accessing the run settings url for the flow. I suspect that there's a process as part of the run settings which walks back through the flow (I have 12 flows chained by reference datasets) and sanity checks it - it seems that my flow was just on the cusp of being complex enough to cause the page load to time out, and I had to cut out a couple of steps just to get to the run settings.
I wonder if each time this timed out, it somehow duplicated the schedule or something else in the process - but then again, the number of duplicated jobs is inconsistent.
I recently rebuilt this project after seeing some issues with sampling errors (in that the sample was corrupt, so I couldn't load the transformation UI, but also couldn't build a new sample). After a hefty attempt at resolving the issue, I took the chance to rebuild as a dedicated GCP project with structure improvements, etc. I didn't see this scheduling error before the rebuild.