Data flow pipeline got stuck - google-cloud-platform

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles

Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.

Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

Related

Dataflow job failing due to ZONE_RESOURCE_POOL_EXHAUSTED in europe-west3 region

My dataflow job has been failing since 7AM this morning with error:
Startup of the worker pool in zone europe-west3-c failed to bring up any of the desired 1 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance '' creation failed: The zone 'projects//zones/europe-west3-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
I tried to launch the job in europe-west3-a and europe-west3-b and I get same error. It's been well over 12 hours but this problem persists. I know this is not a general resource availability problem as I can create a new VM in that region without any problems.
I even have case open with Google Support but unfortunately they don't even read my ticket and simply reply with standard reply asking me to do things I've tried already.
Any idea what I can do here?
Update 1:
I tried to create a new job with --worker-machine-type=e2-standard-2 and that works. The problem seems to be related to their server-specified machine.
Update 2:
We are now going into day 2 of the problem in europe-west3. Our dev environment is in europe-west1 and this problem doesn't occur there.
This error occurs due to current unavailability of Compute Engine resources like GPUs in that zone.
This is not related to your Compute Engine quota.
You can resolve the issue by creating the resource in another zone in the region or a different region.
You can read more information and different resolution regarding this error in this document

Why does AWS Glue say "Max concurrent runs exceeded", when there are no jobs running?

I have an AWS Glue job, with max concurrent runs set to 1. The job is currently not running. But when I try to run it, I keep getting the error: "Max concurrent runs exceeded".
Deleting and re-creating the job does not help. Also, other jobs in the same account run fine, so it cannot be a problem with account wide service quotas.
Why am I getting this error?
I raised this issue with AWS support, and they confirmed that it is a known bug:
I would like to inform you that this is a known bug, where an internal distributed counter that keeps track of job concurrency goes into a stale state due to an edge case, causing this error. Our internal Service team has to manually reset the counter to fix this issue. Service team has already added the bug fix in their product roadmap and will be working on it. Unfortunately I may not be able to comment on the ETA on the deployment, as we don’t have any visibility on product teams road map and fix release timeline.
The suggested workarounds are:
Increase the max concurrency to 2 or higher
Re-create the job with a different name
Glue container is start and its taking some time same when your job end container shutdown taking some time in between if you try to execute new Jon and default concurrency is 1 so you will get this error.
How to resolve:
Go to your Glue Job --> Under Job detail tab you can find "Maximum concurrency" default value is 1 change it to 3 or more as per your need.
I tried changing "Maximum concurrency" to 2 and then run it !
It worked but again running it cause the same issue, but I looked into my s3 ,it has dumped the data ,so it run for once!
I'm still looking for a stable solution but this may work!

Can't use Composer environment after re-enabling service

I have been trying to find a way to save on the costs of Airflow by disabling it when not in use. I have discovered that if we disable the composer.googleapis.com service while not in use that Google does not charge for the service while it is disabled, although it does continue to charge for other resources that are still active. Unfortunately, if the service is disabled for more than an hour or so, the service is not usable after re-enabling it. After the service has been disabled for an extended period of time, the Composer Environment Details Page shows
An error occurred with retrieving the last operation on this environment
and
This environment cannot be edited due to the errors that occurred during environment creation/update. Please investigate the logs to determine the cause, or create a new environment.
And gcloud composer environments describe shows state: ERROR
The one error that I did see in the logs was due to a duplicate key when the airflow_monitoring DAG was rescheduled after a little over an hour. Therefore, created a new Composer environment, disabled all DAGs, disabled the composer service, waited a while, then enabled it again. The environment was once again in an error state.
The Cloud Composer documentation states:
If you disable the Cloud Composer API, environments become unusable within an hour of service deactivation unless you re-enable the API. If you re-enable the API, you are billed for the service usage that occurs while the Cloud Composer service is deactivating.
Maybe this is poorly worded, but to me it sounds like it would become unusable within an hour if you disable it, but if you re-enabled it anytime later, it will become usable again. I am wondering if it really means that if you disable it, you must re-enable it within 1 hour or it will become permanently unusable.
Is there a way to disable the composer.googleapis.com service for longer than an hour and then get it working again after the service has been re-enabled? Is there something I can restart, or some way to clear the error state? Is there more I should do before disabling it?
I am using composer-1.10.4-airflow-1.10.6 with Python 3.
Thanks.
No, there is no way to disable the composer.googleapis.com service for more than an hour and then have environments be functional after re-enablement.
GCP services are not meant to be enabled/disabled on the fly in this manner, and disablement of a service is meant to be performed with the intention of disabling it for the long term. Keeping a service disabled for long enough means Google-managed components created for the service (specifically for your project) will be decommissioned, and in Composer's case, this will render your environments permanently unusable.
The error state in the environment cannot be cleared. If you want to save on costs, you should delete Composer environments as opposed to deactivating the service entirely. The "service" is not cluster-like and isn't meant to be toggled on and off.

Google Cloud Dataprep job failing with error message

I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you
There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.

Job Scheduling in SAS Data Integration Studio

i want to schedule a job in SAS-DIS. i tried the process using sas management console,bt an error is popping up saying scheluing server not found.
can anyone help me how to setup a scheduling server? or is it a software to be installed?
Thanks
I think a scheduling server is an extra package that has to be purchased. Our BI setup is lacking that option and no matter what we can't seem to get it approved. Check with your SAS server admin to see if the job scheduling has been enabled. If so he/she should be able to tell you the process for getting it scheduled.
Alternatively, without a scheduling server you still deploy your jobs and can either use
1. Cron and Crontab (in Unix or Linux)
2. Windows OS scheduler
to schedule jobs manually as this is the best option available if there is none. I know this can be very tedious and cumbersome , but can give it a try if you have less number of jobs to schedule.