simple AWS Batch array job stuck in pending - amazon-web-services

I'm following the print-color AWS batch tutorial for an array batch job from the official amazon batch user guide on page 23 https://docs.aws.amazon.com/batch/latest/userguide/batch_user.pdf . It is supposed to be a very simple tutorial, but I find that my submitted array job is stuck in pending for an indefinite period of time.
Does anybody have an idea? I can't find more info to know if there is a bug. And there is nothing in cloud watch.image link Pending job
Thanks in advance.

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

EMR not generating step logs

Due to some reason, I do not see steps logs for my jobs in EMR. It used to work fine a while back, but it just stopped logging.
I checked from HDFS in path /mnt/var/log/hadoop/steps/, but there is no log there. The steps do complete successfully, just that there are no logs.
Is there anything I can do to find the issue and get the logger back to work?
Thanks in advance for taking time to read and respond.
All the best.

BigQuery job already exists

In one of our GCP project multiple bigquery scheduled queries are running but recently we started facing failing job issue with the following error -
Error Message: Already Exists: Job <PROJECT ID>:US.scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx; JobID:<PROJECT NUMBER>:scheduled_query_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
The write preference of these queries is "WRITE_APPEND". These jobs were created using GCP console.
Once retriggered these jobs run successfully with a new job ID.
Please help in understanding the reason behind used job ID allocation to these scheduled queries and please suggest if any fix available.

Google Cloud Dataprep job failing with error message

I have simple Dataprep jobs that are transferring GCS data to BQ. Until today, scheduled jobs were running fine but today two jobs failed and two jobs succeeded after taking more than half hour to one hour.
Error message I am getting is below:
java.lang.RuntimeException: Failed to create job with prefix beam_load_clouddataprepcmreportalllobmedia4505510bydataprepadmi_aef678fce2f441eaa9732418fc1a6485_2b57eddf335d0c0b09e3000a805a73d6_00001_00000, reached max retries: 3, last failed job:
I ran same job again, it again took very long time and failed but this time with different message:
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
Any pointers or direction for possible cause! Also, link or troubleshooting tips for dataprep or dataflow job is appreciated.
Thank you
There could be a lot of potential causes for the jobs to get stuck: Transient issues, some quota/limit being reached, change in data format/size or another issue with the resources being used. I suggest to start the troubleshooting from Dataflow side.
Here are some useful resources that can guide you through the most common job errors, and how to troubleshoot them:
Troubleshooting your pipeline
Dataflow common errors
In addition, you could check in the Google Issue tracker of Dataprep and Dataflow to see if the issue has been reported before
Issue tracker for Dataprep
Issue tracker for Dataflow
And you can also look at the GCP status dashboard to discard a widespread issue with some service
Google Cloud Status Dashboard
Finally, if you have GCP support you can reach out directly to support. If you don't have support, you can use the Issue tracker to create a new issue for Dataprep and report the behavior you're seeing.

Way to trigger dataflow only after Big Query Job finished

actually the following steps to my data:
new objects in GCS bucket trigger a Google Cloud function that create a BigQuery Job to load this data to BigQuery.
I need low cost solution to know when this Big Query Job is finished and trigger a Dataflow Pipeline only after the job is completed.
Obs:
I know about BigQuery alpha trigger for Google Cloud Function but i
dont know if is a good idea,from what I saw this trigger uses the job
id, which from what I saw can not be fixed and whenever running a job
apparently would have to deploy the function again. And of course
it's an alpha solution.
I read about a Stackdriver Logging->Pub/Sub -> Google cloud function -> Dataflow solution, but i didn't find any log that
indicates that the job finished.
My files are large so isn't a good idea to use a Google Cloud Function to wait until the job finish.
Despite your mention about Stackdriver logging, you can use it with this filter
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state="DONE"
severity="INFO"
You can add dataset filter in addition if needed.
Then create a sink into Function on this advanced filter and run your dataflow job.
If this doesn't match your expectation, can you detail why?
You can look at Cloud Composer which is managed Apache Airflow for orchestrating jobs in a sequential fashion. Composer creates a DAG and executes each node of the DAG and also checks for dependencies to ensure that things either run in parallel or sequentially based on the conditions that you have defined.
You can take a look at the example mentioned here - https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples