BigQuery unable to insert job. Workflow failed - google-cloud-platform

I need to run a batch job from GCS to BigQuery via Dataflow and Beam. All my files are avro with the same schema.
I've created a dataflow java application that is successful on a smaller set of data (~1gb, about 5 files).
But when I try to run it on a bigger set of data ( >500gb, >1000 files), i receive an error message
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix 1b83679a4f5d48c5b45ff20b2b822728_6e48345728d4da6cb51353f0dc550c1b_00001_00000, reached max retries: 3, last failed load job: ...
After 3 retries it terminates with:
Workflow failed. Causes: S57....... A work item was attempted 4 times without success....
This step is the load to BigQuery.
Stack Driver says the processing is stuck in step ....for 10m00s... and
Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes.....
I looked up the 409 error code stating that I might have an existing job, dataset, or table. I've removed all the tables and re-ran the application but it still shows the same error message.
I am currently limited on 65 workers and I have them using n1-standard-4 cpus.
I believe there are other ways to move the data from gcs to bq, but i need to demonstrate dataflow.

"java.lang.RuntimeException: Failed to create job with prefix beam_load_csvtobigqueryxxxxxxxxxxxxxx, reached max retries: 3, last failed job: null.
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:198)..... "
One of the possible cause could be the privilege issue. Ensure the user account which interacts with the BigQuery has privilege "bigquery.jobs.create" in the predefined role "*BigQuery User"

Posting the comment of #DeaconDesperado as community wiki, where they experienced the same error and what they did was remove the restricted characters (eg. Unicode letters, marks, numbers, connectors, dashes or spaces) in the table name and the error is gone.

I got the same problem using "roles/bigquery.jobUser", "roles/bigquery.dataViewer", and "roles/bigquery.user". But only when granting "roles/bigquery.admin" did the issue get resolved.

Related

Unable to create environments on Google Cloud Composer

I tried to create a Google Cloud Composer environment but in the page to set it up I get the following errors:
Service Error: Failed to load GKE machine types. Please leave the field
empty to apply default values or retry later.
Service Error: Failed to load regions. Please leave the field empty to
apply default values or retry later.
Service Error: Failed to load zones. Please leave the field empty to apply
default values or retry later.
Service Error: Failed to load service accounts. Please leave the field
empty to apply default values or retry later.
The only parameters GCP lets me change are the region and the number of nodes, but still lets me create the environment. After 30 minutes the environment crashes with the following error:
CREATE operation on this environment failed 1 day ago with the following error message:
Http error status code: 400
Http error message: BAD REQUEST
Errors in: [Web server]; Error messages:
Failed to deploy the Airflow web server. This might be a temporary issue. You can retry the operation later.
If the issue persists, it might be caused by problems with permissions or network configuration. For more information, see https://cloud.google.com/composer/docs/troubleshooting-environment-creation.
An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-07-20T14:31:23.047Z7050.xd.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
Got error "Another operation failed." during CP_DEPLOYMENT_CREATING_STANDARD []
Is it a problem with permissions? If so, what permissions do I need? Thank you!
It looks like more of a temporary issue:
the first set of errors is stating you cannot load the metadata :
regions list, zones list ....
you dont have a clear
PERMISSION_DENIED error
the second error: is suggesting also:
This might be a temporary issue.

AWS Eventbridge Events (Sagemaker training job status change) fired multiple times with the same payload

I created an event rule for the Sagemaker training job state change in cloudwatch to monitor my training jobs. Then I use this events to trigger a lambda function that send messages in a telegram group as a bot. In this way I receive a message every time one of the training job change its status. It works but there is a problem with the events, they are fired multiple times with the same exact payload, so I receive tons of duplicate messages.
Since all the payploads are identical (except the field LastModifiedTime) I cannot filter them in the lambda. Unfortunately I don't have the AWS Developer plan so I cannot receive support from Amazon. Any idea?
EDIT
There are no duplicate rules/events. I also noticed that enabling the Sagemaker profiler (which is now by default) cause the number of identical rule invocations literally explode. All of them have the same payload except for the LastModifiedTime so I suspect that there is a bug in AWS for that. One solution could be to implement some sort of data retention on the lambda and check if an invocation has already been processed, but I don't want complicate a thing that should be very simple. Just tried to launch a new training job and got this sequence (I only report the fields I parse):
Status: InProgress
Secondary Status: Starting
Status Message: Launching requested ML instances
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Starting the training job
Status: InProgress
Secondary Status: Starting
Status Message: Preparing the instances for training
Status: InProgress
Secondary Status: Downloading
Status Message: Downloading input data
Status: InProgress
Secondary Status: Training
Status Message: Downloading the training image
Status: InProgress
Secondary Status: Training
Status Message: Training in-progres
Status: InProgress
Secondary Status: Training
Status Message: Training image download completed. Training in progress
Duplicate messages can happen but should be very rare. You should check if there's any duplicate rules / schedules. You can use metrics to identify what's being invoked / matched https://docs.aws.amazon.com/eventbridge/latest/userguide/eventbridge-monitoring-cloudwatch-metrics.html.
Another reason maybe your rules are too broad and matching multiple events of the same source. You can create another target on the same rule to Cloudwatch Logs, to see which events get matched and if there needs to be any filtering.
It's also possible the sagemaker just sends duplicate events to EventBridge, in which case your best option would be to us ElastiCache to temporarily store the ids and check against in your lambda.
After a lot of experiments I can answer myself that Sagemaker generates multiple events with the same payload, except for the field LastModifiedTime. I don't know is this is a bug, but should not happen in my opinion. These are rules defined by AWS itself, so nothing I can customize. The situation is even worse if you enable the profiler.
There is nothing I can do, since I already posted on the official AWS forum multiple times without any luck.

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

AWS Glue job runs correct but returns a connection refused error

I am running a test job on AWS. I am reading CSV data from S3 bucket, running a GLUE ETL job on it and storing the same data on Amazon Redshift. GLUE job is just reading the data from S3 and storing in Redshift without any modification. The job runs fine and I get the desired result in Redshift but it returns an error which I am unable to understand.
Here is the error log:
18/11/14 09:17:31 WARN YarnClient: The GET request failed for the URL http://169.254.76.1:8088/ws/v1/cluster/apps/application_1542186720539_0001
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 169.254.76.1:8088 [/169.254.76.1] failed: Connection refused (Connection refused)
It is a WARN rather than error but I want to understand what is causing the WARN. I tried to search for the IP that is indicated in the WARN but I am not able to find the machine with the mentioned IP.
I noticed these error comming up to me in my AWS Glue Job so I found something that could be helpful from AWS:
This WARN message is not so special, and does not mean job failure or any errors directly. I guess there should be other cause.
I would recommend you to enable continuous logging, and check both driver/executor logs to see if there are any suspicious behavior.
If you enable job bookmark, please try disabling it and see how it goes without bookmark.
https://forums.aws.amazon.com/thread.jspa?messageID=927547
I had dissabled bookmarks from the begining. What I check is that my Glue job writing data to S3 and got an exeption per Memory, so what I did is to repartition the data.
MyDynamicFrame.coalesce(100).write.partitionBy("month").mode("overwrite").parquet("s3://"+bucket+"/"+path+"/out_data")
so if you have some write opperations, I'll recommend to check how you are writing to S3

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'