Apache Beam StatusRuntimeException on Dataflow pipeline - google-cloud-platform

I am working on a dataflow pipeline written in python2.7 using apache_beam==2.24.0 . The work of the pipeline is to consume pubsub messages from a subscription using beam's ReadFromPubSub in batches, do some processing on the messages and then to persist the resultant data to two different bigquery tables. There is a lot of data that I am consuming. Google-cloud-pubsub version is 1.7.0 . After running the pipeline everything works fine but after a few hours I start getting the exception:
org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
On gcp dataflow console, the logs show this error but the job in itself seems to work fine. It consumes data from the subscription and writes it to bigquery. What CANCELLED: call is being referred to here and why am I getting this error? How can I resolve this?
Full stacktrace:
Caused by: org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Status.asRuntimeException(Status.java:524)
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341)
org.apache.beam.sdk.fn.stream.DirectStreamObserver.onNext(DirectStreamObserver.java:98)
org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.flush(BeamFnDataSizeBasedBufferingOutboundObserver.java:100)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.shouldWait(RemoteGrpcPortWriteOperation.java:124)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.maybeWait(RemoteGrpcPortWriteOperation.java:167)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.process(RemoteGrpcPortWriteOperation.java:196)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1365)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1085)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

The client I am working for has option for raising request ticket for Google Cloud Support. The exact reply from Google Cloud Support:
This error you are finding is rather harmless. The dataflow is a massively parallel data processing platform and when there are autoscaling events which can move the worker VM around. When the VM is getting shut down the grpc channel is closed before the runner process and the work item being processed will be retried on another newly launched runner. These errors can be ignored.

Related

PERMISSION_DENIED for BigQuery Storage API on Apache Beam 2.39.0 and DataFlow runner

I have the following error for one of my DataFlow Jobs:
2022-06-15T16:12:27.365182607Z Error message from worker: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: BigQuery Storage API has not been used in project 770406736630 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/bigquerystorage.googleapis.com/overview?project=770406736630 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
The same code works fine with Apache Beam 2.38.0. I tested multiple times and this is not a temporary issues. The project number mentioned in the error (770406736630) is not mine.
Any idea why I get this error?
I had the same issue. I'm using Spring Cloud GCP and hadn't set the spring.cloud.gcp.project-id property, which I'm guessing makes the SDK or API use some default value.
I don't know how you've set up you environment, because you haven't specified, but look into how you can explicitly set the project id. You can get it from the dialog for selecting a project in GCP Console.
I just ran into this, and simply needed to re-authenticate with the gcp cli by running gcloud auth application-default login.
The error happens for the latest Apache Beam SKD (2.41.0) when BigQueryIO.Write.Method.STORAGE_WRITE_API is used and destination does not specify the project name. For example dataset.table instead of project-id:dataset.table
This is the solution that worked for me:
BigQueryIO.writeTableRows()
.to("project-id:dataset.table")
.withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
For some reason the Apache Beam implementation for BigQuery Write Storage API does not handle this situation even though it works fine for FILE_LOADS method.
You may also receive a sightly different error for the latest Beam SDK.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException:
java.lang.RuntimeException:
java.lang.RuntimeException: com.google.api.gax.rpc.PermissionDeniedException:
io.grpc.StatusRuntimeException:
PERMISSION_DENIED: Permission denied: Consumer 'project:null' has been suspended.

Google cloud error: No agent on master node(s) found to be active in the past 300 seconds

I am seeing an error in my Google Cloud logs : "No agent on master node(s) found to be active in the past 300 seconds" and I am not sure what is causing this.
I have a schedule in Google Cloud to run my queries and it has been running fine in months. Now when it should trigger a new Cloud Function I see the error.
Thanks
I probably cannot help with OP's issue but for those coming here trying to submit their job in the Console on a recently started cluster: try to cancel the submit job procedure and start again.
I tried to submit a job without starting the cluster, received "cluster not active" error, started the cluster, tried to submit the job again and received exactly this message ("No agent on master node(s) found to be active in the past 300 seconds"). However, the error was resolved and I could submit the job, when I closed it, refreshed the page and started filling in the submit job form again.
This error occurs if the agent on the Dataproc master node is not able to accept any new jobs. It may happen either due to the agent running out of memory or if the master VM node itself is unhealthy. This problem can be resolved by restarting (stopping, then starting) the Dataproc cluster or retrying the job submission later. More information regarding this error is found in the public documentation here

Consuming messages from Google Pubsub and publishing it to Kafka

I am trying to consume Google PubSub messages using synchronous PULL API. This is available in Apache Beam Google PubSub IO connector library.
I want to write the consumed messages to Kafka using KafkaIO. I want to use FlinkRunner to execute the job, since we run this application outside GCP.
The problem I am facing is that the consumed messages are not getting ACK'd in GCP PubSub. I have confirmed that the local Kafka instance has the messages consumed from GCP PubSub. The documentation in GCP DataFlow indicates that the data bundle gets finalized when the pipeline is terminated with a data sink, which is Kafka in my case.
But since code is running in Apache Flink and not GCP DataFlow, I think some sort of callback is not getting fired related to ACK'ing the committed message.
What am I doing wrong here?
pipeline
.apply("Read GCP PubSub Messages", PubsubIO.readStrings()
.fromSubscription(subscription)
)
.apply(ParseJsons.of(User.class))
.setCoder(SerializableCoder.of(User.class))
.apply("Filter-1", ParDo.of(new FilterTextFn()))
.apply(AsJsons.of(User.class).withMapper(new ObjectMapper()))
.apply("Write to Local Kafka",
KafkaIO.<Void,String>write()
.withBootstrapServers("127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094")
.withTopic("test-topic")
.withValueSerializer((StringSerializer.class))
.values()
);
In the Beam documentation on the PubSub IO class it's mentioned this:
Checkpoints are used both to ACK received messages back to Pubsub (so that they may be retired on the Pubsub end), and to NACK already consumed messages should a checkpoint need to be restored (so that Pubsub will resend those messages promptly).
The ACK are not linked to Dataflow, you should have the same behavior on dataflow. The ack are sent on Checkpoints. Usually the Checkpoints are the windows that you set on your stream flow.
But, you didn't set window! By default, the windows is global, and it closed only at the end, if you stop gracefully your job (and even, I'm not sure about this). Anyway, a better solution is to have fixed windows (for example of 5 minutes) to ack the messages on each of these windows.
The way I fixed this solution was by using Guillaume Blaquiere's (https://stackoverflow.com/users/11372593/guillaume-blaquiere) suggestion of looking at Checkpoints. Even after adding the Window.into() function in the pipeline, the source PubSub subscription endpoint did not receive ACKs.
The problem was in the Flink server configuration I had failed to mention checkpoint configuration. Without these parameters, checkpoints are disabled.
state.backend: rocksdb
state.checkpoints.dir: file:///tmp/flink-1.9.3/state/checkpoints/
These configs should go in the flink_home/conf/flink-conf.yaml.
After adding these entries and restarting flink. All the backlogged (unack'd messages) went to 0 in the GCP pubsub monitoring chart.

Cloud Data Fusion - Errors with quickstart

I'm working on testing Cloud Data Fusion in GCP by executing their quickstart tutorial. The tutorial I am following is here
I configured my environment to have all the appropriate permissions and get to the point where my Dataproc cluster is up and running and the job starts.
After a few minutes, the job fails with the following error:
java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out)
And:
io.grpc.netty.shaded.io.netty.channel.ChannelException: eventfd_write(...) failed: Bad file descriptor
For the second error, I manually changed the 'input' format to be JSON instead of text (like it comes when you import the pipeline from the HUB), but still no luck. The first error I'm not exactly sure whats going wrong.
I have already review the Creating a Cloud Data Fusion instance documentation, but still receive errors.
Any suggestions?

Dataflow - 20 streaming Windmill RPC errors for a stream

I've been using dataflow and pubsub for streaming for over a year, and today without me changing anything dataflow is not reading from pubsub anymore. At first, I was getting the below error in my logging but it stopped popping up once I updated pubsub to the latest version and apache beam sdk from 2.10.0 to 2.17.0
20 streaming Windmill RPC errors for a stream, last was: org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
I see the below link but at the end it just says GCP is working on it and does not say if the writer did anything to fix the issue. How does this get fixed and want is causing it?
Dataflow: streaming Windmill RPC errors for a stream