I've been using dataflow and pubsub for streaming for over a year, and today without me changing anything dataflow is not reading from pubsub anymore. At first, I was getting the below error in my logging but it stopped popping up once I updated pubsub to the latest version and apache beam sdk from 2.10.0 to 2.17.0
20 streaming Windmill RPC errors for a stream, last was: org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: NOT_FOUND: Requested entity was not found.
I see the below link but at the end it just says GCP is working on it and does not say if the writer did anything to fix the issue. How does this get fixed and want is causing it?
Dataflow: streaming Windmill RPC errors for a stream
Related
We have a pipeline to extract embeddings (feature vectors) from images stored in Cloud Storage bucket and insert into a BigQuery table.
We're consistently getting SDK harness sdk-0-1 disconnected. errors when the Dataflow job runs on N1 type VM instances.
Error message from worker:
Data channel closed, unable to send additional data to SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
SDK harness sdk-0-0 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-3
SDK harness sdk-0-1 disconnected.
SDK harness sdk-0-2 disconnected.
Data channel closed, unable to receive additional data from SDK sdk-0-1
Notes
N2 machines work fine but N1 fails somewhat surprising because N1 is Google-default machine.
Jobs run slower on N1 machines and sometimes appear to fail due to these errors.
Using a larger VM (more memory, CPU and disk) didn't resolve the errors.
We also have another pipeline to extract embeddings from text and using lapse model which has the same errors on both N1 and N2 machines
Diagnostics tab: No errors found during this interval.
We're creating DF job templates (Apache Beam 2.40 Python), storing them on Cloud Storage and using API to launch new jobs.
We're batching the items before giving them to the stage where embeddings are extracted. Reducing batch size didn't matter.
Pipeline option sdk_worker_parallelism changed from 0 (default) to 1 and didn't change anything.
Auto-scaling disabled (max_worker=1) and same errors.
Reshuffle stage removed from the pipe
There are disconnect errors e.g. SDK harness sdk-0-0 disconnected.
but no data channel errors e.g. The Data channel closed, unable to send additional data to SDK sdk-0-3
This error message could be due to a wide variety of causes which cannot be easily detected unless the error message is accompanied by the other behavior. This error could be due to any number of listed errors in this documentation.
For getting more information about the error, it can be investigated in the Diagnostics table which can be seen in the below image.
The Diagnostics table shows the timeline and possible recommendations for your pipeline for the errors that occurred. You can view the job metrics to monitor your Dataflow Jobs.
I am working on a dataflow pipeline written in python2.7 using apache_beam==2.24.0 . The work of the pipeline is to consume pubsub messages from a subscription using beam's ReadFromPubSub in batches, do some processing on the messages and then to persist the resultant data to two different bigquery tables. There is a lot of data that I am consuming. Google-cloud-pubsub version is 1.7.0 . After running the pipeline everything works fine but after a few hours I start getting the exception:
org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
On gcp dataflow console, the logs show this error but the job in itself seems to work fine. It consumes data from the subscription and writes it to bigquery. What CANCELLED: call is being referred to here and why am I getting this error? How can I resolve this?
Full stacktrace:
Caused by: org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: CANCELLED: call already cancelled
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Status.asRuntimeException(Status.java:524)
org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:341)
org.apache.beam.sdk.fn.stream.DirectStreamObserver.onNext(DirectStreamObserver.java:98)
org.apache.beam.sdk.fn.data.BeamFnDataSizeBasedBufferingOutboundObserver.flush(BeamFnDataSizeBasedBufferingOutboundObserver.java:100)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.shouldWait(RemoteGrpcPortWriteOperation.java:124)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.maybeWait(RemoteGrpcPortWriteOperation.java:167)
org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.process(RemoteGrpcPortWriteOperation.java:196)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:182)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:57)
org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowReshuffleFn.processElement(StreamingGroupAlsoByWindowReshuffleFn.java:39)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:134)
org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1365)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154)
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1085)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
The client I am working for has option for raising request ticket for Google Cloud Support. The exact reply from Google Cloud Support:
This error you are finding is rather harmless. The dataflow is a massively parallel data processing platform and when there are autoscaling events which can move the worker VM around. When the VM is getting shut down the grpc channel is closed before the runner process and the work item being processed will be retried on another newly launched runner. These errors can be ignored.
I am trying to consume Google PubSub messages using synchronous PULL API. This is available in Apache Beam Google PubSub IO connector library.
I want to write the consumed messages to Kafka using KafkaIO. I want to use FlinkRunner to execute the job, since we run this application outside GCP.
The problem I am facing is that the consumed messages are not getting ACK'd in GCP PubSub. I have confirmed that the local Kafka instance has the messages consumed from GCP PubSub. The documentation in GCP DataFlow indicates that the data bundle gets finalized when the pipeline is terminated with a data sink, which is Kafka in my case.
But since code is running in Apache Flink and not GCP DataFlow, I think some sort of callback is not getting fired related to ACK'ing the committed message.
What am I doing wrong here?
pipeline
.apply("Read GCP PubSub Messages", PubsubIO.readStrings()
.fromSubscription(subscription)
)
.apply(ParseJsons.of(User.class))
.setCoder(SerializableCoder.of(User.class))
.apply("Filter-1", ParDo.of(new FilterTextFn()))
.apply(AsJsons.of(User.class).withMapper(new ObjectMapper()))
.apply("Write to Local Kafka",
KafkaIO.<Void,String>write()
.withBootstrapServers("127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094")
.withTopic("test-topic")
.withValueSerializer((StringSerializer.class))
.values()
);
In the Beam documentation on the PubSub IO class it's mentioned this:
Checkpoints are used both to ACK received messages back to Pubsub (so that they may be retired on the Pubsub end), and to NACK already consumed messages should a checkpoint need to be restored (so that Pubsub will resend those messages promptly).
The ACK are not linked to Dataflow, you should have the same behavior on dataflow. The ack are sent on Checkpoints. Usually the Checkpoints are the windows that you set on your stream flow.
But, you didn't set window! By default, the windows is global, and it closed only at the end, if you stop gracefully your job (and even, I'm not sure about this). Anyway, a better solution is to have fixed windows (for example of 5 minutes) to ack the messages on each of these windows.
The way I fixed this solution was by using Guillaume Blaquiere's (https://stackoverflow.com/users/11372593/guillaume-blaquiere) suggestion of looking at Checkpoints. Even after adding the Window.into() function in the pipeline, the source PubSub subscription endpoint did not receive ACKs.
The problem was in the Flink server configuration I had failed to mention checkpoint configuration. Without these parameters, checkpoints are disabled.
state.backend: rocksdb
state.checkpoints.dir: file:///tmp/flink-1.9.3/state/checkpoints/
These configs should go in the flink_home/conf/flink-conf.yaml.
After adding these entries and restarting flink. All the backlogged (unack'd messages) went to 0 in the GCP pubsub monitoring chart.
My data is in PubSub and i want to stream the data.I use dataproc to run my spark job in java.But the job fails with following error.
19/06/18 06:32:30 WARN org.apache.spark.streaming.scheduler.ReceiverTracker: Error reported by receiver for stream 0: Failed to pull messages - java.lang.NullPointerException
at scala.collection.convert.Wrappers$JListWrapper.iterator(Wrappers.scala:88)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
at org.apache.spark.streaming.pubsub.PubsubReceiver.receive(PubsubInputDStream.scala:259)
at org.apache.spark.streaming.pubsub.PubsubReceiver$$anon$1.run(PubsubInputDStream.scala:247)
the code segment i used is
PubsubUtils.createStream(jssc, "projectId","TopicName","subscriptionName",new SparkGCPCredentials.Builder().jsonServiceAccount("absolute path to json placed in dataproc"), StorageLevel.MEMORY_AND_DISK_2());
This warning shows up when spark tries to read events from a pubsub subscription with no waiting events. It should not break the job, but only means that there were no events to be read for a given batch of data.
I have a dataflow streaming job with Pub/Sub subscription as an unbounded source. I want to know at what stage does dataflow acks the incoming pub/sub message. It appears to me that the message is lost if an exception is thrown during any stage of the dataflow pipeline.
Also I'd like to know how to the best practices for writing dataflow pipeline with pub/sub unbounded source for message retrieval on failure. Thank you!
The Dataflow Streaming Runner acks pubsub messages received by a bundle after the bundle has succeeded and results of the bundle (outputs and state mutations etc) have been durably committed. Failed bundles are retried until they succeed, and don't cause data loss. If you believe that data loss may be happening, please include details (job id and your reasoning that lead you to conclude that data has been dropped because of the failures) and we'll investigate.