How to solve stability problems in Google Dataflow

How to solve stability problems in Google Dataflow - google-cloud-platform

I have a Dataflow job that has been running stable for several months.
The last 3 days or so, I've problems with the job, it's getting stuck after a certain amount of time and the only thing I can do is stop the job and start a new one. This happened after 2, 6 and 24 hours of processing. Here is the latest exception:
java.lang.ExceptionInInitializerError
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:183)
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:169)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy54.getWindmillServerStub (Unknown Source)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.<init> (StreamingDataflowWorker.java:677)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromDataflowWorkerHarnessOptions (StreamingDataflowWorker.java:562)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main (StreamingDataflowWorker.java:274)
Caused by: java.lang.RuntimeException: Loading windmill_service failed:
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:42)
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0 (Native Method)
at sun.nio.ch.FileDispatcherImpl.write (FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer (IOUtil.java:93)
at sun.nio.ch.IOUtil.write (IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write (FileChannelImpl.java:211)
at java.nio.channels.Channels.writeFullyImpl (Channels.java:78)
at java.nio.channels.Channels.writeFully (Channels.java:101)
at java.nio.channels.Channels.access$000 (Channels.java:61)
at java.nio.channels.Channels$1.write (Channels.java:174)
at java.nio.file.Files.copy (Files.java:2909)
at java.nio.file.Files.copy (Files.java:3027)
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:39)
Seems like there is no space left on a device, but shouldn't this be managed by Google? Or is this an error in my job somehow?
UPDATE:
The workflow is as follows:
Reading mass data from PubSub (up to 1500/s)
Filter some messages
Keeping session window on key and grouping by it
Sort the data and do calculations
Output the data to another PubSub

You can increase the storage capacity in the parameter of your pipelise. Look at this one diskSizeGb in this page
In addition, more you keep data in memory, more you need memory. It's the case for the windows, if you never close them, or if you allow late data for too long time, you need a lot of memory to keep all these data up.
Tune either your pipeline, or your machine type. Or both!

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks

Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Collect one cell from pyspark Dataframe failed [duplicate]

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /XXX.XX.XXX.XX:36245 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
The reason of adding this configuration was the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o171.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Therefore, I increased maxResultSize to 2.5 Gb, but the Spark job fails anyway (the error shown above).
How to solve this issue?

It seems like the problem is the amount of data you are trying to pull back to to your driver is too large. Most likely you are using the collect method to retrieve all values from a DataFrame/RDD. The driver is a single process and by collecting a DataFrame you are pulling all of that data you had distributed across the cluster back to one node. This defeats the purpose of distributing it! It only makes sense to do this after you have reduced the data down to a manageable amount.
You have two options:
If you really need to work with all that data, then you should keep it out on the executors. Use HDFS and Parquet to save the data in a distributed manner and use Spark methods to work with the data on the cluster instead of trying to collect it all back to one place.
If you really need to get the data back to the driver, you should examine whether you really need ALL of the data or not. If you only need summary statistics then compute that out on the executors before calling collect. Or if you only need the top 100 results, then only collect the top 100.
Update:
There is another reason you can run into this error that is less obvious. Spark will try to send data back the driver beyond just when you explicitly call collect. It will also send back accumulator results for each task if you are using accumulators, data for broadcast joins, and some small status data about each task. If you have LOTS of partitions (20k+ in my experience) you can sometimes see this error. This is a known issue with some improvements made, and more in the works.
The options for getting past if if this is your issue are:
Increase spark.driver.maxResultSize or set it to 0 for unlimited
If broadcast joins are the culprit, you can reduce spark.sql.autoBroadcastJoinThreshold to limit the size of broadcast join data
Reduce the number of partitions

Cause: caused by actions like RDD's collect() that send big chunk of data to the driver
Solution:
set by SparkConf: conf.set("spark.driver.maxResultSize", "4g")
OR
set by spark-defaults.conf: spark.driver.maxResultSize 4g
OR
set when calling spark-submit: --conf spark.driver.maxResultSize=4g

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO.
The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's.
When I don't use any windows, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1585)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1470)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:868)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:823)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:52)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:20)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:388)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:372)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline.main(EntityBuilderPipeline.java:122)
:entityBuilder FAILED
Because of the error above I assume the input collection needs to be windowed and triggered, as SpannerIO uses a GroupByKey (this is also what I need for my use case):
...
.apply("1-minute windows", Window.<MutationGroup>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(SpannerIO.write()
.withProjectId(entityConfig.getSpannerProject())
.withInstanceId(entityConfig.getSpannerInstance())
.withDatabaseId(entityConfig.getSpannerDb())
.grouped());
When I do this, I get the following exceptions during runtime:
java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.issueSideInputFetch(StreamingModeExecutionContext.java:631)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$UserStepContext.issueSideInputFetch(StreamingModeExecutionContext.java:683)
com.google.cloud.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:182)
com.google.cloud.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:71)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.values.ValueWithRecordId$StripIdsDoFn.processElement(ValueWithRecordId.java:145)
After investigating further it appears to be due to the .apply(Wait.on(input)) in SpannerIO: It has a global side input which does not seem to work with my fixed windows, as the docs of Wait.java state:
If signal is globally windowed, main input must also be. This typically would be useful
* only in a batch pipeline, because the global window of an infinite PCollection never
* closes, so the wait signal will never be ready.
As a temporary workaround I tried the following:
add a GlobalWindow with triggers instead of fixed windows:
.apply("globalwindow", Window.<MutationGroup>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
This results in writes to spanner only when I drain my pipeline. I have the impression the Wait.on() signal is only triggered when the Global windows closes, and doesn't work with triggers.
Disable the .apply(Wait.on(input)) in SpannerIO:
This results in the pipeline getting stuck on the view creation which
is described in this SO post:
SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView.
When I check the worker logs for clues, I do get the following warnings:
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type SpannerSchema have well defined equals method. This may produce incorrect results on some PipelineRunner
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner"
Note that everything works with the DirectRunner and that I'm trying to use the DataflowRunner.
Does anyone have any other suggestions for things I can try to get this running? I can hardly imagine that I'm the only one trying to stream MutationGroups into spanner.
Thanks in advance!

Currently, SpannerIO connector is not supported with Beam Streaming. Please follow this Pull Request which adds streaming support for spanner IO connector.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.

A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.

I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

Data error(cyclic redundancy check) while logging transaction status using bitronix transaction manager

Below exception occurred. Any possible explanations. My notion is may be problem with filesystem!?
Caused by: bitronix.tm.internal.BitronixSystemException: error logging status
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:400)
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:379)
at bitronix.tm.BitronixTransaction.setActive(BitronixTransaction.java:367)
at bitronix.tm.BitronixTransactionManager.begin(BitronixTransactionManager.java:126)
... 8 more
Caused by: java.io.IOException: Data error (cyclic redundancy check)
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:71)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:60)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195)
at bitronix.tm.journal.TransactionLogAppender.writeLog(TransactionLogAppender.java:121)
at bitronix.tm.journal.DiskJournal.log(DiskJournal.java:98)
at bitronix.tm.BitronixTransaction.setStatus(BitronixTransaction.java:389)
... 12 more

There are two reasons for such problem: a bug in the BTM disk journal or a hardware failure (could be RAM, disk, power supply, motherboard... almost anything).
Since the Disk journal is IMHO quite a solid piece of software that has been running on many production systems for years, I'd rather suspect your hardware first.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js