Cloud Dataflow: Side effect when watermark advances

Cloud Dataflow: Side effect when watermark advances - google-cloud-platform

Working with a streaming, unbounded PCollection in Google Dataflow that originates from a Cloud PubSub subscription. We are using this as a firehose to simply deliver events to BigTable continuously. Everything with the delivery is performing nicely.
Our problem is that we have downstream batch jobs that expect to read a day's worth of data out of BigTable once it is delivered. I would like to utilize windowing and triggering to implement a side effect that will write a marker row out to bigtable when the watermark advances beyond the day threshold, indicating that dataflow has reason to believe that most of the events have been delivered (we don't need strong guarantees on completeness, just reasonable ones) and that downstream processing can begin.
What we've tried is write out the raw events as one sink in the pipeline, and then window into another sink, using the timing information in the pane to determine if the watermark has advanced. The problem with this approach is that it operates upon the raw events themselves again, which is undesirable since it would repeat writing the event rows. We can prevent this write, but the parallel path in the pipeline would still be operating over the windowed streams of events.
Is there an effecient way to attach a callback-of-sorts to the watermark, such that we can perform a single action when the watermark advances?

The general ability to set a timer in event time and receive a callback is definitely an important feature request, filed as BEAM-27, which is under active development.
But actually your approach of windowing into FixedWindows.of(Duration.standardDays(1)) seems like it will accomplish your goal using just the features of the Dataflow Java SDK 1.x. Instead of forking your pipeline, you can maintain the "firehose" behavior by adding the trigger AfterPane.elementCountAtLeast(1). It does incur the cost of a GroupByKey but does not duplicate anything.
The complete pipeline might look like this:
pipeline
// Read your data from Cloud Pubsub and parse to MyValue
.apply(PubsubIO.Read.topic(...).withCoder(MyValueCoder.of())
// You'll need some keys
.apply(WithKeys.<MyKey, MyValue>of(...))
// Window into daily windows, but still output as fast as possible
.apply(Window.into(FixedWindows.of(Duration.standardDays(1)))
.triggering(AfterPane.elementCountAtLeast(1)))
// GroupByKey adds the necessary EARLY / ON_TIME / LATE labeling
.apply(GroupByKey.<MyKey, MyValue>create())
// Convert KV<MyKey, Iterable<MyValue>>
// to KV<ByteString, Iterable<Mutation>>
// where the iterable of mutations has the "end of day" marker if
// it was ON_TIME
.apply(MapElements.via(new MessageToMutationWithEndOfWindow())
// Write it!
.apply(BigTableIO.Write.to(...);
Please do comment on my answer if I have missed some detail of your use case.

Related

Dataflow job stuck and not reading messages from PubSub

I have a dataflow job which reads JSON from 3 PubSub topics, flattening them in one, apply some transformations and save to BigQuery.
I'm using a GlobalWindow with following configuration.
.apply(Window.<PubsubMessage>into(new GlobalWindows()).triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterFirst.of(AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations))))
.discardingFiredPanes());
The job is running with following configuration
Max Workers : 20
Disk Size: 10GB
Machine Type : n1-standard-4
Autoscaling Algo: Throughput Based
The problem I'm facing is that after processing few messages (approx ~80k) the job stops reading messages from PubSub. There is a backlog of close to 10 Million messages in one of those topics and yet the Dataflow Job is not reading the messages or autoscaling.
I also checked the CPU usage of each worker and that is also hovering in single digit after initial burst.
I've tried changing machine type and max worker configuration but nothing seems to work.
How should I approach this problem ?

I suspect the windowing function is the culprit. GlobalWindow isn't suited to streaming jobs (which I assume this job is, due to the use of PubSub), because it won't fire the window until all elements are present, which never happens in a streaming context.
In your situation, it looks like the window will fire early once, when it hits either that element count or duration, but after that the window will get stuck waiting for all the elements to finally arrive. A quick fix to check if this is the case is to wrap the early firings in a Repeatedly.forever trigger, like so:
withEarlyFirings(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(20000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(durations)))))
This should allow the early firing to fire repeatedly, preventing the window from getting stuck.
However for a more permanent solution I recommend moving away from using GlobalWindow in streaming pipelines. Using fixed-time windows with early firings based on element count would give you the same behavior, but without risk of getting stuck.

Dataflow discarding massive amount of events due to Window object or inner processing

Been recently developing a Dataflow consumer which read from a PubSub subscription and outputs to Parquet files the combination of all those objects grouped within the same window.
While I was doing testing of this without a huge load everything seemed to work fine.
However, after performing some heavy testing I can see that from 1.000.000 events sent to that PubSub queue, only 1000 make it to Parquet!
According to multiple wall times across different stages, the one which parses the events prior applying the window seems to last 58 minutes. The last stage which writes to Parquet files lasts 1h and 32 minutes.
I will show now the most relevant parts of the code within, hope you can shed some light if its due to the logic that comes before the Window object definition or if it's the Window object iself.
pipeline
.apply("Reading PubSub Events",
PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getSubscription()))
.apply("Map to AvroSchemaRecord (GenericRecord)",
ParDo.of(new PubsubMessageToGenericRecord()))
.setCoder(AvroCoder.of(AVRO_SCHEMA))
.apply("15m window",
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(15)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
)
Also note that I'm running Beam 2.9.0.
Could the logic inside the second stage be too heavy so that messages arrive too late and get discarded in the Window? The logic basically consists reading the payload, parsing into a POJO (reading inner Map attributes, filtering and such)
However, if I sent a million events to PubSub, all those million events make it till the Parquet write to file stage, but then those Parquet files don't contain all those events, just partially. Does that make sense?
I would need the trigger to consume all those events independently of the delay.

Citing from an answer on the Apache Beam mailing list:
This is an unfortunate usability problem with triggers where you can accidentally close the window and drop all data. I think instead, you probably want this trigger:
Repeatedly.forever(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
The way I recommend to express this trigger is:
AfterWatermark.pastEndOfWindow().withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
In the second case it is impossible to accidentally "close" the window and drop all data.

Custom Media Foundation sink never receives samples

I have my own MediaSink in Windows Media Foundation with one stream. In the OnClockStart method, I instruct the stream to queue (i) MEStreamStarted and (ii) MEStreamSinkRequestSample on itself. For implementing the queue, I use the IMFMediaEventQueue, and using the mtrace tool, I can also see that someone dequeues the event.
The problem is that ProcessSample of my stream is actually never called. This also has the effect that no further samples are requested, because this is done after processing a sample like in https://github.com/Microsoft/Windows-classic-samples/tree/master/Samples/DX11VideoRenderer.
Is the described approach the right way to implement the sink? If not, what would be the right way? If so, where could I search for the problem?
Some background info: The sink is an RTSP sink based on live555. Since the latter is also sink-driven, I thought it would be a good idea queuing a MEStreamSinkRequestSample whenever live555 requests more data from me. This is working as intended.
However, the solution has the problem that new samples are only requested as long as a client is connected to live555. If I now add a tee before the sink, eg to show a local preview, the system gets out of control, because the tee accumulates samples on the output of my sink which are never fetched. I then started playing around with discardable samples (cf. https://social.msdn.microsoft.com/Forums/sharepoint/en-US/5065a7cd-3c63-43e8-8f70-be777c89b38e/mixing-rate-sink-and-rateless-sink-on-a-tee-node?forum=mediafoundationdevelopment), but the problem is either that the stream does not start, queues are growing or the frame rate of the faster sink is artificially limited depending on which side is discardable.
Therefore, the next idea was rewriting my sink such that it always requests a new sample when it has processed the current one and puts all samples in a ring buffer for live555 such that whenever clients are connected, they can retrieve their data from there, and otherwise, the samples are just discarded. This does not work at all. Now, my sink does not get anything even without the tee.
The observation is: if I just request a lot of samples (as in the original approach), at some point, I get data. However, if I request only one (I also tried moderately larger numbers up to 5), ProcessSample is just not called, so no subsequent requests can be generated. I send MeStreamStarted once the clock is started or restarted exactly as described on https://msdn.microsoft.com/en-us/library/windows/desktop/ms701626, and after that, I request the first sample. In my understanding, MEStreamSinkRequestSample should not get lost, so I should get something even on a single request. Is that a misunderstanding? Should I request until I get something?

Measure the rate at which messages arrive in a Akka Streams Flow or a Sink

I have written an Akka Streams application and its working fine.
What I want to do is to attach my JMX console to the JVM intance running the Akka Streams application and then study the amount of messages coming into my Sink and Flows.
Is this possible? I googled but didn't find a concrete way.
The final stage of my application is a sink to a Cassandra database. I want to know the rate of messages per second coming into the Sink.
I also want to pick a random Flow in my graph and then know the number of messages per second flowing through the flow.
Is there anything out of box? or should I just code something like dropwizard into each of my flows to measure the rate.

Presently there is nothing "out of the box" you can leverage to monitor rates inside your Akka Stream.
However, this is a very simple facility you can extract in a monitoring Flow you can place wherever it fits your needs.
The example below is based on Kamon, but you can see it can be ported to Dropwizard very easily:
def meter[T](name: String): Flow[T, T, NotUsed] = {
val msgCounter = Kamon.metrics.counter(name)
Flow[T].map { x =>
msgCounter.increment()
x
}
}
mySource
.via(meter("source"))
.via(myFlow)
.via(meter("sink"))
.runWith(mySink)
The above is part of a demo you can find at this repo.
An ad-hoc solution like this has the advantage of being perfectly tailored for your application, whilst keeping simplicity.

What are the options to process timeseries data from a Kinesis stream

I need to process data from an AWS Kinesis stream, which collects events from devices. Processing function has to be called each second with all events received during the last 10 seconds.
Say, I have two devices A and B that write events into the stream.
My procedure has name of MyFunction and takes the following params:
DeviceId
Array of data for a period
If I start processing at 10:00:00 (and already have accumulated events for devices A and B for the last 10 seconds)
then I need to make two calls:
MyFunction(А, {Events for device A from 09:59:50 to 10:00:00})
MyFunction(B, {Events for device B from 09:59:50 to 10:00:00})
In the next second, at 10:00:01
MyFunction(А, {Events for device A from 09:59:51 to 10:00:01})
MyFunction(B, {Events for device B from 09:59:51 to 10:00:01})
and so on.
Looks like the most simple way to accumulate all the data received from devices is just store it memory in a temp buffer (the last 10 seconds only, of course), so I'd like to try this first.
And the most convenient way to keep such a memory based buffer I have found is to create a Java Kinesis Client Library (KCL) based application.
I have also considered AWS Lambda based solution, but looks like it's impossible to keep data in memory for lambda. Another option for Lambda is to have 2 functions, the first one has to write all the data into DynamoDB, and the second one to be called each second to process data fetched from db, not from memory. (So this option is much more complicated)
So my questions is: what can be other options to implement such processing?

So, what you are doing is called "window operation" (or "windowed computation"). There are multiple ways to achieve that, like you said buffering is the best option.
In memory cache systems: Ehcache, Hazelcast
Accumulate data in a cache system and choose the proper eviction policy (10 minutes in your case). Then do a grouping summation operation and calculate the output.
In memory database: Redis, VoltDB
Just like a cache system, you can use a database architecture. Redis could be helpful and stateful. If you use VoltDB or such SQL system, calling a "sum()" or "avg()" operation would be easier.
Spark Streaming: http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
It is possible to use Spark to do that counting. You can try Elastic MapReduce (EMR), so you will stay in AWS ecosystem and integration would be easier.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js