Dataflow pipeline waits for elements from all streams before performing GroupBy - google-cloud-platform

We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)

As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.

Related

Map-Reduce with a wait

The concept of map-reduce is very familiar. It seems like a great fit for a problem I'm trying to solve, but it's either missing something (or I lack enough understanding of the concept).
I have a stream of items, structured as follows:
{
"jobId": 777,
"numberOfParts": 5,
"data": "some data..."
}
I want to do a map-reduce on many such items.
My mapping operation is straightforward - take the jobId.
My reduce operation is irrelevant for this phase, but all we know is that it takes multiple strings (the "some data..." part) and somehow reduces them to a single object.
The only problem is - I need all five parts of this job to complete before I can reduce all the strings into a single object. Every item has a "numberOfParts" property which indicates the number of items I must have before I apply the reduce operation. The items are not ordered, therefore I don't have a "partId" field.
Long story short - I need to apply some kind of a waiting mechanism that waits for all parts of the job to complete before initiating the reduce operation, and I need this waiting mechanism to rely on a value that exists within the payload (therefore solutions like kafka wouldn't work).
Is there a way to do that, hopefully using a single tool/framework?
I only want to write the map/reduce part and the "waiting" logic, the rest I believe should come out of the box.
**** EDIT ****
I'm currently in the design phase of the project and therefore not using any framework (such as spark, hadoop, etc...)
I asked this because I wanted to find out the best way to tackle this problem.
"Waiting" is not the correct approach.
Assuming your jobId is the key, and data contains some number of parts (zero or more), then you must have multiple reducers. One that gathers all parts of the same job, then another that processes all jobs with a collection of parts greater than or equal to numberOfParts while ignoring others

Chronicle Queue - reader modifying msg

We are preparing to use Chronicle Queue (SingleChronicleQueue) to record our messages. The prototype is working now. However we have some problems.
Can readers modify the messages? We use a chronicle map to record indices read to remove duplicate messages after a restart. In case this doesn't work, we want to tag messages read on the reader side. Actually we already do that. The problem is now, sometimes, we get error messages like "15c77d8be (62) was 8000003f is now 3f", and we suspect that this is because writes across cache line boundaries are no longer atomic now. What is the recommended way to solve it? Currently we add a one-byte tag before the message, will adding a 3-byte padding solve the problem?
Can we use our own roll policy? We'd like to use an hourly policy. But the hourly policy mandates a file containing less than 256 million entries. Can we use our custom roll cycle? Are there any caveats?
One common approach is to record your consumers' read indices in another output queue. On restart, simply read backwards from the end of the output queue to determine what each consumer's read sequence should be.
Without seeing your code it is a little difficult to determine what the problem might be with trying to modify existing records. Note that records inserted into a queue are supposed to be immutable; modifying them from a reader thread is not supported.
With regards to your RollCycle requirements, the LARGE_HOURLY cycle was recently added, allowing ~2 billion entries per cycle:
https://github.com/OpenHFT/Chronicle-Queue/blob/master/src/main/java/net/openhft/chronicle/queue/RollCycles.java#L27

Dataflow doesn't update GroupByKey's "Output collections" field

Dataflow does not update GroupByKey's Output Collections even though I can see the output data in cloud storage (the next transform in the pipeline writes the output to GCS). None of the transforms after GroupByKey show input/output collections either. I have also tried to implement data-driven triggering using AfterPane.elementCountAtLeast(3000) but then too the Output Collections does not get updated after 3000 elements have been input.
The problem is that the Estimated Size keeps on increasing and I am afraid that it will eventually lead to more workers costing me more money. I have been waiting for more than an hour but it still doesn't get updated. I have an unbounded PCollection and I have set the windowing and triggering as shown below
input
.apply(
Window
.into[String](FixedWindows.of(Duration.standardMinutes(windowSize)))
.withAllowedLateness(Duration.standardMinutes(windowSize))
.discardingFiredPanes()
.triggering(AfterPane.elementCountAtLeast(3000)))
What might be the issue?

Why does EventHubClient.SendBatch() only support a single partition?

Apparently (based on an exception) EventHubClient.SendBatch and EventHubClient.SendBatchAsync only support sending to a single partition per operation. This appears to be indicated indicated in the documentation by the method summary "Sends a batch of event data to the logical partition represented by PartitionId" which appears to be copied from the partition specific EventHubSender.SendBatch.
Are there design considerations (vs just writing less code) in having the higher level client not rebatch as needed?
The EventHubClient has control over the partition key hashing/distribution which is not available to callers of EventHubClient that wish to send a batch of data with differing keys that may lie on the same partition. Left to rebatch myself I need to make calls on the order of the number of messages as opposed to on the number of partitions which with small messages is easily two orders of magnitude difference.
Since it's already necessary to rebatch it could be worse.
I was assuming the PartitionKey on your EventData objects of the batch would be used to partition out. But apparently not.
However, there's Paolo Salvatori, who wrote a nice set of Extension methods to provide good and easy support for sending in batches to Event Hub.
You'll probably like his post here: http://blogs.msdn.com/b/paolos/archive/2015/03/26/how-to-implement-a-partitioned-sendbatch-method-for-azure-service-bus-entities.aspx
Best regards

Amazon - DynamoDB Strong consistent reads, Are they latest and how?

In an attempt to use Dynamodb for one of projects, I have a doubt regarding the strong consistency model of dynamodb. From the FAQs
Strongly Consistent Reads — in addition to eventual consistency, Amazon DynamoDB also gives you the
flexibility and control to request a strongly consistent read if your application, or an element of your application, requires it. A strongly consistent read returns a result that reflects all writes that received a successful response prior to the read.
From the definition above, what I get is that a strong consistent read will return the latest write value.
Taking an example: Lets say Client1 issues a write command on Key K1 to update the value from V0 to V1. After few milliseconds Client2 issues a read command for Key K1, then in case of strong consistency V1 will be returned always, however in case of eventual consistency V1 or V0 may be returned. Is my understanding correct?
If it is, What if the write operation returned success but the data is not updated to all replicas and we issue a strongly consistent read, how it will ensure to return the latest write value in this case?
The following link
AWS DynamoDB read after write consistency - how does it work theoretically? tries to explain the architecture behind this, but don't know if this is how it actually works? The next question that comes to my mind after going through this link is: Is DynamoDb based on Single Master, multiple slave architecture, where writes and strong consistent reads are through master replica and normal reads are through others.
Short answer: Writing successfully in strongly consistent mode requires that your write succeed on a majority of servers that can contain the record, therefore any future consistent reads will always see the same data, because a consistent read must read a majority of the servers that can contain the desired record. If you do not perform a strongly consistent read, the system will ask a random server for the record, and it is possible that the data will not be up-to-date.
Imagine three servers. Server 1, server 2 and server 3. To write a strongly consistent record, you pick two servers at minimum, and write the data. Let's pick 1 and 2.
Now you want to read the data consistently. Pick a majority of servers. Let's say we picked 2 and 3.
Server 2 has the new data, and this is what the system returns.
Eventually consistent reads could come from server 1, 2, or 3. This means if server 3 is chosen by random, your new write will not appear yet, until replication occurs.
If a single server fails, your data is still safe, but if two out of three servers fail your new write may be lost until the offline servers are restored.
More explanation:
DynamoDB (assuming it is similar to the database described in the Dynamo paper that Amazon released) uses a ring topology, where data is spread to many servers. Strong consistency is guaranteed because you directly query all relevant servers and get the current data from them. There is no master in the ring, there are no slaves in the ring. A given record will map to a number of identical hosts in the ring, and all of those servers will contain that record. There is no slave that could lag behind, and there is no master that can fail.
Feel free to read any of the many papers on the topic. A similar database called Apache Cassandra is available which also uses ring replication.
http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf
Disclaimer: the following cannot be verified based on the public DynamoDB documentation, but they are probably very close to the truth
Starting from the theory, DynamoDB makes use of quorums, where V is the total number of replica nodes, Vr is the number of replica nodes a read operation asks and Vw is the number of replica nodes where each write is performed. The read quorum (Vr) can be leveraged to make sure the client is getting the latest value, while the write quorum (Vw) can be leveraged to make sure that writes do not create conflicts.
Based on the fact that there are no write conflicts in DynamoDB (since these would have to be reconciliated from the client, thus being exposed in the API), we conclude that DynamoDB is using a Vw that respects the second law (Vw > V/2), probably just V/2+1 to reduce write latency.
Now regarding read quorums, DynamoDB provides 2 different kinds of read. The strongly consistent read uses a read quorum that respects the first law (Vr + Vw > V), probably just V/2 if we assume V/2+1 for writes as before. However, an eventually consistent read can use only a single random replica Vr = 1, thus being much quicker but giving zero guarantee around consistency.
Note: There's a possibility that the write quorum used does not respect the second law (Vw > V/2), but that would mean DynamoDB resolves automatically such conflicts (e.g. by selecting the latest one based on local time) without reconciliation from the client. But, I believe that this is really unlikely to be true, since there is no such reference in the DynamoDB documentation. Even in that case though, the rest reasoning stays the same.
You can find answer to your question here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/APISummary.html
When you issue a strongly consistent read request, Amazon DynamoDB returns a response with the most up-to-date data that reflects updates by all prior related write operations to which Amazon DynamoDB returned a successful response.
In your example, if the updateItem request to update the value from v0 to v1 was successful, the subsequent strongly consistent read request will return v1.
Hope this helps.