GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB - google-cloud-platform

We run into problems with our Dataflow on Google Cloud. Our pipeline consists of various input steps, which get data pushed in with GCP PubSub. We then aggregate the data and sort it. These 1 steps are clearly too heavy for Dataflow and the window we configured. We get an exception [2] on the step. Also we see these metrics:
droppedDueToClosedWindow 3,838,662 Bids/AggregateExchangeOrders
droppedDueToClosedWindow 21,060,627 Asks/AggregateExchangeOrders
Now I am seeking advice how to attack this issue. Should I break down the steps, so for example iterations and sorting can be done with parallel steps?
Is there a way to get more information about what exactly happens?
Should we increase the number of workers? (Currently 1).
We are rather new with Dataflow. .. Good advice is most welcome.
Edit: I am adding a bit of details on the steps.
This is how the steps below are 'chained' together:
#Override
public PCollection<KV<KV<String, String>, List<ExchangeOrder>>> expand(PCollection<KV<String, KV<String, String>>> input) {
return input.apply("PairWithType", new ByPairWithType(type))
.apply("UnfoldExchangeOrders", new ByAggregatedExchangeOrders())
.apply("AggregateExchangeOrders", GroupByKey.<KV<String, String>, KV<String, KV<BigDecimal, BigDecimal>>>create())
.apply("ReorderExchangeOrders", ParDo.of(new ReorderExchangeOrders()));
}
AggregateExchangeOrders:
So here, clearly we iterate through a collection of orders, and parse the type (twice), so it'a big decimal.
Which makes me think, we could skip one parse step as described here:
Convert string to BigDecimal in java
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, KV<String, String>> key = c.element().getKey();
List<KV<String, String>> value = c.element().getValue();
value.forEach(
exchangeOrder -> {
try {
BigDecimal unitPrice = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getKey()));
BigDecimal quantity = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getValue()));
if (quantity.compareTo(BigDecimal.ZERO) != 0) {
// Exclude exchange orders with no quantity.
c.output(KV.of(key.getValue(), KV.of(key.getKey(), KV.of(unitPrice, quantity))));
}
} catch (NumberFormatException e) {
// Exclude exchange orders with invalid element.
}
});
}
...next we group and sort. (And optionally reverse it), it seems this step is not taking a huge load.
ReorderExchangeOrders:
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, String> pairAndType = c.element().getKey();
Iterable<KV<String, KV<BigDecimal, BigDecimal>>> exchangeOrderBook = c.element().getValue();
List<ExchangeOrder> list = new ArrayList<>();
exchangeOrderBook.forEach(exchangeOrder -> list.add(
new ExchangeOrder(exchangeOrder.getKey(), exchangeOrder.getValue().getKey(), exchangeOrder.getValue().getValue())));
// Asks are sorted in ASC order
Collections.sort(list);
// Bids are sorted in DSC order
if (pairAndType.getValue().equals(EXCHANGE_ORDER_TYPE.BIDS.toString())) {
Collections.reverse(list);
}
c.output(KV.of(pairAndType, list));
}
[ 1 ] Dataflow screenshot:
[ 2 ] Exception: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed.
java.lang.IllegalStateException: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
com.google.cloud.dataflow.worker.StreamingDataflowWorker$Commit.getSize(StreamingDataflowWorker.java:327)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.lambda$new$0(StreamingDataflowWorker.java:342)

The error message is kind of straightforward.
The root cause of the problem, as many of the comments point out, is that the structure that contains all the results for one of the DoFn's is larger than 2GB, and your best option would be to partition your data in some way to make your work units smaller.
In the code I see that some of the structures returned by DoFn's are nested structures in the form KV>. This arrangement forces Dataflow to send the whole response back in one monolithic bundle, and prevents it from chunking it into smaller pieces.
One possible solution would be to use composite keys instead of nested structures for as long as possible in the pipeline, and only combine them when strictly necessary.
For example,
instead of KV>, the DoFn could return
KV<(concat(Key1, Key2)), Value>
This would split the work units into much smaller sets that can then be dispatched in parallel to multiple workers.
To answer the other questions, increasing the number of workers will have no effect as the huge collection generated by DoFn looks like is not splittable. Adding logging to see how the collection arrives at 2GB might provide useful tips to prevent this.

Related

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Aggregating a huge list from reducer input without running out of memory

At the reduce stage (67% of reduce percentage), my code ends up getting stuck and failing after hours of attempting to complete. I found out that the issue is that the reducer is receiving huge amounts of data that it can't handle and ends up running out of memory, which leads to the reducer being stuck.
Now, I am trying to find a way around this. Currently, I am assembling a list from the values received by the reducer fro each key. At the end of the reduce phase, I try to write the key and all of the values in the list. So my question is, how can I get the same functionality of having the key and list of values related to that key without running out of memory?
public class XMLReducer extends Reducer<Text, Text, Text, TextArrayWritable> {
private final Logger logger = Logger.getLogger(XMLReducer.class);
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//logger.info(key.toString());
Set<String> filesFinal = new HashSet<>();
int size = 0;
for(Text value : values) {
String[] files = value.toString().split(",\\s+");
filesFinal.add(value.toString());
//size++;
}
//logger.info(Integer.toString(size));
String[] temp = new String[filesFinal.size()];
temp = filesFinal.toArray(temp);
Text[] tempText = new Text[filesFinal.size()];
for(int i = 0; i < filesFinal.size(); i++) {
tempText[i] = new Text(temp[i]);
}
}
}
and TextArrayWritable is just a way to write an array to file
You can try reducing the amount of data that is read by the single reducer by writing a Custom partitioner.
HashPartitioner is the default partitioner that is used by the map reduce job. While this guarantees you uniform distribution, in some cases it is highly possible that many keys get hashed to a single reducer. As a result, a single reducer would have a lot of data compared to others. In your case, I think this is the issue.
To resolve this:
Analyze your data and the key on which you are doing group by. You
Try to come up with a partitioning function based on your group by key for your Custom Partitioner. Try limiting the number of keys for each partition.
You would see an increase in number of reduce tasks in your job. If the issue is related to uneven key distribution, the solution that I proposed should resolve your issue.
You could also try increasing reducer memory.

Joining a stream against a "table" in Dataflow

Let me use a slightly contrived example to explain what I'm trying to do. Imagine I have a stream of trades coming in, with the stock symbol, share count, and price: { symbol = "GOOG", count = 30, price = 200 }. I want to enrich these events with the name of the stock, in this case "Google".
For this purpose I want to, inside Dataflow, maintain a "table" of symbol->name mappings that is updated by a PCollection<KV<String, String>>, and join my stream of trades with this table, yielding e.g. a PCollection<KV<Trade, String>>.
This seems like a thoroughly fundamental use case for stream processing applications, yet I'm having a hard time figuring out how to accomplish this in Dataflow. I know it's possible in Kafka Streams.
Note that I do not want to use an external database for the lookups – I need to solve this problem inside Dataflow or switch to Kafka Streams.
I'm going to describe two options. One using side-inputs which should work with the current version of Dataflow (1.X) and one using state within a DoFn which should be part of the upcoming Dataflow (2.X).
Solution for Dataflow 1.X, using side inputs
The general idea here is to use a map-valued side-input to make the symbol->name mapping available to all the workers.
This table will need to be in the global window (so nothing ever ages out), will need to be triggered every element (or as often as you want new updates to be produced), and accumulate elements across all firings. It will also need some logic to take the latest name for each symbol.
The downside to this solution is that the entire lookup table will be regenerated every time a new entry comes in and it will not be immediately pushed to all workers. Rather, each will get the new mapping "at some point" in the future.
At a high level, this pipeline might look something like (I haven't tested this code, so there may be some types):
PCollection<KV<Symbol, Name>> symbolToNameInput = ...;
final PCollectionView<Map<Symbol, Iterable<Name>>> symbolToNames = symbolToNameInput
.apply(Window.into(GlobalWindows.of())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.accumulatingFiredPanes())
.apply(View.asMultiMap())
Note that we had to use viewAsMultiMap here. This means that we actually build up all the names for every symbol. When we look things up we'll need make sure to take the latest name in the iterable.
PCollection<Detail> symbolDetails = ...;
symbolDetails
.apply(ParDo.withSideInputs(symbolToNames).of(new DoFn<Detail, AugmentedDetails>() {
#Override
public void processElement(ProcessContext c) {
Iterable<Name> names = c.sideInput(symbolToNames).get(c.element().symbol());
Name name = chooseName(names);
c.output(augmentDetails(c.element(), name));
}
}));
Solution for Dataflow 2.X, using the State API
This solution uses a new feature that will be part of the upcoming Dataflow 2.0 release. It is not yet part of the preview releases (currently Dataflow 2.0-beta1) but you can watch the release notes to see when it is available.
The general idea is that keyed state allows us to store some values associated with the specific key. In this case, we're going to remember the latest "name" value we've seen.
Before running the stateful DoFn we're going to wrap each element into a common element type (a NameOrDetails) object. This would look something like the following:
// Convert SymbolToName entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> left = symbolToName
.apply(ParDo.of(new DoFn<SymbolToName, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
SymbolToName e = c.element();
c.output(KV.of(e.getSymbol(), NameOrDetails.name(e.getName())));
}
});
// Convert detailed entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> right = details
.apply(ParDo.of(new DoFn<Details, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Details e = c.element();
c.output(KV.of(e.getSymobl(), NameOrDetails.details(e)));
}
});
// Flatten the two streams together
PCollectionList.of(left).and(right)
.apply(Flatten.create())
.apply(ParDo.of(new DoFn<KV<Symbol, NameOrDetails>, AugmentedDetails>() {
#StateId("name")
private final StateSpec<ValueState<String>> nameSpec =
StateSpecs.value(StringUtf8Coder.of());
#ProcessElement
public void processElement(ProcessContext c
#StateId("name") ValueState<String> nameState) {
NameOrValue e = c.element().getValue();
if (e.isName()) {
nameState.write(e.getName());
} else {
String name = nameState.read();
if (name == null) {
// Use symbol if we haven't received a mapping yet.
name = c.element().getKey();
}
c.output(e.getDetails().withName(name));
}
});

Creating an Akka stream for parallel processing of collection elements

I am trying to define a graph for Akka stream that contain parallel processing flow (I am using Akka.NET but this shouldn't matter). Imagine a data source of orders, each order consists of an order ID and a list of products (order items). The workflow is as follows:
Receive and order
Broadcast the order to two flows, flow A will deal with order items, channel B will deal with Order ID (some bookkeeping work)
Flow A: Split collection of order items into individual elements, each one to be processed separately
Flow A: For each order items that result from the split in the previous step call some external service which looks up extra information (price, availability etc.)
Flow B: do some extra bookkeeping for the given Order ID
Merge flows A and B
Send to the sink merged data from the previous step which result in enriched order information
Steps 1 (Source.From), 2 (Broadcast), 4-5 (Map), 6 (Merge), 7 (Sink) looks OK. But how is collection split implemented in Akka or reactive streams terms? This is not broadcasting or flattening, a collection of N elements need to be split into N independent substreams that will later be merged back. How is this achieved?
I recommend to do it in one flow. I know two flows looks cooler but trust me it's not worth it in terms of simplicity of design (I tried). You may write something like this
import akka.stream.scaladsl.{Flow, Sink, Source, SubFlow}
import scala.collection.immutable
import scala.concurrent.Future
case class Item()
case class Order(items: List[Item])
val flow = Flow[Order]
.mapAsync(4) { order =>
Future {
// Enrich your order here
order
}
}
.mapConcat { order =>
order.items.map(order -> _)
}
.mapAsync(4) { case (order, item) =>
Future {
// Enrich your item here
order -> item
}
}
.groupBy(2, tuple => tuple._1)
.fold[Map[Order, List[Item]]](immutable.Map.empty) { case (map, (order, item)) => map.updated(order, map.getOrElse(order, Nil) :+ item) }
.mapConcat { _.map { case (order, newItems) => order.copy(items = newItems)} }
but even this approach is bad. There are so many things can go wrong either with code above or your design. What will you do if enrichment of one of order's items fails? What if enrichment of order object fails ? What should happens to your stream(s) ?
If I were you I'd have Flow[Order] and process its children in mapAsync so at least it guarantees I don't have partially processed orders.

Querying a growing data-set

We have a data set that grows while the application is processing the data set. After a long discussion we have come to the decision that we do not want blocking or asynchronous APIs at this time, and we will periodically query our data store.
We thought of two options to design an API for querying our storage:
A query method returns a snapshot of the data and a flag indicating weather we might have more data. When we finish iterating over the last returned snapshot, we query again to get another snapshot for the rest of the data.
A query method returns a "live" iterator over the data, and when this iterator advances it returns one of the following options: Data is available, No more data, Might have more data.
We are using C++ and we borrowed the .NET style enumerator API for reasons which are out of scope for this question. Here is some code to demonstrate the two options. Which option would you prefer?
/* ======== FIRST OPTION ============== */
// similar to the familier .NET enumerator.
class IFooEnumerator
{
// true --> A data element may be accessed using the Current() method
// false --> End of sequence. Calling Current() is an invalid operation.
virtual bool MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
enum class Availability
{
EndOfData,
MightHaveMoreData,
};
class IDataProvider
{
// Query params allow specifying the ID of the starting element. Here is the intended usage pattern:
// 1. Call GetFoo() without specifying a starting point.
// 2. Process all elements returned by IFooEnumerator until it ends.
// 3. Check the availability.
// 3.1 MightHaveMoreDataLater --> Invoke GetFoo() again after some time by specifying the last processed element as the starting point
// and repeat steps (2) and (3)
// 3.2 EndOfData --> The data set will not grow any more and we know that we have finished processing.
virtual std::tuple<std::unique_ptr<IFooEnumerator>, Availability> GetFoo(query-params) = 0;
};
/* ====== SECOND OPTION ====== */
enum class Availability
{
HasData,
MightHaveMoreData,
EndOfData,
};
class IGrowingFooEnumerator
{
// HasData:
// We might access the current data element by invoking Current()
// EndOfData:
// The data set has finished growing and no more data elements will arrive later
// MightHaveMoreData:
// The data set will grow and we need to continue calling MoveNext() periodically (preferably after a short delay)
// until we get a "HasData" or "EndOfData" result.
virtual Availability MoveNext() = 0;
virtual Foo Current() const = 0;
virtual ~IFooEnumerator() {}
};
class IDataProvider
{
std::unique_ptr<IGrowingFooEnumerator> GetFoo(query-params) = 0;
};
Update
Given the current answers, I have some clarification. The debate is mainly over the interface - its expressiveness and intuitiveness in representing queries for a growing data-set that at some point in time will stop growing. The implementation of both interfaces is possible without race conditions (at-least we believe so) because of the following properties:
The 1st option can be implemented correctly if the pair of the iterator + the flag represent a snapshot of the system at the time of querying. Getting snapshot semantics is a non-issue, as we use database transactions.
The 2nd option can be implemented given a correct implementation of the 1st option. The "MoveNext()" of the 2nd option will, internally, use something like the 1st option and re-issue the query if needed.
The data-set can change from "Might have more data" to "End of data", but not vice versa. So if we, wrongly, return "Might have more data" because of a race condition, we just get a small performance overhead because we need to query again, and the next time we will receive "End of data".
"Invoke GetFoo() again after some time by specifying the last processed element as the starting point"
How are you planning to do that? If it's using the earlier-returned IFooEnumerator, then functionally the two options are equivalent. Otherwise, letting the caller destroy the "enumerator" then however-long afterwards call GetFoo() to continue iteration means you're losing your ability to monitor the client's ongoing interest in the query results. It might be that right now you have no need for that, but I think it's poor design to exclude the ability to track state throughout the overall result processing.
It really depends on many things whether the overall system will at all work (not going into details about your actual implementation):
No matter how you twist it, there will be a race condition between checking for "Is there more data" and more data being added to the system. Which means that it's possibly pointless to try to capture the last few data items?
You probably need to limit the number of repeated runs for "is there more data", or you could end up in an endless loop of "new data came in while processing the last lot".
How easy it is to know if data has been updated - if all the updates are "new items" with new ID's that are sequentially higher, you can simply query "Is there data above X", where X is your last ID. But if you are, for example, counting how many items in the data has property Y set to value A, and data may be updated anywhere in the database at the time (e.g. a database of where taxis are at present, that gets updated via GPS every few seconds and has thousands of cars, it may be hard to determine which cars have had updates since last time you read the database).
As to your implementation, in option 2, I'm not sure what you mean by the MightHaveMoreData state - either it has, or it hasn't, right? Repeated polling for more data is a bad design in this case - given that you will never be able to say 100% certain that there hasn't been "new data" provided in the time it took from fetching the last data until it was processed and acted on (displayed, used to buy shares on the stock market, stopped the train or whatever it is that you want to do once you have processed your new data).
Read-write lock could help. Many readers have simultaneous access to data set, and only one writer.
The idea is simple:
-when you need read-only access, reader uses "read-block", which could be shared with other reads and exclusive with writers;
-when you need write access, writer uses write-lock which is exclusive for both readers and writers;