Aggregating a huge list from reducer input without running out of memory - list

At the reduce stage (67% of reduce percentage), my code ends up getting stuck and failing after hours of attempting to complete. I found out that the issue is that the reducer is receiving huge amounts of data that it can't handle and ends up running out of memory, which leads to the reducer being stuck.
Now, I am trying to find a way around this. Currently, I am assembling a list from the values received by the reducer fro each key. At the end of the reduce phase, I try to write the key and all of the values in the list. So my question is, how can I get the same functionality of having the key and list of values related to that key without running out of memory?
public class XMLReducer extends Reducer<Text, Text, Text, TextArrayWritable> {
private final Logger logger = Logger.getLogger(XMLReducer.class);
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//logger.info(key.toString());
Set<String> filesFinal = new HashSet<>();
int size = 0;
for(Text value : values) {
String[] files = value.toString().split(",\\s+");
filesFinal.add(value.toString());
//size++;
}
//logger.info(Integer.toString(size));
String[] temp = new String[filesFinal.size()];
temp = filesFinal.toArray(temp);
Text[] tempText = new Text[filesFinal.size()];
for(int i = 0; i < filesFinal.size(); i++) {
tempText[i] = new Text(temp[i]);
}
}
}
and TextArrayWritable is just a way to write an array to file

You can try reducing the amount of data that is read by the single reducer by writing a Custom partitioner.
HashPartitioner is the default partitioner that is used by the map reduce job. While this guarantees you uniform distribution, in some cases it is highly possible that many keys get hashed to a single reducer. As a result, a single reducer would have a lot of data compared to others. In your case, I think this is the issue.
To resolve this:
Analyze your data and the key on which you are doing group by. You
Try to come up with a partitioning function based on your group by key for your Custom Partitioner. Try limiting the number of keys for each partition.
You would see an increase in number of reduce tasks in your job. If the issue is related to uneven key distribution, the solution that I proposed should resolve your issue.
You could also try increasing reducer memory.

Related

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Pagination in Dynamo DB Results with Completable Future

I am querying Dynamo DB for a given primary key. Primary Key consists of two UUID fields (fieldUUID1, fieldUUID2).
I have a lot of queries to be executed for the above primary key combination with list of values. For which i am using Asynchronous CompleteableFuture with ExecutorService with a thread pool of size 4.
After all the queries return results, which is CompletableFuture<Object>, i join them using allOf method of completable future which ensures that all the query execution is complete, and it gives me CompletableFuture<void>, on which using stream i receive CompletableFuture<List<Object>>
If some of the queries result in pagination of result, i.e. returns lastEvaluatedKey, there is no way for me to know which Query Request returned this.
if i do a .get() call while i received `CompletableFuture, this will be a blocking operation, which defeats the purpose of using asynchronous. Is there a way i can handle this scenario?
example:
I can try thenCompose method, but how do i know at what point i need to stop when lastEvaluatedKey is absent.
for (final QueryRequest queryRequest : queryRequests) {
final CompletableFuture<QueryResult> futureResult =
CompletableFuture.supplyAsync(() ->
dynamoDBClient.query(queryRequest), executorService));
if (futureResult == null) {
continue;
}
futures.add(futureResult);
}
// Wait for completion of all of the Futures provided
final CompletableFuture<Void> allfuture = CompletableFuture
.allOf(futures.toArray(new CompletableFuture[futures.size()]));
// The return type of the CompletableFuture.allOf() is a
// CompletableFuture<Void>. The limitation of this method is that it does not
// return the combined results of all Futures. Instead we have to manually get
// results from Futures. CompletableFuture.join() method and Java 8 Streams API
// makes it simple:
final CompletableFuture<List<QueryResult>> allFutureList = allfuture.thenApply(val -> {
return futures.stream().map(f -> f.join()).collect(Collectors.toList());
});
final List<QueryOutcome> completableResults = new ArrayList<>();
try {
try {
// at this point all the Futures should be done, because we already executed
// CompletableFuture.allOf method.
final List<QueryResult> returnedResult = allFutureList.get();
for (final QueryResult queryResult : returnedResult) {
if (MapUtils.isNotEmpty(queryResult.getLastEvaluatedKey()) {
// how to get hold of original request and include last evaluated key ?
}
}
} finally {
}
} finally {
}
I can rely on .get() method, but it will be a blocking call.
the quick solution to your need is to change your futures list. Instead of having it store CompletableFuture<QueryResult> you can change to store CompletableFuture<RequestAndResult> where RequestAndResult is a simple data class holding a QueryRequest and a QueryResult. To do that you need to change your first loop.
Then, once the allfuture completes you can iterate over futures and get access to both the requests and the results.
However, there is a deeper issue here. What are you planning to do once you have access to the origianl QueryRequest? my guess is that you want to issue a followup request with exclusiveStartKey set to whatever the response's lastEvaluatedKey holds. This means that you will wait for all original queries to complete and only then you'll issue the next bunch. This is inefficient: if a query returned with a lastEvaluatedKey you want to issue its followup query ASAP.
To achieve this my advise to you is to introduce a new method which takes a single QueryRequest object and returns a CompletableFuture<QueryResult>. Its implementation will be roughly as follows:
issue a query with the given request
once the result arrives check it. if its lastEvaluatedKey is empty return it as the result of the method
otherwise, update request.exclusiveStartKey and go back to the first step.
Yes, its a bit harder to do that with CompletableFutures (compared to blocking code) but is totally doable.
Once you have that method your code needs to call this method once for each of the requests in queryRequests, put the returned CompletableFutures in a list, and do a CompletableFuture.allOf() on that list. Once the allOf future completes you can just use the results - no need to do issue followup queries.

GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB

We run into problems with our Dataflow on Google Cloud. Our pipeline consists of various input steps, which get data pushed in with GCP PubSub. We then aggregate the data and sort it. These 1 steps are clearly too heavy for Dataflow and the window we configured. We get an exception [2] on the step. Also we see these metrics:
droppedDueToClosedWindow 3,838,662 Bids/AggregateExchangeOrders
droppedDueToClosedWindow 21,060,627 Asks/AggregateExchangeOrders
Now I am seeking advice how to attack this issue. Should I break down the steps, so for example iterations and sorting can be done with parallel steps?
Is there a way to get more information about what exactly happens?
Should we increase the number of workers? (Currently 1).
We are rather new with Dataflow. .. Good advice is most welcome.
Edit: I am adding a bit of details on the steps.
This is how the steps below are 'chained' together:
#Override
public PCollection<KV<KV<String, String>, List<ExchangeOrder>>> expand(PCollection<KV<String, KV<String, String>>> input) {
return input.apply("PairWithType", new ByPairWithType(type))
.apply("UnfoldExchangeOrders", new ByAggregatedExchangeOrders())
.apply("AggregateExchangeOrders", GroupByKey.<KV<String, String>, KV<String, KV<BigDecimal, BigDecimal>>>create())
.apply("ReorderExchangeOrders", ParDo.of(new ReorderExchangeOrders()));
}
AggregateExchangeOrders:
So here, clearly we iterate through a collection of orders, and parse the type (twice), so it'a big decimal.
Which makes me think, we could skip one parse step as described here:
Convert string to BigDecimal in java
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, KV<String, String>> key = c.element().getKey();
List<KV<String, String>> value = c.element().getValue();
value.forEach(
exchangeOrder -> {
try {
BigDecimal unitPrice = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getKey()));
BigDecimal quantity = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getValue()));
if (quantity.compareTo(BigDecimal.ZERO) != 0) {
// Exclude exchange orders with no quantity.
c.output(KV.of(key.getValue(), KV.of(key.getKey(), KV.of(unitPrice, quantity))));
}
} catch (NumberFormatException e) {
// Exclude exchange orders with invalid element.
}
});
}
...next we group and sort. (And optionally reverse it), it seems this step is not taking a huge load.
ReorderExchangeOrders:
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, String> pairAndType = c.element().getKey();
Iterable<KV<String, KV<BigDecimal, BigDecimal>>> exchangeOrderBook = c.element().getValue();
List<ExchangeOrder> list = new ArrayList<>();
exchangeOrderBook.forEach(exchangeOrder -> list.add(
new ExchangeOrder(exchangeOrder.getKey(), exchangeOrder.getValue().getKey(), exchangeOrder.getValue().getValue())));
// Asks are sorted in ASC order
Collections.sort(list);
// Bids are sorted in DSC order
if (pairAndType.getValue().equals(EXCHANGE_ORDER_TYPE.BIDS.toString())) {
Collections.reverse(list);
}
c.output(KV.of(pairAndType, list));
}
[ 1 ] Dataflow screenshot:
[ 2 ] Exception: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed.
java.lang.IllegalStateException: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
com.google.cloud.dataflow.worker.StreamingDataflowWorker$Commit.getSize(StreamingDataflowWorker.java:327)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.lambda$new$0(StreamingDataflowWorker.java:342)
The error message is kind of straightforward.
The root cause of the problem, as many of the comments point out, is that the structure that contains all the results for one of the DoFn's is larger than 2GB, and your best option would be to partition your data in some way to make your work units smaller.
In the code I see that some of the structures returned by DoFn's are nested structures in the form KV>. This arrangement forces Dataflow to send the whole response back in one monolithic bundle, and prevents it from chunking it into smaller pieces.
One possible solution would be to use composite keys instead of nested structures for as long as possible in the pipeline, and only combine them when strictly necessary.
For example,
instead of KV>, the DoFn could return
KV<(concat(Key1, Key2)), Value>
This would split the work units into much smaller sets that can then be dispatched in parallel to multiple workers.
To answer the other questions, increasing the number of workers will have no effect as the huge collection generated by DoFn looks like is not splittable. Adding logging to see how the collection arrives at 2GB might provide useful tips to prevent this.

Joining a stream against a "table" in Dataflow

Let me use a slightly contrived example to explain what I'm trying to do. Imagine I have a stream of trades coming in, with the stock symbol, share count, and price: { symbol = "GOOG", count = 30, price = 200 }. I want to enrich these events with the name of the stock, in this case "Google".
For this purpose I want to, inside Dataflow, maintain a "table" of symbol->name mappings that is updated by a PCollection<KV<String, String>>, and join my stream of trades with this table, yielding e.g. a PCollection<KV<Trade, String>>.
This seems like a thoroughly fundamental use case for stream processing applications, yet I'm having a hard time figuring out how to accomplish this in Dataflow. I know it's possible in Kafka Streams.
Note that I do not want to use an external database for the lookups – I need to solve this problem inside Dataflow or switch to Kafka Streams.
I'm going to describe two options. One using side-inputs which should work with the current version of Dataflow (1.X) and one using state within a DoFn which should be part of the upcoming Dataflow (2.X).
Solution for Dataflow 1.X, using side inputs
The general idea here is to use a map-valued side-input to make the symbol->name mapping available to all the workers.
This table will need to be in the global window (so nothing ever ages out), will need to be triggered every element (or as often as you want new updates to be produced), and accumulate elements across all firings. It will also need some logic to take the latest name for each symbol.
The downside to this solution is that the entire lookup table will be regenerated every time a new entry comes in and it will not be immediately pushed to all workers. Rather, each will get the new mapping "at some point" in the future.
At a high level, this pipeline might look something like (I haven't tested this code, so there may be some types):
PCollection<KV<Symbol, Name>> symbolToNameInput = ...;
final PCollectionView<Map<Symbol, Iterable<Name>>> symbolToNames = symbolToNameInput
.apply(Window.into(GlobalWindows.of())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.accumulatingFiredPanes())
.apply(View.asMultiMap())
Note that we had to use viewAsMultiMap here. This means that we actually build up all the names for every symbol. When we look things up we'll need make sure to take the latest name in the iterable.
PCollection<Detail> symbolDetails = ...;
symbolDetails
.apply(ParDo.withSideInputs(symbolToNames).of(new DoFn<Detail, AugmentedDetails>() {
#Override
public void processElement(ProcessContext c) {
Iterable<Name> names = c.sideInput(symbolToNames).get(c.element().symbol());
Name name = chooseName(names);
c.output(augmentDetails(c.element(), name));
}
}));
Solution for Dataflow 2.X, using the State API
This solution uses a new feature that will be part of the upcoming Dataflow 2.0 release. It is not yet part of the preview releases (currently Dataflow 2.0-beta1) but you can watch the release notes to see when it is available.
The general idea is that keyed state allows us to store some values associated with the specific key. In this case, we're going to remember the latest "name" value we've seen.
Before running the stateful DoFn we're going to wrap each element into a common element type (a NameOrDetails) object. This would look something like the following:
// Convert SymbolToName entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> left = symbolToName
.apply(ParDo.of(new DoFn<SymbolToName, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
SymbolToName e = c.element();
c.output(KV.of(e.getSymbol(), NameOrDetails.name(e.getName())));
}
});
// Convert detailed entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> right = details
.apply(ParDo.of(new DoFn<Details, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Details e = c.element();
c.output(KV.of(e.getSymobl(), NameOrDetails.details(e)));
}
});
// Flatten the two streams together
PCollectionList.of(left).and(right)
.apply(Flatten.create())
.apply(ParDo.of(new DoFn<KV<Symbol, NameOrDetails>, AugmentedDetails>() {
#StateId("name")
private final StateSpec<ValueState<String>> nameSpec =
StateSpecs.value(StringUtf8Coder.of());
#ProcessElement
public void processElement(ProcessContext c
#StateId("name") ValueState<String> nameState) {
NameOrValue e = c.element().getValue();
if (e.isName()) {
nameState.write(e.getName());
} else {
String name = nameState.read();
if (name == null) {
// Use symbol if we haven't received a mapping yet.
name = c.element().getKey();
}
c.output(e.getDetails().withName(name));
}
});

How to read all the items present in an Appfabric Cache:

I am trying to develop a tool (in Visual Studio 2010, C#) which can read all the items present in an Appfabric cache and store them in a Table. I don't have to use powershell.
First I thought that If I can get all the regions present in the cache, I can make use of the DataCache.GetObjectsInRegion Method to complete my task. But I was not able to get all the region names from the cache as it does not shows the user defined region names but only the default ones, so now I am giving up on this approach.
Can anyone please guide me here, my main goal is to read all the items present in a cache.
There is no built-in method to list all items in the cache.
You're correct, it's possible to list all items using GetObjectsInRegion for a named cache. You have to know first all regions names (if used) or call GetSystemRegions to get all (default) system regions. A simple foreach will allow you to list all items. When you put something into the cache without region name, it will be added to a system region.
Here is a basic example
// Declare array for cache host(s).
DataCacheServerEndpoint[] servers = new DataCacheServerEndpoint[1];
servers[0] = new DataCacheServerEndpoint("YOURSERVERHERE", 22233);
// Setup the DataCacheFactory configuration.
DataCacheFactoryConfiguration factoryConfig = new DataCacheFactoryConfiguration();
factoryConfig.Servers = servers;
factoryConfig.SecurityProperties = new DataCacheSecurity(DataCacheSecurityMode.None, DataCacheProtectionLevel.None);
// Create a configured DataCacheFactory object.
DataCacheFactory mycacheFactory = new DataCacheFactory(factoryConfig);
// Get a cache client for the default cache
DataCache myCache = mycacheFactory.GetDefaultCache(); //or change to mycacheFactory.GetCache(myNamedCache);
//inserty dummytest data
myCache.Put("key1", "myobject1");
myCache.Put("key2", "myobject2");
myCache.Put("key3", "myobject3");
Random random = new Random();
//list all items in the cache : important part
foreach (string region in myCache.GetSystemRegions())
{
foreach (var kvp in myCache.GetObjectsInRegion(region))
{
Console.WriteLine("data item ('{0}','{1}') in region {2} of cache {3}", kvp.Key, kvp.Value.ToString(), region, "default");
}
}