Joining a stream against a "table" in Dataflow

Joining a stream against a "table" in Dataflow - google-cloud-platform

Let me use a slightly contrived example to explain what I'm trying to do. Imagine I have a stream of trades coming in, with the stock symbol, share count, and price: { symbol = "GOOG", count = 30, price = 200 }. I want to enrich these events with the name of the stock, in this case "Google".
For this purpose I want to, inside Dataflow, maintain a "table" of symbol->name mappings that is updated by a PCollection<KV<String, String>>, and join my stream of trades with this table, yielding e.g. a PCollection<KV<Trade, String>>.
This seems like a thoroughly fundamental use case for stream processing applications, yet I'm having a hard time figuring out how to accomplish this in Dataflow. I know it's possible in Kafka Streams.
Note that I do not want to use an external database for the lookups – I need to solve this problem inside Dataflow or switch to Kafka Streams.

I'm going to describe two options. One using side-inputs which should work with the current version of Dataflow (1.X) and one using state within a DoFn which should be part of the upcoming Dataflow (2.X).
Solution for Dataflow 1.X, using side inputs
The general idea here is to use a map-valued side-input to make the symbol->name mapping available to all the workers.
This table will need to be in the global window (so nothing ever ages out), will need to be triggered every element (or as often as you want new updates to be produced), and accumulate elements across all firings. It will also need some logic to take the latest name for each symbol.
The downside to this solution is that the entire lookup table will be regenerated every time a new entry comes in and it will not be immediately pushed to all workers. Rather, each will get the new mapping "at some point" in the future.
At a high level, this pipeline might look something like (I haven't tested this code, so there may be some types):
PCollection<KV<Symbol, Name>> symbolToNameInput = ...;
final PCollectionView<Map<Symbol, Iterable<Name>>> symbolToNames = symbolToNameInput
.apply(Window.into(GlobalWindows.of())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.accumulatingFiredPanes())
.apply(View.asMultiMap())
Note that we had to use viewAsMultiMap here. This means that we actually build up all the names for every symbol. When we look things up we'll need make sure to take the latest name in the iterable.
PCollection<Detail> symbolDetails = ...;
symbolDetails
.apply(ParDo.withSideInputs(symbolToNames).of(new DoFn<Detail, AugmentedDetails>() {
#Override
public void processElement(ProcessContext c) {
Iterable<Name> names = c.sideInput(symbolToNames).get(c.element().symbol());
Name name = chooseName(names);
c.output(augmentDetails(c.element(), name));
}
}));
Solution for Dataflow 2.X, using the State API
This solution uses a new feature that will be part of the upcoming Dataflow 2.0 release. It is not yet part of the preview releases (currently Dataflow 2.0-beta1) but you can watch the release notes to see when it is available.
The general idea is that keyed state allows us to store some values associated with the specific key. In this case, we're going to remember the latest "name" value we've seen.
Before running the stateful DoFn we're going to wrap each element into a common element type (a NameOrDetails) object. This would look something like the following:
// Convert SymbolToName entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> left = symbolToName
.apply(ParDo.of(new DoFn<SymbolToName, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
SymbolToName e = c.element();
c.output(KV.of(e.getSymbol(), NameOrDetails.name(e.getName())));
}
});
// Convert detailed entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> right = details
.apply(ParDo.of(new DoFn<Details, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Details e = c.element();
c.output(KV.of(e.getSymobl(), NameOrDetails.details(e)));
}
});
// Flatten the two streams together
PCollectionList.of(left).and(right)
.apply(Flatten.create())
.apply(ParDo.of(new DoFn<KV<Symbol, NameOrDetails>, AugmentedDetails>() {
#StateId("name")
private final StateSpec<ValueState<String>> nameSpec =
StateSpecs.value(StringUtf8Coder.of());
#ProcessElement
public void processElement(ProcessContext c
#StateId("name") ValueState<String> nameState) {
NameOrValue e = c.element().getValue();
if (e.isName()) {
nameState.write(e.getName());
} else {
String name = nameState.read();
if (name == null) {
// Use symbol if we haven't received a mapping yet.
name = c.element().getKey();
}
c.output(e.getDetails().withName(name));
}
});

Related

Intercepting all orders with full data in MT4

I'm trying to write a trade copier for MT4. I have already written one for MT5, but the issue I'm having with translation is in intercepting active orders. In MT5, this is relatively simple:
void OnTradeTransaction(const MqlTradeTransaction &trans, const MqlTradeRequest &request,
const MqlTradeResult &result) {
// Code goes here
}
As shown in the MQL5 documentation, this event intercepts all orders sent from the client and accepted by a trade server.
Looking at the MQL4 documentation, however, I don't see any easy way of doing this. The closest I could get would be to iterate over all the orders by doing this:
for (int i = 0; i < OrdersTotal(); i++) {
if (!OrderSelect(i, SELECT_BY_POS)) {
// Error handling here
}
// Do stuff with this order
}
My understanding is that this code also gets all open orders. However, the issue I'm having is that there are key pieces of information that I cannot determine on these orders:
Slippage
Position-by (for close-by orders)
Action type (close, close-by, delete, modify, send). Although this could be inferred from the fields populated on the order.
In my mind, I could then go and intercept the orders when they're generated (i.e. wrap OrderClose, OrderCloseBy, OrderDelete, OrderModify and OrderSend) and pull the relevant information off of the orders that way. But that still doesn't cover the case where the user enters an order manually.
Is there a way I can intercept all orders data without losing information?

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.

If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Google Cloud Datastore - get after insert in one request

I am trying to retrieve an entity immediately after it was saved. When debugging, I insert the entity, and check entities in google cloud console, I see it was created.
Key key = datastore.put(fullEntity)
After that, I continue with getting the entity with
datastore.get(key)
, but nothing is returned. How do I retrieve the saved entity within one request?
I've read this question Missing entities after insertion in Google Cloud DataStore
but I am only saving 1 entity, not tens of thousands like in that question
I am using Java 11 and google datastore (com.google.cloud.datastore. package)*
edit: added code how entity was created
public Key create.... {
// creating the entity inside a method
Transaction txn = this.datastore.newTransaction();
this.datastore = DatastoreOptions.getDefaultInstance().getService();
Builder<IncompleteKey> builder = newBuilder(entitykey);
setLongOrNull(builder, "price", purchase.getPrice());
setTimestampOrNull(builder, "validFrom", of(purchase.getValidFrom()));
setStringOrNull(builder, "invoiceNumber", purchase.getInvoiceNumber());
setBooleanOrNull(builder, "paidByCard", purchase.getPaidByCard());
newPurchase = entityToObject(this.datastore.put(builder.build()));
if (newPurchase != null && purchase.getItems() != null && purchase.getItems().size() > 0) {
for (Item item : purchase.getItems()) {
newPurchase.getItems().add(this.itemDao.save(item, newPurchase));
}
}
txn.commit();
return newPurchase.getKey();
}
after that, I am trying to retrieve the created entity
Key key = create(...);
Entity e = datastore.get(key)

I believe that there are a few issues with your code, but since we are unable to see the logic behind many of your methods, here comes my guess.
First of all, as you can see on the documentation, it's possible to save and retrieve an entity on the same code, so this is not a problem.
It seems like you are using a transaction which is right to perform multiple operations in a single action, but it doesn't seem like you are using it properly. This is because you only instantiate it and close it, but you don't put any operation on it. Furthermore, you are using this.datastore to save to the database, which completely neglects the transaction.
So you either save the object when it has all of its items already added or you create a transaction to save all the entities at once.
And I believe you should use the entityKey in order to fetch the added purchase afterwards, but don't mix it.
Also you are creating the Transaction object from this.datastore before instantiating the latter, but I assume this is a copy-paste error.

Since you're creating a transaction for this operation, the entity put should happen inside the transaction:
txn.put(builder.builder());
Also, the operations inside the loop where you add the purchase.getItems() to the newPurchase object should also be done in the context of the same transaction.
Let me know if this resolves the issue.
Cheers!

GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB

We run into problems with our Dataflow on Google Cloud. Our pipeline consists of various input steps, which get data pushed in with GCP PubSub. We then aggregate the data and sort it. These 1 steps are clearly too heavy for Dataflow and the window we configured. We get an exception [2] on the step. Also we see these metrics:
droppedDueToClosedWindow 3,838,662 Bids/AggregateExchangeOrders
droppedDueToClosedWindow 21,060,627 Asks/AggregateExchangeOrders
Now I am seeking advice how to attack this issue. Should I break down the steps, so for example iterations and sorting can be done with parallel steps?
Is there a way to get more information about what exactly happens?
Should we increase the number of workers? (Currently 1).
We are rather new with Dataflow. .. Good advice is most welcome.
Edit: I am adding a bit of details on the steps.
This is how the steps below are 'chained' together:
#Override
public PCollection<KV<KV<String, String>, List<ExchangeOrder>>> expand(PCollection<KV<String, KV<String, String>>> input) {
return input.apply("PairWithType", new ByPairWithType(type))
.apply("UnfoldExchangeOrders", new ByAggregatedExchangeOrders())
.apply("AggregateExchangeOrders", GroupByKey.<KV<String, String>, KV<String, KV<BigDecimal, BigDecimal>>>create())
.apply("ReorderExchangeOrders", ParDo.of(new ReorderExchangeOrders()));
}
AggregateExchangeOrders:
So here, clearly we iterate through a collection of orders, and parse the type (twice), so it'a big decimal.
Which makes me think, we could skip one parse step as described here:
Convert string to BigDecimal in java
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, KV<String, String>> key = c.element().getKey();
List<KV<String, String>> value = c.element().getValue();
value.forEach(
exchangeOrder -> {
try {
BigDecimal unitPrice = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getKey()));
BigDecimal quantity = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getValue()));
if (quantity.compareTo(BigDecimal.ZERO) != 0) {
// Exclude exchange orders with no quantity.
c.output(KV.of(key.getValue(), KV.of(key.getKey(), KV.of(unitPrice, quantity))));
}
} catch (NumberFormatException e) {
// Exclude exchange orders with invalid element.
}
});
}
...next we group and sort. (And optionally reverse it), it seems this step is not taking a huge load.
ReorderExchangeOrders:
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, String> pairAndType = c.element().getKey();
Iterable<KV<String, KV<BigDecimal, BigDecimal>>> exchangeOrderBook = c.element().getValue();
List<ExchangeOrder> list = new ArrayList<>();
exchangeOrderBook.forEach(exchangeOrder -> list.add(
new ExchangeOrder(exchangeOrder.getKey(), exchangeOrder.getValue().getKey(), exchangeOrder.getValue().getValue())));
// Asks are sorted in ASC order
Collections.sort(list);
// Bids are sorted in DSC order
if (pairAndType.getValue().equals(EXCHANGE_ORDER_TYPE.BIDS.toString())) {
Collections.reverse(list);
}
c.output(KV.of(pairAndType, list));
}
[ 1 ] Dataflow screenshot:
[ 2 ] Exception: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed.
java.lang.IllegalStateException: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
com.google.cloud.dataflow.worker.StreamingDataflowWorker$Commit.getSize(StreamingDataflowWorker.java:327)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.lambda$new$0(StreamingDataflowWorker.java:342)

The error message is kind of straightforward.
The root cause of the problem, as many of the comments point out, is that the structure that contains all the results for one of the DoFn's is larger than 2GB, and your best option would be to partition your data in some way to make your work units smaller.
In the code I see that some of the structures returned by DoFn's are nested structures in the form KV>. This arrangement forces Dataflow to send the whole response back in one monolithic bundle, and prevents it from chunking it into smaller pieces.
One possible solution would be to use composite keys instead of nested structures for as long as possible in the pipeline, and only combine them when strictly necessary.
For example,
instead of KV>, the DoFn could return
KV<(concat(Key1, Key2)), Value>
This would split the work units into much smaller sets that can then be dispatched in parallel to multiple workers.
To answer the other questions, increasing the number of workers will have no effect as the huge collection generated by DoFn looks like is not splittable. Adding logging to see how the collection arrives at 2GB might provide useful tips to prevent this.

Aggregating a huge list from reducer input without running out of memory

At the reduce stage (67% of reduce percentage), my code ends up getting stuck and failing after hours of attempting to complete. I found out that the issue is that the reducer is receiving huge amounts of data that it can't handle and ends up running out of memory, which leads to the reducer being stuck.
Now, I am trying to find a way around this. Currently, I am assembling a list from the values received by the reducer fro each key. At the end of the reduce phase, I try to write the key and all of the values in the list. So my question is, how can I get the same functionality of having the key and list of values related to that key without running out of memory?
public class XMLReducer extends Reducer<Text, Text, Text, TextArrayWritable> {
private final Logger logger = Logger.getLogger(XMLReducer.class);
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//logger.info(key.toString());
Set<String> filesFinal = new HashSet<>();
int size = 0;
for(Text value : values) {
String[] files = value.toString().split(",\\s+");
filesFinal.add(value.toString());
//size++;
}
//logger.info(Integer.toString(size));
String[] temp = new String[filesFinal.size()];
temp = filesFinal.toArray(temp);
Text[] tempText = new Text[filesFinal.size()];
for(int i = 0; i < filesFinal.size(); i++) {
tempText[i] = new Text(temp[i]);
}
}
}
and TextArrayWritable is just a way to write an array to file

You can try reducing the amount of data that is read by the single reducer by writing a Custom partitioner.
HashPartitioner is the default partitioner that is used by the map reduce job. While this guarantees you uniform distribution, in some cases it is highly possible that many keys get hashed to a single reducer. As a result, a single reducer would have a lot of data compared to others. In your case, I think this is the issue.
To resolve this:
Analyze your data and the key on which you are doing group by. You
Try to come up with a partitioning function based on your group by key for your Custom Partitioner. Try limiting the number of keys for each partition.
You would see an increase in number of reduce tasks in your job. If the issue is related to uneven key distribution, the solution that I proposed should resolve your issue.
You could also try increasing reducer memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Joining a stream against a "table" in Dataflow - google-cloud-platform

Related

Intercepting all orders with full data in MT4

How to query big data in DynamoDB in best practice

Google Cloud Datastore - get after insert in one request

GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB

Aggregating a huge list from reducer input without running out of memory

Categories

Resources