GCP Dataflow Apache Beam writing output error handling - google-cloud-platform

I need to apply error handling to my Dataflow for multiple inserts to Spanner with the same primary key. The logic being that an older message may be received after the current message and I do not want to overwrite the saved values. Therefore I will create my mutation as an insert and throw an error when a duplicate insert is attempted.
I have seen several examples of try blocks within DoFn's that write to a side output to log any errors. This is a very nice solution but I need to apply error handling to the step that writes to Spanner which does not contain a DoFn
spannerBranchTuples2.get(spannerOutput2)
.apply("Create Spanner Mutation", ParDo.of(createSpannerMutation))
.apply("Write Spanner Records", SpannerIO.write()
.withInstanceId(options.getSpannerInstanceId())
.withDatabaseId(options.getSpannerDatabaseId())
.grouped());
I have not found any documentation that allows error handling to be applied to this step, or found a way to re-write it as a DoFn. Any suggestions how to apply error handling to this? thanks

There is an interesting pattern for this in Dataflow documentation.
Basically, the idea is to have a DoFn before sending your results to your writing transforms. It'd look something like so:
final TupleTag<Output> successTag = new TupleTag<>() {};
final TupleTag<Input> deadLetterTag = new TupleTag<>() {};
PCollection<Input> input = /* … */;
PCollectionTuple outputTuple = input.apply(ParDo.of(new DoFn<Input, Output>() {
#Override
void processElement(ProcessContext c) {
try {
c.output(process(c.element());
} catch (Exception e) {
LOG.severe("Failed to process input {} -- adding to dead letter file",
c.element(), e);
c.sideOutput(deadLetterTag, c.element());
}
}).withOutputTags(successTag, TupleTagList.of(deadLetterTag)));
outputTuple.get(deadLetterTag)
.apply(/* Write to a file or table or anything */);
outputTuple.get(successTag)
.apply(/* Write to Spanner or any other sink */);
Let me know if this is useful!

Related

Apache Beam Kafka IO for Json messages - org.apache.kafka.common.errors.SerializationException

Am trying to get familiar with Apache beam Kafka IO and getting following error
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:348)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:318)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:213)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at com.andrewjones.KafkaConsumerExample.main(KafkaConsumerExample.java:58)
Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
Following is piece of code which reads messages from a kafka topic. Appreciate if you someone can provide some pointers.
public class KafkaConsumerJsonExample {
static final String TOKENIZER_PATTERN = "[^\p{L}]+";
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("myTopic2")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
// We're writing to a file, which does not support unbounded data sources. This line makes it bounded to
// the first 5 records.
// In reality, we would likely be writing to a data source that supports unbounded data, such as BigQuery.
.withMaxNumRecords(5)
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create())
.apply(TextIO.write().to("wordcounts"));
System.out.println("running data pipeline");
p.run().waitUntilFinish();
}
}
The issue is caused by using LongDeserializer for keys that seems were serialised by other serialiser than Long and it depends how you produced the records.
So, you either can use a proper deserializer or, if keys don't matter, as a workaround, try to use StringDeserializer for keys as well.

Joining a stream against a "table" in Dataflow

Let me use a slightly contrived example to explain what I'm trying to do. Imagine I have a stream of trades coming in, with the stock symbol, share count, and price: { symbol = "GOOG", count = 30, price = 200 }. I want to enrich these events with the name of the stock, in this case "Google".
For this purpose I want to, inside Dataflow, maintain a "table" of symbol->name mappings that is updated by a PCollection<KV<String, String>>, and join my stream of trades with this table, yielding e.g. a PCollection<KV<Trade, String>>.
This seems like a thoroughly fundamental use case for stream processing applications, yet I'm having a hard time figuring out how to accomplish this in Dataflow. I know it's possible in Kafka Streams.
Note that I do not want to use an external database for the lookups – I need to solve this problem inside Dataflow or switch to Kafka Streams.
I'm going to describe two options. One using side-inputs which should work with the current version of Dataflow (1.X) and one using state within a DoFn which should be part of the upcoming Dataflow (2.X).
Solution for Dataflow 1.X, using side inputs
The general idea here is to use a map-valued side-input to make the symbol->name mapping available to all the workers.
This table will need to be in the global window (so nothing ever ages out), will need to be triggered every element (or as often as you want new updates to be produced), and accumulate elements across all firings. It will also need some logic to take the latest name for each symbol.
The downside to this solution is that the entire lookup table will be regenerated every time a new entry comes in and it will not be immediately pushed to all workers. Rather, each will get the new mapping "at some point" in the future.
At a high level, this pipeline might look something like (I haven't tested this code, so there may be some types):
PCollection<KV<Symbol, Name>> symbolToNameInput = ...;
final PCollectionView<Map<Symbol, Iterable<Name>>> symbolToNames = symbolToNameInput
.apply(Window.into(GlobalWindows.of())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.accumulatingFiredPanes())
.apply(View.asMultiMap())
Note that we had to use viewAsMultiMap here. This means that we actually build up all the names for every symbol. When we look things up we'll need make sure to take the latest name in the iterable.
PCollection<Detail> symbolDetails = ...;
symbolDetails
.apply(ParDo.withSideInputs(symbolToNames).of(new DoFn<Detail, AugmentedDetails>() {
#Override
public void processElement(ProcessContext c) {
Iterable<Name> names = c.sideInput(symbolToNames).get(c.element().symbol());
Name name = chooseName(names);
c.output(augmentDetails(c.element(), name));
}
}));
Solution for Dataflow 2.X, using the State API
This solution uses a new feature that will be part of the upcoming Dataflow 2.0 release. It is not yet part of the preview releases (currently Dataflow 2.0-beta1) but you can watch the release notes to see when it is available.
The general idea is that keyed state allows us to store some values associated with the specific key. In this case, we're going to remember the latest "name" value we've seen.
Before running the stateful DoFn we're going to wrap each element into a common element type (a NameOrDetails) object. This would look something like the following:
// Convert SymbolToName entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> left = symbolToName
.apply(ParDo.of(new DoFn<SymbolToName, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
SymbolToName e = c.element();
c.output(KV.of(e.getSymbol(), NameOrDetails.name(e.getName())));
}
});
// Convert detailed entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> right = details
.apply(ParDo.of(new DoFn<Details, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Details e = c.element();
c.output(KV.of(e.getSymobl(), NameOrDetails.details(e)));
}
});
// Flatten the two streams together
PCollectionList.of(left).and(right)
.apply(Flatten.create())
.apply(ParDo.of(new DoFn<KV<Symbol, NameOrDetails>, AugmentedDetails>() {
#StateId("name")
private final StateSpec<ValueState<String>> nameSpec =
StateSpecs.value(StringUtf8Coder.of());
#ProcessElement
public void processElement(ProcessContext c
#StateId("name") ValueState<String> nameState) {
NameOrValue e = c.element().getValue();
if (e.isName()) {
nameState.write(e.getName());
} else {
String name = nameState.read();
if (name == null) {
// Use symbol if we haven't received a mapping yet.
name = c.element().getKey();
}
c.output(e.getDetails().withName(name));
}
});

Intercepting error handling with loopback

Is there somewhere complete, consistent and well documented source of information on error handling in loopback?
Things like error codes and their meaning, relation with http statuses. I've already read their docs and have not found anything like this.
I would like to translate all the messages to add multi language support to my app. I would also like to add my custom messages, with their code and to use it consistently with other loopback errors.
In order to achieve this, I need to intercept all the errors (I've done this already) and to know all the possible different codes, so I can translate them.
For example, if there is an error with code 555, I have to know what it means and treat it accordingly.
Any ideas?
I need to "catch" all the messages and translate them
This is the beginning of an answer. You can write an error-handling middleware that will intercept any error returned by the server. You will need in turn to implement the logic for making the translation.
module.exports = function() {
return function logError(err, req, res, next) {
if (err) {
console.log('ERR', req.url, err);
}
next();
};
};
This middleware must be configured to be called in the final phase. Save the code above in log-error.js for instance, then modify server/middleware.json
{ "final": { "./middleware/log-error": {} } }
I need a full list of loopback codes/messages
I'm pretty sure there is no such thing. Errors are build and returned all over the place in the code, not centralized anywhere.

Can I force a Cassandra table flush from the C/C++ driver like nodetool does?

I'm wondering whether I could replicate the forceKeyspaceFlush() function found in the nodetool utility from the C/C++ driver of Cassandra.
The nodetool function looks like this:
public class Flush extends NodeToolCmd
{
#Arguments(usage = "[<keyspace> <tables>...]", description = "The keyspace followed by one or many tables")
private List<String> args = new ArrayList<>();
#Override
public void execute(NodeProbe probe)
{
List<String> keyspaces = parseOptionalKeyspace(args, probe);
String[] tableNames = parseOptionalTables(args);
for (String keyspace : keyspaces)
{
try
{
probe.forceKeyspaceFlush(keyspace, tableNames);
} catch (Exception e)
{
throw new RuntimeException("Error occurred during flushing", e);
}
}
}
}
What I would like to replicate in my C++ software is this line:
probe.forceKeyspaceFlush(keyspace, tableNames);
Is it possible?
That's an unusual request, primarily because Cassandra is designed to be distributed, so if you're executing a query, you'd need to perform that blocking flush on each of the (potentially many) replicas. Rather than convince you that you don't really need this, I'll attempt to answer your question - however, you probably don't really need this.
Nodetool is using the JMX interface (on tcp/7199) to force that flush - Your c/c++ driver talks over the native protocol (on tcp/9042). At this time, flush is not possible via the native protocol.
Work around the limitation, you'd need to either exec a jmx-capable commandline utility (nodetool or other), implement a JMX client in c++ (it's been done), or extend the native protocol. None of those are particularly pleasant options, but I imagine executing a jmx CLI utility is significantly easier than the other two.

many queries in a task to generate json

So I've got a task to build which is going to archive a ton of data in our DB into JSON.
To give you a better idea of what is happening; X has 100s of Ys, and Y has 100s of Zs and so on. I'm creating a json file for every X, Y, and Z. But every X json file has an array of ids for the child Ys of X, and likewise the Ys store an array of child Zs..
It more complicated than that in many cases, but you should get an idea of the complexity involved from that example I think.
I was using ColdFusion but it seems to be a bad choice for this task because it is crashing due to memory errors. It seems to me that if it were removing queries from memory that are no longer referenced while running the task (ie: garbage collecting) then the task should have enough memory, but afaict ColdFusion isn't doing any garbage collection at all, and must be doing it after a request is complete.
So I'm looking either for advice on how to better achieve my task in CF, or for recommendations on other languages to use..
Thanks.
1) If you have debugging enabled, coldfusion will hold on to your queries until the page is done. Turn it off!
2) You may need to structDelete() the query variable to allow it to be garbage collected, otherwise it may persist as long as the scope that has a reference to it exists. eg.,
<cfset structDelete(variables,'myQuery') />
3) A cfquery pulls the entire ResultSet into memory. Most of the time this is fine. But for reporting on a large result set, you don't want this. Some JDBC drivers support setting the fetchSize, which in a forward, read only fashion, will let you get a few results at a time. This way you can deal with thousands and thousands of rows, without swamping memory. I just generated a 1GB csv file in ~80 seconds, using less than 100mb of heap. This requires dropping out to Java. But it kills two birds with one stone. It reduces the amount of data brought in at a time by the JDBC driver, and since you're working directly with the ResultSet, you don't hit the cfloop problem #orangepips mentioned. Granted, it's not for those without some Java chops.
You can do it something like this (you need cfusion.jar in your build path):
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.sql.ResultSet;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import au.com.bytecode.opencsv.CSVWriter;
import coldfusion.server.ServiceFactory;
public class CSVExport {
public static void export(String dsn,String query,String fileName) {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
FileWriter fw = null;
BufferedWriter bw = null;
try {
DataSource ds = ServiceFactory.getDataSourceService().getDatasource(dsn);
conn = ds.getConnection();
// we want a forward-only, read-only result.
// you may want need to use a PreparedStatement instead.
stmt = conn.createStatement(
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
// we only want to go forward!
stmt.setFetchDirect(ResultSet.FETCH_FORWARD);
// how many records to pull back at a time.
// the hard part is balancing memory usage, and round trips to the database.
// basically sacrificing speed for a lower memory hit.
stmt.setFetchSize(256);
rs = stmt.executeQuery(query);
// do something with the ResultSet, for example write to csv using opencsv
// the key is to stream it. you don't want it stored in memory.
// so excel spreadsheets and pdf files are out, but text formats like
// like csv, json, html, and some binary formats like MDB (via jackcess)
// that support streaming are in.
fw = new FileWriter(fileName);
bw = new BufferedWriter(fw);
CSVWriter writer = new CSVWriter(bw);
writer.writeAll(rs,true);
}
catch (Exception e) {
// handle your exception.
// maybe try ServiceFactory.getLoggingService() if you want to do a cflog.
e.printStackTrace();
}
finally() {
try {rs.close()} catch (Exception e) {}
try {stmt.close()} catch (Exception e) {}
try {conn.close()} catch (Exception e) {}
try {bw.close()} catch (Exception e) {}
try {fw.close()} catch (Exception e) {}
}
}
}
Figuring out how to pass parameters, logging, turning this into a background process (hint: extend Thread) etc. are separate issues, but if you grok this code, it shouldn't be too difficult.
4) Perhaps look at Jackson for generating your json. It supports streaming, and combined with the fetchSize, and a BufferedOutputStream, you should be able to keep the memory usage way down.
Eric, you are absolutely correct about ColdFusion garbage collection not removing query information from memory until request end and I've documented it fairly extensively in another SO question. In short, you hit OoM Exceptions when you loop over queries. You can prove it using a tool like VisualVM to generate a heap dump while the process is running and then running the resulting dump through Eclipse Memory Analyzer Tool (MAT). What MAT would show you is a large hierarchy, starting with an object named (I'm not making this up) CFDummyContent that holds, among other things, references to cfquery and cfqueryparam tags. Note, attempting to change it up to stored procs or even doing the database interaction via JDBC does not make difference.
So. What. To. Do?
This took me a while to figure out, but you've got 3 options in increasing order of complexity:
<cthread/>
asynchronous CFML gateway
daisy chain http requests
Using cfthread looks like this:
<cfloop ...>
<cfset threadName = "thread" & createUuid()>
<cfthread name="#threadName#" input="#value#">
<!--- do query stuff --->
<!--- code has access to passed attributes (e.g. #attributes.input#) --->
<cfset thread.passOutOfThread = somethingGeneratedInTheThread>
</cfthread>
<cfthread action="join" name="#threadName#">
<cfset passedOutOfThread = cfthread["#threadName#"].passOutOfThread>
</cfloop>
Note, this code is not taking advantage of asynchronous processing, thus the immediate join after each thread call, but rather the side effect that cfthread runs in its own request-like scope independent of the page.
I'll not cover ColdFusion gateways here. HTTP daisy chaining means executing an increment of the work, and at the end of the increment launching a request to the same algorithm telling it to execute the next increment.
Basically, all three approaches allow those memory references to be collected mid process.
And yes, for whoever asks, bugs have been raised with Adobe, see the question referenced. Also, I believe this issue is specific to Adobe ColdFusion, but have not tested Railo or OpenDB.
Finally, have to rant. I've spent a lot of time tracking this one down, fixing it in my own large code base, and several others listed in the question referenced have as well. AFAIK Adobe has not acknowledge the issue much-the-less committed to fixing it. And, yes it's a bug, plain and simple.