Google cloud dataflow - batch insert in bigquery - google-cloud-platform

I was able to create a dataflow pipeline which reads data from pub/sub and after processing it writes to big query in streaming mode.
Now instead of stream mode i would like to run my pipeline in batch mode to reduce the costs.
Currently my pipeline is doing streaming inserts in bigquery with dynamic destinations. I would like to know if there is a way to perform a batch insert operation with dynamic destinations.
Below is the
public class StarterPipeline {
public interface StarterPipelineOption extends PipelineOptions {
/**
* Set this required option to specify where to read the input.
*/
#Description("Path of the file to read from")
#Default.String(Constants.pubsub_event_pipeline_url)
String getInputFile();
void setInputFile(String value);
}
#SuppressWarnings("serial")
public static void main(String[] args) throws SocketTimeoutException {
StarterPipelineOption options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(StarterPipelineOption.class);
Pipeline p = Pipeline.create(options);
PCollection<String> datastream = p.apply("Read Events From Pubsub",
PubsubIO.readStrings().fromSubscription(Constants.pubsub_event_pipeline_url));
PCollection<String> windowed_items = datastream.apply(Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(300))))
.withAllowedLateness(Duration.standardDays(10)).discardingFiredPanes());
// Write into Big Query
windowed_items.apply("Read and make event table row", new
ReadEventJson_bigquery())
.apply("Write_events_to_BQ",
BigQueryIO.writeTableRows().to(new DynamicDestinations<TableRow, String>() {
public String getDestination(ValueInSingleWindow<TableRow> element) {
String destination = EventSchemaBuilder
.fetch_destination_based_on_event(element.getValue().get("event").toString());
return destination;
}
#Override
public TableDestination getTable(String table) {
String destination =
EventSchemaBuilder.fetch_table_name_based_on_event(table);
return new TableDestination(destination, destination);
}
#Override
public TableSchema getSchema(String table) {
TableSchema table_schema =
EventSchemaBuilder.fetch_table_schema_based_on_event(table);
return table_schema;
}
}).withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));
p.run().waitUntilFinish();
log.info("Events Pipeline Job Stopped");
}
}

Batch or Streaming are determined by the PCollection, so you would need to transform your data stream PCollection from Pub/Sub into a batch PCollection to write to BigQuery. The transform that allows to do this is GroupIntoBatches<K,InputT>.
Note that since this Transform uses Key-Value pairs, batches will contain only elements of a single key. For non-KV elements, check this related answer.
Once you have created your PCollection as batch using this transform, then apply the BigQuery write with Dynamic Destinations as you did with the stream PCollection.

You can limit the costs by using file loads for Streaming jobs. The Insertion Method section states that BigQueryIO.Write supports two methods of inserting data into BigQuery specified using BigQueryIO.Write.withMethod (org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method). If no method is supplied, then a default method will be chosen based on the input PCollection. See BigQueryIO.Write.Method for more information about the methods.
The different insertion methods provide different tradeoffs of cost, quota, and data consistency. Please see BigQuery documentation for more information about these tradeoffs.

Related

AWS Kinesis KCL skips records added before startup

I started to use both KPL and KCL to exchange data between services. But whenever consumer service is offline, all data sent by KPL are lost forever. So I get only those chunks of data that were sent while consumer service is up and its shardConsumer is ready. I need to start from the last consumed point or somehow else process data left behind.
Here is my ShardProcessor code:
#Override
public void initialize(InitializationInput initializationInput) {
}
#Override
public void processRecords(ProcessRecordsInput processRecordsInput) {
processRecordsInput.records()
.forEach(record -> {
//my logic
});
}
#Override
public void leaseLost(LeaseLostInput leaseLostInput) {
}
#Override
public void shardEnded(ShardEndedInput shardEndedInput) {
try {
shardEndedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
LOG.error("Kinesis error on Shard Ended", e);
}
}
#Override
public void shutdownRequested(ShutdownRequestedInput shutdownRequestedInput) {
try {
shutdownRequestedInput.checkpointer().checkpoint();
} catch (ShutdownException | InvalidStateException e) {
LOG.error("Kinesis error on Shutdown Requested", e);
}
}
And configuration code:
public void configure(String streamName, ShardRecordProcessorFactory factory) {
Region region = Region.of(awsRegion);
KinesisAsyncClient kinesisAsyncClient =
KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.builder().region(region));
DynamoDbAsyncClient dynamoClient = DynamoDbAsyncClient.builder().region(region).build();
CloudWatchAsyncClient cloudWatchClient = CloudWatchAsyncClient.builder().region(region).build();
ConfigsBuilder configsBuilder =
new ConfigsBuilder(streamName, appName, kinesisAsyncClient, dynamoClient, cloudWatchClient,
UUID.randomUUID().toString(), factory);
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
configsBuilder.retrievalConfig()
.retrievalSpecificConfig(new PollingConfig(streamName, kinesisAsyncClient))
);
Thread schedulerThread = new Thread(scheduler);
schedulerThread.setDaemon(true);
schedulerThread.start();
}
There are two ways to address this. First, the problem.
By default, the KCL is configured to start reading the stream at LATEST. This setting tells the stream reader to pick up the stream at the "current" timestamp.
In your case, you have data in that stream that was placed in there before "now." In order to read that data, you might want to consider reading the earliest data you have in the stream. If you set up a default stream, the stream will store data for 24 hours.
To read the data from the "beginning" of that stream, or 24 hours before you start the KCL application, you'll want to set the stream reader to TRIM_HORIZON. This setting is called initialPositionInStream. You can read about it here. There are three different settings documented in the API.
To solve your issue, the preferred method, as noted in the first link, is to add an entry to the properties file. If you're not using a properties file, you can simply add this to your Scheduler ctor:
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
configsBuilder.retrievalConfig()
.initialPositionInStreamExtended(InitialPositionInStreamExtended.newInitialPosition(TRIM_HORIZON))
.retrievalSpecificConfig(new PollingConfig(streamName, kinesisAsyncClient))
);
One thing to keep in mind with this setting is startup functionality when you have data in the stream and you start at TRIM_HORIZON. In this scenario, the RecordProcessor will iterate through records as fast as it can. This could create performance issues at the Kinesis API, or even downstream systems (wherever you're sending the data once the RecordProcessor has it),

Apache Beam Kafka IO for Json messages - org.apache.kafka.common.errors.SerializationException

Am trying to get familiar with Apache beam Kafka IO and getting following error
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:348)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:318)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:213)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at com.andrewjones.KafkaConsumerExample.main(KafkaConsumerExample.java:58)
Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
Following is piece of code which reads messages from a kafka topic. Appreciate if you someone can provide some pointers.
public class KafkaConsumerJsonExample {
static final String TOKENIZER_PATTERN = "[^\p{L}]+";
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("myTopic2")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
// We're writing to a file, which does not support unbounded data sources. This line makes it bounded to
// the first 5 records.
// In reality, we would likely be writing to a data source that supports unbounded data, such as BigQuery.
.withMaxNumRecords(5)
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create())
.apply(TextIO.write().to("wordcounts"));
System.out.println("running data pipeline");
p.run().waitUntilFinish();
}
}
The issue is caused by using LongDeserializer for keys that seems were serialised by other serialiser than Long and it depends how you produced the records.
So, you either can use a proper deserializer or, if keys don't matter, as a workaround, try to use StringDeserializer for keys as well.

GCP Dataflow Apache Beam writing output error handling

I need to apply error handling to my Dataflow for multiple inserts to Spanner with the same primary key. The logic being that an older message may be received after the current message and I do not want to overwrite the saved values. Therefore I will create my mutation as an insert and throw an error when a duplicate insert is attempted.
I have seen several examples of try blocks within DoFn's that write to a side output to log any errors. This is a very nice solution but I need to apply error handling to the step that writes to Spanner which does not contain a DoFn
spannerBranchTuples2.get(spannerOutput2)
.apply("Create Spanner Mutation", ParDo.of(createSpannerMutation))
.apply("Write Spanner Records", SpannerIO.write()
.withInstanceId(options.getSpannerInstanceId())
.withDatabaseId(options.getSpannerDatabaseId())
.grouped());
I have not found any documentation that allows error handling to be applied to this step, or found a way to re-write it as a DoFn. Any suggestions how to apply error handling to this? thanks
There is an interesting pattern for this in Dataflow documentation.
Basically, the idea is to have a DoFn before sending your results to your writing transforms. It'd look something like so:
final TupleTag<Output> successTag = new TupleTag<>() {};
final TupleTag<Input> deadLetterTag = new TupleTag<>() {};
PCollection<Input> input = /* … */;
PCollectionTuple outputTuple = input.apply(ParDo.of(new DoFn<Input, Output>() {
#Override
void processElement(ProcessContext c) {
try {
c.output(process(c.element());
} catch (Exception e) {
LOG.severe("Failed to process input {} -- adding to dead letter file",
c.element(), e);
c.sideOutput(deadLetterTag, c.element());
}
}).withOutputTags(successTag, TupleTagList.of(deadLetterTag)));
outputTuple.get(deadLetterTag)
.apply(/* Write to a file or table or anything */);
outputTuple.get(successTag)
.apply(/* Write to Spanner or any other sink */);
Let me know if this is useful!

Aggregating a huge list from reducer input without running out of memory

At the reduce stage (67% of reduce percentage), my code ends up getting stuck and failing after hours of attempting to complete. I found out that the issue is that the reducer is receiving huge amounts of data that it can't handle and ends up running out of memory, which leads to the reducer being stuck.
Now, I am trying to find a way around this. Currently, I am assembling a list from the values received by the reducer fro each key. At the end of the reduce phase, I try to write the key and all of the values in the list. So my question is, how can I get the same functionality of having the key and list of values related to that key without running out of memory?
public class XMLReducer extends Reducer<Text, Text, Text, TextArrayWritable> {
private final Logger logger = Logger.getLogger(XMLReducer.class);
#Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//logger.info(key.toString());
Set<String> filesFinal = new HashSet<>();
int size = 0;
for(Text value : values) {
String[] files = value.toString().split(",\\s+");
filesFinal.add(value.toString());
//size++;
}
//logger.info(Integer.toString(size));
String[] temp = new String[filesFinal.size()];
temp = filesFinal.toArray(temp);
Text[] tempText = new Text[filesFinal.size()];
for(int i = 0; i < filesFinal.size(); i++) {
tempText[i] = new Text(temp[i]);
}
}
}
and TextArrayWritable is just a way to write an array to file
You can try reducing the amount of data that is read by the single reducer by writing a Custom partitioner.
HashPartitioner is the default partitioner that is used by the map reduce job. While this guarantees you uniform distribution, in some cases it is highly possible that many keys get hashed to a single reducer. As a result, a single reducer would have a lot of data compared to others. In your case, I think this is the issue.
To resolve this:
Analyze your data and the key on which you are doing group by. You
Try to come up with a partitioning function based on your group by key for your Custom Partitioner. Try limiting the number of keys for each partition.
You would see an increase in number of reduce tasks in your job. If the issue is related to uneven key distribution, the solution that I proposed should resolve your issue.
You could also try increasing reducer memory.

Joining a stream against a "table" in Dataflow

Let me use a slightly contrived example to explain what I'm trying to do. Imagine I have a stream of trades coming in, with the stock symbol, share count, and price: { symbol = "GOOG", count = 30, price = 200 }. I want to enrich these events with the name of the stock, in this case "Google".
For this purpose I want to, inside Dataflow, maintain a "table" of symbol->name mappings that is updated by a PCollection<KV<String, String>>, and join my stream of trades with this table, yielding e.g. a PCollection<KV<Trade, String>>.
This seems like a thoroughly fundamental use case for stream processing applications, yet I'm having a hard time figuring out how to accomplish this in Dataflow. I know it's possible in Kafka Streams.
Note that I do not want to use an external database for the lookups – I need to solve this problem inside Dataflow or switch to Kafka Streams.
I'm going to describe two options. One using side-inputs which should work with the current version of Dataflow (1.X) and one using state within a DoFn which should be part of the upcoming Dataflow (2.X).
Solution for Dataflow 1.X, using side inputs
The general idea here is to use a map-valued side-input to make the symbol->name mapping available to all the workers.
This table will need to be in the global window (so nothing ever ages out), will need to be triggered every element (or as often as you want new updates to be produced), and accumulate elements across all firings. It will also need some logic to take the latest name for each symbol.
The downside to this solution is that the entire lookup table will be regenerated every time a new entry comes in and it will not be immediately pushed to all workers. Rather, each will get the new mapping "at some point" in the future.
At a high level, this pipeline might look something like (I haven't tested this code, so there may be some types):
PCollection<KV<Symbol, Name>> symbolToNameInput = ...;
final PCollectionView<Map<Symbol, Iterable<Name>>> symbolToNames = symbolToNameInput
.apply(Window.into(GlobalWindows.of())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(5)))
.accumulatingFiredPanes())
.apply(View.asMultiMap())
Note that we had to use viewAsMultiMap here. This means that we actually build up all the names for every symbol. When we look things up we'll need make sure to take the latest name in the iterable.
PCollection<Detail> symbolDetails = ...;
symbolDetails
.apply(ParDo.withSideInputs(symbolToNames).of(new DoFn<Detail, AugmentedDetails>() {
#Override
public void processElement(ProcessContext c) {
Iterable<Name> names = c.sideInput(symbolToNames).get(c.element().symbol());
Name name = chooseName(names);
c.output(augmentDetails(c.element(), name));
}
}));
Solution for Dataflow 2.X, using the State API
This solution uses a new feature that will be part of the upcoming Dataflow 2.0 release. It is not yet part of the preview releases (currently Dataflow 2.0-beta1) but you can watch the release notes to see when it is available.
The general idea is that keyed state allows us to store some values associated with the specific key. In this case, we're going to remember the latest "name" value we've seen.
Before running the stateful DoFn we're going to wrap each element into a common element type (a NameOrDetails) object. This would look something like the following:
// Convert SymbolToName entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> left = symbolToName
.apply(ParDo.of(new DoFn<SymbolToName, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
SymbolToName e = c.element();
c.output(KV.of(e.getSymbol(), NameOrDetails.name(e.getName())));
}
});
// Convert detailed entries to KV<Symbol, NameOrDetails>
PCollection<KV<Symbol, NameOrDetails>> right = details
.apply(ParDo.of(new DoFn<Details, KV<Symbol, NameOrDetails>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Details e = c.element();
c.output(KV.of(e.getSymobl(), NameOrDetails.details(e)));
}
});
// Flatten the two streams together
PCollectionList.of(left).and(right)
.apply(Flatten.create())
.apply(ParDo.of(new DoFn<KV<Symbol, NameOrDetails>, AugmentedDetails>() {
#StateId("name")
private final StateSpec<ValueState<String>> nameSpec =
StateSpecs.value(StringUtf8Coder.of());
#ProcessElement
public void processElement(ProcessContext c
#StateId("name") ValueState<String> nameState) {
NameOrValue e = c.element().getValue();
if (e.isName()) {
nameState.write(e.getName());
} else {
String name = nameState.read();
if (name == null) {
// Use symbol if we haven't received a mapping yet.
name = c.element().getKey();
}
c.output(e.getDetails().withName(name));
}
});