How to report invalid data while processing data with Google dataflow? - google-cloud-platform

I am looking at the documentation and the provided examples to find out how I can report invalid data while processing data with Google's dataflow service.
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation())
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
p.run();
In addition to the actual in-/output, I want to produce a 2nd output file that contains records that which are considered invalid (e.g. missing data, malformed data, values were too high). I want to troubleshoot those records and process them separately.
Input: gs://.../input.csv
Output: gs://.../output.csv
List of invalid records: gs://.../invalid.csv
How can I redirect those invalid records into a separate output?

You can use PCollectionTuples to return multiple PCollections from a single transform. For example,
TupleTag<String> mainOutput = new TupleTag<>("main");
TupleTag<String> missingData = new TupleTag<>("missing");
TupleTag<String> badValues = new TupleTag<>("bad");
Pipeline p = Pipeline.create(options);
PCollectionTuple all = p
.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation());
all.get(mainOutput)
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
all.get(missingData)
.apply(TextIO.Write.named("WriteMissingData").to(...));
...
PCollectionTuples can either be built up directly out of existing PCollections, or emitted from ParDo operations with side outputs, e.g.
PCollectionTuple partitioned = input.apply(ParDo
.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
if (checkOK(c.element()) {
// Shows up in partitioned.get(mainOutput).
c.output(...);
} else if (hasMissingData(c.element())) {
// Shows up in partitioned.get(missingData).
c.sideOutput(missingData, c.element());
} else {
// Shows up in partitioned.get(badValues).
c.sideOutput(badValues, c.element());
}
}
})
.withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues)));
Note that in general the various side outputs need not have the same type, and data can be emitted any number of times to any number of side outputs (rather than the strict partitioning we have here).
Your SomeTransformation class could then look something like
class SomeTransformation extends PTransform<PCollection<String>,
PCollectionTuple> {
public PCollectionTuple apply(PCollection<String> input) {
// Filter into good and bad data.
PCollectionTuple partitioned = ...
// Process the good data.
PCollection<String> processed =
partitioned.get(mainOutput)
.apply(...)
.apply(...)
...;
// Repackage everything into a new output tuple.
return PCollectionTuple.of(mainOutput, processed)
.and(missingData, partitioned.get(missingData))
.and(badValues, partitioned.get(badValues));
}
}

Robert's suggestion of using sideOutputs is great, but note that this will only work if the bad data is identified by your ParDos. There currently isn't a way to identify bad records hit during initial decoding (where the error is hit in Coder.decode). We've got plans to work on that soon.

Related

Can someone please explain the proper usage of Timers and Triggers in Apache Beam?

I'm looking for some examples of usage of Triggers and Timers in Apache beam, I wanted to use Processing-time timers for listening my data from pub sub in every 5 minutes and using Processing time triggers processing the above data collected in an hour altogether in python.
Please take a look at the following resources: Stateful processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam
The first blog post is more general in how to handle states for context, and the second has some examples on buffering and triggering after a certain period of time, which seems similar to what you are trying to do.
A full example was requested. Here is what I was able to come up with:
PCollection<String> records =
pipeline.apply(
"ReadPubsub",
PubsubIO.readStrings()
.fromSubscription(
"projects/{project}/subscriptions/{subscription}"));
TupleTag<Iterable<String>> every5MinTag = new TupleTag<>();
TupleTag<Iterable<String>> everyHourTag = new TupleTag<>();
PCollectionTuple timersTuple =
records
.apply("WithKeys", WithKeys.of(1)) // A KV<> is required to use state. Keying by data is more appropriate than hardcode.
.apply(
"Batch",
ParDo.of(
new DoFn<KV<Integer, String>, Iterable<String>>() {
#StateId("buffer5Min")
private final StateSpec<BagState<String>> bufferedEvents5Min =
StateSpecs.bag();
#StateId("count5Min")
private final StateSpec<ValueState<Integer>> countState5Min =
StateSpecs.value();
#TimerId("every5Min")
private final TimerSpec every5MinSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("bufferHour")
private final StateSpec<BagState<String>> bufferedEventsHour =
StateSpecs.bag();
#StateId("countHour")
private final StateSpec<ValueState<Integer>> countStateHour =
StateSpecs.value();
#TimerId("everyHour")
private final TimerSpec everyHourSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void process(
#Element KV<Integer, String> record,
#StateId("count5Min") ValueState<Integer> count5MinState,
#StateId("countHour") ValueState<Integer> countHourState,
#StateId("buffer5Min") BagState<String> buffer5Min,
#StateId("bufferHour") BagState<String> bufferHour,
#TimerId("every5Min") Timer every5MinTimer,
#TimerId("everyHour") Timer everyHourTimer) {
if (Objects.firstNonNull(count5MinState.read(), 0) == 0) {
every5MinTimer
.offset(Duration.standardMinutes(1))
.align(Duration.standardMinutes(1))
.setRelative();
}
buffer5Min.add(record.getValue());
if (Objects.firstNonNull(countHourState.read(), 0) == 0) {
everyHourTimer
.offset(Duration.standardMinutes(60))
.align(Duration.standardMinutes(60))
.setRelative();
}
bufferHour.add(record.getValue());
}
#OnTimer("every5Min")
public void onTimerEvery5Min(
OnTimerContext context,
#StateId("buffer5Min") BagState<String> bufferState,
#StateId("count5Min") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(every5MinTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
#OnTimer("everyHour")
public void onTimerEveryHour(
OnTimerContext context,
#StateId("bufferHour") BagState<String> bufferState,
#StateId("countHour") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(everyHourTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
})
.withOutputTags(every5MinTag, TupleTagList.of(everyHourTag)));
timersTuple
.get(every5MinTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<<do something every 5 min>>);
timersTuple
.get(everyHourTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<< do something every hour>>);
pipeline.run().waitUntilFinish();

Write more than 25 items using BatchWriteItemEnhancedRequest Dynamodb JAVA SDK 2

I have an List items to be inserted into the DynamoDb collection. The size of the list may vary from 100 to 10k. I looking for an optimised way to Batch Write all the items using the BatchWriteItemEnhancedRequest (JAVA SDK2). What is the best way to add the items into the WriteBatch builder and then write the request using BatchWriteItemEnhancedRequest?
My Current Code:
WriteBatch.Builder<T> builder = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build();
items.forEach(item -> { builder.addPutItem(item); });
BatchWriteItemEnhancedRequest bwr = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build()
BatchWriteResult batchWriteResult =
DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
do {
// Check for unprocessed keys which could happen if you exceed
// provisioned throughput
List<T> unprocessedItems = batchWriteResult.unprocessedPutItemsForTable(getTable());
if (unprocessedItems.size() != 0) {
unprocessedItems.forEach(unprocessedItem -> {
builder.addPutItem(unprocessedItem);
});
batchWriteResult = DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
}
} while (batchWriteResult.unprocessedPutItemsForTable(getTable()).size() > 0);
Looking for a batching logic and a more better way to execute the BatchWriteItemEnhancedRequest.
I came up with a utility class to deal with that. Their batches of batches approach in v2 is overly complex for most use cases, especially when we're still limited to 25 items overall.
public class DynamoDbUtil {
private static final int MAX_DYNAMODB_BATCH_SIZE = 25; // AWS blows chunks if you try to include more than 25 items in a batch or sub-batch
/**
* Writes the list of items to the specified DynamoDB table.
*/
public static <T> void batchWrite(Class<T> itemType, List<T> items, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
Stream<List<T>> chunksOfItems = Lists.partition(items, MAX_DYNAMODB_BATCH_SIZE);
chunksOfItems.forEach(chunkOfItems -> {
List<T> unprocessedItems = batchWriteImpl(itemType, chunkOfItems, client, table);
while (!unprocessedItems.isEmpty()) {
// some failed (provisioning problems, etc.), so write those again
unprocessedItems = batchWriteImpl(itemType, unprocessedItems, client, table);
}
});
}
/**
* Writes a single batch of (at most) 25 items to DynamoDB.
* Note that the overall limit of items in a batch is 25, so you can't have nested batches
* of 25 each that would exceed that overall limit.
*
* #return those items that couldn't be written due to provisioning issues, etc., but were otherwise valid
*/
private static <T> List<T> batchWriteImpl(Class<T> itemType, List<T> chunkOfItems, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
WriteBatch.Builder<T> subBatchBuilder = WriteBatch.builder(itemType).mappedTableResource(table);
chunkOfItems.forEach(subBatchBuilder::addPutItem);
BatchWriteItemEnhancedRequest.Builder overallBatchBuilder = BatchWriteItemEnhancedRequest.builder();
overallBatchBuilder.addWriteBatch(subBatchBuilder.build());
return client.batchWriteItem(overallBatchBuilder.build()).unprocessedPutItemsForTable(table);
}
}

Create XML dataset with the same table name as initial data set in DBUnit?

I'm trying to create an initial DB state in DB Unit like this...
public function getDataSet() {
$primary = new \PHPUnit\DbUnit\DataSet\CompositeDataSet();
$fixturePaths = [
"test/Seeds/Upc/DB/UpcSelect.xml",
"test/Seeds/Generic/DB/ProductUpcSelect.xml"
];
foreach($fixturePaths as $fixturePath) {
$dataSet = $this->createXmlDataSet($fixturePath);
$primary->addDataSet($dataSet);
}
return $primary;
}
Then after my query I'm attempting to call this user-defined function...
protected function compareDatabase(String $seedPath, String $table) {
$expected = $this->createFlatXmlDataSet($seedPath)->getTable($table);
$result = $this->getConnection()->createQueryTable($table, "SELECT * FROM $table");
$this->assertTablesEqual($expected, $result);
}
The idea here is that I have an initial DB state, run my query, then compare the actual table state with the XML data set representing what I expect the table to look like. This process is described in PHPUnit's documentation for DBUnit but I keep having an exception thrown...
PHPUnit\DbUnit\InvalidArgumentException: There is already a table named upc with different table definition
Test example...
public function testDeleteByUpc() {
$mapper = new UpcMapper($this->getPdo());
$mapper->deleteByUpc("someUpcCode1");
$this->compareDatabase("test/Seeds/Upc/DB/UpcAfterDelete.xml", 'upc');
}
I seem to be following the docs...how is this supposed to be done?
This was actually unrelated to creating a second XML Dataset. This exception was thrown because the two fixtures I loaded in my getDataSet() method both had table definitions for upc.

Issue with Weka core DenseInstance

I'm building up an Instances object, adding Attributes, and then adding data in the form of Instance objects.
When I go to write it out, the toString() method is already throwing an OutOfBoundsException and unable to evaluate the data in the Instances. I receive the error when I try to print out the data and I can see the exception being thrown just in the Debugger as it shows it can't evaluate the toString() for the data object.
The only clue I have is that the error message seems to be using the first data element (StudentId) and using it as an index. I'm confused as to why.
The code:
// Set up the attributes for the Weka data model
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("StudentIdentifier", true));
attributes.add(new Attribute("CourseGrade", true));
attributes.add(new Attribute("CourseIdentifier"));
attributes.add(new Attribute("Term", true));
attributes.add(new Attribute("YearCourseTaken", true));
// Create the data model object - I'm not happy that capacity is required and fixed? But that's another issue
Instances dataSet = new Instances("Records", attributes, 500);
// Set the attribute that will be used for prediction purposes - that will be CourseIdentifier
dataSet.setClassIndex(2);
// Pull back all the records in this term range, create Weka Instance objects for each and add to the data set
List<Record> records = recordsInTermRangeFindService.find(0, 10);
int count = 0;
for (Record r : records) {
Instance i = new DenseInstance(attributes.size());
i.setValue(attributes.get(0), r.studentIdentifier);
i.setValue(attributes.get(1), r.courseGrade);
i.setValue(attributes.get(2), r.courseIdentifier);
i.setValue(attributes.get(3), r.term);
i.setValue(attributes.get(4), r.yearCourseTaken);
dataSet.add(i);
}
System.out.println(dataSet.size());
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("./test.arff"));
writer.write(dataSet.toString());
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
The error message:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1010, Size: 0
I figured it out finally. I was setting the Attributes to strings with the 'true' second parameter in the Constructors, but they were integers coming out of the database table. I needed to change my lines to convert the integers to strings:
i.setValue(attributes.get(0), Integer.toString(r.studentIdentifier));
However, that created a different set of issues for me as things like the Apriori algorithm don't work on strings! I'm continuing to plug along learning Weka.

How to return an JSON object I made in a reduce function

I need your help about CouchDB reduce function.
I have some docs like:
{'about':'1', 'foo':'a1','bar':'qwe'}
{'about':'1', 'foo':'a1','bar':'rty'}
{'about':'1', 'foo':'a2','bar':'uio'}
{'about':'1', 'foo':'a1','bar':'iop'}
{'about':'2', 'foo':'b1','bar':'qsd'}
{'about':'2', 'foo':'b1','bar':'fgh'}
{'about':'3', 'foo':'c1','bar':'wxc'}
{'about':'3', 'foo':'c2','bar':'vbn'}
As you can seen they all have the same key, just the values are differents.
My purpse is to use a Map/Reduce and my return expectation would be:
'rows':[ 'keys':'1','value':{'1':{'foo':'a1', 'at':'rty'},
'2':{'foo':'a2', 'at':'uio'},
'3':{'foo':'a1', 'at':'iop'}}
'keys':'1','value':{'foo':'a1', 'bar','rty'}
...
'keys':'3','value':{'foo':'c2', 'bar',vbn'}
]
Here is the result of my Map function:
'rows':[ 'keys':'1','value':{'foo':'a1', 'bar','qwe'}
'keys':'1','value':{'foo':'a1', 'bar','rty'}
...
'keys':'3','value':{'foo':'c2', 'bar',vbn'}
]
But my Reduce function isn't working:
function(keys,values,rereduce){
var res= {};
var lastCheck = values[0];
for(i=0; i<values.length;++i)
{
value = values[i];
if (lastCheck.foo != value.foo)
{
res.append({'change':[i:lastCheck]});
}
lastCheck = value;
}
return res;
}
Is it possible to have what I expect or I need to use an other way ?
You should not do this in the reduce function. As the couchdb wiki explains:-
If you are building a composite return structure in your reduce, or only transforming the values field, rather than summarizing it, you might be misusing this feature.
There are two approaches that you can take instead
Transform the results at your application layer.
Use the list function.
Lists functions are simple. I will try to explain them here:
Lists like views are saved in design documents under the key lists. Like so:
"lists":{
"formatResults" : "function(head,req) {....}"
}
To call the list function you use a url like this
http://localhost:5984/your-database/_design/your-designdoc/_list/your-list-function/your-view-name
Here is an example of list function
function(head, req) {
var row = getRow();
if (!row){
return 'no ingredients'
}
var jsonOb = {};
while(row=getRow()){
//construct the json object here
}
return {"body":jsonOb,"headers":{"Content-Type" : "application/json"}};
}
The getRow function is of interest to us. It contains the result of the view. So we can query it like
row.key for key
row.value for value
All you have to do now is construct the json like you want and then send it.
By the way you can use log
to debug your functions.
I hope this helps a little.
Apparently now you need to use
provides('json', function() { ... });
As in:
Simplify Couchdb JSON response