I have an List items to be inserted into the DynamoDb collection. The size of the list may vary from 100 to 10k. I looking for an optimised way to Batch Write all the items using the BatchWriteItemEnhancedRequest (JAVA SDK2). What is the best way to add the items into the WriteBatch builder and then write the request using BatchWriteItemEnhancedRequest?
My Current Code:
WriteBatch.Builder<T> builder = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build();
items.forEach(item -> { builder.addPutItem(item); });
BatchWriteItemEnhancedRequest bwr = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build()
BatchWriteResult batchWriteResult =
DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
do {
// Check for unprocessed keys which could happen if you exceed
// provisioned throughput
List<T> unprocessedItems = batchWriteResult.unprocessedPutItemsForTable(getTable());
if (unprocessedItems.size() != 0) {
unprocessedItems.forEach(unprocessedItem -> {
builder.addPutItem(unprocessedItem);
});
batchWriteResult = DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
}
} while (batchWriteResult.unprocessedPutItemsForTable(getTable()).size() > 0);
Looking for a batching logic and a more better way to execute the BatchWriteItemEnhancedRequest.
I came up with a utility class to deal with that. Their batches of batches approach in v2 is overly complex for most use cases, especially when we're still limited to 25 items overall.
public class DynamoDbUtil {
private static final int MAX_DYNAMODB_BATCH_SIZE = 25; // AWS blows chunks if you try to include more than 25 items in a batch or sub-batch
/**
* Writes the list of items to the specified DynamoDB table.
*/
public static <T> void batchWrite(Class<T> itemType, List<T> items, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
Stream<List<T>> chunksOfItems = Lists.partition(items, MAX_DYNAMODB_BATCH_SIZE);
chunksOfItems.forEach(chunkOfItems -> {
List<T> unprocessedItems = batchWriteImpl(itemType, chunkOfItems, client, table);
while (!unprocessedItems.isEmpty()) {
// some failed (provisioning problems, etc.), so write those again
unprocessedItems = batchWriteImpl(itemType, unprocessedItems, client, table);
}
});
}
/**
* Writes a single batch of (at most) 25 items to DynamoDB.
* Note that the overall limit of items in a batch is 25, so you can't have nested batches
* of 25 each that would exceed that overall limit.
*
* #return those items that couldn't be written due to provisioning issues, etc., but were otherwise valid
*/
private static <T> List<T> batchWriteImpl(Class<T> itemType, List<T> chunkOfItems, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
WriteBatch.Builder<T> subBatchBuilder = WriteBatch.builder(itemType).mappedTableResource(table);
chunkOfItems.forEach(subBatchBuilder::addPutItem);
BatchWriteItemEnhancedRequest.Builder overallBatchBuilder = BatchWriteItemEnhancedRequest.builder();
overallBatchBuilder.addWriteBatch(subBatchBuilder.build());
return client.batchWriteItem(overallBatchBuilder.build()).unprocessedPutItemsForTable(table);
}
}
Related
I'm looking for some examples of usage of Triggers and Timers in Apache beam, I wanted to use Processing-time timers for listening my data from pub sub in every 5 minutes and using Processing time triggers processing the above data collected in an hour altogether in python.
Please take a look at the following resources: Stateful processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam
The first blog post is more general in how to handle states for context, and the second has some examples on buffering and triggering after a certain period of time, which seems similar to what you are trying to do.
A full example was requested. Here is what I was able to come up with:
PCollection<String> records =
pipeline.apply(
"ReadPubsub",
PubsubIO.readStrings()
.fromSubscription(
"projects/{project}/subscriptions/{subscription}"));
TupleTag<Iterable<String>> every5MinTag = new TupleTag<>();
TupleTag<Iterable<String>> everyHourTag = new TupleTag<>();
PCollectionTuple timersTuple =
records
.apply("WithKeys", WithKeys.of(1)) // A KV<> is required to use state. Keying by data is more appropriate than hardcode.
.apply(
"Batch",
ParDo.of(
new DoFn<KV<Integer, String>, Iterable<String>>() {
#StateId("buffer5Min")
private final StateSpec<BagState<String>> bufferedEvents5Min =
StateSpecs.bag();
#StateId("count5Min")
private final StateSpec<ValueState<Integer>> countState5Min =
StateSpecs.value();
#TimerId("every5Min")
private final TimerSpec every5MinSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("bufferHour")
private final StateSpec<BagState<String>> bufferedEventsHour =
StateSpecs.bag();
#StateId("countHour")
private final StateSpec<ValueState<Integer>> countStateHour =
StateSpecs.value();
#TimerId("everyHour")
private final TimerSpec everyHourSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#ProcessElement
public void process(
#Element KV<Integer, String> record,
#StateId("count5Min") ValueState<Integer> count5MinState,
#StateId("countHour") ValueState<Integer> countHourState,
#StateId("buffer5Min") BagState<String> buffer5Min,
#StateId("bufferHour") BagState<String> bufferHour,
#TimerId("every5Min") Timer every5MinTimer,
#TimerId("everyHour") Timer everyHourTimer) {
if (Objects.firstNonNull(count5MinState.read(), 0) == 0) {
every5MinTimer
.offset(Duration.standardMinutes(1))
.align(Duration.standardMinutes(1))
.setRelative();
}
buffer5Min.add(record.getValue());
if (Objects.firstNonNull(countHourState.read(), 0) == 0) {
everyHourTimer
.offset(Duration.standardMinutes(60))
.align(Duration.standardMinutes(60))
.setRelative();
}
bufferHour.add(record.getValue());
}
#OnTimer("every5Min")
public void onTimerEvery5Min(
OnTimerContext context,
#StateId("buffer5Min") BagState<String> bufferState,
#StateId("count5Min") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(every5MinTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
#OnTimer("everyHour")
public void onTimerEveryHour(
OnTimerContext context,
#StateId("bufferHour") BagState<String> bufferState,
#StateId("countHour") ValueState<Integer> countState) {
if (!bufferState.isEmpty().read()) {
context.output(everyHourTag, bufferState.read());
bufferState.clear();
countState.clear();
}
}
})
.withOutputTags(every5MinTag, TupleTagList.of(everyHourTag)));
timersTuple
.get(every5MinTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<<do something every 5 min>>);
timersTuple
.get(everyHourTag)
.setCoder(IterableCoder.of(StringUtf8Coder.of()))
.apply(<< do something every hour>>);
pipeline.run().waitUntilFinish();
I am looking at the documentation and the provided examples to find out how I can report invalid data while processing data with Google's dataflow service.
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation())
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
p.run();
In addition to the actual in-/output, I want to produce a 2nd output file that contains records that which are considered invalid (e.g. missing data, malformed data, values were too high). I want to troubleshoot those records and process them separately.
Input: gs://.../input.csv
Output: gs://.../output.csv
List of invalid records: gs://.../invalid.csv
How can I redirect those invalid records into a separate output?
You can use PCollectionTuples to return multiple PCollections from a single transform. For example,
TupleTag<String> mainOutput = new TupleTag<>("main");
TupleTag<String> missingData = new TupleTag<>("missing");
TupleTag<String> badValues = new TupleTag<>("bad");
Pipeline p = Pipeline.create(options);
PCollectionTuple all = p
.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
.apply(new SomeTransformation());
all.get(mainOutput)
.apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
all.get(missingData)
.apply(TextIO.Write.named("WriteMissingData").to(...));
...
PCollectionTuples can either be built up directly out of existing PCollections, or emitted from ParDo operations with side outputs, e.g.
PCollectionTuple partitioned = input.apply(ParDo
.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
if (checkOK(c.element()) {
// Shows up in partitioned.get(mainOutput).
c.output(...);
} else if (hasMissingData(c.element())) {
// Shows up in partitioned.get(missingData).
c.sideOutput(missingData, c.element());
} else {
// Shows up in partitioned.get(badValues).
c.sideOutput(badValues, c.element());
}
}
})
.withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues)));
Note that in general the various side outputs need not have the same type, and data can be emitted any number of times to any number of side outputs (rather than the strict partitioning we have here).
Your SomeTransformation class could then look something like
class SomeTransformation extends PTransform<PCollection<String>,
PCollectionTuple> {
public PCollectionTuple apply(PCollection<String> input) {
// Filter into good and bad data.
PCollectionTuple partitioned = ...
// Process the good data.
PCollection<String> processed =
partitioned.get(mainOutput)
.apply(...)
.apply(...)
...;
// Repackage everything into a new output tuple.
return PCollectionTuple.of(mainOutput, processed)
.and(missingData, partitioned.get(missingData))
.and(badValues, partitioned.get(badValues));
}
}
Robert's suggestion of using sideOutputs is great, but note that this will only work if the bad data is identified by your ParDos. There currently isn't a way to identify bad records hit during initial decoding (where the error is hit in Coder.decode). We've got plans to work on that soon.
We have a Silverlight application that calls a web service to retrieve database rows from SQL Server. These are then displayed on the screen in a paged control. However, the entire table is needed and it consists of several thousand rows. On the customer system, if we ask for more than around 1500 rows, we get HttpRequestTimedOutWithoutDetail. On our development system we need about 500,000 rows before this happens.
Obviously, what we should be doing is paging the results and returning them to the silverlight bit by bit. But I don't know how to do this. Can anyone advise, or point me to some web-pages that clearly explain the principles and methods (I am a bit simple!)
Here is the code in the Web Service:
public IQueryable<Referral> GetReferrals()
{
/*
* In the customer's environments it seems that any more than around 1500 referrals will break the system: they will fail to load.
* In the dev environment is takes around 500,000 so it seems to be timeout related.
*
* The code below was an attempt to restrict the number, only returning referrals above a certain size and within a certain age.
* It seems the customer is more interested in the smaller referrals though since they are more likely to be added to existing
* substations so if this method is re-instated, we should be using '<' instead of '>'
*
int mdToBeMet = int.Parse(ConfigurationManager.AppSettings["ReferralMDToBeMet"]);
DateTime minusNYears = DateTime.Today.AddYears(int.Parse(ConfigurationManager.AppSettings["ReferralTargetDate"]) * -1);
int maxReferralsCount = int.Parse(ConfigurationManager.AppSettings["ReferralMaxRecordCount"]);
if (mdToBeMet != 0 && maxReferralsCount != 0)
{
return this.ObjectContext.Referrals.Where(x => x.MD_to_be_Met > mdToBeMet && x.Target_Date > minusNYears).OrderByDescending(y => y.Target_Date).Take(maxReferralsCount);
}
*/
/*
* This is the 2nd attempt: the customer is mainly interested in referrals that have an allocated substation
*/
bool allocatedReferralsOnly = bool.Parse(ConfigurationManager.AppSettings["ReferralAllocatedOnly"]);
int maxReferralsCount = int.Parse(ConfigurationManager.AppSettings["ReferralMaxRecordCount"]);
if (allocatedReferralsOnly)
{
var referrals = this.ObjectContext.Referrals.Where(x => x.Sub_no != "").OrderByDescending(y => y.Target_Date).Take(maxReferralsCount);
return referrals;
}
else
{
/*
* Ideally, we should just page the referrals here and return to retrieving all of them, bit by bit.
*/
var referrals = this.ObjectContext.Referrals;
return referrals;
}
}
Many thanks for any suggestions.
An example to expand on the comment I gave ...
/// <summary>
/// Returns a page of Referrels
/// pageNumber is 0-index based
/// </summary>
public IQueryable<Referrel> GetReferrals(int pageNumber)
{
var pageSize = 100;
return ObjectContext.Referrals.Skip(pageNumber*pageSize).Take(pageSize);
}
Obviously you could pass in the pageSize too if you wanted, or define it as a constant somewhere outside of this method.
I have a requirement where in i need to maintain a list in memory.
Eg -
list<products>
every user will run the application and add new product/update product/ remove product from the list
and all the changes should be reflected on that list.
I am trying to store the list in the objectcache.
but when i run the application it creates products but the moment i run it second time the list is not in the cache.
need a help.
Following is the code -
public class ProductManagement
{
List<Productlist> _productList;
ObjectCache cache= MemoryCache.Default;
public int Createproduct(int id,string productname)
{
if (cache.Contains("Productlist"))
{
_productList = (List<Productlist>)cache.Get("Productlist");
}
else
{
_productList = new List<Productlist>();
}
Product pro = new Product();
pro.ID = id;
pro.ProductName = productname;
_productList.Add(pro);
cache.AddOrGetExisting("Productlist", _productList, DateTime.MaxValue);
return id;
}
public Product GetProductbyId(int id)
{
if (cache.Contains("Productlist"))
{
_productList = (List<Productlist>)cache.Get("Productlist");
}
else
{
_productList = new List<Productlist>();
}
var product = _productList.Single(i => i.ID == id);
return product;
}
}
how i can make sure that the list is persistent even when the application is not running, can this be possible.
also how to keep the cache alive and to retrieve the same cache next time.
Many thanks for help.
Memory can be different http://en.wikipedia.org/wiki/Computer_memory. MemoryCache stores information in Process Memory, which exists only while Process exists.
If your requirement is to maintain list in process memory - your task is done, and more than you do not need to use ObjectCache cache= MemoryCache.Default; you can just keep the list as a field for ProductManagement.
If you need to keep the list between application launches - you need to do additional work to write the list to file when you close application and read it when you open application.
I recently Setup 4 node Cassandra cluster for learning with one column family which hold time series data as.
Key -> {column name: timeUUID, column value: csv log line, ttl: 1year}, I use Netflix Astyanax java client to load about 1 million log lines.
I also configured Hadoop to run map-reduce jobs with 1 namenode and 4 datanode's to run some analytics on Cassandra data.
All the available examples on internet uses column name as SlicePredicate for Hadoop Job Configuration, where as I have timeUUID as columns how can I efficiently feed Cassandra data to Hadoop Job configurator with batches of 1000 columns at one time.
There are more than 10000 column's for some rows in this test data and expected to be more in real data.
I configure my job as
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), JOB_NAME);
Job.setJarByClass(LogTypeCounterByDate.class);
job.setMapperClass(LogTypeCounterByDateMapper.class);
job.setReducerClass(LogTypeCounterByDateReducer.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
ConfigHelper.setRangeBatchSize(getConf(), 1000);
SliceRange sliceRange = new SliceRange(ByteBuffer.wrap(new byte[0]),
ByteBuffer.wrap(new byte[0]), true, 1000);
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(sliceRange);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), INPUT_RPC_PORT);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), INPUT_INITIAL_ADRESS);
ConfigHelper.setInputPartitioner(job.getConfiguration(), INPUT_PARTITIONER);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
But I can't able to understand how I define Mapper, kindly can you provide template for Mapper class.
public static class LogTypeCounterByDateMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
private Text key = null;
private LongWritable value = null;
#Override
protected void setup(Context context){
}
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context){
//String[] lines = columns.;
}
}
ConfigHelper.setRangeBatchSize(getConf(), 1000)
...
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(TimeUUID.asByteBuffer(startValue), TimeUUID.asByteBuffer(endValue), false, 1000))
ConfigHelper.setInputSlicePredicate(conf, predicate)