Using java 8 I want to create a new collection from a list and accumulate a sum along the way.
The source list consists of objects that look something like this:
class Event {
String description;
double sum;
}
With an example list like this:
{ { "desc1", 10.0 }, {"desc2", 14.0 }, {"desc3", 5.0 } }
The resulting list should look like this
desc1, 10.0, 10.0
desc2, 14.0, 24.0
desc3, 5.0, 29.0
I know how to sum up to get a final sum, in this case 29.0, but I want to create the result list and at the same time accumulate the sum along the way.
How can I do this with Java8?
You could do this by implementing your own collector to perform the mapping and summing together. Your streaming code would look like this:
List<SummedEvent> summedEvents = events.stream().collect(EventConsumer::new, EventConsumer::accept, EventConsumer::combine).summedEvents();
summedEvents.forEach((se) -> System.out.println(String.format("%s, %2f, %2f", se.description, se.sum, se.runningTotal)));
For this I've assumed a new class SummedEvent which also holds the running total. Your collector class would then be implemented something like this:
class EventConsumer {
private List<SummedEvent> summedEvents = new ArrayList<>();
private double runningTotal = 0;
public void accept(Event event) {
runningTotal += event.sum;
summedEvents.add(new SummedEvent(event.description, event.sum, runningTotal));
}
public void combine(EventConsumer other) {
this.summedEvents.addAll(other.summedEvents);
this.runningTotal += other.runningTotal;
}
public List<SummedEvent> summedEvents() {
return summedEvents;
}
}
If you will run your pipeline sequentially, you can use this little hack with peek.
double[] acc = {0};
List<CustomEvent> list = originalList.stream()
.peek(e -> acc[0] += e.sum)
.map(e -> new CustomEvent(e, acc[0]))
.collect(toList());
Be aware that you'll get wrong results if
the stream is run in parallel.
However I'm not sure if the pipeline can be run in parallel in one pass, but assuming the underlying list has a fast access to element at index i you can do it like this:
double[] acc = originalList.stream().mapToDouble(e -> e.sum).toArray();
Arrays.parallelPrefix(acc, Double::sum);
List<CustomEvent> lx = IntStream.range(0, originalList.size())
.parallel()
.mapToObj(i -> new CustomEvent(originalList.get(i), acc[i]))
.collect(toList());
parallelPrefix will apply the reduction you are looking for the sums. Then you just have to stream the indices and you map each event to its corresponding accumulated sum.
Related
I have an List items to be inserted into the DynamoDb collection. The size of the list may vary from 100 to 10k. I looking for an optimised way to Batch Write all the items using the BatchWriteItemEnhancedRequest (JAVA SDK2). What is the best way to add the items into the WriteBatch builder and then write the request using BatchWriteItemEnhancedRequest?
My Current Code:
WriteBatch.Builder<T> builder = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build();
items.forEach(item -> { builder.addPutItem(item); });
BatchWriteItemEnhancedRequest bwr = BatchWriteItemEnhancedRequest.builder().writeBatches(builder.build()).build()
BatchWriteResult batchWriteResult =
DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
do {
// Check for unprocessed keys which could happen if you exceed
// provisioned throughput
List<T> unprocessedItems = batchWriteResult.unprocessedPutItemsForTable(getTable());
if (unprocessedItems.size() != 0) {
unprocessedItems.forEach(unprocessedItem -> {
builder.addPutItem(unprocessedItem);
});
batchWriteResult = DynamoDB.enhancedClient().batchWriteItem(getBatchWriteItemEnhancedRequest(builder));
}
} while (batchWriteResult.unprocessedPutItemsForTable(getTable()).size() > 0);
Looking for a batching logic and a more better way to execute the BatchWriteItemEnhancedRequest.
I came up with a utility class to deal with that. Their batches of batches approach in v2 is overly complex for most use cases, especially when we're still limited to 25 items overall.
public class DynamoDbUtil {
private static final int MAX_DYNAMODB_BATCH_SIZE = 25; // AWS blows chunks if you try to include more than 25 items in a batch or sub-batch
/**
* Writes the list of items to the specified DynamoDB table.
*/
public static <T> void batchWrite(Class<T> itemType, List<T> items, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
Stream<List<T>> chunksOfItems = Lists.partition(items, MAX_DYNAMODB_BATCH_SIZE);
chunksOfItems.forEach(chunkOfItems -> {
List<T> unprocessedItems = batchWriteImpl(itemType, chunkOfItems, client, table);
while (!unprocessedItems.isEmpty()) {
// some failed (provisioning problems, etc.), so write those again
unprocessedItems = batchWriteImpl(itemType, unprocessedItems, client, table);
}
});
}
/**
* Writes a single batch of (at most) 25 items to DynamoDB.
* Note that the overall limit of items in a batch is 25, so you can't have nested batches
* of 25 each that would exceed that overall limit.
*
* #return those items that couldn't be written due to provisioning issues, etc., but were otherwise valid
*/
private static <T> List<T> batchWriteImpl(Class<T> itemType, List<T> chunkOfItems, DynamoDbEnhancedClient client, DynamoDbTable<T> table) {
WriteBatch.Builder<T> subBatchBuilder = WriteBatch.builder(itemType).mappedTableResource(table);
chunkOfItems.forEach(subBatchBuilder::addPutItem);
BatchWriteItemEnhancedRequest.Builder overallBatchBuilder = BatchWriteItemEnhancedRequest.builder();
overallBatchBuilder.addWriteBatch(subBatchBuilder.build());
return client.batchWriteItem(overallBatchBuilder.build()).unprocessedPutItemsForTable(table);
}
}
I want to build a JSON from two lists. I need to use the corresponding elements from both lists to create a single JSON object.
My problem could be solved with ordinary loop like this:
List<Class1> items = baseManager.findObjectsByNamedQuery(Class1.class, "Class1.findAll", new Object[]{});
for(int i=0 ; i<items.size();i++){
List<Class2> items2 = baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCreatedBy" ,new Object[] {items.get(i).getCreatedBy()});
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
JsonObjectBuilder jpb = Json.createObjectBuilder()
.add("createdBy",items.get(i).getCreatedBy())
.add("phone",items2.get(0).getPhone())
groupsBuilder.add(jpb);
}
Is it possible to solve it using Java 8 Stream API?
There are still some things unclear. Like why you are insisting on creating that SimpleDateFormat instance that you are not using anywhere. Or whether there is a significance in calling getCreatedBy() multiple times. Assuming that it is not necessary, the following code is equivalent
baseManager.findObjectsByNamedQuery(Class1.class, "Class1.findAll", new Object[]{})
.stream()
.map(item -> item.getCreatedBy())
.map(createdBy -> Json.createObjectBuilder()
.add("createdBy", createdBy)
.add("phone", baseManager.findObjectsByNamedQuery(
Class2.class, "Class2.findByCreatedBy", new Object[] {createdBy})
.get(0).getPhone())
)
.forEach(jpb -> groupsBuilder.add(jpb));
It’s still unclear to me whether (or why) findObjectsByNamedQuery is not a varargs method. It would be quite natural to be a varargs method, not requiring these explicit new Object[] { … } allocations.
With pure Java8 Stream API:
public void convertItemsToJSon(List<Item> items) {
...
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Map<Item, List<Class2>> sqlItems = items
.stream()
.collect(Collectors.toMap(Function.identity(), (item) -> baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCratedBy", new Object[]{item.getCreatedBy()})));
sqlItems.entrySet()
.stream()
.map(sqlItem -> buildJson(sqlItem.getKey(), sqlItem.getValue()))
.forEach(groupsBuilder::add);
...
}
private JsonObjectBuilder buildJson(Item item, List<Class2> class2Items) {
return Json.createObjectBuilder().add("createdBy", item.getCreatedBy());
}
With StreamEx library
public void convertItemsToJSonStreamEx(List<Item> items) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
...
StreamEx.of(items)
.cross(item -> baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCratedBy", new Object[]{item.getCreatedBy()}).stream())
.mapKeys(item -> Json.createObjectBuilder().add("createdBy", item.getCreatedBy()))
.mapKeyValue(this::addField)
.forEach(groupsBuilder::add);
...
}
private JsonObjectBuilder addField(JsonObjectBuilder json, Class2 class2) {
// You logic how to convert class2 to field in JSON
return json;
}
Thanks for your help and solution. Most helpful was the first response from Vlad Bochenin The code is here:
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
items.stream().map(item -> {
List<Class2> items2 = baseManager.findObjectsByNamedQuery(Class2.class, "Class2.findByCreatedBy", new Object[]{item.getCreatedBy()});
JsonObjectBuilder jpb = Json.createObjectBuilder()
.add("createdBy", item.getCreatedBy())
.add("phone", items2.get(0).getPhone())
return jpb;
}).forEach(groupsBuilder::add);
I'm developing a network model in OMNeT++ in which I have introduced a custom channel type to represent links in my network. For one property of this channel type's instances, I'd like to assign a random parameter. However, the random number should be the same for connected gates.
My node definition has the following gates definition:
simple GridAgent
{
/* ... other paramters/definitions omitted ... */
gates:
inout agentConnections[];
}
In my network configuration, I connect nodes using the simple <--> syntax, e.g.:
someSwitchyard.agentConnections++ <--> AgentConnectionChannel <--> someWindfarm.agentConnections++;
Now, this AgentConnectionChannel has a property called impedance, which I'd like to randomly assign. This impedance property should be the same for both A -> B and B -> A. I have tried to add { impedance = default(unitform(1, 10)) } to the network definition, as well as putting **.agentConnections$o[*].channel.impedance = uniform(1, 10) into omnetpp.ini. In both cases, however, A -> B has a different value assigned than B -> A.
As indicated on the OMNet++ mailing list, this happens because the <--> syntax is actually a shorthand for creating two distinct connections, hence two drawings from the random number distribution happen.
How can I assign a random parameter to a connection's property and have the same value for both directions of two connected gates? Is there a way to do this in the omnetpp.ini file, or do I need to create a script in, e.g., Perl, Ruby, or Python to generate the omnetpp.ini for my runs?
There is no simple solution of your problem, and it could not be resolved manipulating omnetpp.ini file merely.
I propose manual rewriting a parameter value for the second direction. It requires preparing a C++ class for a channel (which you have probably done).
Assuming that your channel definition in NED is following:
channel AgentConnectionChannel extends ned.DatarateChannel {
#class(AgentConnectionChannel);
double impedance;
}
and in omnetpp.ini you has:
**.agentConnections$o[*].channel.impedance = uniform(1, 10)
you should prepare C++ class AgentConnectionChannel:
class AgentConnectionChannel: public cDatarateChannel {
public:
AgentConnectionChannel() : parAlreadyRewritten(false) {}
void setParAlreadyRewritten() {parAlreadyRewritten=true;}
protected:
virtual void initialize();
private:
bool parAlreadyRewritten;
private:
double impedance;
};
Define_Channel(AgentConnectionChannel);
void AgentConnectionChannel::initialize() {
if (parAlreadyRewritten == false) {
parAlreadyRewritten = true;
cGate * srcOut = this->getSourceGate();
cModule *owner = srcOut->getOwnerModule();
int index = srcOut->isVector() ? srcOut->getIndex() : -1;
cGate *srcIn = owner->gateHalf(srcOut->getBaseName(), cGate::INPUT,
index);
cChannel * channel = srcIn->findIncomingTransmissionChannel();
AgentConnectionChannel * reverseChan =
dynamic_cast<AgentConnectionChannel*>(channel);
if (reverseChan) {
reverseChan->setParAlreadyRewritten();
// assigning a value from forward direction channel
reverseChan->par("impedance") = this->par("impedance");
}
}
// and now read a parameter as usual
impedance = par("impedance").doubleValue();
EV << getFullPath() << ", impedance=" << impedance << endl;
}
I'm making an application with Spark that will run some topic extration algorithms. For that, first I need to make some preprocessing, extracting the document-term matrix by the end. Ive could done that, but for a (not that much) big collection of documents (only 2 thousand, 5MB), this proccess is taking forever.
So, debugging, Ive found where the program kinda stucks, and it's in a reduce operation. What I'm doing in this part of the code is counting how many times each term occurs on the collection, so first I done a "map", couting it for each rdd, and them I "reduce" it, saving the result inside a hashmap. The map operation is very fast, but in the reduce, its splitting the operation in 40 blocks, and each block takes 5~10 minutes to proccess.
So I'm trying to figure out what I'm doing wrong, or if reduce operations are that much costly.
SparkConf: Standalone mode, using local[2]. I've tried to use it as "spark://master:7077", and it worked, but still the same slowness.
Code:
"filesIn" is a JavaPairRDD where the key is the file path and the value is the content of the file.
So, first the map, where I take this "filesIn", split the words, and count their frequency (in that case doesn't matter what document is)
And then the reduce, where I create a HashMap (term, freq).
JavaRDD<HashMap<String, Integer>> termDF_ = filesIn.map(new Function<Tuple2<String, String>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(Tuple2<String, String> t) throws Exception {
String[] allWords = t._2.split(" ");
HashMap<String, Double> hashTermFreq = new HashMap<String, Double>();
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
HashMap<String, Integer> termDF = new HashMap<String, Integer>();
for (String term : allWords) {
if (hashTermFreq.containsKey(term)) {
Double freq = hashTermFreq.get(term);
hashTermFreq.put(term, freq + 1);
} else {
if (term.length() > 1) {
hashTermFreq.put(term, 1.0);
if (!terms.contains(term)) {
terms.add(term);
}
if (!words.contains(term)) {
words.add(term);
if (termDF.containsKey(term)) {
int value = termDF.get(term);
value++;
termDF.put(term, value);
} else {
termDF.put(term, 1);
}
}
}
}
}
return termDF;
}
});
HashMap<String, Integer> termDF = termDF_.reduce(new Function2<HashMap<String, Integer>, HashMap<String, Integer>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(HashMap<String, Integer> t1, HashMap<String, Integer> t2) throws Exception {
HashMap<String, Integer> result = new HashMap<String, Integer>();
Iterator iterator = t1.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t1.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
iterator = t2.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t2.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
return result;
}
});
Thanks!
OK, so just off the top of my head:
Spark transformations are lazy. It means that map is not executed until you call subsequent reduce action so what you describe as slow reduce is most likely slow map + reduce
ArrayList.contains is O(N) so all these words.contains and terms.contains are extremely inefficient
map logic smells fishy. In particular:
if term has been already seen you never get into else branch
at first glance words and terms should have exactly the same content and should be equivalent to the hashTermFreq keys or termDF keys.
it looks like values in termDF can only take value 1. If this is what you want and you ignore frequencies what is the point of creating hashTermFreq?
reduce phase as implemented here means an inefficient linear scan with growing object over the data while you what you really want is reduceByKey.
Using Scala as a pseudocode your whole code can be efficiently expressed as follows:
val termDF = filesIn.flatMap{
case (_, text) =>
text.split(" ") // Split
.toSet // Take unique terms
.filter(_.size > 1) // Remove single characters
.map(term => (term, 1))} // map to pairs
.reduceByKey(_ + _) // Reduce by key
termDF.collectAsMap // Optionally
Finally it looks like you're reinventing the wheel. At least some tools you need are already implemented in mllib.feature or ml.feature
A simple version of my document document is the following structure:
doc:
{
"date": "2014-04-16T17:13:00",
"key": "de5cefc56ff51c33351459b88d42ca9f828445c0",
}
I would like to group my document by key, to get the latest date and the number of documents for each key, something like
{ "Last": "2014-04-16T16:00:00", "Count": 10 }
My idea is to to do a map/reduce view and query setting group to true.
This is what I have so far tried. I get the exact count, but not the correct dates.
map
function (doc, meta) {
if(doc.type =="doc")
emit(doc.key, doc.date);
}
reduce
function(key, values, rereduce) {
var result = {
Last: 0,
Count: 0
};
if (rereduce) {
for (var i = 0; i < values.length; i++) {
result.Count += values[i].Count;
result.Last = values[i].Last;
}
} else {
result.Count = values.length;
result.Last = values[0]
}
return result;
}
You're not comparing dates... Couchbase sorts values by key. In your situation it will not sort it by date, so you should do it manually in your reduce function. Probably it will look like:
result.Last = values[i].Last > result.Last ? values[i].Last : result.Last;
and in reduce function it also can be an array, so I don't think that your reduce function always be correct.
Here is an example of my reduce function that filter documents and leave just one that have the newest date. May be it will be helpful or you can try to use it (seems it looks like reduce function that you want, you just need to add count somewhere).
function(k,v,r){
if (r){
if (v.length > 1){
var m = v[0].Date;
var mid = 0;
for (var i=1;i<v.length;i++){
if (v[i].Date > m){
m = v[i].Date;
mid = i;
}
}
return v[mid];
}
else {
return v[0] || v;
}
}
if (v.length > 1){
var m = v[0].Date;
var mid = 0;
for (var i=1;i<v.length;i++){
if (v[i].Date > m){
m = v[i].Date;
mid = i;
}
}
return v[mid];
}
else {
return v[0] || v;
}
}
UPD: Here is an example of what that reduce do:
Input date (values) for that function will look like (I've used just numbers instead of text date to make it shorter):
[{Date:1},{Date:3},{Date:8},{Date:2},{Date:4},{Date:7},{Date:5}]
On the first step rereduce will be false, so we need to find the biggest date in array, and it will return
Object {Date: 8}.
Note, that this function can be called one time, but it can be called on several servers in cluster or on several branches of b-tree inside one couchbase instance.
Then on next step (if there were several machines in cluster or "branches") rereduce will be called and rereduce var will be set to true
Incoming data will be:
[{Date:8},{Date:10},{Date:3}], where {Date:8} came from reduce from one server(or branch), and other dates came from another server(or branch).
So we need to do exactly the same on that new values to find the biggest one.
Answering your question from comments: I don't remember why I used same code for reduce and rereduce, because it was long time ago (when couchbase 2.0 was in dev preview). May be couchbase had some bugs or I just tried to understand how rereduce works. But I remember that without that if (r) {..} it not worked at that time.
You can try to place return v; code in different parts of my or your reduce function to see what it returns on each reduce phase. It's better to try once by yourself to understand what actually happens there.
I forget to mention that I have many documents for the same key. In fact for each key I can have many documents( message here):
{
"date": "2014-04-16T17:13:00",
"key": "de5cefc56ff51c33351459b88d42ca9f828445c0",
"message": "message1",
}
{
"date": "2014-04-16T15:22:00",
"key": "de5cefc56ff51c33351459b88d42ca9f828445c0",
"message": "message2",
}
Another way to deal with the problem is to do it in the map function:
function (doc, meta) {
var count = 0;
var last =''
if(doc.type =="doc"){
for (k in doc.message){
count += 1;
last = doc.date> last?doc.date:last;
}
emit(doc.key,{'Count':count,'Last': last});
}
}
I found this simpler and it do the job in my case.