Apache Flink's Iterative Stream won't loop - mapreduce

I'm implementing the connected components algorithm using Flink's DataStream API, since there is not yet an implementation of it using this API.
For this algorithm, I'm separating the data by tumbling windows. So, for each window, I`m trying to compute the algorithm independently.
My problem comes from the iterative character of the algorithm. I implemented the data pipeline that I wanted for the interactions (the step data pipeline), which consists on FlatMaps, 1 Join, 1 ProcessWindow and 1 Filter. However, it seems that the stream I wanted to feedback the loop is not actually being fed back to the beginning of the loop, because the algorithm does not iterate. I suspect it is not possible to do it if the original iteration datastream was joined with another stream (even though the latter was originated by a flatMap on the former).
The code that I`m using is as follows:
//neigborsList = Datastream of <Vertex, [List of neighbors], label>
IterativeStream< Tuple3<Integer, ArrayList<Integer>, Integer> > beginning_loop = neigborsList.iterate(maxTimeout);
//Emits tuples Vertices and Labels for every vertex and its neighbors
DataStream<Tuple2<Integer,Integer> > labels = beginning_loop
//Datastream of <Vertex, label> for every neigborsList.f0 and element in neigborsList.f1
.flatMap( new EmitVertexLabel() )
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.minBy(1)
;
DataStream<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> updatedVertex = beginning_loop
//Update vertex label with the results from the labels reduction
.join(labels)
.where("vertex")
.equalTo("vertex")
.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.apply(new JoinFunction<Tuple3<Integer,ArrayList<Integer>,Integer>, Tuple2<Integer,Integer>, Tuple4<Integer,ArrayList<Integer>,Integer,Integer>>() {
#Override
public Tuple4<Integer,ArrayList<Integer>,Integer,Integer> join(
Tuple3<Integer, ArrayList<Integer>, Integer> arg0, Tuple2<Integer, Integer> arg1)
throws Exception {
int hasConverged = 1;
if(arg1.f1.intValue() < arg0.f2.intValue() )
{
arg0.f2 = arg1.f1;
hasConverged=0;
}
return new Tuple4<>(arg0.f0,arg0.f1,arg0.f2,new Integer(hasConverged));
}
})
//Disseminates the convergence flag if a change was made in the window
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))
.process(new ProcessAllWindowFunction<Tuple4<Integer,ArrayList<Integer>,Integer,Integer>,Tuple4<Integer, ArrayList<Integer>, Integer, Integer>,TimeWindow >() {
#Override
public void process(
ProcessAllWindowFunction<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, Tuple4<Integer, ArrayList<Integer>, Integer, Integer>, TimeWindow>.Context ctx,
Iterable<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> values,
Collector<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> out) throws Exception {
Iterator<Tuple4<Integer, ArrayList<Integer>, Integer, Integer>> iterator = values.iterator();
Tuple4<Integer, ArrayList<Integer>, Integer, Integer> element;
int hasConverged= 1;
while(iterator.hasNext())
{
element = iterator.next();
if(element.f3.intValue()>0)
{
hasConverged=0;
break;
}
}
//Re iterate and emit the values on the correct output
iterator = values.iterator();
Integer converged = new Integer(hasConverged);
while(iterator.hasNext())
{
element = iterator.next();
element.f3 = converged;
out.collect(element);
}
}
})
;
DataStream<Tuple3<Integer, ArrayList<Integer>, Integer>> feed_back = updatedVertex
.filter(new NotConvergedFilter())
//Remove the finished convergence flag
//Transforms the Tuples4 to Tuples3 so that it becomes compatible with beginning_loop
.map(new RemoveConvergeceFlag())
;
beginning_loop.closeWith(feed_back);
//Selects the windows that have already converged
DataStream<?> convergedWindows = updatedVertex
.filter(new ConvergedFilter() );
convergedWindows.print()
.setParallelism(1)
.name("Sink to stdout");
At the end of the execution convergedWindows does not receives any tupple (because the algorithm could not converge with only 1 iteration).
If I print the beginning_loop, I see the initial tupples and the tupples from feed_back resultant from the fist iteration. But, nothing else than that.
So, summarizing my question, could this be a limitation of Flink? If so, do you know a different way of updating the vertices labels after the initial reduction, one that is not based on joins?
PS. I'm using Flink 1.3.3

Related

Java8 Lambda compare two List and trasform to Map

Suppose I have two class:
class Key {
private Integer id;
private String key;
}
class Value {
private Integer id;
private Integer key_id;
private String value;
}
Now I fill the first list as follows:
List<Key> keys = new ArrayLisy<>();
keys.add(new Key(1, "Name"));
keys.add(new Key(2, "Surname"));
keys.add(new Key(3, "Address"));
And the second one:
List<Value> values = new ArrayLisy<>();
values.add(new Value(1, 1, "Mark"));
values.add(new Value(2, 3, "Fifth Avenue"));
values.add(new Value(3, 2, "Fischer"));
Can you please tell me how can I rewrite the follow code:
for (Key k : keys) {
for (Value v : values) {
if (k.getId().equals(v.getKey_Id())) {
map.put(k.getKey(), v.getValue());
break;
}
}
}
Using Lambdas?
Thank you!
‐------UPDATE-------
Yes sure it works, I forget "using Lambdas" on the first post (now I added). I would like to rewrite the two nested for cicle with Lamdas.
Here is how you would do it using streams.
stream the keylist
stream an index for indexing the value list
filter matching ids
package the key instance key and the value instance value into a SimpleEntry.
then add that to a map.
Map<String, String> results = keys.stream()
.flatMap(k -> IntStream.range(0, values.size())
.filter(i -> k.getId() == values.get(i).getKey_id())
.mapToObj(i -> new AbstractMap.SimpleEntry<>(
k.getKey(), values.get(i).getValue())))
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
results.entrySet().forEach(System.out::println);
prints
Address=Fifth Avenue
Surname=Fischer
Name=Mark
Imo, your way is much clearer and easier to understand. Streams/w lambdas or method references are not always the best approach.
A hybrid approach might also be considered.
allocate a map.
iterate over the keys.
stream the values trying to find a match on key_id's and return first one found.
The value was found (isPresent) add to map.
Map<String,String> map = new HashMap<>();
for (Key k : keys) {
Optional<Value> opt = values.stream()
.filter(v -> k.getId() == v.getKey_id())
.findFirst();
if (opt.isPresent()) {
map.put(k.getKey(), opt.get().getValue());
}
}

Read a list of parameters from a LuaRef using LuaBridge

[RESOLVED]
I'm building a game engine that uses LuaBridge in order to read components for entities. In my engine, an entity file looks like this, where "Components" is a list of the components that my entity has and the rest of parameters are used to setup the values for each individual component:
-- myEntity.lua
Components = {"MeshRenderer", "Transform", "Rigidbody"}
MeshRenderer = {
Type = "Sphere",
Position = {0,300,0}
}
Transform = {
Position = {0,150,0},
Scale = {1,1,1},
Rotation = {0,0,0}
}
Rigidbody = {
Type = "Sphere",
Mass = 1
}
I'm currently using this function (in C++) in order to read the value from a parameter (given its name) inside a LuaRef.
template<class T>
T readParameter(LuaRef& table, const std::string& parameterName)
{
try {
return table.rawget(parameterName).cast<T>();
}
catch (std::exception e) {
// std::cout ...
return NULL;
}
}
For example, when calling readVariable<std::string>(myRigidbodyTable, "Type"), with myRigidbodyTable being a LuaRef with the values of Rigidbody, this function should return an std::string with the value "Sphere".
My problem is that when I finish reading and storing the values of my Transform component, when I want to read the values for "Ridigbody" and my engine reads the value "Type", an unhandled exception is thrown at Stack::push(lua_State* L, const std::string& str, std::error_code&).
I am pretty sure that this has to do with the fact that my component Transform stores a list of values for parameters like "Position", because I've had no problems while reading components that only had a single value for each parameter. What's the right way to do this, in case I am doing something wrong?
I'd also like to point out that I am new to LuaBridge, so this might be a beginner problem with a solution that I've been unable to find. Any help is appreciated :)
Found the problem, I wasn't reading the table properly. Instead of
LuaRef myTable = getGlobal(state, tableName.c_str());
I was using the following
LuaRef myTable = getGlobal(state, tableName.c_str()).getMetatable();

Spark - Reduce operation taking too long

I'm making an application with Spark that will run some topic extration algorithms. For that, first I need to make some preprocessing, extracting the document-term matrix by the end. Ive could done that, but for a (not that much) big collection of documents (only 2 thousand, 5MB), this proccess is taking forever.
So, debugging, Ive found where the program kinda stucks, and it's in a reduce operation. What I'm doing in this part of the code is counting how many times each term occurs on the collection, so first I done a "map", couting it for each rdd, and them I "reduce" it, saving the result inside a hashmap. The map operation is very fast, but in the reduce, its splitting the operation in 40 blocks, and each block takes 5~10 minutes to proccess.
So I'm trying to figure out what I'm doing wrong, or if reduce operations are that much costly.
SparkConf: Standalone mode, using local[2]. I've tried to use it as "spark://master:7077", and it worked, but still the same slowness.
Code:
"filesIn" is a JavaPairRDD where the key is the file path and the value is the content of the file.
So, first the map, where I take this "filesIn", split the words, and count their frequency (in that case doesn't matter what document is)
And then the reduce, where I create a HashMap (term, freq).
JavaRDD<HashMap<String, Integer>> termDF_ = filesIn.map(new Function<Tuple2<String, String>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(Tuple2<String, String> t) throws Exception {
String[] allWords = t._2.split(" ");
HashMap<String, Double> hashTermFreq = new HashMap<String, Double>();
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
HashMap<String, Integer> termDF = new HashMap<String, Integer>();
for (String term : allWords) {
if (hashTermFreq.containsKey(term)) {
Double freq = hashTermFreq.get(term);
hashTermFreq.put(term, freq + 1);
} else {
if (term.length() > 1) {
hashTermFreq.put(term, 1.0);
if (!terms.contains(term)) {
terms.add(term);
}
if (!words.contains(term)) {
words.add(term);
if (termDF.containsKey(term)) {
int value = termDF.get(term);
value++;
termDF.put(term, value);
} else {
termDF.put(term, 1);
}
}
}
}
}
return termDF;
}
});
HashMap<String, Integer> termDF = termDF_.reduce(new Function2<HashMap<String, Integer>, HashMap<String, Integer>, HashMap<String, Integer>>() {
#Override
public HashMap<String, Integer> call(HashMap<String, Integer> t1, HashMap<String, Integer> t2) throws Exception {
HashMap<String, Integer> result = new HashMap<String, Integer>();
Iterator iterator = t1.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t1.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
iterator = t2.keySet().iterator();
while (iterator.hasNext()) {
String key = (String) iterator.next();
if (result.containsKey(key) == false) {
result.put(key, t2.get(key));
} else {
result.put(key, result.get(key) + 1);
}
}
return result;
}
});
Thanks!
OK, so just off the top of my head:
Spark transformations are lazy. It means that map is not executed until you call subsequent reduce action so what you describe as slow reduce is most likely slow map + reduce
ArrayList.contains is O(N) so all these words.contains and terms.contains are extremely inefficient
map logic smells fishy. In particular:
if term has been already seen you never get into else branch
at first glance words and terms should have exactly the same content and should be equivalent to the hashTermFreq keys or termDF keys.
it looks like values in termDF can only take value 1. If this is what you want and you ignore frequencies what is the point of creating hashTermFreq?
reduce phase as implemented here means an inefficient linear scan with growing object over the data while you what you really want is reduceByKey.
Using Scala as a pseudocode your whole code can be efficiently expressed as follows:
val termDF = filesIn.flatMap{
case (_, text) =>
text.split(" ") // Split
.toSet // Take unique terms
.filter(_.size > 1) // Remove single characters
.map(term => (term, 1))} // map to pairs
.reduceByKey(_ + _) // Reduce by key
termDF.collectAsMap // Optionally
Finally it looks like you're reinventing the wheel. At least some tools you need are already implemented in mllib.feature or ml.feature

Java 8 - accumulate and create new collection

Using java 8 I want to create a new collection from a list and accumulate a sum along the way.
The source list consists of objects that look something like this:
class Event {
String description;
double sum;
}
With an example list like this:
{ { "desc1", 10.0 }, {"desc2", 14.0 }, {"desc3", 5.0 } }
The resulting list should look like this
desc1, 10.0, 10.0
desc2, 14.0, 24.0
desc3, 5.0, 29.0
I know how to sum up to get a final sum, in this case 29.0, but I want to create the result list and at the same time accumulate the sum along the way.
How can I do this with Java8?
You could do this by implementing your own collector to perform the mapping and summing together. Your streaming code would look like this:
List<SummedEvent> summedEvents = events.stream().collect(EventConsumer::new, EventConsumer::accept, EventConsumer::combine).summedEvents();
summedEvents.forEach((se) -> System.out.println(String.format("%s, %2f, %2f", se.description, se.sum, se.runningTotal)));
For this I've assumed a new class SummedEvent which also holds the running total. Your collector class would then be implemented something like this:
class EventConsumer {
private List<SummedEvent> summedEvents = new ArrayList<>();
private double runningTotal = 0;
public void accept(Event event) {
runningTotal += event.sum;
summedEvents.add(new SummedEvent(event.description, event.sum, runningTotal));
}
public void combine(EventConsumer other) {
this.summedEvents.addAll(other.summedEvents);
this.runningTotal += other.runningTotal;
}
public List<SummedEvent> summedEvents() {
return summedEvents;
}
}
If you will run your pipeline sequentially, you can use this little hack with peek.
double[] acc = {0};
List<CustomEvent> list = originalList.stream()
.peek(e -> acc[0] += e.sum)
.map(e -> new CustomEvent(e, acc[0]))
.collect(toList());
Be aware that you'll get wrong results if
the stream is run in parallel.
However I'm not sure if the pipeline can be run in parallel in one pass, but assuming the underlying list has a fast access to element at index i you can do it like this:
double[] acc = originalList.stream().mapToDouble(e -> e.sum).toArray();
Arrays.parallelPrefix(acc, Double::sum);
List<CustomEvent> lx = IntStream.range(0, originalList.size())
.parallel()
.mapToObj(i -> new CustomEvent(originalList.get(i), acc[i]))
.collect(toList());
parallelPrefix will apply the reduction you are looking for the sums. Then you just have to stream the indices and you map each event to its corresponding accumulated sum.

Hadoop: Use only a part of the reduce Iterable

I have a situation in which I only want to use the first n values of the Iterable given to my reducer and then abort. I have been reading about the Iterable class and it seems like this may not be trivial.
I can't use a for loop or a next method. I can't use a foreach since it iterates over the whole object. Is there a straight-forward solution or am I approaching the problem wrong?
Thanks.
You can just extract the iterator from the iterable and use a good old for loop, or a while loop.
For example, the below sums over only at most the first TOPN values.
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private static final int TOPN = 10;
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
Iterator<IntWritable> iter = values.iterator();
for (int i=0; iter.hasNext() && i < TOPN; i++) {
sum += iter.next().get();
}
result.set(sum);
context.write(key, result);
}
}