Spark - GraphX: mapReduceTriplets vs aggregateMessages - mapreduce

I am running by a tutorial
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
And at some point we use the mapReduceTriplets operations. This returns the expected result
// Find the oldest follower for each user
val oldestFollower: VertexRDD[(String, Int)] = userGraph.mapReduceTriplets[(String, Int)](
// For each edge send a message to the destination vertex with the attribute of the source vertex
edge => Iterator((edge.dstId, (edge.srcAttr.name, edge.srcAttr.age))),
// To combine messages take the message for the older follower
(a, b) => if (a._2 > b._2) a else b
)
But the IntelliJ points me that mapReduceTriplets is deprecated (as of 1.2.0) and should be replaced by aggregateMessages
// Find the oldest follower for each user
val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages()[(String, Int)](
// For each edge send a message to the destination vertex with the attribute of the source vertex
edge => Iterator((edge.dstId, (edge.srcAttr.name, edge.srcAttr.age))),
// To combine messages take the message for the older follower
(a, b) => if (a._2 > b._2) a else b
)
So I run the exact same code but then I don't have any output. Is that the expected result or should I change something due to the cahnge of aggregateMessages?

Probably you need something like this:
val oldestFollower: VertexRDD[(String, Int)] = userGraph.aggregateMessages[(String, Int)]
(
// For each edge send a message to the destination vertex with the attribute of the source vertex
sendMsg = { triplet => triplet.sendToDst(triplet.srcAttr.name, triplet.srcAttr.age) },
// To combine messages take the message for the older follower
mergeMsg = {(a, b) => if (a._2 > b._2) a else b}
)
You can find aggregateMessages function signature and useful examples at Grapx proggraming guide page. Hope this helps.

Related

Akka Streams: Throttling a File Source

Ive got a file containing several thousand lines like this.
Mr|David|Smith|david.smith#gmail.com
Mrs|Teri|Smith|teri.smith#gmail.com
...
I want to read the file emitting each line downstream but in a throttled manner ie. 1 per/sec.
I cannot quite figure out how to get the throttling working in the flow.
flow1 (below) outputs the first line after 1 sec and then terminates.
flow2 (below) waits 1 sec then outputs the whole file.
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 = Flow[ByteString].
via(Framing.delimiter(ByteString(System.lineSeparator),10000)).
throttle(1, 1.second, 1, ThrottleMode.shaping).
map(bs => bs.utf8String)
val flow2 = Flow[ByteString].
throttle(1, 1.second, 1, ThrottleMode.shaping).
via(Framing.delimiter(ByteString(System.lineSeparator), 10000)).
map(bs => bs.utf8String)
val sink = Sink.foreach(println)
val res = source.via(flow2).to(sink).run().onComplete(_ => system.terminate())
I couldn't glean any solution from studying the docs.
Would greatly appreciate any pointers.
Use runWith, instead of to, with flow1:
val source: Source[ByteString, Future[IOResult]] = FileIO.fromPath(file)
val flow1 =
Flow[ByteString]
.via(Framing.delimiter(ByteString(System.lineSeparator), 10000))
.throttle(1, 1.second, 1, ThrottleMode.shaping)
.map(bs => bs.utf8String)
val sink = Sink.foreach(println)
source.via(flow1).runWith(sink).onComplete(_ => system.terminate())
to returns the materialized value of the Source (i.e., the source.via(flow1)), so you're terminating the actor system when the "left-hand side" of the stream is completed. What you want to do is to shut down the system when the materialized value of the Sink is completed. Using runWith returns the materialized value of the Sink parameter and is equivalent to:
source
.via(flow1)
.toMat(sink)(Keep.right)
.run()
.onComplete(_ => system.terminate())

How to extract messages using regex in Scala?

My version of RegEx is being greedy and now working as it suppose to. I need extract each message with timestamp and user who created it. Also if user has two or more consecutive messages it should go inside one match / block / group. How to solve it?
https://regex101.com/r/zD5bR6/1
val pattern = "((a\.b|c\.d)\n(.+\n)+)+?".r
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
UPDATE
ndn solution throws StackOverflowError when executed:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4708)
.......
Code:
val pattern = "(?:.+(?:\\Z|\\n))+?(?=\\Z|\\w\\.\\w)".r
val array = (pattern findAllIn str).toArray.reverse foreach{println _}
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)
I don't think a regular expression is the right tool for this job. My solution below uses a (tail) recursive function to loop over the lines, keep the current username and create a Message for every timestamp / message pair.
import java.time.LocalTime
case class Message(user: String, timestamp: LocalTime, message: String)
val Timestamp = """\[(\d{2})\:(\d{2})\:(\d{2})\]""".r
def parseMessages(lines: List[String], usernames: Set[String]) = {
#scala.annotation.tailrec
def go(
lines: List[String], currentUser: Option[String], messages: List[Message]
): List[Message] = lines match {
// no more lines -> return parsed messages
case Nil => messages.reverse
// found a user -> keep as currentUser
case user :: tail if usernames.contains(user) =>
go(tail, Some(user), messages)
// timestamp and message on next line -> create a Message
case Timestamp(h, m, s) :: msg :: tail if currentUser.isDefined =>
val time = LocalTime.of(h.toInt, m.toInt, s.toInt)
val newMsg = Message(currentUser.get, time, msg)
go(tail, currentUser, newMsg :: messages)
// invalid line -> ignore
case _ =>
go(lines.tail, currentUser, messages)
}
go(lines, None, Nil)
}
Which we can use as :
val input = """
a.b
[10:12:03]
you can also get commands
[10:11:26]
from the console
[10:11:21]
can you check if has been resolved
[10:10:47]
ah, okay
c.d
[10:10:39]
anyways startsLevel is still 4
a.b
[10:09:25]
might be a dead end
[10:08:56]
that need to be started early as well
"""
val lines = input.split('\n').toList
val users = Set("a.b", "c.d")
parseMessages(lines, users).foreach(println)
// Message(a.b,10:12:03,you can also get commands)
// Message(a.b,10:11:26,from the console)
// Message(a.b,10:11:21,can you check if has been resolved)
// Message(a.b,10:10:47,ah, okay)
// Message(c.d,10:10:39,anyways startsLevel is still 4)
// Message(a.b,10:09:25,might be a dead end)
// Message(a.b,10:08:56,that need to be started early as well)
The idea is to take as little characters as possible that will be followed by a username or the end of the string:
(?:.+(?:\Z|\n))+?(?=\Z|\w\.\w)
See it in action

How to maintain an immutable list when you impact object linked to each other into this list

I'm trying to code the fast Non Dominated Sorting algorithm (NDS) of Deb used in NSGA2 in immutable way using Scala.
But the problem seems more difficult than i think, so i simplify here the problem to make a MWE.
Imagine a population of Seq[A], and each A element is decoratedA with a list which contains pointers to other elements of the population Seq[A].
A function evalA(a:decoratedA) take the list of linkedA it contains, and decrement value of each.
Next i take a subset list decoratedAPopulation of population A, and call evalA on each. I have a problem, because between each iteration on element on this subset list decoratedAPopulation, i need to update my population of A with the new decoratedA and the new updated linkedA it contain ...
More problematic, each element of population need an update of 'linkedA' to replace the linked element if it change ...
Hum as you can see, it seem complicated to maintain all linked list synchronized in this way. I propose another solution bottom, which probably need recursion to return after each EvalA a new Population with element replaced.
How can i do that correctly in an immutable way ?
It's easy to code in a mutable way, but i don't find a good way to do this in an immutable way, do you have a path or an idea to do that ?
object test extends App{
case class A(value:Int) {def decrement()= new A(value - 1)}
case class decoratedA(oneAdecorated:A, listOfLinkedA:Seq[A])
// We start algorithm loop with A element with value = 0
val population = Seq(new A(0), new A(0), new A(8), new A(1))
val decoratedApopulation = Seq(new decoratedA(population(1),Seq(population(2),population(3))),
new decoratedA(population(2),Seq(population(1),population(3))))
def evalA(a:decoratedA) = {
val newListOfLinked = a.listOfLinkedA.map{e => e.decrement()
new decoratedA(a.oneAdecorated,newListOfLinked)}
}
def run()= {
//decoratedApopulation.map{
// ?
//}
}
}
Update 1:
About the input / output of the initial algorithm.
The first part of Deb algorithm (Step 1 to Step 3) analyse a list of Individual, and compute for each A : (a) domination count, the number of A which dominate me (the value attribute of A) (b) a list of A i dominate (listOfLinkedA).
So it return a Population of decoratedA totally initialized, and for the entry of Step 4 (my problem) i take the first non dominated front, cf. the subset of elements of decoratedA with A value = 0.
My problem start here, with a list of decoratedA with A value = 0; and i search the next front into this list by computing each listOfLinkedA of each of this A
At each iteration between step 4 to step 6, i need to compute a new B subset list of decoratedA with A value = 0. For each , i decrementing first the domination count attribute of each element into listOfLinkedA, then i filter to get the element equal to 0. A the end of step 6, B is saved to a list List[Seq[DecoratedA]], then i restart to step 4 with B, and compute a new C, etc.
Something like that in my code, i call explore() for each element of B, with Q equal at the end to new subset of decoratedA with value (fitness here) = 0 :
case class PopulationElement(popElement:Seq[Double]){
implicit def poptodouble():Seq[Double] = {
popElement
}
}
class SolutionElement(values: PopulationElement, fitness:Double, dominates: Seq[SolutionElement]) {
def decrement()= if (fitness == 0) this else new SolutionElement(values,fitness - 1, dominates)
def explore(Q:Seq[SolutionElement]):(SolutionElement, Seq[SolutionElement])={
// return all dominates elements with fitness - 1
val newSolutionSet = dominates.map{_.decrement()}
val filteredSolution:Seq[SolutionElement] = newSolutionSet.filter{s => s.fitness == 0.0}.diff{Q}
filteredSolution
}
}
A the end of algorithm, i have a final list of seq of decoratedA List[Seq[DecoratedA]] which contain all my fronts computed.
Update 2
A sample of value extracted from this example.
I take only the pareto front (red) and the {f,h,l} next front with dominated count = 1.
case class p(x: Int, y: Int)
val a = A(p(3.5, 1.0),0)
val b = A(p(3.0, 1.5),0)
val c = A(p(2.0, 2.0),0)
val d = A(p(1.0, 3.0),0)
val e = A(p(0.5, 4.0),0)
val f = A(p(0.5, 4.5),1)
val h = A(p(1.5, 3.5),1)
val l = A(p(4.5, 1.0),1)
case class A(XY:p, value:Int) {def decrement()= new A(XY, value - 1)}
case class ARoot(node:A, children:Seq[A])
val population = Seq(
ARoot(a,Seq(f,h,l),
ARoot(b,Seq(f,h,l)),
ARoot(c,Seq(f,h,l)),
ARoot(d,Seq(f,h,l)),
ARoot(e,Seq(f,h,l)),
ARoot(f,Nil),
ARoot(h,Nil),
ARoot(l,Nil))
Algorithm return List(List(a,b,c,d,e), List(f,h,l))
Update 3
After 2 hour, and some pattern matching problems (Ahum...) i'm comming back with complete example which compute automaticaly the dominated counter, and the children of each ARoot.
But i have the same problem, my children list computation is not totally correct, because each element A is possibly a shared member of another ARoot children list, so i need to think about your answer to modify it :/ At this time i only compute children list of Seq[p], and i need list of seq[A]
case class p(x: Double, y: Double){
def toSeq():Seq[Double] = Seq(x,y)
}
case class A(XY:p, dominatedCounter:Int) {def decrement()= new A(XY, dominatedCounter - 1)}
case class ARoot(node:A, children:Seq[A])
case class ARootRaw(node:A, children:Seq[p])
object test_stackoverflow extends App {
val a = new p(3.5, 1.0)
val b = new p(3.0, 1.5)
val c = new p(2.0, 2.0)
val d = new p(1.0, 3.0)
val e = new p(0.5, 4.0)
val f = new p(0.5, 4.5)
val g = new p(1.5, 4.5)
val h = new p(1.5, 3.5)
val i = new p(2.0, 3.5)
val j = new p(2.5, 3.0)
val k = new p(3.5, 2.0)
val l = new p(4.5, 1.0)
val m = new p(4.5, 2.5)
val n = new p(4.0, 4.0)
val o = new p(3.0, 4.0)
val p = new p(5.0, 4.5)
def isStriclyDominated(p1: p, p2: p): Boolean = {
(p1.toSeq zip p2.toSeq).exists { case (g1, g2) => g1 < g2 }
}
def sortedByRank(population: Seq[p]) = {
def paretoRanking(values: Set[p]) = {
//comment from #dk14: I suppose order of values isn't matter here, otherwise use SortedSet
values.map { v1 =>
val t = (values - v1).filter(isStriclyDominated(v1, _)).toSeq
val a = new A(v1, values.size - t.size - 1)
val root = new ARootRaw(a, t)
println("Root value ", root)
root
}
}
val listOfARootRaw = paretoRanking(population.toSet)
//From #dk14: Here is convertion from Seq[p] to Seq[A]
val dominations: Map[p, Int] = listOfARootRaw.map(a => a.node.XY -> a.node.dominatedCounter) //From #dk14: It's a map with dominatedCounter for each point
val listOfARoot = listOfARootRaw.map(raw => ARoot(raw.node, raw.children.map(p => A(p, dominations.getOrElse(p, 0)))))
listOfARoot.groupBy(_.node.dominatedCounter)
}
//Get the first front, a subset of ARoot, and start the step 4
println(sortedByRank(Seq(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p)).head)
}
Talking about your problem with distinguishing fronts (after update 2):
val (left,right) = population.partition(_.node.value == 0)
List(left, right.map(_.copy(node = node.copy(value = node.value - 1))))
No need for mutating anything here. copy will copy everything but fields you specified with new values. Talking about the code, the new copy will be linked to the same list of children, but new value = value - 1.
P.S. I have a feeling you may actually want to do something like this:
case class A(id: String, level: Int)
val a = A("a", 1)
val b = A("b", 2)
val c = A("c", 2)
val d = A("d", 3)
clusterize(List(a,b,c,d)) === List(List(a), List(b,c), List(d))
It's simple to implement:
def clusterize(list: List[A]) =
list.groupBy(_.level).toList.sortBy(_._1).map(_._2)
Test:
scala> clusterize(List(A("a", 1), A("b", 2), A("c", 2), A("d", 3)))
res2: List[List[A]] = List(List(A(a,1)), List(A(b,2), A(c,2)), List(A(d,3)))
P.S.2. Please consider better naming conventions, like here.
Talking about "mutating" elements in some complex structure:
The idea of "immutable mutating" some shared (between parts of a structure) value is to separate your "mutation" from the structure. Or simply saying, divide and conquerror:
calculate changes in advance
apply them
The code:
case class A(v: Int)
case class AA(a: A, seq: Seq[A]) //decoratedA
def update(input: Seq[AA]) = {
//shows how to decrement each value wherever it is:
val stats = input.map(_.a).groupBy(identity).mapValues(_.size) //domination count for each A
def upd(a: A) = A(a.v - stats.getOrElse(a, 0)) //apply decrement
input.map(aa => aa.copy(aa = aa.seq.map(upd))) //traverse and "update" original structure
}
So, I've introduced new Map[A, Int] structure, that shows how to modify the original one. This approach is based on highly simplified version of Applicative Functor concept. In general case, it should be Map[A, A => A] or even Map[K, A => B] or even Map[K, Zipper[A] => B] as applicative functor (input <*> map). *Zipper (see 1, 2) actually could give you information about current element's context.
Notes:
I assumed that As with same value are same; that's default behaviour for case classess, otherwise you need to provide some additional id's (or redefine hashCode/equals).
If you need more levels - like AA(AA(AA(...)))) - just make stats and upd recursive, if dеcrement's weight depends on nesting level - just add nesting level as parameter to your recursive function.
If decrement depends on parent node (like decrement only A(3)'s, which belongs to A(3)) - add parent node(s) as part of stats's key and analise it during upd.
If there is some dependency between stats calculation (how much to decrement) of let's say input(1) from input(0) - you should use foldLeft with partial stats as accumulator: val stats = input.foldLeft(Map[A, Int]())((partialStats, elem) => partialStats ++ analize(partialStats, elem))
Btw, it takes O(N) here (linear memory and cpu usage)
Example:
scala> val population = Seq(A(3), A(6), A(8), A(3))
population: Seq[A] = List(A(3), A(6), A(8), A(3))
scala> val input = Seq(AA(population(1),Seq(population(2),population(3))), AA(population(2),Seq(population(1),population(3))))
input: Seq[AA] = List(AA(A(6),List(A(8), A(3))), AA(A(8),List(A(6), A(3))))
scala> update(input)
res34: Seq[AA] = List(AA(A(5),List(A(7), A(3))), AA(A(7),List(A(5), A(3))))

Saving partial spark DStream window to HDFS

I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
val counts = eegStreams(a).map(x => (math.round(x.toDouble), 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(4), Seconds(4))
val sortedCounts = counts.map(_.swap).transform(rdd => rdd.sortByKey(false)).map(_.swap)
ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
//sortedCounts.foreachRDD(rdd =>println("\nTop 10 amplitudes:\n" + rdd.take(10).mkString("\n")))
sortedCounts.map(tuple => "%s,%s".format(tuple._1, tuple._2)).saveAsTextFiles("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))
I can print top 10 as above (commented).
I have also tried
sortedCounts.foreachRDD{ rdd => ssc.sparkContext.parallelize(rdd.take(10)).saveAsTextFile("hdfs://ec2-23-21-113-136.compute-1.amazonaws.com:9000/user/hduser/output/" + (a+1))}
but I get the following error. My Array is not serializable
15/01/05 17:12:23 ERROR actor.OneForOneStrategy:
org.apache.spark.streaming.StreamingContext
java.io.NotSerializableException:
org.apache.spark.streaming.StreamingContext
Can you try this?
sortedCounts.foreachRDD(rdd => rdd.filterWith(ind => ind)((v, ind) => ind <= 10).saveAsTextFile(...))
Note: I didn't test the snippet...
Your first version should work. Just declare #transient ssc = ... where the Streaming Context is first created.
The second version won't work b/c StreamingContext cannot be serialized in a closure.

Insertion order of a list based on order of another list

I have a sorting problem in Scala that I could certainly solve with brute-force, but I'm hopeful there is a more clever/elegant solution available. Suppose I have a list of strings in no particular order:
val keys = List("john", "jill", "ganesh", "wei", "bruce", "123", "Pantera")
Then at random, I receive the values for these keys at random (full-disclosure, I'm experiencing this problem in an akka actor, so events are not in order):
def receive:Receive = {
case Value(key, otherStuff) => // key is an element in keys ...
And I want to store these results in a List where the Value objects appear in the same order as their key fields in the keys list. For instance, I may have this list after receiving the first two Value messages:
List(Value("ganesh", stuff1), Value("bruce", stuff2))
ganesh appears before bruce merely because he appears earlier in the keys list. Once the third message is received, I should insert it into this list in the correct location per the ordering established by keys. For instance, on receiving wei I should insert him into the middle:
List(Value("ganesh", stuff1), Value("wei", stuff3), Value("bruce", stuff2))
At any point during this process, my list may be incomplete but in the expected order. Since the keys are redundant with my Value data, I throw them away once the list of values is complete.
Show me what you've got!
I assume you want no worse than O(n log n) performance. So:
val order = keys.zipWithIndex.toMap
var part = collection.immutable.TreeSet.empty[Value](
math.Ordering.by(v => order(v.key))
)
Then you just add your items.
scala> part = part + Value("ganesh", 0.1)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1))
scala> part = part + Value("bruce", 0.2)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(bruce,0.2))
scala> part = part + Value("wei", 0.3)
part: scala.collection.immutable.TreeSet[Value] =
TreeSet(Value(ganesh,0.1), Value(wei,0.3), Value(bruce,0.2))
When you're done, you can .toList it. While you're building it, you probably don't want to, since updating a list in random order so that it is in a desired sorted order is an obligatory O(n^2) cost.
Edit: with your example of seven items, my solution takes about 1/3 the time of Jean-Philippe's. For 25 items, it's 1/10th the time. 1/30th for 200 (which is the difference between 6 ms and 0.2 ms on my machine).
If you can use a ListMap instead of a list of tuples to store values while they're gathered, this could work. ListMap preserves insertion order.
class MyActor(keys: List[String]) extends Actor {
def initial(values: ListMap[String, Option[Value]]): Receive = {
case v # Value(key, otherStuff) =>
if(values.forall(_._2.isDefined))
context.become(valuesReceived(values.updated(key, Some(v)).collect { case (_, Some(v)) => v))
else
context.become(initial(keys, values.updated(key, Some(v))))
}
def valuesReceived(values: Seq[Value]): Receive = { } // whatever you need
def receive = initial(keys.map { k => (k -> None) })
}
(warning: not compiled)