merge elements in a Spark RDD under custom condition - mapreduce

How can I merge elements in a Spark RDD under custom condition?
Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD.
For example, suppose RDD[Seq[Int]] = [[1,2,3], [2,4,5], [1,2], [7,8,9]], the result should be [[1,2,3,4,5], [7,8,9]].
Since RDD[Seq[Int]] is very large, I cannot do it in driver program. Is it possible to get it done using distributed groupBy/map/reduce, etc?

Finally worked it out by myself.
This problem can be transformed into computing all connected components formed by elements in RDD[Seq[Int]], since the merge condition (two Seq[Int] have overlapping integers) denotes connectivity between two Seq[Int].
The basic idea is:
Give each element in RDD[Seq[Int]] an unique key (.zipWithUniqueId)
Group integers in Seq[Int] by the generated key, thus integers that appear in multiple Seq[Int] will have the corresponding key grouped together
Generate a RDD graph, where the edges are key pairs from the same group in Step 2
Use GraphX to compute connected components, and join the results
val sets = Seq(Seq(1,2,3,4), Seq(4,5), Seq(1,2,3), Seq(6,7,8), Seq(9,10), Seq(7,9))
val rddSets = sc.parallelize(sets)
.zipWithUniqueId
.map(x => (x._2, x._1)).cache()
val edges = rddSets.flatMap(s => s._2.map(i => (i, s._1)))
.groupByKey.flatMap(g => {
var first = g._2.head
for (v <- g._2.drop(1)) yield {
val pair = (first, v)
first = v
pair
}
}).flatMap(e => Seq((e._1, e._2), (e._2, e._1)))
val vertices = Graph.fromEdgeTuples[Long](edges, defaultValue = 0)
.connectedComponents.vertices
rddSets.join(vertices).map(x => (x._2._2, x._2._1))
.reduceByKey((s1, s2) => s1.union(s2).distinct)
.collect().foreach(x => println (x._2.toString()))

Related

Scala iterate over two consecutive elements of a list

How would we iterate over two consecutive elements of a list and apply the difference function
For instance I have this :
val list = List(List("Eat", "Drink", "Sleep", "work"), List("Eat", "Sleep", "Dance"))
I want to iterate over these two consecutive elements and calculate the difference
I've tried this but I do not know how to iterate over each two consecutive elements
list.map((a,b) => a.diff(b))
the output should be List("Drink", "work")
If I understand correctly you probably want to iterate over a sliding window.
list.sliding(2).map{
case List(a, b) => a.diff(b)
case List(a) => a
}.toList
Alternatively you might also want grouped(2) which partitions the list into groups instead.
def main(args: Array[String]): Unit = {
val list = List(List("Eat", "Drink", "Sleep", "work"), List("Eat", "Sleep", "Dance"));
val diff = list.head.diff(list(1))
println(diff)
}
In your case, match can work perfectly fine:
val list = List(List("Eat", "Drink", "Sleep", "work"), List("Eat", "Sleep", "Dance"))
list match { case a :: b :: Nil => a diff b}
If the list does not always have 2 items, you should also have a catch-all case in match

Add sum of values of two lists into new one in scala

v1 = [1,2,3,4]
v2 = [1,2,3,4,5]
I need the sum of these lists: [2,4,6,8,5]
And is there any way to print elements that executes a+b= c , where c is for example 8 ?
How can I do that in scala?
You can use zipAll to zip the lists together. That method takes in two extra arguments that represents the element to use in case one list is longer than the other, and vice versa. Since you are adding the lists, you should use the additive identity 0. Then you can simply map over the generated list of tuples:
val v1 = List(1, 2, 3, 4)
val v2 = List(1, 2, 3, 4, 5)
v1.zipAll(v2, 0, 0).map { case (a, b) => a + b }
You can read the documentation of zipAll in the documentation of IterableLike. The most relevant part:
Returns a iterable collection formed from this iterable collection and another iterable collection by combining corresponding elements in pairs. If one of the two collections is shorter than the other, placeholder elements are used to extend the shorter collection to the length of the longer.
If you're looking to print out certain elements, you might choose to filter instead of map, and then use foreach:
v1.zipAll(v2, 0, 0).filter {
case(a, b) => a + b == 8
}.foreach {
case(a, b) => println(s"$a+$b=8")
}
Or just a foreach with more interesting case statements:
v1.zipAll(v2, 0, 0).foreach {
case(a, b) if a + b == 8 => println(s"$a+$b=8")
case _ =>
}
Or you could use collect, and ignore the return value:
v1.zipAll(v2, 0, 0).collect {
case(a, b) if a + b == 8 => println(s"$a+$b=8")
}
You might want to read some introductory text to the Scala collections library, like the one in the docs.
A similar approach to Ben's, using a for comprehension,
for ( (a,b) <- v1.zipAll(v2, 0, 0) if a+b == 8 ) yield (a,b)
which delivers those (zipped) pairs of values whose sum is 8.

How to check a list contains substring from other list using scala?

I have following lists-
A = List(("192.168.20.1", "WinInfra", List("naa.6d867d9c7ac")),
("192.168.20.1", "TriSQLFreshInstall", List("naa.6d867d",
"naa.42704fdc4")),
("192.168.20.1", "redHat7", List("naa.4270cdf",
"naa.427045dc")))
B = List("4270cdf", "427045dc", "42704fdc4")
I want to check if last element of list A (it is a list of strings) contains any substring from list B and get output as unmatched elements only.
Edit: I want to check if any element of list B is exist in list A and collect only such list elements from list A which do not contains list B elements.
I want following output-
List(("192.168.20.1","WinInfra",List( "naa.6d867d9c7ac")))
How do I get above output using scala??
I think something like this:
A.filterNot(a => B.exists(b => a._3.exists(str => str.contains(b))))
or
A.filterNot(a => a._3.exists(str => B.exists(b => str.contains(b))))
or shorter, but less readable
A.filterNot(_._3 exists (B exists _.contains))
First, I wouldn't pass around tuples. It would be a lot easier if you would put this data structure into an object and work with that. However, it would be easier start by finding matches first. So you'll start out by applying a filter on List A:
A.filter { (ip, disc, sublist) => .... }
Where items in your sublist items are in List B:
sublist.exists(sublistItem => b.contains(sublistItem.replaceAll("naa.", "")))
This returns:
res1: List[(String, String, List[String])] = List((192.168.20.1,TriSQLFreshInstall,List(naa.6d867d, naa.42704fdc4)), (192.168.20.1,redHat7,List(naa.4270cdf, naa.427045dc)))
Which is the opposite of what you want. This is easy to correct by saying filterNot:
A.filterNot { (ip, disc, sublist) => sublist.exists(sublistItem => b.contains(sublistItem.replaceAll("naa.", ""))) }

How can I fold the nth and (n+1)th elements into a new list in Scala?

Let's say I have List(1,2,3,4,5) and I want to get
List(3,5,7,9), that is, the sums of the element and the previous (1+2, 2+3,3+4,4+5)
I tried to do this by making two lists:
val list1 = List(1,2,3,4)
val list2 = (list1.tail ::: List(0)) // 2,3,4,5,0
for (n0_ <- list1; n1th_ <- list2) yield (n0_ + n1_)
But that combines all the elements with each other like a cross product, and I only want to combine the elements pairwise. I'm new to functional programming and I thought I'd use map() but can't seem to do so.
List(1, 2, 3, 4, 5).sliding(2).map(_.sum).to[List] does the job.
Docs:
def sliding(size: Int): Iterator[Seq[A]]
Groups elements in fixed size blocks by passing a "sliding window" over them (as opposed to partitioning them, as is done in grouped.)
You can combine the lists with zip and use map to add the pairs.
val list1 = List(1,2,3,4,5)
list1.zip(list1.tail).map(x => x._1 + x._2)
res0: List[Int] = List(3, 5, 7, 9)
Personally I think using sliding as Infinity has is the clearest, but if you want to use a zip-based solution then you might want to use the zipped method:
( list1, list1.tail ).zipped map (_+_)
In addition to being arguably clearer than using zip, it is more efficient in that the intermediate data structure (the list of tuples) created by zip is not created with zipped. However, don't use it with infinite streams, or it will eat all of your memory.

Scala conditional sum of elements in a filtered tuples list

I'm new to Scala and need a little help about how to combine
filters and sum on a list of tuples.
What I need is the sum of integers of a filtered tuples list which
essentially the answer to the question:
What is the sum of all set weights?
The result should be 20 for the sample list below
The list is pretty simple:
val ln = List( ("durationWeight" , true, 10),
("seasonWeight" , true, 10),
("regionWeight" , false, 5),
("otherWeight" , false, 5)
Filtering the list according to the Boolean flag is a simple:
val filtered = ln.filter { case(name, check, value) => check == true }
which returns me the wanted tuples. Getting the sum of all them seems to work
with a map followed by sum:
val b = filtered.map{case((name, check, value) ) => value}.sum
Which returns me the wanted sum of all set weights.
However, how do I do all that in one step combining filter, map and sum,
ideally in an elegant one liner?
Thanks for your help.
ln.collect{ case (_, true, value) => value }.sum
Another approach for the heck of it:
(0 /: ln)((sum,x) => if (x._2) sum + x._3 else sum)