I have the following RDD:
x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR
I just want to get the first part of every part of this RDD as you can see in the next example:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
To do it I am trying it in this way.
//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(.+?),)"""
val rdd_1 = x.map(g => g.matches(regex1))
This is what I am trying but is not working for me because I just get an Array of Boolean. What am I doing wrong?
I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!
try this.
val x: Array[String] =
Array(
"Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rdd = sc.parallelize(x)
val result = rdd.map(lines => {
lines.split(",")(0)
})
result.collect().foreach(println)
output:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
Try with this regex :
^\s*([^,]+)(_\w+)?
Demo
To implement this regex in your example, you can try :
val arr = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rd_var = spark.sparkContext.parallelize((arr).map((Row(_))))
val pattern = "^\s*([^,]+)(_\w+)?".r
rd_var.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
With RDD:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val pattern = "([a-zA-Z0-9=:_ ]+),(.*)".r
val el = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
def main(args: Array[String]): Unit = {
val rdd = spark.sparkContext.parallelize((el).map((Row(_))))
rdd.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
}
It gives:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
I want to find if the specific string is present in the list for example
val fruit: List[String] = List("apples", "oranges", "pears")
I want to check if oranges is present in the given list
It would be great if someone can help me on this. TIA
there are several ways to do that:
scala> val fruits: List[String] = List("apples", "oranges", "pears")
fruits: List[String] = List(apples, oranges, pears)
using .contains
scala> val hasApples = fruit.contains("apples")
hasApples: Boolean = true
scala> val hasBananas = fruit.contains("bananas")
bananas: Boolean = false
or using .find
scala> fruits.find(_ == "apples")
res1: Option[String] = Some(apples)
scala> fruits.find(_ == "bananas")
res2: Option[String] = None
check the documentation for other useful methods on: http://www.scala-lang.org/api/current/#scala.collection.immutable.List
I have a 2 list of following class
case class User(var userId: Int =0,
var userName: String ="",
var email: String="",
var password: String ="") {
def this() = this(0, "", "", "")
}
globalList of User class.
localList of User class.
I would like to remove/filter all items from globalList that are same userId in localList.
I tried couple of api with no success such as filterNot, filter, drop, dropWhile. Please advice me how it can be done.
The diff operator "Computes the multiset difference between this list and another sequence".
scala> val global = List(0,1,2,3,4,5)
global: List[Int] = List(0, 1, 2, 3, 4, 5)
scala> val local = List(1,2,3)
local: List[Int] = List(1, 2, 3)
scala> global.diff(local)
res9: List[Int] = List(0, 4, 5)
You can try the following:
val userIdSet = localList.map(_.userId).toSet
val filteredList = globalList.filterNot(u => userIdSet.contains(u.userId))
I expect regex "[a-zA-Z]\\d{6}" to match "z999999" but it is not matching as an empty List is mapped :
val lines = List("z999999"); //> lines : List[String] = List(z999999)
val regex = """[a-zA-Z]\d{6}""".r //> regex : scala.util.matching.Regex = [a-zA-Z]\d{6}
val fi = lines.map(line => line match { case regex(group) => group case _ => "" })
//> fi : List[String] = List("")
Is there a problem with my regex or how I'm using it with Scala ?
val l="z999999"
val regex = """[a-zA-Z]\d{6}""".r
regex.findAllIn(l).toList
res1: List[String] = List(z999999)
The regex seems valid.
lines.map( _ match { case regex(group) => group; case _ => "" })
res2: List[String] = List("")
How odd. Let's see what happens with a capturing group around the whole expression we defined in regex.
val regex2= """([a-zA-Z]\d{6})""".r
regex2: scala.util.matching.Regex = ([a-zA-Z]\d{6})
lines.map( _ match { case regex2(group) => group; case _ => "" })
res3: List[String] = List(z999999)
Huzzah.
The unapply method on a regex is for getting the results of capturing groups.
There are other methods on a regex object that just get matches (e.g. findAllIn, findFirstIn, etc)
What is the problem with parsing the blank/whitespace?
scala> object BlankParser extends RegexParsers {
def blank: Parser[Any] = " "
def foo: Parser[Any] = "foo"
}
defined module BlankParser
scala> BlankParser.parseAll(BlankParser.foo, "foo")
res15: BlankParser.ParseResult[Any] = [1.4] parsed: foo
scala> BlankParser.parseAll(BlankParser.blank, " ")
res16: BlankParser.ParseResult[Any] =
[1.2] failure: ` ' expected but ` ' found
^
scala>
the lexer for scala throws blankspaces away.
try
override val skipWhitespace = false
to avoid this.
the question was already solved so it seems...
Scala parser combinators for language embedded in html or text (like php)