Regex RDD using Apache Spark Scala - regex

I have the following RDD:
x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR
I just want to get the first part of every part of this RDD as you can see in the next example:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
To do it I am trying it in this way.
//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(.+?),)"""
val rdd_1 = x.map(g => g.matches(regex1))
This is what I am trying but is not working for me because I just get an Array of Boolean. What am I doing wrong?
I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!

try this.
val x: Array[String] =
Array(
"Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rdd = sc.parallelize(x)
val result = rdd.map(lines => {
lines.split(",")(0)
})
result.collect().foreach(println)
output:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

Try with this regex :
^\s*([^,]+)(_\w+)?
Demo
To implement this regex in your example, you can try :
val arr = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rd_var = spark.sparkContext.parallelize((arr).map((Row(_))))
val pattern = "^\s*([^,]+)(_\w+)?".r
rd_var.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))

With RDD:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val pattern = "([a-zA-Z0-9=:_ ]+),(.*)".r
val el = Seq("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
def main(args: Array[String]): Unit = {
val rdd = spark.sparkContext.parallelize((el).map((Row(_))))
rdd.map {
case Row(str) => str match {
case pattern(gr1, _) => gr1
}
}.foreach(println(_))
}
It gives:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

Related

Parsing case statements in scala

Parsing case statements in scala
CASE WHEN col1 <> 0 AND col2 <> 0 THEN 'COL1 & COL2 IS NOT ZERO' ELSE 'COL1 & COL2 IS ZERO'
challenge here is to give all the scenarios where case statement can come for e.g. it can come inside a function. Also case statements/functions etc. can come inside another case statements which has to be handled.
This problem can be solved with scala parser combinator
first define the classes needed to map experssions
sealed trait Exp {
def asStr: String
override def toString: String = asStr
}
case class OperationExp(a: Exp, op: String, b: Exp, c: Option[String]) extends Exp { override def asStr = s"$a $op $b ${c.getOrElse("")}" }
case class CaseConditions(conditionValue: List[(String, String)] , elseValue: String, asAlias: Option[Exp]) extends Exp {
override def asStr = "CASE " + conditionValue.map(c => s"WHEN ${c._1} THEN ${c._2}").mkString(" ") + s" ELSE ${elseValue} END ${asAlias.getOrElse("")}"
}
now the solution
case class OperationExp(a: Exp, op: String, b: Exp, c: Option[String]) extends Exp { override def asStr = s"$a $op $b ${c.getOrElse("")}" }
case class CaseConditions(conditionValue: List[(String, String)] , elseValue: String, asAlias: Option[Exp]) extends Exp {
override def asStr = "CASE " + conditionValue.map(c => s"WHEN ${c._1} THEN ${c._2}").mkString(" ") + s" ELSE ${elseValue} END ${asAlias.getOrElse("")}"
}
val identifiers: Parser[String] = "[a-zA-Z0-9_~\\|,'\\-\\+:.()]+".r
val operatorTokens: Parser[String] = "[<>=!]+".r | ("IS NOT" | "IN" | "IS")
val conditionJoiner: Parser[String] = ( "AND" | "OR" )
val excludeKeywords = List("CASE","WHEN", "THEN", "ELSE", "END")
val identifierWithoutCaseKw: Parser[Exp] = Parser(input =>
identifiers(input).filterWithError(
!excludeKeywords.contains(_),
reservedWord => s"$reservedWord encountered",
input
)
) ^^ StrExp
val anyStrExp: Parser[Exp] = "[^()]*".r ^^ StrExp
val funcIdentifier: Parser[Exp] = name ~ ("(" ~> (caseConditionExpresionParser | funcIdentifier | anyStrExp) <~ ")") ^^ {case func ~ param => FunCallExp(func, Seq(param))}
val identifierOrFunctions = funcIdentifier | identifierWithoutCaseKw
val conditionParser: Parser[String] =
identifierOrFunctions ~ operatorTokens ~ identifierOrFunctions ~ opt(conditionJoiner) ^^ {
case a ~ op ~ b ~ c => s"$a $op $b ${c.getOrElse("")}"
}
def caseConditionExpresionParser: Parser[CaseConditions] = "CASE" ~ rep1("WHEN" ~ rep(conditionParser) ~ "THEN" ~ rep(identifierWithoutCaseKw)) ~ "ELSE" ~ rep(identifierWithoutCaseKw) ~ "END" ~ opt("AS" ~> identifierWithoutCaseKw)^^ {
case "CASE" ~ conditionValuePair ~ "ELSE" ~ falseValue ~ "END" ~ asName =>
CaseConditions(
conditionValuePair.map(cv => (
cv._1._1._2.mkString(" "),
parsePipes(cv._2.mkString(" ")).isRight match {
case true => parsePipes(cv._2.mkString(" ")).right.get
case _ => cv._2.mkString(" ")
}
)),
parsePipes(falseValue.mkString("")).isRight match {
case true => parsePipes(falseValue.mkString(" ")).right.get
case _ => falseValue.mkString("")
}, asName)
}
//this parser can be used to get the results
val caseExpression = caseConditionExpresionParser | funcIdentifier
def parsePipes(input: String): Either[Seq[ParsingError], String] = {
parse(caseExpression, input) match {
case Success(parsed, _) => Right(parsed.asStr)
case Failure(msg, next) => Left(Seq(ParsingError(s"Failed to parse $pipedStr: $msg, next: ${next.source}.")))
case Error(msg, next) => Left(Seq(ParsingError(s"Error in $pipedStr parse: $msg, next: ${next.source}.")))
}
}

How to do pattern matching on regex in a foreach function in scala?

I don't understand why this doesn't work (i have two "no match" here) :
val a = "aaa".r
val b = "bbb".r
List("aaa", "bbb").foreach {
case a(t) => println(t)
case b(t) => println(t)
case _ => println("no match")
}
Variable in parentheses is supposed to be capturing group.
Change your regexes to val a = "(aaa)".r; val b = "(bbb)".r, that'll make it do what you want.
Alternatively, change the match patterns:
List("aaa", "bbb").foreach {
case a() => println("aaa")
case b() => println("bbb")
case _ => println("no match")
}
Your pattern contains no capture group, you need to put parenthesis around the pattern you want to capture in order for the the pattern matching to work:
val a = "(aaa)".r
// a: scala.util.matching.Regex = (aaa)
val b = "(bbb)".r
// b: scala.util.matching.Regex = (bbb)
List("aaa", "bbb").foreach {
case b(t) => println(t)
case a(t) => println(t)
case _ => println("no match")
}
//aaa
//bbb

Scala Escape Character Regex

How can I write an expression to filter inputs so that it would be in the format of
(AAA) where A is a number from 0-9.
EX: (123), (592), (999)
Usually you want to do more than filter.
scala> val r = raw"\(\d{3}\)".r
r: scala.util.matching.Regex = \(\d{3}\)
scala> List("(123)", "xyz", "(456)").filter { case r() => true case _ => false }
res0: List[String] = List((123), (456))
scala> import PartialFunction.{cond => when}
import PartialFunction.{cond=>when}
scala> List("(123)", "xyz", "(456)").filter(when(_) { case r() => true })
res1: List[String] = List((123), (456))
Keeping all matches from each input:
scala> List("a(123)b", "xyz", "c(456)d").flatMap(s =>
| r.findAllMatchIn(s).map(_.matched).toList)
res2: List[String] = List((123), (456))
scala> List("a(123)b", "xyz", "c(456)d(789)e").flatMap(s =>
| r.findAllMatchIn(s).map(_.matched).toList)
res3: List[String] = List((123), (456), (789))
Keeping just the first:
scala> val r = raw"(\(\d{3}\))".r.unanchored
r: scala.util.matching.UnanchoredRegex = (\(\d{3}\))
scala> List("a(123)b", "xyz", "c(456)d(789)e").flatMap(r.unapplySeq(_: String)).flatten
res4: List[String] = List((123), (456))
scala> List("a(123)b", "xyz", "c(456)d(789)e").collect { case r(x) => x }
res5: List[String] = List((123), (456))
Keeping entire lines that match:
scala> List("a(123)b", "xyz", "c(456)d(789)e").collect { case s # r(_*) => s }
res6: List[String] = List(a(123)b, c(456)d(789)e)
Java API:
scala> import java.util.regex._
import java.util.regex._
scala> val p = Pattern.compile(raw"(\(\d{3}\))")
p: java.util.regex.Pattern = (\(\d{3}\))
scala> val q = p.asPredicate
q: java.util.function.Predicate[String] = java.util.regex.Pattern$$Lambda$1107/824691524#3234474
scala> List("(123)", "xyz", "(456)").filter(q.test)
res0: List[String] = List((123), (456))
Typically you create regexes by using the .r method available on strings, such as "[0-9]".r. However, as you have noticed, that means you can't interpolate escape characters, as the parser thinks you want to insert escape characters into the string, not the regex.
For this, you can use Scala's triple-quoted strings, which create strings of the exact character sequence, including backslashes and newlines.
To create a regex like you describe, you could write """\(\d\d\d\)""".r. Here's an example of it in use:
val regex = """\(\d\d\d\)""".r.pattern
Seq("(123)", "(---)", "456").filter(str => regex.matcher(str).matches)

Regex not matching

I expect regex "[a-zA-Z]\\d{6}" to match "z999999" but it is not matching as an empty List is mapped :
val lines = List("z999999"); //> lines : List[String] = List(z999999)
val regex = """[a-zA-Z]\d{6}""".r //> regex : scala.util.matching.Regex = [a-zA-Z]\d{6}
val fi = lines.map(line => line match { case regex(group) => group case _ => "" })
//> fi : List[String] = List("")
Is there a problem with my regex or how I'm using it with Scala ?
val l="z999999"
val regex = """[a-zA-Z]\d{6}""".r
regex.findAllIn(l).toList
res1: List[String] = List(z999999)
The regex seems valid.
lines.map( _ match { case regex(group) => group; case _ => "" })
res2: List[String] = List("")
How odd. Let's see what happens with a capturing group around the whole expression we defined in regex.
val regex2= """([a-zA-Z]\d{6})""".r
regex2: scala.util.matching.Regex = ([a-zA-Z]\d{6})
lines.map( _ match { case regex2(group) => group; case _ => "" })
res3: List[String] = List(z999999)
Huzzah.
The unapply method on a regex is for getting the results of capturing groups.
There are other methods on a regex object that just get matches (e.g. findAllIn, findFirstIn, etc)

Parsing a blank / whitespace with RegexParsers

What is the problem with parsing the blank/whitespace?
scala> object BlankParser extends RegexParsers {
def blank: Parser[Any] = " "
def foo: Parser[Any] = "foo"
}
defined module BlankParser
scala> BlankParser.parseAll(BlankParser.foo, "foo")
res15: BlankParser.ParseResult[Any] = [1.4] parsed: foo
scala> BlankParser.parseAll(BlankParser.blank, " ")
res16: BlankParser.ParseResult[Any] =
[1.2] failure: ` ' expected but ` ' found
^
scala>
the lexer for scala throws blankspaces away.
try
override val skipWhitespace = false
to avoid this.
the question was already solved so it seems...
Scala parser combinators for language embedded in html or text (like php)