strange behaviour with filter? - regex

I want to extract MIME-like headers (starting with [Cc]ontent- ) from a multiline string:
scala> val regex = "[Cc]ontent-".r
regex: scala.util.matching.Regex = [Cc]ontent-
scala> headerAndBody
res2: String =
"Content-Type:application/smil
Content-ID:0.smil
content-transfer-encoding:binary
<smil><head>
"
This fails
scala> headerAndBody.lines.filter(x => regex.pattern.matcher(x).matches).toList
res4: List[String] = List()
but the "related" cases work as expected:
scala> headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
res5: List[String] = List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)
and:
scala> headerAndBody.lines.filter(x => x.startsWith("Content-")).toList
res8: List[String] = List(Content-Type:application/smil, Content-ID:0.smil)
what am I doing wrong in
x => regex.pattern.matcher(x).matches
since it returns an empty List??

The reason for the failure with the first line is that you use the java.util.regex.Matcher.matches() method that requires a full string match.
To fix that, use the Matcher.find() method that searches for the match anywhere inside the input string and use the "^[Cc]ontent-" regex (note that the ^ symbol will force the match to appear at the start of the string).
Note that this line of code does not work as you expect:
headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
You run the regex check against the pattern Content-, and it is always true (that is why you get all the lines in the result).
See this IDEONE demo:
val headerAndBody = "Content-Type:application/smil\nContent-ID:0.smil\ncontent-transfer-encoding:binary\n<smil><head>"
val regex = "^[Cc]ontent-".r
val s1 = headerAndBody.lines.filter(x => regex.pattern.matcher(x).find()).toList
println(s1)
val s2 = headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
print (s2)
Results (the first is the fix, and the second shows that your second line of code fails):
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary)
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)

Your regexp should match all line but not only first sub-string.
val regex = "[Cc]ontent-.*".r

Related

Scala: Regex Pattern Matching

I have the following input strings
"/horses/c132?XXX=abc-049#companyorg"
"/Goats/b-01?XXX=abc-721#"
"/CATS/001?XXX=abc-451#CompanyOrg"
I'd like to obtain the following as output
"horses", "c132", "abc-049#companyorg"
"Goats", "b-01", "abc-721#"
"CATS", "001", "abc-451#CompanyOrg"
I tried the following
StandardTokenParsers
import scala.util.parsing.combinator.syntactical._
val p = new StandardTokenParsers {
lexical.reserved ++= List("/", "?", "XXX=")
def p = "/" ~ opt(ident) ~ "/" ~ opt(ident) ~ "?" ~ "XXX=" ~ opt(ident)
}
p: scala.util.parsing.combinator.syntactical.StandardTokenParsers{def p: this.Parser[this.~[this.~[this.~[String,Option[String]],String],Option[String]]]} = $anon$1#6ca97ddf
scala> p.p(new p.lexical.Scanner("/horses/c132?XXX=abc-049#companyorg"))
warning: there was one feature warning; re-run with -feature for details
res3: p.ParseResult[p.~[p.~[p.~[String,Option[String]],String],Option[String]]] =
[1.1] failure: ``/'' expected but ErrorToken(illegal character) found
/horses/c132?XXX=abc-049#companyorg
^
RegEx
import scala.util.matching.regex
val p1 = "(/)(.*)(/)(.*)(?)(XXX)(=)(.*)".r
p1: scala.util.matching.Regex = (/)(.*)(/)(.*)(?)(XXX)(=)(.*)
scala> val p1(_,animal,_,id,_,_,_,company) = "/horses/c132?XXX=abc-049#companyorg"
scala.MatchError: /horses/c132?XXX=abc-049#companyorg (of class java.lang.String)
... 32 elided
Can someone please help? Thanks!
Your pattern looks like /(desired-group1)/(desired-group2)?XXX=(desired-group3).
So, regex would be
scala> val extractionPattern = """(/)(.*)(/)(.*)(\?XXX=)(.*)""".r
extractionPattern: scala.util.matching.Regex = (/)(.*)(/)(.*)(\?XXX=)(.*)
note - escape ? char.
How it is going to work is,
Full match `/horses/c132?XXX=abc-049#companyorg`
Group 1. `/`
Group 2. `horses`
Group 3. `/`
Group 4. `c132`
Group 5. `?XXX=`
Group 6. `abc-049#companyorg`
Now, apply the regex which gives you the group of all matches
scala> extractionPattern.findAllIn("""/horses/c132?XXX=abc-049#companyorg""")
.matchData.flatMap{m => m.subgroups}.toList
res15: List[String] = List(/, horses, /, c132, ?XXX=, abc-049#companyorg)
Since you only care care about 2nd, 4th and 6th match, only collect those.
So the solution would look like,
scala> extractionPattern.findAllIn("""/horses/c132?XXX=abc-049#companyorg""")
.matchData.map(_.subgroups)
.flatMap(matches => Seq(matches(1), matches(3), matches(4))).toList
res16: List[String] = List(horses, c132, ?XXX=)
When your input does not match regex, you get empty result
scala> extractionPattern.findAllIn("""/horses/c132""")
.matchData.map(_.subgroups)
.flatMap(matches => Seq(matches(1), matches(3), matches(4))).toList
res17: List[String] = List()
Working regex here - https://regex101.com/r/HuGRls/1/

How to pull string value in url using scala regex?

I have below urls in my applications, I want to take one of the value in urls.
For example:
rapidvie value 416
Input URL: http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
Output should be: 416
I've written the code in scala using import java.util.regex.{Matcher, Pattern}
val p: Pattern = Pattern.compile("[?&]rapidView=(\\d+)[?&]")**strong text**
val m:Matcher = p.matcher(url)
if(m.find())
println(m.group(1))
I am getting output, but i want to migrate this scala using scala.util.matching library.
How to implement this in simply?
This code is working with java utils.
In Scala, you may use an unanchored regex within a match block to get just the captured part:
val s = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
val pattern ="""[?&]rapidView=(\d+)""".r.unanchored
val res = s match {
case pattern(rapidView) => rapidView
case _ => ""
}
println(res)
// => 416
See the Scala demo
Details:
"""[?&]rapidView=(\d+)""".r.unanchored - the triple quoted string literal allows using single backslashes with regex escapes, and the .unanchored property makes the regex match partially, not the entire string
pattern(rapidView) gets the 1 or more digits part (captured with (\d+)) if a pattern finds a partial match
case _ => "" will return an empty string upon no match.
You can do this quite easily with Scala:
scala> val url = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
url: String = http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
scala> url.split("rapidView=").tail.head.split("&").head
res0: String = 416
You can also extend it by parameterize the search word:
scala> def searchParam(sp: String) = sp + "="
searchParam: (sp: String)String
scala> val sw = "rapidView"
sw: String = rapidView
And just search with the parameter name
scala> url.split(searchParam(sw)).tail.head.split("&").head
res1: String = 416
scala> val sw2 = "projectKey"
sw2: String = projectKey
scala> url.split(searchParam(sw2)).tail.head.split("&").head
res2: String = DSCI

Scala, regex matching ignore unnecessary words

My program is:
val pattern = "[*]prefix_([a-zA-Z]*)_[*]".r
val outputFieldMod = "TRASHprefix_target_TRASH"
var tar =
outputFieldMod match {
case pattern(target) => target
}
println(tar)
Basically, I try to get the "target" and ignore "TRASH" (I used *). But it has some error and I am not sure why..
Simple and straight forward standard library function (unanchored)
Use Unanchored
Solution one
Use unanchored on the pattern to match inside the string ignoring the trash
val pattern = "prefix_([a-zA-Z]*)_".r.unanchored
unanchored will only match the pattern ignoring all the trash (all the other words)
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """foo\((.*)\)""".r.unanchored
pattern: scala.util.matching.UnanchoredRegex = foo\((.*)\)
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res3: String = bar
Solution two
Pad your pattern from both sides with .*. .* matches any char other than a linebreak character.
val pattern = ".*prefix_([a-zA-Z]*)_.*".r
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """.*foo\((.*)\).*""".r
pattern: scala.util.matching.Regex = .*foo\((.*)\).*
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res4: String = bar
This will work, val pattern = ".*prefix_([a-z]+).*".r, but it distinguishes between target and trash via lower/upper-case letters. Whatever determines real target data from trash data will determine the real regex pattern.

apache-spark regex extract words from rdd

I try to extract words from a textfile.
Textfile:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
The following works well:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
But I want to extract the words for every line.
If i type
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
i get
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
What am I doing wrong?
Thanks in advance
Thank you for your Answer.
The goal was to count the occourence of words in a pos/neg-wordlist.
Seems this works:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))
Your problem is not exactly Apache Spark, your first map will make you handle a line, but your flatMap on that line will make you iterate on the characters in this line String. So Spark or not, your code won't work, for example in a Scala REPL :
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
So if you really want, using your regexp, all the words in your line, just use flatMap once :
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
Regards,

Scala regex pattern match groups different from that using findAllIn

I find that the groups extracted by Pattern-matching on regex's in Scala are different from those extracted using findAllIn function.
1) Here is an example of extraction using pattern match -
scala> val fullRegex = """(.+?)=(.+?)""".r
fullRegex: scala.util.matching.Regex = (.+?)=(.+?)
scala> val x = """a='b'"""
x: String = a='b'
scala> x match { case fullRegex(l,r) => println( l ); println(r) }
a
'b'
2) And here is an example of extraction using the findAllIn function -
scala> fullRegex.findAllIn(x).toArray
res4: Array[String] = Array(a=')
I was expecting the returned Array using findAllIn to be Array(a, 'b'). Why is it not so?
This is because you have not specified to what extent the second lazy match should go. So after = it consumes just one character and stops as it is in lazy mode.
See here.
https://regex101.com/r/dU7oN5/10
Change it to .+?=.+ to get full array
In particular, the pattern match's use of unapplySeq uses Matcher.matches, while findAllIn uses Matcher.find. matches tries to match entire input.
scala> import java.util.regex._
import java.util.regex._
scala> val p = Pattern compile ".+?"
p: java.util.regex.Pattern = .+?
scala> val m = p matcher "hello"
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=.+? region=0,5 lastmatch=]
scala> m.matches
res0: Boolean = true
scala> m.group
res1: String = hello
scala> m.reset
res2: java.util.regex.Matcher = java.util.regex.Matcher[pattern=.+? region=0,5 lastmatch=]
scala> m.find
res3: Boolean = true
scala> m.group
res4: String = h
scala>