Scala: Regex Pattern Matching - regex

I have the following input strings
"/horses/c132?XXX=abc-049#companyorg"
"/Goats/b-01?XXX=abc-721#"
"/CATS/001?XXX=abc-451#CompanyOrg"
I'd like to obtain the following as output
"horses", "c132", "abc-049#companyorg"
"Goats", "b-01", "abc-721#"
"CATS", "001", "abc-451#CompanyOrg"
I tried the following
StandardTokenParsers
import scala.util.parsing.combinator.syntactical._
val p = new StandardTokenParsers {
lexical.reserved ++= List("/", "?", "XXX=")
def p = "/" ~ opt(ident) ~ "/" ~ opt(ident) ~ "?" ~ "XXX=" ~ opt(ident)
}
p: scala.util.parsing.combinator.syntactical.StandardTokenParsers{def p: this.Parser[this.~[this.~[this.~[String,Option[String]],String],Option[String]]]} = $anon$1#6ca97ddf
scala> p.p(new p.lexical.Scanner("/horses/c132?XXX=abc-049#companyorg"))
warning: there was one feature warning; re-run with -feature for details
res3: p.ParseResult[p.~[p.~[p.~[String,Option[String]],String],Option[String]]] =
[1.1] failure: ``/'' expected but ErrorToken(illegal character) found
/horses/c132?XXX=abc-049#companyorg
^
RegEx
import scala.util.matching.regex
val p1 = "(/)(.*)(/)(.*)(?)(XXX)(=)(.*)".r
p1: scala.util.matching.Regex = (/)(.*)(/)(.*)(?)(XXX)(=)(.*)
scala> val p1(_,animal,_,id,_,_,_,company) = "/horses/c132?XXX=abc-049#companyorg"
scala.MatchError: /horses/c132?XXX=abc-049#companyorg (of class java.lang.String)
... 32 elided
Can someone please help? Thanks!

Your pattern looks like /(desired-group1)/(desired-group2)?XXX=(desired-group3).
So, regex would be
scala> val extractionPattern = """(/)(.*)(/)(.*)(\?XXX=)(.*)""".r
extractionPattern: scala.util.matching.Regex = (/)(.*)(/)(.*)(\?XXX=)(.*)
note - escape ? char.
How it is going to work is,
Full match `/horses/c132?XXX=abc-049#companyorg`
Group 1. `/`
Group 2. `horses`
Group 3. `/`
Group 4. `c132`
Group 5. `?XXX=`
Group 6. `abc-049#companyorg`
Now, apply the regex which gives you the group of all matches
scala> extractionPattern.findAllIn("""/horses/c132?XXX=abc-049#companyorg""")
.matchData.flatMap{m => m.subgroups}.toList
res15: List[String] = List(/, horses, /, c132, ?XXX=, abc-049#companyorg)
Since you only care care about 2nd, 4th and 6th match, only collect those.
So the solution would look like,
scala> extractionPattern.findAllIn("""/horses/c132?XXX=abc-049#companyorg""")
.matchData.map(_.subgroups)
.flatMap(matches => Seq(matches(1), matches(3), matches(4))).toList
res16: List[String] = List(horses, c132, ?XXX=)
When your input does not match regex, you get empty result
scala> extractionPattern.findAllIn("""/horses/c132""")
.matchData.map(_.subgroups)
.flatMap(matches => Seq(matches(1), matches(3), matches(4))).toList
res17: List[String] = List()
Working regex here - https://regex101.com/r/HuGRls/1/

Related

Regex doesn't work when newline is at the end of the string

Exercise: given a string with a name, then space or newline, then email, then maybe newline and some text separated by newlines capture the name and the domain of email.
So I created the following:
val regexp = "^([a-zA-Z]+)(?:\\s|\\n)\\w+#(\\w+\\.\\w+)(?:.|\\r|\\n)*".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
And started testing:
scala> val input = "oleg oleg#email.com"
scala> fun(input)
res17: String = oleg email.com
scala> val input = "oleg\noleg#email.com"
scala> fun(input)
res18: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3"""
scala> fun(input)
res19: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3
| """
scala> fun(input)
res20: String = invalid
Why doesn't the regexp capture the string with the newline at the end?
This part (?:\\s|\\n) can be shortened to \s as it will also match a newline, and as there is still a space before the emails where you are using multiple lines it can be \s+ to repeat it 1 or more times.
Matching any character like this (?:.|\\r|\\n)* if very inefficient due to the alternation. You can use either [\S\s]* or use an inline modifier (?s) to make the dot match a newline.
But using your pattern to just get the name and the domain of the email you don't have to match what comes after it, as you are using the 2 capturing groups in the output.
^([a-zA-Z]+)\s+\w+#(\w+\.\w+)
Regex demo
If you do want to match all that follows, you can use:
val regexp = """(?s)^([a-zA-Z]+)\s+\w+#(\w+\.\w+).*""".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
Scala demo
Note that this pattern \w+#(\w+\.\w+) is very limited for matching an email

How to pull string value in url using scala regex?

I have below urls in my applications, I want to take one of the value in urls.
For example:
rapidvie value 416
Input URL: http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
Output should be: 416
I've written the code in scala using import java.util.regex.{Matcher, Pattern}
val p: Pattern = Pattern.compile("[?&]rapidView=(\\d+)[?&]")**strong text**
val m:Matcher = p.matcher(url)
if(m.find())
println(m.group(1))
I am getting output, but i want to migrate this scala using scala.util.matching library.
How to implement this in simply?
This code is working with java utils.
In Scala, you may use an unanchored regex within a match block to get just the captured part:
val s = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
val pattern ="""[?&]rapidView=(\d+)""".r.unanchored
val res = s match {
case pattern(rapidView) => rapidView
case _ => ""
}
println(res)
// => 416
See the Scala demo
Details:
"""[?&]rapidView=(\d+)""".r.unanchored - the triple quoted string literal allows using single backslashes with regex escapes, and the .unanchored property makes the regex match partially, not the entire string
pattern(rapidView) gets the 1 or more digits part (captured with (\d+)) if a pattern finds a partial match
case _ => "" will return an empty string upon no match.
You can do this quite easily with Scala:
scala> val url = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
url: String = http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
scala> url.split("rapidView=").tail.head.split("&").head
res0: String = 416
You can also extend it by parameterize the search word:
scala> def searchParam(sp: String) = sp + "="
searchParam: (sp: String)String
scala> val sw = "rapidView"
sw: String = rapidView
And just search with the parameter name
scala> url.split(searchParam(sw)).tail.head.split("&").head
res1: String = 416
scala> val sw2 = "projectKey"
sw2: String = projectKey
scala> url.split(searchParam(sw2)).tail.head.split("&").head
res2: String = DSCI

strange behaviour with filter?

I want to extract MIME-like headers (starting with [Cc]ontent- ) from a multiline string:
scala> val regex = "[Cc]ontent-".r
regex: scala.util.matching.Regex = [Cc]ontent-
scala> headerAndBody
res2: String =
"Content-Type:application/smil
Content-ID:0.smil
content-transfer-encoding:binary
<smil><head>
"
This fails
scala> headerAndBody.lines.filter(x => regex.pattern.matcher(x).matches).toList
res4: List[String] = List()
but the "related" cases work as expected:
scala> headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
res5: List[String] = List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)
and:
scala> headerAndBody.lines.filter(x => x.startsWith("Content-")).toList
res8: List[String] = List(Content-Type:application/smil, Content-ID:0.smil)
what am I doing wrong in
x => regex.pattern.matcher(x).matches
since it returns an empty List??
The reason for the failure with the first line is that you use the java.util.regex.Matcher.matches() method that requires a full string match.
To fix that, use the Matcher.find() method that searches for the match anywhere inside the input string and use the "^[Cc]ontent-" regex (note that the ^ symbol will force the match to appear at the start of the string).
Note that this line of code does not work as you expect:
headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
You run the regex check against the pattern Content-, and it is always true (that is why you get all the lines in the result).
See this IDEONE demo:
val headerAndBody = "Content-Type:application/smil\nContent-ID:0.smil\ncontent-transfer-encoding:binary\n<smil><head>"
val regex = "^[Cc]ontent-".r
val s1 = headerAndBody.lines.filter(x => regex.pattern.matcher(x).find()).toList
println(s1)
val s2 = headerAndBody.lines.filter(x => regex.pattern.matcher("Content-").matches).toList
print (s2)
Results (the first is the fix, and the second shows that your second line of code fails):
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary)
List(Content-Type:application/smil, Content-ID:0.smil, content-transfer-encoding:binary, <smil><head>)
Your regexp should match all line but not only first sub-string.
val regex = "[Cc]ontent-.*".r

regular expression matching string in scala

I have a string like this
result: String = /home/administrator/com.supai.common-api-1.8.5-DEV- SNAPPSHOT/com/a/infra/UserAccountDetailsMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/UserAccountDetailsMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/VendorAddressMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorAddressMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData.class
/home/a/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/a/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
regex: scala.util.matching.Regex = (\\/([u|s|r])\\/([s|h|a|r|e]))
x: scala.util.matching.Regex.MatchIterator = empty iterator`
and out of this how can I get only this part /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand this part can be anywhere in the string, how can I achieve this, I tried using regular expression in Scala but don't know how to use forward slashes, so anybody plz explain how to do this in scala.
What is your search criteria? Your pattern seems to be wrong.
In your rexexp, I see u|s|r which means to search for either u, or s or r . See here for more information
how can I get only this part
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand
this part can be anywhere in the string
If you are looking for a path, see the below example:
scala> val input = """/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
| /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar"""
input: String =
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp = "/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar".r
myRegExp: scala.util.matching.Regex = /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp2 = "helloWorld.jar".r
myRegExp2: scala.util.matching.Regex = helloWorld.jar
scala> (myRegExp findAllIn input) foreach( println)
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> (myRegExp2 findAllIn input) foreach( println)
scala>

how do I extract substring (group) using regex without knowing if regex matches?

I want to use this
val r = """^myprefix:(.*)""".r
val r(suffix) = line
println(suffix)
But it gives an error when the string doesn't match. How do I use a similar construct where matching is optional?
Edit: To make it clear, I need the group (.*)
You can extract match groups via pattern matching.
val r = """^myprefix:(.*)""".r
line match {
case r(group) => group
case _ => ""
}
Another way using Option:
Option(line) collect { case r(group) => group }
"""^myprefix:(.*)""".r // Regex
.findFirstMatchIn(line) // Option[Match]
.map(_ group 1) // Option[String]
This has the advantage that you can write it as a one-liner without needing to assign the regex to an intermediate value r.
In case you're wondering, group 0 is the matched string while group 1 etc are the capture groups.
try
r.findFirstIn(line)
UPD:
scala> val rgx = """^myprefix:(.*)""".r
rgx: scala.util.matching.Regex = ^myprefix:(.*)
scala> val line = "myprefix:value"
line: java.lang.String = myprefix:value
scala> for (rgx(group) <- rgx.findFirstIn(line)) yield group
res0: Option[String] = Some(value)