scala regex : extract from string - regex

I am trying to extract few values out of an big string , I have an hard time extracting them , I have tired a couple of regex patterns , but they always give me no match. Anyway they seem to work in the online regex sites available but not in Scala. What I am trying to do is
Input :
ESSSTOR\Disk&Ven_VendorName&Prod_MO_Might_MS_5.0&Rev_6.01\08765J54U3K4QVR0&0
Extract [output]:
Vendorname
MO_Might_MS_5.0&Rev_6.01
08765J54U3K4QVR0&0
I am trying to extract those three values from the input string , but unable to do so.
Can some one please me see what I am doing wrong.
Thanks in advance.
//Input value
val device:String= "ESSSTOR\\Disk&Ven_VendorName&Prod_MO_Might_MS_5.0&Rev_6.01\\08765J54U3K4QVR0&0"
// Regex build for product extraction
val proReg= """.*[Prod_]([^\\\\]*)""".r
// """.*Prod_([^\\\\]*)""".r -- no match as output
// """(?:Prod_)([^\\\\]*)""".r -- no match as output
println("Device: "+device)
// method -1:
device match{
case proReg(prVal) => println(s"$prVal is product val")
case _ => println("no match") }
// method-2 :
val proReg(g1) = "ESSSTOR\\Disk&Ven_VendorName&Prod_MO_Might_MS_5.0&Rev_6.01\\08765J54U3K4QVR0&0"
println(s"group1: $g1 ")
O/P:
Device: ESSSTOR\Disk&Ven_VendorName&Prod_MO_Might_MS_5.0&Rev_6.01\08765J54U3K4QVR0&0
//method-1
no match
// method-2
error
// Regex build for dev serial
val serReg = """(?:Prod_\\S*[\\\\])(.*)""".r
device match {
case serReg(srVal) => println(s"$srVal is product val")
case _ => println("no match")
}
o/p:
no match
// Regex for vendor
val venReg="""(?:Ven_)([^&]*)""".r
device match {
case venReg(vnVal) => println(s"$vnVal is vendor val")
case _ => println("no match")
}
o/p:
no match

See if this gets closer to what you want.
val pttrn = raw"Ven_([^&]+)&Prod_([^&]+)&Rev_6.01\\(.*)".r.unanchored
device match {
case pttrn(ven, prod, rev) =>
s"vendor: $ven\nproduct: $prod\nrevNum: $rev"
case _ => "does not match pattern"
}
explanation
Ven_([^&]+) --> Look for something that begins with Ven_. Capture everything that isn't an ampersand &.
&Prod_([^&]+) --> That should be followed by the Prod_ string. Capture everything that isn't an ampersand &.
&Rev_6.01\\(.*) --> That should be followed by the Rev_ string that ends with a single backslash \. Capture everything that follows.

Related

Extract the repetitive parts of a String by Regex pattern matching in Scala

I have this code for extracting the repetitive : separated sections of a regex, which does not give me the right output.
val pattern = """([a-zA-Z]+)(:([a-zA-Z]+))*""".r
for (p <- pattern findAllIn "it:is:very:great just:because:it is") p match {
case pattern("it", pattern(is, pattern(very, great))) => println("it: "+ is + very+ great)
case pattern(it, _,rest) => println( it+" : "+ rest)
case pattern(it, is, very, great) => println(it +" : "+ is +" : "+ very +" : " + great)
case _ => println("match failure")
}
What am I doing wrong?
How can I write a case expression which allows me to extract each : separated part of the pattern regex?
What is the right syntax with which to solve this?
How can the match against unknown number of arguments to be extracted from a regex be done?
In this case print:
it : is : very : great
just : because : it
is
You can't use repeated capturing group like that, it only saves the last captured value as the current group value.
You can still get the matches you need with a \b[a-zA-Z]+(?::[a-zA-Z]+)*\b regex and then split each match with ::
val text = "it:is:very:great just:because:it is"
val regex = """\b[a-zA-Z]+(?::[a-zA-Z]+)*\b""".r
val results = regex.findAllIn(text).map(_ split ':').toList
results.foreach { x => println(x.mkString(", ")) }
// => it, is, very, great
// just, because, it
// is
See the Scala demo. Regex details:
\b - word boundary
[a-zA-Z]+ - one or more ASCII letters
(?::[a-zA-Z]+)* - zero or more repetitions of
: - a colon
[a-zA-Z]+ - one or more ASCII letters
\b - word boundary

Scala - match on regex for non-empty string

In Scala - I need to validate if a given string is non-empty. Following snippet returns true. What is the issue with the regex along with match?
def isEmpty(input: String): String = {
val nonEmptyStringPattern = raw"""(\w+)""".r
input match {
case nonEmptyStringPattern => s"matched $input"
case _ => "n/a"
}
}
However, the same regex works are expected on matches method as below.
def isEmpty(input: String): String = {
val nonEmptyStringPattern = raw"""(\w+)""".r
input match {
case input if nonEmptyStringPattern matches( input) => s"matched $input"
case _ => "n/a" ```.
}
}
Does this mean match cannot have regex instances ?
Just as case x => ... creates a new variable x to match against, it's the same for case nonEmptyStringPattern => .... A new variable is created that shadows the existing nonEmptyStringPattern. And as it's an unencumbered variable, it will match anything and everything.
Also, you've created and compiled a regex pattern but you have to invoke it in order to pattern match against it.
def isEmpty(input: String): String = {
val nonEmptyStringPattern = "\\w+".r
input match {
case nonEmptyStringPattern() => s"matched $input"
case _ => "n/a"
}
}
This now works, except for the fact that not all String characters are \w word characters.
isEmpty("") //res0: String = n/a
isEmpty("abc") //res1: String = matched abc
isEmpty("#$#") //res2: String = n/a

Pattern matching extract String Scala

I want to extract part of a String that match one of the tow regex patterns i defined:
//should match R0010, R0100,R0300 etc
val rPat="[R]{1}[0-9]{4}".r
// should match P.25.01.21 , P.27.03.25 etc
val pPat="[P]{1}[.]{1}[0-9]{2}[.]{1}[0-9]{2}[.]{1}[0-9]{2}".r
When I now define my method to extract the elements as:
val matcher= (s:String) => s match {case pPat(el)=> println(el) // print the P.25.01.25
case rPat(el)=>println(el) // print R0100
case _ => println("no match")}
And test it eg with:
val pSt=" P.25.01.21 - Hello whats going on?"
matcher(pSt)//prints "no match" but should print P.25.01.21
val rSt= "R0010 test test 3,870"
matcher(rSt) //prints also "no match" but should print R0010
//check if regex is wrong
val pHead="P.25.01.21"
pHead.matches(pPat.toString)//returns true
val rHead="R0010"
rHead.matches(rPat.toString)//return true
I'm not sure if the regex expression are wrong but the matches method works on the elements. So what is wrong with the approach?
When you use pattern matching with strings, you need to bear in mind that:
The .r pattern you pass will need to match the whole string, else, no match will be returned (the solution is to make the pattern .r.unanchored)
Once you make it unanchored, watch out for unwanted matches: R[0-9]{4} will match R1234 in CSR123456 (solutions are different depending on what your real requirements are, usually word boundaries \b are enough, or negative lookarounds can be used)
Inside a match block, the regex matching function requires a capturing group to be present if you want to get some value back (you defined it as el in pPat(el) and rPat(el).
So, I suggest the following solution:
val rPat="""\b(R\d{4})\b""".r.unanchored
val pPat="""\b(P\.\d{2}\.\d{2}\.\d{2})\b""".r.unanchored
val matcher= (s:String) => s match {case pPat(el)=> println(el) // print the P.25.01.25
case rPat(el)=>println(el) // print R0100
case _ => println("no match")
}
Then,
val pSt=" P.25.01.21 - Hello whats going on?"
matcher(pSt) // => P.25.01.21
val pSt2_bad=" CP.2334565.01124.212 - Hello whats going on?"
matcher(pSt2_bad) // => no match
val rSt= "R0010 test test 3,870"
matcher(rSt) // => R0010
val rSt2_bad = "CSR00105 test test 3,870"
matcher(rSt2_bad) // => no match
Some notes on the patterns:
\b - a leading word boundary
(R\d{4}) - a capturing group matching exactly 4 digits
\b - a trailing word boundary
Due to the triple quotes used to define the string literal, there is no need to escape the backslashes.
Introduce groups in your patterns:
val rPat=".*([R]{1}[0-9]{4}).*".r
val pPat=".*([P]{1}[.]{1}[0-9]{2}[.]{1}[0-9]{2}[.]{1}[0-9]{2}).*".r
...
scala> matcher(pSt)
P.25.01.21
scala> matcher(rSt)
R0010
If code is written in the following way, the desired outcome will be generated. Reference API documentation followed is http://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html
//should match R0010, R0100,R0300 etc
val rPat="[R]{1}[0-9]{4}".r
// should match P.25.01.21 , P.27.03.25 etc
val pPat="[P]{1}[.]{1}[0-9]{2}[.]{1}[0-9]{2}[.]{1}[0-9]{2}".r
def main(args: Array[String]) {
val pSt=" P.25.01.21 - Hello whats going on?"
val pPatMatches = pPat.findAllIn(pSt);
pPatMatches.foreach(println)
val rSt= "R0010 test test 3,870"
val rPatMatches = rPat.findAllIn(rSt);
rPatMatches.foreach(println)
}
Please, let me know if that works for you.

Scala Regex Extractor with OR operator

I have this verbose code that does shortcircuit Regex extraction / matching in Scala. This attempts to match a string with the first Regex, if that doesn't match, it attempts to match the string with the second Regex.
val regex1 : scala.util.matching.Regex = "^a(b)".r
val regex2 : scala.util.matching.Regex = "^c(d)".r
val s = ?
val extractedGroup1 : Option[String] = s match { case regex1(v) => Some(v) case _ => None }
val extractedGroup2 : Option[String] = s match { case regex2(v) => Some(v) case _ => None}
val extractedValue = extractedGroup1.orElse(extractedGroup2)
Here are the results:
s == "ab" then extractedValue == "b"
s == "cd" then extractedValue == "c"
s == "gg" then extractedValue == None.
My question is how can we combine the two regex into a single regex with the regex or operator, and still use Scala extractors. I tried this, but it's always the None case.
val regex : scala.util.matching.Regex = "^a(b)$ | ^c(d)$".r
val extractedValue: s match { case regex(v) => Some(v) case _ => None }
Don't use quality of life spaces within the regex, although they feel very scala-esque, they might be taken literary and your program expects that there should be a whitespace after the endOfString or space before the startOfString, which is obviously never the case. Try ^(?:a(b)|c(d))$, which is the same thing you did without repeating ^ and $.
Your own ^a(b)$|^c(d)$ can work too (if you remove the whitespaces).
Also, do you really get c out of cd? Judging by your regex, you should be getting d, if we're talking about capture groups.
Also, note that you're extracting capture groups. If you combine the regexes, an extracted d will be $2, while b would be $1.

Scala regex "starts with lowercase alphabets" not working

val AlphabetPattern = "^([a-z]+)".r
def stringMatch(s: String) = s match {
case AlphabetPattern() => println("found")
case _ => println("not found")
}
If I try,
stringMatch("hello")
I get "not found", but I expected to get "found".
My understanding of the regex,
[a-z] = in the range of 'a' to 'z'
+ = one more of the previous pattern
^ = starts with
So regex AlphabetPattern is "all strings that start with one or more alphabets in the range a-z"
Surely I am missing something, want to know what.
Replace case AlphabetPattern() with case AlphabetPattern(_) and it works. The extractor pattern takes a variable to which it binds the result. Here we discard it but you could use x or whatever.
edit: Further to Randall's comment below, if you check the docs for Regex you'll see that it has an unapplySeq rather than an unapply method, which means it takes multiple variables. If you have the wrong number, it won't match, rather like
list match { case List(a,b,c) => a + b + c }
won't match if list doesn't have exactly 3 elements.
There are some issues with the match statement. s match is matching on the value of s which is checked against AlphabetPattern and _ which always evaluates to _ since s is never equal to "^([a-z]+)".r. Use one of the find methods in Scala.Util.Regex to look for a match with the given `Regex.
For example, using findFirstIn to find the first match of a string in AlphabetPattern.
scala> AlphabetPattern.findFirstIn("hello")
res0: Option[String] = Some(hello)
The stringMatch method using findFirstIn and a case statement:
scala> def stringMatch(s: String) = AlphabetPattern findFirstIn s match {
| case Some(s) => println("Found: " + s)
| case None => println("Not found")
| }
stringMatch: (s:String)Unit
scala> stringMatch("hello")
Found: hello