Split the string by a regular expression on the part (Scala language) - regex

How to split a string into three parts by a regular expression:
The first part contains the prefix the match;
The second part contains a match for the regular expression;
The third part contains the postfix the match.
val line="before word after word tree"
val regex="""word""".r
val (resultLeft,result,resultRight) = find(line, regex) //("before ","word"," after word tree")
I need to describe "find" on Scala language

You can group regexes by parantheses.
Then, you can reach them through matchings with regex.findAllIn.
object Main extends App {
val str = "prefix:somestring:postfix";
val regex = """(prefix:)([A-z0-9]+)(:postfix)""".r
regex.findAllIn(str).matchData foreach {
m => System.out.println(m.group(1), m.group(2), m.group(3))
}
}

I found solution
def find(line:String,regex: String)={
val matcher=regex.r.pattern.matcher(line)
if (matcher.find()){
(
line.substring(0,matcher.start()),
line.substring(matcher.start(),matcher.end()),
line.substring(matcher.end())
)
}else
throw new Exception("no match")
}

Related

scala regex meaning

i am new to scala and hate regex :D
cuurently i am debuggig a piece of code
def validateReslutions(reslutions: String): Unit = {
val regex = "(\\d+-\\d+[d,w,m,h,y],?)*"
if (!reslutions.matches(regex)) {
throw new Error("no match")
} else {
print("matched")
}
}
validateReslutions(reslutions = "(20-1w,100-1w)")
}
the problem is it produces no match for this input , so how to correct the regex to match this input
Your (20-1w,100-1w) string contains a pair of parentheses at the start and end, and the rest matches with your (\d+-\d+[d,w,m,h,y],?)* regex. Since String#matches requires a full string match, you get an exception.
Include the parentheses patterns to the regex to avoid the exception:
def validateReslutions(reslutions: String): Unit = {
val regex = """\((\d+-\d+[dwmhy],?)*\)"""
if (!reslutions.matches(regex)) {
throw new Error("no match")
} else {
print("matched")
}
}
validateReslutions(reslutions = "(20-1w,100-1w)")
// => matched
See the Scala demo.
Note the triple quotes used to define the string literal, inside which you can use single backslashes to define literal backslash chars.
Also, mind the absence of commas in the character class, they match literal commas in the text, they do not mean "or" inside character classes.

Scala regex find matches in middle of string [duplicate]

This question already has an answer here:
Working regex fails when using Scala pattern matching
(1 answer)
Closed 5 years ago.
I have written the following code in scala:
val regex_str = "([a-z]+)(\\d+)".r
"_abc123" match {
case regex_str(a, n) => "found"
case _ => "other"
}
which returns "other", but if I take off the leading underscore:
val regex_str = "([a-z]+)(\\d+)".r
"abc123" match {
case regex_str(a, n) => "found"
case _ => "other"
}
I get "found". How can I find any ([a-z]+)(\\d+) instead of just at the beginning? I am used to other regex languages where you use a ^ to specify beginning of the string, and the absence of that just gets all matches.
Scala regex patterns default as "anchored", i.e. bound to beginning and end of target string.
You'll get the expected match with this.
val regex_str = "([a-z]+)(\\d+)".r.unanchored
Hi May be you need something like this,
val regex_str = "[^>]([a-z]+)(\\d+)".r
"_abc123" match {
case regex_str(a, n) => println(s"found $a $n")
case _ => println("other")
}
This will avoid the first character from your string.
Hope this helps!
The unapplySeq of the Regex tries to capture the whole input by default (treats the pattern as if it was between ^ and $).
There are two ways to capture inside the input:
use .* before and after the captures: val regex_str = ".*([a-z]+)(\\d+).*".r
do the same with .unanchored: val regex_str = "([a-z]+)(\\d+)".r.unanchored
Otherwise scala treats regular expression anchors the same way as in other languages; this one is an exception made for semantic reasons.
The regex extractor in scala pattern-matching attempts to match the entire string. If you want to skip some junk-characters in the beginning and in the end, prepend a . with a reluctant quantifier to the regex:
val regex_str = ".*?([a-z]+)(\\d+).*".r
val result = "_!+<>__abc123_%$" match {
case regex_str(a, n) => s"found a = '$a', n = '$n'"
case _ => "no match"
}
println(result)
This outputs:
found a = 'abc', n = '123'
Otherwise, don't use the pattern match with the extractor, use "...".r.findAllIn to find all matches.

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test
While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).
This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}
Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

Scala regex find and replace

I'm having problems finding and replacing portions of a string using regex in scala.
Given the following string: q[k6.q3]>=0 and q[dist.report][0] or q[dist.report][1] and q[10]>20
I want to replace all the occurrences of "and" and "or" with "&&" and "||".
The regex I have come up with is: .+\s((and|or)+)\s.+. However, this seems to only find the last "and".
When using https://regex101.com/#pcre I tried to solve this by adding the modifiers gU, which seems to work. But I'm not sure how to use those modifiers in Scala code.
Any help is much appreciated
Why not to use solution like:
str.replaceAll("\\sand\\s", " && ").replaceAll("\\sor\\s", " || ")
You can check the captured/matched substrings with a lambda and use an if/else syntax to replace with the appropriate replacement:
val str = "q[k6.q3]>=0 and q[dist.report][0] or q[dist.report][1] and q[10]>20"
val pattern = """\b(and|or)\b""".r
val replacedStr = pattern replaceAllIn (str, m => if (m.group(1) == "or") "||" else "&&")
println(replacedStr)
Result of the code demo: q[k6.q3]>=0 && q[dist.report][0] || q[dist.report][1] && q[10]>20
Regex breakdown:
\b - word boundary
(and|or) - either and or or letter sequences
\b - the closing word boundary.
If you require whitespaces on both ends, use
val pattern = """ (and|or) """.r
val replacedStr = pattern replaceAllIn (str, m => if (m.group(1) == "or") " || " else " && ")
See another Scala demo
You need to add "?" in the right places to make your patterns reluctant:
val line = "q[k6.q3]>=0 and q[dist.report][0] or q[dist.report][1] and q[10]>20"
val regex = ".+\\s((and|or)+)\\s.+".r
regex.findAllIn(line).toList
//Produces list with one item:
//res0: List[String] = List(q[k6.q3]>=0 and q[dist.report][0] or q[dist.report][1] and q)
Compared with:
val line = "q[k6.q3]>=0 and q[dist.report][0] or q[dist.report][1] and q[10]>20"
val regex = ".+?\\s((and|or)+)\\s.+?".r
regex.findAllIn(line).toList
//List with 3 items:
//res0: List[String] = List(q[k6.q3]>=0 and q, [dist.report][0] or q, [dist.report][1] and q)

Scala capture group using regex

Let's say I have this code:
val string = "one493two483three"
val pattern = """two(\d+)three""".r
pattern.findAllIn(string).foreach(println)
I expected findAllIn to only return 483, but instead, it returned two483three. I know I could use unapply to extract only that part, but I'd have to have a pattern for the entire string, something like:
val pattern = """one.*two(\d+)three""".r
val pattern(aMatch) = string
println(aMatch) // prints 483
Is there another way of achieving this, without using the classes from java.util directly, and without using unapply?
Here's an example of how you can access group(1) of each match:
val string = "one493two483three"
val pattern = """two(\d+)three""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
This prints "483" (as seen on ideone.com).
The lookaround option
Depending on the complexity of the pattern, you can also use lookarounds to only match the portion you want. It'll look something like this:
val string = "one493two483three"
val pattern = """(?<=two)\d+(?=three)""".r
pattern.findAllIn(string).foreach(println)
The above also prints "483" (as seen on ideone.com).
References
regular-expressions.info/Lookarounds
val string = "one493two483three"
val pattern = """.*two(\d+)three.*""".r
string match {
case pattern(a483) => println(a483) //matched group(1) assigned to variable a483
case _ => // no match
}
Starting Scala 2.13, as an alternative to regex solutions, it's also possible to pattern match a String by unapplying a string interpolator:
"one493two483three" match { case s"${x}two${y}three" => y }
// String = "483"
Or even:
val s"${x}two${y}three" = "one493two483three"
// x: String = one493
// y: String = 483
If you expect non matching input, you can add a default pattern guard:
"one493deux483three" match {
case s"${x}two${y}three" => y
case _ => "no match"
}
// String = "no match"
You want to look at group(1), you're currently looking at group(0), which is "the entire matched string".
See this regex tutorial.
def extractFileNameFromHttpFilePathExpression(expr: String) = {
//define regex
val regex = "http4.*\\/(\\w+.(xlsx|xls|zip))$".r
// findFirstMatchIn/findAllMatchIn returns Option[Match] and Match has methods to access capture groups.
regex.findFirstMatchIn(expr) match {
case Some(i) => i.group(1)
case None => "regex_error"
}
}
extractFileNameFromHttpFilePathExpression(
"http4://testing.bbmkl.com/document/sth1234.zip")