Escape string to be passed as regex - regex

I would like to create a function that creates regex matching an arbitrary string given at the input. For example, when I feed it with 123$ it should match literally "123$" and not 123 at the end of the string.
def convert( xs: String ) = (xs map ( x => "\\"+x)).mkString
val text = """ 123 \d+ 567 """
val x = """\d+"""
val p1 = x.r
val p2 = convert(x).r
println( p1.toString )
\d+ // regex to match number
println( ( p1 findAllIn text ).toList )
List(123, 567) // ok, numbers are matched
println( p2.toString )
\\\d\+ // regex to match "backshash d plus"
println( ( p2 findAllIn text ).toList )
List() // nothing matched :(
So the last findAllIn should find \d+ in text, but it doesn't. What's wrong here?

You can use Java's Pattern class to escape strings as regular expressions. See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#quote%28java.lang.String%29
For example:
scala> Pattern.quote("123$").r.findFirstIn("123$")
res3: Option[String] = Some(123$)

Just to bring more attention to Harold L's comment above, if you want to do this with a Scala library you can use:
import scala.util.matching.Regex
Regex.quote("123$").r.findFirstIn("123$")

Related

Scala regex on a whole column

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.

Regex doesn't work when newline is at the end of the string

Exercise: given a string with a name, then space or newline, then email, then maybe newline and some text separated by newlines capture the name and the domain of email.
So I created the following:
val regexp = "^([a-zA-Z]+)(?:\\s|\\n)\\w+#(\\w+\\.\\w+)(?:.|\\r|\\n)*".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
And started testing:
scala> val input = "oleg oleg#email.com"
scala> fun(input)
res17: String = oleg email.com
scala> val input = "oleg\noleg#email.com"
scala> fun(input)
res18: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3"""
scala> fun(input)
res19: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3
| """
scala> fun(input)
res20: String = invalid
Why doesn't the regexp capture the string with the newline at the end?
This part (?:\\s|\\n) can be shortened to \s as it will also match a newline, and as there is still a space before the emails where you are using multiple lines it can be \s+ to repeat it 1 or more times.
Matching any character like this (?:.|\\r|\\n)* if very inefficient due to the alternation. You can use either [\S\s]* or use an inline modifier (?s) to make the dot match a newline.
But using your pattern to just get the name and the domain of the email you don't have to match what comes after it, as you are using the 2 capturing groups in the output.
^([a-zA-Z]+)\s+\w+#(\w+\.\w+)
Regex demo
If you do want to match all that follows, you can use:
val regexp = """(?s)^([a-zA-Z]+)\s+\w+#(\w+\.\w+).*""".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
Scala demo
Note that this pattern \w+#(\w+\.\w+) is very limited for matching an email

Scala regex get parameters in path

regex noob here.
example path:
home://Joseph/age=20/race=human/height=170/etc
Using regex, how do I grab everything after the "=" between the /Joseph/ path and /etc? I'm trying to create a list like
[20, human, 170]
So far I have
val pattern = ("""(?<=Joseph/)[^/]*""").r
val matches = pattern.findAllIn(path)
The pattern lets me just get "age=20" but I thought findAllIn would let me find all of the "parameter=" matches. And after that, I'm not sure how I would use regex to just obtain the "20" in "age=20", etc.
Code
See regex in use here
(?:(?<=/Joseph/)|\G(?!\A)/)[^=]+=([^=/]+)
Usage
See code in use here
object Main extends App {
val path = "home://Joseph/age=20/race=human/height=170/etc"
val pattern = ("""(?:(?<=/Joseph/)|\G(?!\A)/)[^=]+=([^=/]+)""").r
pattern.findAllIn(path).matchData foreach {
m => println(m.group(1))
}
}
Results
Input
home://Joseph/age=20/race=human/height=170/etc
Output
20
human
170
Explanation
(?:(?<=/Joseph/)|\G(?!\A)/) Match the following
(?<=/Joseph/) Positive lookbehind ensuring what precedes matches /Joseph/ literally
\G(?!\A)/ Assert position at the end of the previous match and match / literally
[^=]+ Match one or more of any character except =
= Match this literally
([^=/]+) Capture one or more of any character except = and / into capture group 1
Your pattern looks for the pattern directly after Joseph/, which is why only age=20 matched, maybe just look after =?
val s = "home://Joseph/age=20/race=human/height=170/etc"
// s: String = home://Joseph/age=20/race=human/height=170/etc
val pattern = "(?<==)[^/]*".r
// pattern: scala.util.matching.Regex = (?<==)[^/]*
pattern.findAllIn(s).toList
// res3: List[String] = List(20, human, 170)

Scala: string pattern matching and splitting

I am new to Scala and want to create a function to split Hello123 or Hello 123 into two strings as follows:
val string1 = 123
val string2 = Hello
What is the best way to do it, I have attempted to use regex matching \\d and \\D but I am not sure how to write the function fully.
Regards
You may replace with 0+ whitespaces (\s*+) that are preceded with letters and followed with digits:
var str = "Hello123"
val res = str.split("(?<=[a-zA-Z])\\s*+(?=\\d)")
println(res.deep.mkString(", ")) // => Hello, 123
See the online Scala demo
Pattern details:
(?<=[a-zA-Z]) - a positive lookbehind that only checks (but does not consume the matched text) if there is an ASCII letter before the current position in the string
\\s*+ - matches (consumes) zero or more spaces possessively, i.e.
(?=\\d) - this check is performed only once after the whitespaces - if any - were matched, and it requires a digit to appear right after the current position in the string.
Based on the given string I assume you have to match a string and a number with any number of spaces in between
here is the regex for that
([a-zA-Z]+)\\s*(\\d+)
Now create a regex object using .r
"([a-zA-Z]+)\\s*(\\d+)".r
Scala REPL
scala> val regex = "([a-zA-Z]+)\\s*(\\d+)".r
scala> val regex(a, b) = "hello 123"
a: String = "hello"
b: String = "123"
scala> val regex(a, b) = "hello123"
a: String = "hello"
b: String = "123"
Function to handle pattern matching safely
pattern match with extractors
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
Here is the function which does Regex with Pattern matching
def matchStr(str: String): Option[(String, Int)] = {
val regex = "([a-zA-Z]+)\\s*(\\d+)".r
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
}
Scala REPL
scala> def matchStr(str: String): Option[(String, Int)] = {
val regex = "([a-zA-Z]+)\\s*(\\d+)".r
str match {
case regex(a, b) => Some(a -> b.toInt)
case _ => None
}
}
defined function matchStr
scala> matchStr("Hello123")
res41: Option[(String, Int)] = Some(("Hello", 123))
scala> matchStr("Hello 123")
res42: Option[(String, Int)] = Some(("Hello", 123))

Capitalize first letter after period

I am using regex to capitalize first letter after . or ? or ! but I am not able to use Upper case, is there something I am missing?
val reply = line.replaceAll("""([\.!?])\s+([a-z])""","""$1"""+" "+"""$2""".toUpperCase)
.toUpperCase has not effect so I tried this:
val pattern = """(?:(.+)?([\.!?])\s+([a-z])(.+)?)+""".r
val reply = line match {
case pattern(a,b,c,d) => a+b+" "+c.toUpperCase+d
case _ => line
}
This does not match all the occurences of . and it only capitalizes the letter after the first period.
You could use replaceAllIn method of Regex:
scala> """[\.!?]\s+[a-z]""".r.replaceAllIn("abc. abc", _.matched.toUpperCase)
res0: String = abc. Abc