scala regex for strings with double quotes - regex

I am struggling to concatenate a message with two texts into a single text using regex in scala
original message = "part1 "+" part2"
original message = "part1 " + " part2"
original message = "part 1 "+ " part2"
concatenated message = "part1 part2"
What I am using is this code below (to replace atleast the + sign with null)
val line:String = """"text1"+"text2"""" //My original String which is "text1"+"text2"
val temp_line:String = line.replaceAll("\\+","")
println(temp_line)
It works fine and results "text1""text2". Is there a way to get the output "text1 text2" using regex?
Please help. Thanks in advance

This is really not an ideal problem for regexes, but okay:
val Part = """"([^"]*)"(.*$)""".r // Quotes, non quotes, quotes, then the rest
val Plus = """\s*\+\s*(.*)""".r // Plus with optional spaces, then the rest
def parts(s: String, found: List[String] = Nil): String = s match {
case Part(p,rest) => rest match {
case "" => (p :: found).map(_.filter(c => !c.isWhitespace)).reverse.mkString(" ")
case Plus(more) => parts(more, p :: found)
case x => throw new IllegalArgumentException(s"$p :$x:")
}
case x => throw new IllegalArgumentException(s"|$x|")
}
This just takes the input string apart piece by piece; you can add printlns if you want to see how it works. (Note that + is a special character in regex, so you need to escape it to match it.)
scala> parts(""""part1 "+" part2"""")
res1: String = part1 part2
scala> parts(""""part1 " + " part2"""")
res2: String = part1 part2
scala> parts(""""part 1 "+ " part2"""")
res3: String = part1 part2

Related

Regex doesn't work when newline is at the end of the string

Exercise: given a string with a name, then space or newline, then email, then maybe newline and some text separated by newlines capture the name and the domain of email.
So I created the following:
val regexp = "^([a-zA-Z]+)(?:\\s|\\n)\\w+#(\\w+\\.\\w+)(?:.|\\r|\\n)*".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
And started testing:
scala> val input = "oleg oleg#email.com"
scala> fun(input)
res17: String = oleg email.com
scala> val input = "oleg\noleg#email.com"
scala> fun(input)
res18: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3"""
scala> fun(input)
res19: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3
| """
scala> fun(input)
res20: String = invalid
Why doesn't the regexp capture the string with the newline at the end?
This part (?:\\s|\\n) can be shortened to \s as it will also match a newline, and as there is still a space before the emails where you are using multiple lines it can be \s+ to repeat it 1 or more times.
Matching any character like this (?:.|\\r|\\n)* if very inefficient due to the alternation. You can use either [\S\s]* or use an inline modifier (?s) to make the dot match a newline.
But using your pattern to just get the name and the domain of the email you don't have to match what comes after it, as you are using the 2 capturing groups in the output.
^([a-zA-Z]+)\s+\w+#(\w+\.\w+)
Regex demo
If you do want to match all that follows, you can use:
val regexp = """(?s)^([a-zA-Z]+)\s+\w+#(\w+\.\w+).*""".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
Scala demo
Note that this pattern \w+#(\w+\.\w+) is very limited for matching an email

Regex Matching using Matcher and Pattern

I am trying to do regex on a number based on the below conditions, however its returning an empty string
import java.util.regex.Matcher
import java.util.regex.Pattern
object clean extends App {
val ALPHANUMERIC: Pattern = Pattern.compile("^[a-zA-Z0-9]*$")
val SPECIALCHAR: Pattern = Pattern.compile("[a-zA-Z0-9\\-#\\.\\(\\)\\/%&\\s]")
val LEADINGZEROES: Pattern = Pattern.compile("^[0]+(?!$)")
val TRAILINGZEROES: Pattern = Pattern.compile("\\.0*$|(\\.\\d*?)0+$")
def evaluate(codes: String): String = {
var str2: String = codes.toString
var text:Matcher = LEADINGZEROES.matcher(str2)
str2 = text.replaceAll("")
text = ALPHANUMERIC.matcher(str2)
str2 = text.replaceAll("")
text = SPECIALCHAR.matcher(str2)
str2 = text.replaceAll("")
text = TRAILINGZEROES.matcher(str2)
str2 = text.replaceAll("")
}
}
the code is returning empty string for LEADINGZEROES match.
scala> println("cleaned value :" + evaluate("0001234"))
cleaned value :
What change should I do to make the code work as I expect. Basically i am trying to remove leading/trailing zeroes and if the numbers has special characters/alphanumeric values than entire value should be returned null
Your LEADINGZEROES pattern is working correct as
val LEADINGZEROES: Pattern = Pattern.compile("^[0]+(?!$)")
println(LEADINGZEROES.matcher("0001234").replaceAll(""))
gives
//1234
But then there is a pattern matching
text = ALPHANUMERIC.matcher(str2)
which replaces all alphanumeric to "" and this made str as empty ("")
As when you do
val ALPHANUMERIC: Pattern = Pattern.compile("^[a-zA-Z0-9]*$")
val LEADINGZEROES: Pattern = Pattern.compile("^[0]+(?!$)")
println(ALPHANUMERIC.matcher(LEADINGZEROES.matcher("0001234").replaceAll("")).replaceAll(""))
it will print empty
Updated
As you have commented
if there is a code that is alphanumeric i want to make that value NULL
but in case of leading or trailing zeroes its pure number, which should return me the value after removing zeroes
but its also returning null for trailing and leading zeroes matches
and also how can I skip a match , suppose i want the regex to not match the number 0999 for trimming leading zeroes
You can write your evaluate function and regexes as below
val LEADINGTRAILINGZEROES = """(0*)(\d{4})(0*)""".r
val ALPHANUMERIC = """[a-zA-Z]""".r
def evaluate(codes: String): String = {
val LEADINGTRAILINGZEROES(first, second, third) = if(ALPHANUMERIC.findAllIn(codes).length != 0) "0010" else codes
if(second.equalsIgnoreCase("0010")) "NULL" else second
}
which should give you
println("cleaned value : " + evaluate("000123400"))
// cleaned value : 1234
println("alphanumeric : " + evaluate("0001A234"))
// alphanumeric : NULL
println("skipping : " + evaluate("0999"))
// skipping : 0999
I hope the answer is helpful

In Scala how can I split a string on whitespaces accounting for an embedded quoted string?

I know Scala can split strings on regex's like this simple split on whitespace:
myString.split("\\s+").foreach(println)
What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?
"""This is a "very complex" test"""
In this example I want the resulting substrings to be:
This
is
a
very complex
test
While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):
import scala.util.matching._
val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }
words.foreach(println)
/*
This
is
a
very complex
test
*/
Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/
You can also omit quote symbols when inlining a string by splitting a regex group:
...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).
For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):
val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))
See the online Scala demo, output:
This
is
a
very complex
test
The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:
\"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
| - or
[^\"\\s]+ - 1+ chars other than " and whitespace.
You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).
This should work:
val xx = """This is a "very complex" test"""
var x = xx.split("\\s+")
for(i <-0 until x.length) {
if(x(i) contains "\"") {
x(i) = x(i) + " " + x(i + 1)
x(i + 1 ) = ""
}
}
val newX= x.filter(_ != "")
for(i<-newX) {
println(i.replace("\"",""))
}
Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.
def fancySplit(s: String): List[String] = {
def recurse(s: List[Char]): List[String] = s match {
case Nil => Nil
case '"' :: tail =>
val (quoted, theRest) = tail.span(_ != '"')
quoted.mkString :: recurse(theRest drop 1)
case c :: tail if c.isWhitespace => recurse(tail)
case chars =>
val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
word.mkString :: recurse(theRest)
}
recurse(s.toList)
}
If the list is empty, you've finished recursion
If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
If the first character is whitespace, throw it out and recurse from the next character
In any other case, grab everything up to the next split character, then recurse with what's left
Results:
scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test

Scala, regex matching ignore unnecessary words

My program is:
val pattern = "[*]prefix_([a-zA-Z]*)_[*]".r
val outputFieldMod = "TRASHprefix_target_TRASH"
var tar =
outputFieldMod match {
case pattern(target) => target
}
println(tar)
Basically, I try to get the "target" and ignore "TRASH" (I used *). But it has some error and I am not sure why..
Simple and straight forward standard library function (unanchored)
Use Unanchored
Solution one
Use unanchored on the pattern to match inside the string ignoring the trash
val pattern = "prefix_([a-zA-Z]*)_".r.unanchored
unanchored will only match the pattern ignoring all the trash (all the other words)
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """foo\((.*)\)""".r.unanchored
pattern: scala.util.matching.UnanchoredRegex = foo\((.*)\)
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res3: String = bar
Solution two
Pad your pattern from both sides with .*. .* matches any char other than a linebreak character.
val pattern = ".*prefix_([a-zA-Z]*)_.*".r
val result = str match {
case pattern(value) => value
case _ => ""
}
Example
Scala REPL
scala> val pattern = """.*foo\((.*)\).*""".r
pattern: scala.util.matching.Regex = .*foo\((.*)\).*
scala> val str = "blahblahfoo(bar)blahblah"
str: String = blahblahfoo(bar)blahblah
scala> str match { case pattern(value) => value ; case _ => "no match" }
res4: String = bar
This will work, val pattern = ".*prefix_([a-z]+).*".r, but it distinguishes between target and trash via lower/upper-case letters. Whatever determines real target data from trash data will determine the real regex pattern.

Scala: How to replace all consecutive underscore with a single space?

I want to replace all the consecutive underscores with a single space. This is the code that I have written. But it is not replacing anything. Below is the code that I have written. What am I doing wrong?
import scala.util.matching.Regex
val regex: Regex = new Regex("/[\\W_]+/g")
val name: String = "cust_id"
val newName: String = regex.replaceAllIn(name, " ")
println(newName)
Answer: "cust_id"
You could use replaceAll to do the job without regex :
val name: String = "cust_id"
val newName: String = name.replaceAll("_"," ")
println(newName)
The slashes in your regular expression don't belong there.
new Regex("[\\W_]+", "g").replaceAllIn("cust_id", " ")
// "cust id"
A string in Scala may be treated as a collection, hence we can map over it and in this case apply pattern matching to substitute characters, like this
"cust_id".map {
case '_' => " "
case c => c
}.mkString
Method mkString glues up the vector of characters back onto a string.