Non-empty iterator over regex groups becomes empty array - regex

I have this strange situation - when I print regex groups to a console, they show up. When I convert this iterator to array - it's empty. Following code doesnt print anything:
val str = "buy--751-rates.data"
val expr = "--(.+)-rates.data".r
val target = Array[String]()
expr.findAllIn(str).matchData map(m => m group 1) copyToArray(target, 0, 4)
target foreach { println }
But this snippet works:
val str = "buy--751-rates.data"
val expr = "--(.+)-rates.data".r
println("Scala matches:")
expr.findAllIn(str).matchData foreach {
m => println(m group 1)
}
I guess I missed something simple

You didn't get anything because you were copying to a zero length array. You don't actually need to do that as there is a toArray method on the iterator that converts it to and array and from that you can get the head value if you want. For example:
(expr.findAllIn(str).matchData).map(m => m group 1).toArray.head

Related

How to use match with regular expressions in Scala

I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1

Find index locations by regex pattern and replace them with a list of indexes in Scala

I have strings in this format:
object[i].base.base_x[i] and I get lists like List(0,1).
I want to use regular expressions in scala to find the match [i] in the given string and replace the first occurance with 0 and the second with 1. Hence getting something like object[0].base.base_x[1].
I have the following code:
val stringWithoutIndex = "object[i].base.base_x[i]" // basically this string is generated dynamically
val indexReplacePattern = raw"\[i\]".r
val indexValues = List(0,1) // list generated dynamically
if(indexValues.nonEmpty){
indexValues.map(row => {
indexReplacePattern.replaceFirstIn(stringWithoutIndex , "[" + row + "]")
})
else stringWithoutIndex
Since String is immutable, I cannot update stringWithoutIndex resulting into an output like List("object[0].base.base_x[i]", "object[1].base.base_x[i]").
I tried looking into StringBuilder but I am not sure how to update it. Also, is there a better way to do this? Suggestions other than regex are also welcome.
You couldloop through the integers in indexValues using foldLeft and pass the string stringWithoutIndex as the start value.
Then use replaceFirst to replace the first match with the current value of indexValues.
If you want to use a regex, you might use a positive lookahead (?=]) and a positive lookbehind (?<=\[) to assert the i is between opening and square brackets.
(?<=\[)i(?=])
For example:
val strRegex = """(?<=\[)i(?=])"""
val res = indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
s.replaceFirst(strRegex, row.toString)
}
See the regex demo | Scala demo
How about this:
scala> val str = "object[i].base.base_x[i]"
str: String = object[i].base.base_x[i]
scala> str.replace('i', '0').replace("base_x[0]", "base_x[1]")
res0: String = object[0].base.base_x[1]
This sounds like a job for foldLeft. No need for the if (indexValues.nonEmpty) check.
indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
indexReplacePattern.replaceFirstIn(s, "[" + row + "]")
}

How to group similar characters in a string in scala?

Lets assume I have a string as such:
val a = "aaaabbbcccss"
and I want to group only the a's and b's as such:
"a4b3cccss"
I have tries a.toList.groupBy(identity).mapValues(_.size) but that returns a map with no ordering so I cannot convert it into the form I want. I was wondering if there is a function in scala that can achieve what I want?
You may use
val a = "aaaabbbcccss"
val p = """([ab])\1*""".r
println(p replaceAllIn (a, m => s"${m.group(1)}${m.group(0).size}") )
See Scala demo
The regex matches:
([ab]) - Group 1: a or b
\1* - zero or more occurrences of the char captured into Group 1.
In the replacement part, m.group(1) is the char captured into Group 1 and m.group(0).size is the size of the whole match.
As an alternative, you might create a function which you can give your string and a list of characters and use a recursive approach where you could take consecutive characters from the list using takeWhile.
Then drop from the list using the length of the result from takewhile and add to the accumulator what you want to concatenate to the acc string which will be returned when the list will be empty.
def countSimilar(str: String, ch: List[Char]): String = {
def process(l: List[Char], acc: String = ""): String = {
l match {
case Nil => acc
case h :: _ =>
val tw = l.takeWhile(_ == h)
acc + process(
l.drop(tw.length),
if (ch.contains(h)) h + tw.length.toString else tw.mkString("")
)
}
}
process(str.toList)
}
println(countSimilar("aaaabbbcccss", List('a', 'b')))
println(countSimilar("aaaabbbcccssaaaabb", List('a', 'b', 'c')))
That will give you:
a4b3cccss
a4b3c3ssa4b2
See the Scala demo

regular expression to match a ascii character

I want to match a regular expression for the string
2=abc\u000148=123\u0001
Explanation
Key value pairs separated by SOH(\u0001) characeter
Key - Number
Value can be string of number ,alphabets,decimals
key and value are separated by "="
The regex I tried is
[0-9]=.*[u0001]+
but it does not matches properly
Update
I have a list of numbers val num =Seq(2,3,4)
Instead of finding I want to remove the matches from the string
keys for which I want to replace is from values inside list num
Input
2=abc\u000148=123\u00013=def\u0001
Output It is the filtered string
148=123\u0001 ,where keys which match value 2 and 3 are removed from list
object Main extends App {
val s = "2=abc\u000148=123\u00013=def\u0001"
val num = Seq(2,3)
for (e <- num) {
val p = s"(\\$e+)=([^\u0001]*)".r
test(p)
}
private def test(p: Regex) = {
p.findAllIn(s).matchData foreach {
m => println(m.group(1) + " : " + m.group(2))
}
}
}
You need to build the pattern dynamically like this:
s"\\b(?:${num.mkString("|")})=[^\\u0001]*\\u0001*"
Details
\b - a word boundary
(?:num1|num2...|numN) - any of the values in the num variable
= - an equal sign
[^\u0001]* - zero or more chars other than a SOH char (a char with the decimal code of 1)
\u0001* - zero or more SOH chars.
See a Scala demo:
val num = Seq(2,3)
val s = "1041=pqr\u000148=xyz\u000122=8\u00012=abc\u000148=123\u00013=def\u0001"
val pattern = s"\\b(?:${num.mkString("|")})=[^\\u0001]*\\u0001*"
// println(pattern) // => \b(?:2|3)=[^\u0001]*\u0001*
println(s.replaceAll(pattern, ""))
// => 1041=pqr\u000148=xyz\u000122=8\u000148=123\u0001

How does regex capturing work in scala?

Here is an example:
object RegexTest {
def main (args: Array[String]): Unit = {
val input = "Enjoy this apple 3.14 times"
val pattern = """.* apple ([\d.]+) times""".r
val pattern(amountText) = input
val amount = amountText.toDouble
println(amount)
}
}
I understand what this does, but how does val pattern(amountText) = input actually work? It looks very weird to me.
What that line is doing is calling Regex.unapplySeq (which is also called an extractor) to deconstruct input into a list of captured groups, and then bind each group to a new variable. In this particular scenario, only one group is expected to be captured and bound to the value amountText.
Validation aside, this is kinda what's going on behind the scenes:
val capturedGroups = pattern.unapplySeq(input)
val amountText = capturedGroups(0)
// And this:
val pattern(a, b, c) = input
// Would be equivalent to this:
val capturedGroups = pattern.unapplySeq(input)
val a = capturedGroups(0)
val b = capturedGroups(1)
val c = capturedGroups(2)
It is very similar in essence to extracting tuples:
val (a, b) = (2, 3)
Or even pattern matching:
(2,3) match {
case (a, b) =>
}
In both of these cases, Tuple.unapply is being called.
I suggest you have a look at this page : http://docs.scala-lang.org/tutorials/tour/extractor-objects.html. It is the official tutorial regarding extractors which this the pattern you are looking for.
I find that looking at the source makes it clear how it works : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/matching/Regex.scala#L243
Then, note that your code val pattern(amountText) = input is perfectly working, but, you must be sure about the input and be sure that there is a match with the regex.
Otherwise, I recommend you to write it this way :
input match {
case pattern(amountText) => ...
case _ => ...
}