Scala regular expression : matching a long unicode Devanagari pattern fails

Scala regular expression : matching a long unicode Devanagari pattern fails - regex

Consider the following script code:
import scala.util.matching.Regex
val VIRAMA = "्"
val consonantNonVowelPattern = s"(म|त|य)([^$VIRAMA])".r
// val consonantNonVowelPattern = s"(थ|ठ|छ|स|ब|घ|ण|ट|ज|ग|न|ष|भ|ळ|ढ|ख|श|प|ह|ध|ङ|म|झ|ड|ल|व|र|फ|क|द|च|ञ|त|य)([^$VIRAMA])".r
var output = "असय रामः "
output = consonantNonVowelPattern.replaceAllIn(output, _ match {
case consonantNonVowelPattern(consonant, followingCharacter) =>
consonant + VIRAMA + "a" + followingCharacter
})
println("After virAma addition: " + output.mkString("-"))
It produces the following correct output:
After virAma addition: अ-स-य-्-a- -र-ा-म-्-a-ः-
However, if I use the longer pattern (commented out above), I get the following wrong output:
After virAma addition: अ-स-्-a-य- -र-्-a-ा-म-्-a-ः-
Is this a bug? Am I doing something wrong?

The below is thanks to Lalit Pant-
I'm assuming that the correct output for the second case is:
अ-स-्-a-य-्-a- -र-्-a-ा-म-्-a-ः-
If that's the case, read on. If not, tell me the expected output.
The problem appears to be that with your bigger 'consonantNonVowelPattern', the presence of 'सय' in 'output' makes 'य' show up as a 'followingCharacter' in your pattern match after the 'स' consonant.  'य' is consequently never reported as a consonant.

Related

Casting regex match to String

I created a simple code in Scala that checks whether an input is correctly formatted as HH:mm. I expect the code to result in an Array of valid strings. However, what I'm getting as a result is of type Any = Array(), which is problematic as when I try to print that result I get something like that:
[Ljava.lang.Object;#32a59591.
I guess it's a simple problem but being a Scala newbie I didn't manage to solve it even after a good few hours of googling and trial & error.
val scheduleHours = if (inputScheduleHours == "") {
dbutils.notebook.exit(s"ERROR: Missing param value for schedule hours.")
}
else {
val timePattern = """^((?:[0-30]?[0-9]|2[0-3]):[0-5][0-9])$""".r
val inputScheduleHoursParsed = inputScheduleHours.split(";").map(_.trim)
for (e <- inputScheduleHoursParsed) yield e match {
case timePattern(e) => e.toString
case _ => dbutils.notebook.exit(s"ERROR: Wrong param value for schedule hours: '${inputScheduleHours}'")
}
}

The problem is that some branches return the result you want and others return dbutils.notebook.exit which (I think) returns Unit. Scala must pick a type for the result that is compatible with both Unit and Array[String], and Any is the only one that fits.
One solution is to add a compatible value after the calls to dbutils.notebook.exit, e.g.
val scheduleHours = if (inputScheduleHours == "") {
dbutils.notebook.exit(s"ERROR: Missing param value for schedule hours.")
Array.empty[String]
}
Then all the branches return Array[String] so that will be the type of the result.

Using regex in Scala to group and pattern match

I need to process phone numbers using regex and group them by (country code) (area code) (number). The input format:
country code: between 1-3 digits
, area code: between 1-3 digits
, number: between 4-10 digits
Examples:
1 877 2638277
91-011-23413627
And then I need to print out the groups like this:
CC=91,AC=011,Number=23413627
This is what I have so far:
String s = readLine
val pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
val ret = pattern.findAllIn(s)
println("CC=" + ret.group(1) + "AC=" + ret.group(2) + "Number=" + ret.group(3));
The compiler said "empty iterator." I also tried:
val (cc,ac,n) = s
and that didn't work either. How to fix this?

The problem is with your pattern. I would recommend using some tool like RegexPal to test them. Put the pattern in the first text box and your provided examples in the second one. It will highlight the matched parts.
You added spaces between your groups and [ -] separators, and it was expecting spaces there. The correct pattern is:
val pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
Also if you want to explicitly get groups then you want to get a Match returned. For an example the findFirstMatchIn function returns the first optional Match or the findAllMatchIn returns a list of matches:
val allMatches = pattern.findAllMatchIn(s)
allMatches.foreach { m =>
println("CC=" + m.group(1) + "AC=" + m.group(2) + "Number=" + m.group(3))
}
val matched = pattern.findFirstMatchIn(s)
matched match {
case Some(m) =>
println("CC=" + m.group(1) + "AC=" + m.group(2) + "Number=" + m.group(3))
case None =>
println("There wasn't a match!")
}
I see you also tried extracting the string into variables. You have to use the Regex extractor in the following way:
val Pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
val Pattern(cc, ac, n) = s
println(s"CC=${cc}AC=${ac}Number=$n")
And if you want to handle errors:
s match {
case Pattern(cc, ac, n) =>
println(s"CC=${cc}AC=${ac}Number=$n")
case _ =>
println("No match!")
}
Also you can also take a look at string interpolation to make your strings easier to understand: s"..."

Scala regex "starts with lowercase alphabets" not working

val AlphabetPattern = "^([a-z]+)".r
def stringMatch(s: String) = s match {
case AlphabetPattern() => println("found")
case _ => println("not found")
}
If I try,
stringMatch("hello")
I get "not found", but I expected to get "found".
My understanding of the regex,
[a-z] = in the range of 'a' to 'z'
+ = one more of the previous pattern
^ = starts with
So regex AlphabetPattern is "all strings that start with one or more alphabets in the range a-z"
Surely I am missing something, want to know what.

Replace case AlphabetPattern() with case AlphabetPattern(_) and it works. The extractor pattern takes a variable to which it binds the result. Here we discard it but you could use x or whatever.
edit: Further to Randall's comment below, if you check the docs for Regex you'll see that it has an unapplySeq rather than an unapply method, which means it takes multiple variables. If you have the wrong number, it won't match, rather like
list match { case List(a,b,c) => a + b + c }
won't match if list doesn't have exactly 3 elements.

There are some issues with the match statement. s match is matching on the value of s which is checked against AlphabetPattern and _ which always evaluates to _ since s is never equal to "^([a-z]+)".r. Use one of the find methods in Scala.Util.Regex to look for a match with the given `Regex.
For example, using findFirstIn to find the first match of a string in AlphabetPattern.
scala> AlphabetPattern.findFirstIn("hello")
res0: Option[String] = Some(hello)
The stringMatch method using findFirstIn and a case statement:
scala> def stringMatch(s: String) = AlphabetPattern findFirstIn s match {
| case Some(s) => println("Found: " + s)
| case None => println("Not found")
| }
stringMatch: (s:String)Unit
scala> stringMatch("hello")
Found: hello

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks

A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy

For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});

A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;

Parametrized Regex for pattern matching

Is it possible to match regular expression pattern which is returned from a function? Can I do something like this?
def pattern(prefix: String) = (prefix + "_(\\w+)").r
val x = something match {
case pattern("a")(key) => "AAAA" + key
case pattern("b")(key) => "BBBB" + key
}
I cannot compile the above code. The following console snapshot shows an error I get. What am I doing wrong?
scala> def pattern(prefix: String) = (prefix + "_(\\w+)").r
pattern: (prefix: String)scala.util.matching.Regex
scala> def f(s:String) = s match {
| case pattern("a")(x) => s+x+"AAAAA"
<console>:2: error: '=>' expected but '(' found.
case pattern("a")(x) => s+x+"AAAAA"
^

This syntax is not supported by scala, you have to declare the extractor before you use it. See my earlier question on this topic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scala regular expression : matching a long unicode Devanagari pattern fails - regex

Related

Casting regex match to String

Using regex in Scala to group and pattern match

Scala regex "starts with lowercase alphabets" not working

regex how can I split this word?

Parametrized Regex for pattern matching

Categories

Resources