scala matching optional set of characters - regex

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.

Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.

Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!

Related

How to replace part of string using regex pattern matching in scala?

I have a String which contains column names and datatypes as below:
val cdt = "header:integer|releaseNumber:numeric|amountCredit:numeric|lastUpdatedBy:numeric(15,10)|orderNumber:numeric(20,0)"
My requirement is to convert the postgres datatypes which are present as numeric, numeric(15,10) into spark-sql compatible datatypes.
In this case,
numeric -> decimal(38,30)
numeric(15,10) -> decimal(15,10)
numeric(20,0) -> bigint (This is an integeral datatype as there its precision is zero.)
In order to access the datatype in the string: cdt, I split it and created a Seq from it.
val dt = cdt.split("\\|").toSeq
Now I have a Seq of elements in which each element is a String in the below format:
Seq("header:integer", "releaseNumber:numeric","amountCredit:numeric","lastUpdatedBy:numeric(15,10)","orderNumber:numeric(20,0)")
I have the pattern matching regex: """numeric\(\d+,(\d+)\)""".r, for numeric(precision, scale) which only works if there is a
scale of two digits, ex: numeric(20,23).
I am very new to REGEX and Scala & I don't understand how to create regex pattterns for the remaining two cases & apply it on a string to match a condition. I tried it in the below way but it gives me a compilation error: "Cannot resolve symbol findFirstMatchIn"
dt.map(e => e.split("\\:")).map(e => changeDataType(e(0), e(1)))
def changeDataType(colName: String, cd:String): String = {
val finalColumns = new String
val pattern1 = """numeric\(\d+,(\d+)\)""".r
cd match {
case pattern1.findFirstMatchIn(dt) =>
}
}
I am trying to get the final output into a String as below:
header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
How to multiple regex patterns for different cases to check/apply pattern matching on datatype of each value in the seq and change it to my suitable datatype as mentioned above.
Could anyone let me know how can I achieve it ?
It can be done with a single regex pattern, but some testing of the match results is required.
val numericRE = raw"([^:]+):numeric(?:\((\d+),(\d+)\))?".r
cdt.split("\\|")
.map{
case numericRE(col,a,b) =>
if (Option(b).isEmpty) s"$col:decimal(38,30)"
else if (b == "0") s"$col:bigint"
else s"$col:decimal($a,$b)"
case x => x //pass-through
}.mkString("|")
//res0: String = header:integer|releaseNumber:decimal(38,30)|amountCredit:decimal(38,30)|lastUpdatedBy:decimal(15,10)|orderNumber:bigint
Of course it can be done with three different regex patterns, but I think this is pretty clear.
explanation
raw - don't need so many escape characters - \
([^:]+) - capture everything up to the 1st colon
:numeric - followed by the string ":numeric"
(?: - start a non-capture group
\((\d+),(\d+)\) - capture the 2 digit strings, separated by a comma, inside parentheses
)? - the non-capture group is optional
numericRE(col,a,b) - col is the 1st capture group, a and b are the digit captures, but they are inside the optional non-capture group so they might be null

How to find the whole word in kotlin using regex

I want to find the whole word in string. But I don't know how to find the all word in kotlin. My finding word is [non alpha]cba[non alpha]. My example code is bellows.
val testLink3 = """cba#cba cba"""
val word = "cba"
val matcher = "\\b[^a-zA-Z]*(?i)$word[^a-zA-Z]*\\b".toRegex()
val ret = matcher.find(testLink3)?.groupValues
But output of my source code is "cba"
My expected value is string array such as "{cba, cba, cba}".
How to find this value in kotlin language.
You may use
val testLink3 = """cBa#Cba cbA123"""
val word = "cba"
val matcher = "(?i)(?<!\\p{L})$word(?!\\p{L})".toRegex()
println(matcher.findAll(testLink3).map{it.value}.toList() )
println(matcher.findAll(testLink3).count() )
// => [cBa, Cba, cbA]
// => 3
See the online Kotlin demo.
To obtain the list of matches, you need to run the findAll method on the regex object, map the results to values and cast to list:
.findAll(testLink3).map{it.value}.toList()
To count the matches, you may use
matcher.findAll(testLink3).count()
Regex demo
(?i) - case insensitive modifier
(?<!\\p{L}) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a letter
$word - your word variable value
(?!\\p{L}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a letter.

Scala Anchored Regex acts as unachored

So for some reason in Scala 2.11, my anchored regex patterns act as unanchored regex patterns.
scala> """something\.com""".r.anchored findFirstIn "app.something.com"
res66: Option[String] = Some(something.com)
scala> """^.something\.com$""".r.anchored findFirstIn "app.something.com"
res65: Option[String] = None
I thought the first expression would evaluate as None like the second (manually entered anchors) but it does not.
Any help would be appreciated.
The findFirstIn method un-anchors the regex automatically.
You can see that the example code also matches A only:
Example:
"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"
BTW, once you create a regex like "pattern".r, it is anchored by default, but that only matters when you use the regex in a match block. Inside the FindAllIn or FindFirstIn, this type of anchoring is just ignored.
So, to make sure the regex matches the whole string, always add ^ and $ (or \A and \z) anchors if you are not sure where you are going to use the regexes.
I think, it is only supposed to work with match:
val reg = "o".r.anchored
"foo" match {
case reg() => "Yes!"
case _ => "No!"
}
... returns "No!".
This doesn't seem very useful, because just "o".r is anchored by default anyway. The only use of this I can imagine is if you made some unanchored (by accident? :)), and then want to undo it, or if you just want to match both cases, but s
eparately:
val reg = "o".r.unanchored
"foo" match {
case reg.anchored() => "Anchored!
case reg() => "Unanchored"
case _ => "I dunno"
}

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

scala regex to match tab separated words from a string

I'm trying to match the following string
"name type this is a comment"
Name and type are definitely there.
Comment may or may not exist.
I'm trying to store this into variables n,t and c.
val nameTypeComment = """^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+)"""
str match { case nameType(n, t, c) => print(n,t,c) }
This is what I have but doesn't seem to be working. Any help is appreciated.
val nameType = """^(\w+)\s+([\w\)\(\,]+)""".r
However this works when i was trying to work with strings only with name and type and no comment which is a group of words which might or not be there.
Note that ^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+) regex only contains 1 capturing group ((\w+\s+)) while you define 3 in the match block.
The ^(\w+)\s+([\w\)\(\,]+) only contains 2 capturing groups: (\w+) and ([\w\)\(\,]+).
To make your code work, you need to define 3 capturing groups. Also, it is not clear what the separators are, let me assume the first two fields are just 1 or more alphanumeric/underscore symbols separated by 1 or more whitespaces. The comment is anything after 2 first fields.
Then, use
val s = "name type this comment a comment"
val nameType = """(\w+)\s+(\w+)\s+(.*)""".r
val res = s match {
case nameType(n, t, c) => print(n,t,c)
case _ => print("NONE")
}
See the online demo
Note that we need to compile a regex object, pay attention at the .r after the regex pattern nameType.
Note that a pattern inside match is anchored by default, the start of string anchor ^ can be omitted.
Also, it is a good idea to add case _ to define the behavior when no match is found.