Groovy regex PatternSyntaxException when parsing GString-style variables - regex

Groovy here. I'm being given a String with GString-style variables in it like:
String target = 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
Keep in mind, this is not intended to be used as an actual GString!!! That is, I'm not going to have 3 string variables (animal, role and bodyPart, respectively) that Groovy will be resolving at runtime. Instead, I'm looking to do 2 distinct things to these "target" strings:
I want to be able to find all instances of these variables refs ("${*}") in the target string, and replace it with a ?; and
I also need to find all instances of these variables refs and obtain a list (allowing dupes) with their names (which in the above example, would be [animal,role,bodyPart])
My best attempt thus far:
class TargetStringUtils {
private static final String VARIABLE_PATTERN = "\${*}"
// Example input: 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
// Example desired output: 'How now brown ?. The ? has oddly-shaped ?.'
static String replaceVarsWithQuestionMarks(String target) {
target.replaceAll(VARIABLE_PATTERN, '?')
}
// Example input: 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
// Example desired output: [animal,role,bodyPart] } list of strings
static List<String> collectVariableRefs(String target) {
target.findAll(VARIABLE_PATTERN)
}
}
...produces PatternSytaxException anytime I go to run either method:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal repetition near index 0
${*}
^
Any ideas where I'm going awry?

The issue is that you have not escaped the pattern properly, and findAll will only collect all matches, while you need to capture a subpattern inside the {}.
Use
def target = 'How now brown ${animal}. The ${role} has oddly-shaped ${bodyPart}.'
println target.replaceAll(/\$\{([^{}]*)\}/, '?') // => How now brown ?. The ? has oddly-shaped ?.
def lst = new ArrayList<>();
def m = target =~ /\$\{([^{}]*)\}/
(0..<m.count).each { lst.add(m[it][1]) }
println lst // => [animal, role, bodyPart]
See this Groovy demo
Inside a /\$\{([^{}]*)\}/ slashy string, you can use single backslashes to escape the special regex metacharacters, and the whole regex pattern looks cleaner.
\$ - will match a literal $
\{ - will match a literal {
([^{}]*) - Group 1 capturing any characters other than { and }, 0 or more times
\} - a literal }.

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

How to regex the class name out of this?

So imagine I have big long string and inside it, I have this piece of text....
(BlahUtils.loggerName(MyClass.class.getName())
I want to extract out "MyClass".
If I do:
def matcher1 = test =~ /MyClass/
matcher1[0]
I get it. But then MyClass can be anything and that is what I want to extract out. How do I do that?
You may use
/(?<=loggerName\()\w+(?=\.class\b)/
See the regex demo
Details
(?<=loggerName\() - right before, there must be loggerName( substring
\w+ - 1+ word chars
(?=\.class\b) - right after, there must be a .class as whole word.
See the Groovy demo:
String test = "(BlahUtils.loggerName(MyClass.class.getName())"
def m = (test =~ /(?<=loggerName\()\w+(?=\.class\b)/)
if (m) {
println m.group();
}
Simple no-brainer:
'(BlahUtils.loggerName(MyClass.class.getName())'.eachMatch( /loggerName\(([^\(\)\.]+)/ ){ println it[ 1 ] }
gives MyClass

preg_match pattern with slashes stored in variable

I'm having trouble with this regex. (https://regex101.com/r/vQLlyY/1)
My pattern is working and is:
(?<=Property: )(.*?)(?= \(Contact)|(?<=Property: )(.*?)(?= - )
You'll see in the link that the property text is extracted in both these strings:
Property: This is the property (Contact - Warren)
Property: This is another property - Warren
In my code, this pattern is stored like this:
$this->rex["property"][2] = '/(?<=Property: )(.*?)(?= \(Contact)|(?<=Property: )(.*?)(?= - )/s'
Then, it is extracted like this:
foreach ($this->rex as $key => $value ) {
if (isset($value[$provider])) {
preg_match_all($value[$provider], $emailtext, $matches);
if (!empty($matches[1][0])) {
$emaildetails[$key] = trim(preg_replace("/\r|\n/", "", $matches[1][0]));
} else {
$emaildetails[$key] = "";
}
}
}
In this example, $provider = 2
My problem I'm sure is with the blackslash because I can't get this code to pickup the (Contact part of the pattern where I need to escape the bracket. I know the code works because I have many other patterns in use. Also, this works for the property text if the pattern is stored like this:
$this->rex["property"][2] = '/(?<=Property: )(.*?)(?= - )/s
So, am I storing the pattern correctly with the escaped bracket, or is that even my problem? Thanks in advance!
Because you're using separate capture groups, the different paths are ending up in different match indexes. For instance, the first line (the Contact - Warren one) is storing the match result in index 1, where the second line has an empty string in index 1 and the match result you're looking for in index 2.
To solve this issue, you can use non-capture groups or you can rewrite your expression to use positive lookaheads. The benefits of the former include allowing for quantifiers. The benefits of the latter include not having the entire match result end up in your 0 match index.
Example of non-capture group: (?<=Property: )(.*?)\s*(?:\(Contact|- ) https://regex101.com/r/vQLlyY/2.
Example of positive-lookahead: (?<=Property: )(.*?)(?= \(Contact| - ) https://regex101.com/r/vQLlyY/3.

Overlapping matches in Regex - Scala

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.
val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray
When I run the script I get:
aba, ded, ghg
And it should be:
aba, aca, ded, dfd, ghg, gig
It does not detect overlapped combinations.
The way consists to enclose the whole pattern in a lookahead to consume only the start position:
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
m => println(m.group(1))
}
The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).
You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:
object Main extends App {
val text = "abaca dedfd ghgig"
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
val allMatches = p.findAllMatchIn(text).map(_.group(1))
println(allMatches.mkString(", "))
// => aba, aca, ded, dfd, ghg, gig
}
See the online Scala demo
Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}