Cannot get `Regex::replace()` to replace a numbered capture group - regex

I'm porting a pluralizer to Rust and I'm having some difficulty with regular expressions. I can't get the Regex::replace() method to replace a numbered capture group as I would expect. For example, the following displays an empty string:
let re = Regex::new("(m|l)ouse").unwrap();
println!("{}", re.replace("mouse", "$1ice"));
I would expect it to print "mice", as it does in JavaScript (or Swift, Python, C# or Go)
var re = RegExp("(m|l)ouse")
console.log("mouse".replace(re, "$1ice"))
Is there some method I should be using instead of Regex::replace()?
Examining the Inflector crate, I see that it extracts the first capture group and then appends the suffix to the captured text:
if let Some(c) = rule.captures(&non_plural_string) {
if let Some(c) = c.get(1) {
return format!("{}{}", c.as_str(), replace);
}
}
However, given that replace works in every other language I've used regular expressions in, I would expect it work in Rust as well.

As mentioned in the documentation:
The longest possible name is used. e.g., $1a looks up the capture group named 1a and not the capture group at index 1. To exert more precise control over the name, use braces, e.g., ${1}a.
And
Sometimes the replacement string requires use of curly braces to
delineate a capture group replacement and surrounding literal text.
For example, if we wanted to join two words together with an
underscore:
let re = Regex::new(r"(?P<first>\w+)\s+(?P<second>\w+)").unwrap();
let result = re.replace("deep fried", "${first}_$second");
assert_eq!(result, "deep_fried");
Without the curly braces, the capture group name first_ would be used,
and since it doesn't exist, it would be replaced with the empty
string.
You want re.replace("mouse", "${1}ice")

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

Evaluate second regex after one regex pass

I want to grab value after one regex is passed. The sample is
My test string is ##[FirstVal][SecondVal]##
I want to grab FirstVal and SecondVal.
I have tried \#\#(.*?)\#\# pattern but only return [FirstVal][SecondVal].
Is it possible to evaluate result of one regex and apply another regex?
In .NET, you may use a capture stack to grab all the repeated captures.
A regex like
##(?:\[([^][]*)])+##
will find ##, then match and capture any amount of strings like [ + texts-with-no-brackets + ] and all these texts-with-no-brackets will be pushed into a CaptureCollection that is associated with capture group 1.
See the regex demo online
In C#, you would use the following code:
var s = "My test string is ##[FirstVal][SecondVal]##";
var myvalues = Regex.Matches(s, #"##(?:\[([^][]*)])+##")
.Cast<Match>()
.SelectMany(m => m.Groups[1].Captures
.Cast<Capture>()
.Select(t => t.Value));
Console.WriteLine(string.Join(", ", myvalues));
See the C# demo
Mind you can do a similar thing with Python PyPi regex module.
It will make a difference as to what programming language you are using as the modules might vary slightly. I used python for my solution, since you didn't specify what language you were using, and you could use two parentheses with the #'s on either side and using an escape character to make regex not match the square braces (ie. \[(\w+)\]. Where in the python re module the \w represents the denotation for a-zA-Z0-9_.
import re
data = "##[FirstVal][SecondVal]##"
x = re.search(r'##\[(\w+)\]\[(\w+)\]', data)
print(x.groups())
Which prints ('FirstVal', 'SecondVal')

Parsing of a string with the length specified within the string

Example data:
029Extract this specific string. Do not capture anything else.
In the example above, I would like to capture the first n characters immediately after the 3 digit entry which defines the value of n. I.E. the 29 characters "Extract this specific string."
I can do this within a loop, but it is slow. I would like (if it is possible) to achieve this with a single regex statement instead, using some kind of backreference. Something like:
(\d{3})(.{\1})
With perl, you can do:
my $str = '029Extract this specific string. Do not capture anything else.';
$str =~ s/^(\d+)(.*)$/substr($2,0,$1)/e;
say $str;
output:
Extract this specific string.
You can not do it with single regex, while you can use knowledge where regex stop processing to use substr. For example in JavaScript you can do something like this http://jsfiddle.net/75Tm5/
var input = "blahblah 011I want this, and 029Extract this specific string. Do not capture anything else.";
var regex = /(\d{3})/g;
var matches;
while ((matches = regex.exec(input)) != null) {
alert(input.substr(regex.lastIndex, matches[0]));
}
This will returns both lines:
I want this
Extract this specific string.
Depending on what you really want, you can modify Regex to match only numbers starting from line beginning, match only first match etc
Are you sure you need a regex?
From https://stackoverflow.com/tags/regex/info:
Fools Rush in Where Angels Fear to Tread
The tremendous power and expressivity of modern regular expressions
can seduce the gullible — or the foolhardy — into trying to use
regular expressions on every string-related task they come across.
This is a bad idea in general, ...
Here's a Python three-liner:
foo = "029Extract this specific string. Do not capture anything else."
substr_len = int(foo[:3])
print foo[3:substr_len+3]
And here's a PHP three-liner:
$foo = "029Extract this specific string. Do not capture anything else.";
$substr_len = (int) substr($foo,0,3);
echo substr($foo,3,substr_len+3);

Regex to remove characters up to a certain point in a string

How do I use regex to convert
11111aA$xx1111xxdj$%%`
to
aA$xx1111xxdj$%%
So, in other words, I want to remove (or match) the FIRST grouping of 1's.
Depending on the language, you should have a way to replace a string by regex. In Java, you can do it like this:
String s = "11111aA$xx1111xxdj$%%";
String res = s.replaceAll("^1+", "");
The ^ "anchor" indicates that the beginning of the input must be matched. The 1+ means a sequence of one or more 1 characters.
Here is a link to ideone with this running program.
The same program in C#:
var rx = new Regex("^1+");
var s = "11111aA$xx1111xxdj$%%";
var res = rx.Replace(s, "");
Console.WriteLine(res);
(link to ideone)
In general, if you would like to make a match of anything only at the beginning of a string, add a ^ prefix to your expression; similarly, adding a $ at the end makes the match accept only strings at the end of your input.
If this is the beginning, you can use this:
^[1]*
As far as replacing, it depends on the language. In powershell, I would do this:
[regex]::Replace("11111aA$xx1111xxdj$%%","^[1]*","")
This will return:
aA$xx1111xxdj$%%
If you only want to replace consecutive "1"s at the beginning of the string, replace the following with an empty string:
^1+
If the consecutive "1"s won't necessarily be the first characters in the string (but you still only want to replace one group), replace the following with the contents of the first capture group (usually \1 or $1):
1+(.*)
Note that this is only necessary if you only have a "replace all" capability available to you, but most regex implementations also provide a way to replace only one instance of a match, in which case you could just replace 1+ with an empty string.
I'm not sure but you can try this
[^1](\w*\d*\W)* - match all as a single group except starting "1"(n) symbols
In Javascript
var str = '11111aA$xx1111xxdj$%%';
var patt = /^1+/g;
str = str.replace(patt,"");