Why doesn't (...)? regular expression capture the string? - c++

I have a code where a QString is being modified using a regular expression:
QString str; // str is the string that shall be modified
QString pattern, after; // pattern and after are parameters provided as arguments
str.replace(QRegularExpression(pattern), after);
Whenever I need to append something to the end of the string I use the arguments:
QString pattern("$");
QString after("ending");
Now I have a case where the same pattern is being applied two times, but it shall append the string only once. I expected that this should work (I assume that the initial string doesn't end on "ending"):
QString pattern("(ending)?$");
QString after("ending");
But if applied twice this pattern produces double ending: "<initial string>endingending".
Looks like the ()? expression is lazy, and it captures the expression in parentheses if I force it with a sub-expression before:
QString pattern("string(ending)?$");
QString after("ending");
QString str("Init string");
str.replace(QRegularExpression(pattern), after);
// str == "Init ending"
What's wrong with the "()?" construction (why it is lazy) and how to achieve my goal?
I'm using Qt 5.14.0 (due to some dependencies I cannot use Qt6).

A pattern like (foo)?$ matches twice at the end of a string ending with foo. You can see easily in action in Perl or https://regex101.com/r/3Oqwo1/1 :
$ perl -E '$_ = "abcfoo"; while ($_ =~ /(foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched |foo| from 3 to 6
Matched || from 6 to 6
Therefore you'll do two substitutions at the end, neglecting your purpose.
(A way to see this is that patterns match between characters:
/-----------\
v v first pattern matches here
| a | b | c | f | o | o |
^ ^
\-/ second pattern matches here
If the "tail" is fixed-length, you can use a negative lookbehind, like already suggested: (?<!foo)$.
$ perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
# no match
$ perl -E '$_ = "abcfie"; while ($_ =~ /(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6
Note that there's no .* before, nor ? after the negative lookbehind. If you add them, you'll again break the matching:
$ perl -E '$_ = "abcfie"; while ($_ =~ /.*(?<!foo)$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched |abcfie| from 0 to 6
Matched || from 6 to 6
Global matching will happen twice in abcfie, once matching the entire string, and again matching the empty string at the end (look at the offsets). This will result in 2 replacements.
perl -E '$_ = "abcfoo"; while ($_ =~ /(?<!foo)?$/g) { say "Matched |$&| from $-[0] to $+[0]"; }'
Matched || from 6 to 6
This will match at the very end of the string, resulting in a replacement that you don't want (string already ends in foo).

What ? in your regex is doing is that it is telling the regex engine that the string can optionally end with ending. Your question is a bit unclear, but if I understand it correctly, what you need instead is a negative lookbehind. Changing your pattern as follows should do the trick:
QString pattern(".*(?<!ending)$");
This makes sure that it only matches strings that don't originally end with ending. You can play with it here.

Ok I have explanation why it happens (so question in title is answered).
Basically QRegExp::replace or std::regex_replace finds two matches and performs two replacements. One where capture group matches ending and second times when capture group do not match and only ending is matched.
this is result of fact that $ is just an assertion. It doesn't match any character, so can be used multiple times in index based search (when doing replace all).
Here is demo in clean C++ which illustrates the issue:
int main()
{
std::string s;
auto r = std::regex{"(ending)?$"};
auto after = "ending";
while(getline(std::cin, s)) {
std::cout << "s: " << s << '\n';
std::cout << "replace: " << std::regex_replace(s, r, after) << '\n';
for (auto i = std::regex_iterator{s.begin(), s.end(), r};
i != decltype(i){};
++i) {
std::cout << "found: " << i->str() << " capture: " << i->str(1);
std::cout << '\n';
}
std::cout << "------------\n";
}
return 0;
}
https://godbolt.org/z/e85ajb9aP
Now knowing root cause you can try address this issue.
I come up with: match whole string, using two capture grups one none greedy which will be used in after: re: ^(.*?)(ending)?$ and after: $1ending
https://godbolt.org/z/EKz4fEaT7

Related

Execute code block in nested regex only if the whole match succeeded in Perl 6

I would like a code block within a nested regex to be executed only if the whole pattern matches.
Here's my test code.
my regex left {
| a { take "left: 'a' matched" }
| aa { take "left: 'aa' matched" }
}
my regex right { b }
my regex whole {
^
<left>
<right>
$
}
my $string = 'aab';
my #left_history = gather
$string ~~ m:ex/ <whole> /;
.say for #left_history;
As expected, it produces the following output:
left: 'aa' matched
left: 'a' matched
But I want it to print only the 1st line, corresponding to the value of left, which was used in the successful match. Is it possible?
(Of course, I understand that I can extract the successful values of <left> from the $/ match variable)
UPD: In general, :ex or :g produce many matches, so I'd like the code block in the nested regex to be executed during each successful match of the whole pattern.
The following modification of the program is based on #raiph's very useful explanations. It's not an entirely general answer to the original question, but it solves my problems, so I venture to post it as an answer. Any corrections and improvements are very welcome!
my regex left {
| a { $*left_history = "left: 'a' matched" }
| aa { $*left_history = "left: 'aa' matched" }
| aab { $*left_history = "left: 'aab' matched" }
}
my regex right { b }
my regex whole {
:my $*left_history;
^
<left>
<right>
$
{ take $*left_history }
}
my $string = 'aab';
my #left_history = gather
$string ~~ m:ex/ <whole> /;
.say for #left_history;
Output:
left: 'aa' matched
It would be better if the scope of the variable would be restricted to whole and left, but I don't know how to do it.

Regex to read the id

I have the log file with the following content:
(2947:_dRW00T3WEeSkhZ9pqkt5dQ) ---$ ABC XY "Share" 16-Sep-2014 03:22 PM
(2948:_3nFSwz3TEeSkhZ9pqkt5dQ) ---$ ABC XY "Share" 16-Sep-2014 03:05 PM
(2949:_voeYED3AEeSkhZ9pqkt5dQ) ---$ ABC XY "Initial for Re,oved" 16-Sep-2014 12:44 PM
I want to read the unique id say _dRW00T3WEeSkhZ9pqkt5dQ from each line and store it in a array.
My current code is:
while(<$fh>) {
if ($_ =~ /\((.*?)\)/) {
push #cs_ids , $1;
}
}
Try this:
while(<$fh>) {
if ($_ =~ /\(\d+:(.+?)\)/) {
push #cs_ids , $1;
}
}
The regexp checks all string which starts with ( then one or more digits a double point and than one or more characters ( Which will be stored in $1). THe end of the string is a ).
You were almost there:
perl -e '$string = "(2947:_dRW00T3WEeSkhZ9pqkt5dQ)"; if ($string =~ /^\((\d+:)(.*?)\)$/) { die $2; }'
_dRW00T3WEeSkhZ9pqkt5dQ at -e line 1.
Change your regular expression condition to:
/^\((\d+:)(.*?)\)$/
What that does is match and group the 4 digits and colon into special var $1 and the id you want into special var $2.
If every line of the log file is guaranteed to have an ID string, then you can write just
while (<$fh>) {
/:(\w+)/ and push #cs_ids , $1;
}
The \w ("word") character class matches alphanumeric characters or underscore, and this regex just snags the first sequence of word characters that follow a colon. It is best to avoid the non-greedy modifier if possible as it is a sloppy specification and can be much slower than a simple multiple character match.

Regular expressions to match protected separated values

I'd like to have a regular expression to match a separated values with some protected values that can contain the separator character.
For instance:
"A,B,{C,D,E},F"
would give:
"A"
"B"
"{C,D,E}"
"F"
Please note the protected values can be nested, as follows:
"A,B,{C,D,{E,F}},G"
would give:
"A"
"B"
"{C,D,{E,F}}"
"G"
I already coded that feature with a character iteration as follow:
sub Parse
{
my #item;
my $curly;
my $string;
foreach(split //)
{
$_ eq "{" and ++$curly;
$_ eq "}" and --$curly;
if(!$curly && /[,:]/)
{
push #item, $string;
undef $string;
next;
}
$string .= $_;
}
push #item, $string;
return #item;
}
But it would definitively be so much nicer with a regexp.
A regex that supports nesting would look as follows:
my #items;
push #items, $1 while
/
(?: ^ | \G , )
(
(?: [^,{}]+
| (
\{
(?: [^{}]
| (?2)
)*
\}
)
| # Empty
)
)
/xg;
$ perl -E'$_ = shift; ... say for #items;' 'A,B,{C,D,{E,F}},G'
A
B
{C,D,{E,F}}
G
Assumes valid input since it can't extract and validate at the same time. (Well, not without making things really messy.)
Improved from nhahtdh's answer.
$_ = "A,B,{C,D,E},F";
while ( m/(\{.*?\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "[$&]\n";
}
Improved it again. Please look at this one!
$_ = "A,B,{C,D,{E,F}},G";
while ( m/(\{.*\}|((?<=^)|(?<=,)).(?=,|$))/g ) {
print "$&\n";
}
It will get:
A
B
{C,D,{E,F}}
G
$a = "A,B,{C,D,E},F";
while ($a =~ s/(\{[\{\}\w,]+\}|\w)//) {
push (#res, $1);
}
print "\#res: #res\n"
Result:
#res: A B {C,D,E} F
Explanation : we try to match either the protected block \{[\{\}\w,]+\} or just a single character \w successively in a loop, deleting it from the original string if there is a match. Every time there is a match, we store it (meaning the $1) in the array, et voilĂ !
Here is a regex in bash:
chronos#localhost / $ echo "A,B,{C,D,E},F" | grep -oE "(\{[^\}]*\}|[A-Z])"
A
B
{C,D,E}
F
Try this regex. Use the regex to match and extract the token.
/(\{.*?\}|(?<=,|^).*?(?=,|$))/
I have not tested this code in Perl.
There is an assumption about on how the regex engine works here (I assume that it will try to match the first part \{.*?\} before the second part). I also assume that there are no nested curly bracket, and badly paired curly brackets.
$s = "A,B,{C,D,E},F";
#t = split /,(?=.*{)|,(?!.*})/, $s;

regular epxressions that matches the longest repeating sequence

I want to match the longest sequence that is repeating at least once
Having:
T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced
the result should be: pending-cancel-replace_replaced
Try this
(.+)(?=.*\1)
See it here on Regexr
This will match any character sequence with at least one character, that is repeated later on in the string.
You would need to store your matches and decide which one is the longest afterwards.
This solution requires your regex flavour to support backreferences and lookaheads.
it will match any character sequence with at least one character .+ and store it in the group 1 because of the brackets around it. The next step is the positive lookahead (?=.*\1), it will be true if the captured sequence occurs at a later point again in the string.
Here a perl script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $s = q/T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced/;
my $max = 0;
my $seq = '';
while($s =~ /(.+)(?=.*\1)/g) {
if(length$1 > $max) {
$max = length $1;
$seq = $1;
}
}
say "longuest sequence : $seq, length = $max"
output:
longuest sequence : _pending-cancel-replace_replaced, length = 32
I have to admit that this one got me thinking. It was obvious that positive lookahead is absolutely necessary to solve this with regex. Anyhow here is how it would work in Java:
public static String biggestOccurance(String input){
Pattern p = Pattern.compile("(.+)(?=.*\\1)");
Matcher m = p.matcher(input);
String longestOccurence = "";
while(m.find()){
if(longestOccurence.length() < m.group(1).length()) longestOccurence = m.group(1);
}
return longestOccurence;
}
The thing that got me stuck was the
\\1
I knew that you could refer to a backreference in Java with
$1
but if you replace $1 with \\1 it will not work.
Will have to dig into that.
Cheers,Eugene.
Using Perl you can do:
s='T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced'
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g'
Which gives:
T_
send_
ac
k-
n
e
w_
a
mend
_pending-cancel-replace_replaced
_
cancel
_
p
e
n
d
in
g-
c
a
nce
l
-replace
_re
placed
Then you need to post process it in any language or script to get longest text.
One Possible Post Processing of repeated string can be using awk:
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g' | awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }'
Which prints:
_pending-cancel-replace_replaced
PS: Note longest string here is _pending-cancel-replace_replaced

Perl regex digit

Consider my regex in this code section:
use strict;
my #list = ("1", "2", "123");
&chk(#list);
sub chk {
my #num = split (" ", "#_");
foreach my $chk (#num) {
chomp $chk;
if ($chk =~ m/\d{1,2}?/) {
print "$chk\n";
}
}
}
The \d{4} will print nothing. The \d{3} will print only 123. But if I change to \d{1,2}? it will print all. I thought, according to all the sources I read so far, that {1,2} mean: one digit but no more than two. So it should have printed only 1 and 2, correct?
What do I need to extract items that contains only one to two digits?
\d{1,2} succeeds if it finds 1 or 2 digits anywhere in the string provided. Additional string content is does not cause the match to fail. If you want to match only when the string contains exactly 1 or 2 digits, do this: ^\d{1,2}$
You should anchor your regular expression for the desired effect. The built-in function grep suits better here since it is a selection from an array that is to be done:
#!/usr/bin/env perl
use strict;
use warnings;
my #list = ( 1, 2, 123 );
print join "\n", grep /^\d{1,2}$/, #list;
It appears to be working perfectly!
Here's a hint: Use the Perl variables $`, $&, and $'. These variables are special regular expression variables that show the part of the string before the match, what was matched, and the post matched string.
Here's a sample program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use Scalar::Util;
my #list = ("1", "2", "123");
foreach my $string (#list) {
if ($string =~ /\d{1,2}?/) {
say qq(We have a match for "string"!);
say qq("$`" "$&" "$'");
}
else {
say "No match makes David Sad";
}
}
The output will be:
We have a match for "1"!
"" "1" ""
We have a match for "2"!
"" "2" ""
We have a match for "123"!
"" "1" "23"
What this does is divide up the string into three sections: The section of the string before the regular expression match, the section of the string that matches the regular expression, and the section of the string after the regular expression match.
In each case, there was no pre-match because the regular expression matches from the start of the string. We also see that \d{1,2}? matches a single digit in each case even through 123 could have matched two digits. Why? Because the question mark on the end of the match specifier tells the regular expression not to be greedy. In this case, we tell the regular expression to match either one or two characters. Fine, it matches on one. Remove the question mark, and the last line would have looked like this:
We have a match for "123"!
"" "12" "3"
If you want to match on one or two digits, but not three or more digits, you'll have to specify the part of your string before and after the one or two digits. Something like this:
/\D\d{1,2}\D/
This would match your string foo12bar, but not foo123bar. But what if the string is 12? In that case, we want to say that either we have the beginning of the string, or a non-digit before our one or two character match, and we either have a non-digit or the end of the string at the end of our one or two character match:
/(\D|^)\d{1,2}(/D|$)/
A quick explanation:
(\D|^): A non-digit or the beginning of the string (The ^ anchor)
d{1,2}: One or two digits
(\D|$): A non-digit or the end of the string (The $ anchor)
Now, this will match 12, but not 123, and it will match foo12 and foo12bar, but not foo123 or foo123bar.
Just looking for a one or two digit number, we can simply specify the anchors:
/^\d{1,2}$/;
Now, that will match 1, 12, but not foo12 or 123.
The main thing is to use the $`, $&, and $' variables in order to help see exactly what your regular expression is matching on and what's before and after your match.
No, because while the regex only matches two digits, $chk still contains 123. If you want to only print the part that is matched, use
if ($chk =~ m/(\d{1,2})/) {
print "$1\n";
}
Note the parentheses and the $1. This causes it to print only that which is in the parentheses.
Also, this code doesn't make much sense:
sub chk {
my #num = split (" ", "#_");
Because #_ already is an array it makes no sense to make it into a string and then split it. Simply do:
sub chk {
foreach my $chk (#_) {
You also do not need to use chomp for data that is not coming from user input, as it is intended to remove the trailing newline. There is no newline in any of this data.
#!/usr/bin/perl
use strict;
my #list = ("1", "2", "123");
&chk(\#list);
sub chk {
foreach my $chk (#{$_[0]}) {
print "$chk\n" if $chk =~ m/^\d{1,2}$/ ;
}
}
#!/usr/bin/perl
use strict;
use warnings;
my #list = ("1", "2", "123");
&chk(#list);
sub chk {
my #num = split (" ", "#_");
foreach my $chk (#num) {
chomp $chk;
if ($chk =~ m/\d{1,2}/ && length($chk) <= 2) {
print "$chk\n";
}
}
}