Alphabetic order regex using backreferences - regex

I recently came across a puzzle to find a regular expression that matches:
5-character-long strings comprised of lowercase English letters in ascending ASCII order
Valid examples include:
aaaaa
abcde
xxyyz
ghost
chips
demos
Invalid examples include:
abCde
xxyyzz
hgost
chps
My current solution is kludgy. I use the regex:
(?=^[a-z]{5}$)^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*)$
which uses a non-consuming capture group to assert a string length of 5, and then verifies that the string comprises of lowercase English letters in order (see Rubular).
Instead, I'd like to use back references inside character classes. Something like:
^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])$
The logic for the solution (see Rubular) in my head is to capture the first character [a-z], use it as a backrefence in the second character class and so on. However, \1, \2 ... within character classes seem to refer to ASCII values of 1, 2... effectively matching any four- or five-character string.
I have 2 questions:
Can I use back references in my character classes to check for ascending order strings?
Is there any less-hacky solution to this puzzle?

I'm posting this answer more as a comment than an answer since it has better formatting than comments.
Related to your questions:
Can I use back references in my character classes to check for ascending order strings?
No, you can't. If you take a look a backref regular-expressions section, you will find below documentation:
Parentheses and Backreferences Cannot Be Used Inside Character Classes
Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).
Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it's an octal escape.
Regarding your 2nd question:
Is there any less-hacky solution to this puzzle?
Imho, your regex is perfectly well, you could shorten it very little at the beginning like this:
(?=^.{5}$)^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
^--- Here
Regex demo

If you are willing to use Perl (!), this will work:
/^([a-z])((??{"[$1-z]"}))((??{"[$2-z]"}))((??{"[$3-z]"}))(??{"[$4-z]"})$/

Since someone has broken the ice by using Perl, this is a
Perl solution I guess ..
Note that this is a basic non-regex solution that just happens to be
stuffed into code constructs inside a Perl regex.
The interesting thing is that if a day comes when you need the synergy
of regex/code this is a good choice.
It is possible then that instead of a simple [a-z] character, you may
use a very complex pattern in it's place and using a check vs. last.
That is power !!
The regex ^(?:([a-z])(?(?{ $last gt $1 })(?!)|(?{ $last = $1 }))){5}$
Perl code
use strict;
use warnings;
$/ = "";
my #DAry = split /\s+/, <DATA>;
my $last;
for (#DAry)
{
$last = '';
if (
/
^ # BOS
(?: # Cluster begin
( [a-z] ) # (1), Single a-z letter
# Code conditional
(?(?{
$last gt $1 # last > current ?
})
(?!) # Fail
| # else,
(?{ $last = $1 }) # Assign last = current
)
){5} # Cluster end, do 5 times
$ # EOS
/x )
{
print "good $_\n";
}
else {
print "bad $_\n";
}
}
__DATA__
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
Output
good aaaaa
good abcde
good xxyyz
good ghost
good chips
good demos
bad abCde
bad xxyyzz
bad hgost
bad chps

Ah, well, it's a finite set, so you can always enumerate it with alternation! This emits a "brute force" kind of regex in a little perl REPL:
#include <stdio.h>
int main(void) {
printf("while (<>) { if (/^(?:");
for (int a = 'a'; a <= 'z'; ++a)
for (int b = a; b <= 'z'; ++b)
for (int c = b; c <= 'z'; ++c) {
for (int d = c; d <= 'y'; ++d)
printf("%c%c%c%c[%c-z]|", a, b, c, d, d);
printf("%c%c%czz", a, b, c);
if (a != 'z' || b != 'z' || c != 'z') printf("|\n");
}
printf(")$/x) { print \"Match!\\n\" } else { print \"No match.\\n\" }}\n");
return 0;
}
And now:
$ gcc r.c
$ ./a.out > foo.pl
$ cat > data.txt
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
^D
$ perl foo.pl < data.txt
Match!
Match!
Match!
Match!
Match!
Match!
No match.
No match.
No match.
No match.
The regex is only 220Kb or so ;-)

Related

Regex doesn't fetch the nested curly braces

Curly braces matches sometimes and doesn't in few case.
My Code:
use strict;
use warnings;
my $str1 = '$$\eqalign{&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\{\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+\!m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}$$';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
my $regex = qr/(?:[^{}]*(?:{(?:[^{}]*(?:{(?:[^{}]*(?:{[^{}]*})*[^{}]*)})*[^{}]*)*})*[^{}]*)*/;
if($str1=~m/\{$regex\}/) { print "str1: $&\n"; }
if($str2=~m/\{$regex\}/) { print "str2: $&\n"; }
OUTPUT:
str1: {&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\ {\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+ !m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}
str2: {2}
str1 is correct output. str2 incorrect output.
Expected Output on str2 is:
str2: {{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}
In the sample str1 string doesn't matched with the nested curly braces. However the second sample str12 string can matched the nested curly braces.
This is my question can matched the nested curly braces. I am clueless. It would be better if someone point out my mistake.
Thanks in advance.
Since your actual requirements (discussed in the chat) are to match substrings starting with \bib followed with {...} substrings or any chars other than { and }, you should use a regex with a subroutine:
/\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g
Details:
\\bib - \bib literal text
(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])* - 0+ occurrences of:
({(?:[^{}]++|(?1))*}) - Group 1 (that will be recursed with (?1)) matching
{ - a literal {
(?:[^{}]++|(?1))* - 0 or more occurrences of 1+ chars other than { and } or the whole Group 1 subpattern
} - a literal }
| - or
(?!\\bib)[^{}] - a char other than { and } not starting a \bib literal char sequence.
See the sample Perl code:
use strict;
use warnings;
use feature 'say';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
while($str2 =~ /\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g) {
say "$&";
}
Note The edit in the question adds \\bibcite{Airdetal2013} in front. However, this doesn't change the analysis below as it doesn't change the overall nesting levels.
This has got to be possible to do in a better way. There is recursive regex offered by Wiktor Stribiżew in comments. There are modules for recursive parsing. And there are tools for parsing Latex.
However, out of curiosity ...
Your string, shortened suitably
my $str2 = "{{2}{2017}{{{John}{et~al.}}}{{{James}, ... {Gnali}, \& {Giito}}}}";
or, with C standing for a pair of curlies with something inside (no nesting)
"{ C C { { C C } { C, ... \& C } } }"
So you have three levels of nesting, to get down to the last pair {...} (no further nesting).
Your regex, spread out and with $nc = qr/[^{}]*/ (Non-Curlies), so that we can look at it
my $regex = qr/
(?: $nc
(?: {
(?: $nc
(?: {
(?: $nc (?: { $nc } )* $nc )
}
)* $nc
)*
}
)* $nc
)*/x;
I can count two levels here. (The $nc has no curlies so { $nc } matches my C above.)
Thus this regex cannot match that whole string.
How to fix it? Best, find another way so to not drown in this.
Or, write it out like above, very carefully, and add the missing level.

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Detecting text like "#smth" with RegExp (with some more terms)

I'm really bad in regular expressions, so please help me.
I need to find in string any pieces like #text.
text mustn't contain any space characters (\\s). It's length must be at least 2 characters ({2,}), and it must contain at least 1 letter(QChar::isLetter()).
Examples:
#c, #1, #123456, #123 456, #123_456 are incorrect
#cc, #text, #text123, #123text are correct
I use QRegExp.
QRegExp rx("#(\\S+[A-Za-z]\\S*|\\S*[A-Za-z]\\S+)$");
bool result = (rx.indexIn(str) == 0);
rx either finds a non-whitespace followed by a letter and by an unspecified number of non-whitespace characters, or a letter followed by at least non-whitespace.
Styne666 gave the right regex.
Here is a little Perl script which is trying to match its first argument with this regex:
#!/usr/bin/env perl
use strict;
use warnings;
my $arg = shift;
if ($arg =~ m/(#(?=\d*[a-zA-Z])[a-zA-Z\d]{2,})/) {
print "$1 MATCHES THE PATTERN!\n";
} else {
print "NO MATCH\n";
}
Perl is always great to quickly test your regular expressions.
Now, your question is a bit different. You want to find all the substrings in your text string,
and you want to do it in C++/Qt. Here is what I could come up with in couple of minutes:
#include <QtCore/QCoreApplication>
#include <QRegExp>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
QString str = argv[1];
QRegExp rx("[\\s]?(\\#(?=\\d*[a-zA-Z])[a-zA-Z\\d]{2,})\\b");
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1)
{
QString token = rx.cap(1);
cout << token.toStdString().c_str() << endl;
pos += rx.matchedLength();
}
return 0;
}
To make my test I feed it an input like this (making a long string just one command line argument):
peter#ubuntu01$ qt-regexp "#hjhj 4324 fdsafdsa #33e #22"
And it matches only two words: #hjhj and #33e.
Hope it helps.
The shortest I could come up with (which should work, but I haven't tested extensively) is:
QRegExp("^#(?=[0-9]*[A-Za-z])[A-Za-z0-9]{2,}$");
Which matches:
^ the start of the string
# a literal hash character
(?= then look ahead (but don't match)
[0-9]* zero or more latin numbers
[A-Za-z] a single upper- or lower-case latin letter
)
[A-Za-z0-9]{2,} then match at least two characters which may be upper- or lower-case latin letters or latin numbers
$ then find and consume the end of the line
Technically speaking though this is still wrong. It only matches latin letters and numbers. Replacing a few bits gives you:
QRegExp("^#(?=\\d*[^\\d\\s])\\w{2,}$");
This should work for non-latin letters and numbers but this is totally untested. Have a quick read of the QRegExp class reference for an explanation of each escaped group.
And then to match within larger strings of text (again, untested):
QRegExp("\b#(?=\\d*[^\\d\\s])\\w{2,}\b");
A useful tool is the Regular Expressions Example which comes with the SDK.
use this regular expression. hope fully your problem will solve with given RE.
^([#(a-zA-Z)]+[(a-zA-Z0-9)]+)*(#[0-9]+[(a-zA-Z)]+[(a-zA-Z0-9)]*)*$

Using alternation or character class for single character matching?

(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)
Given this regex: (.*_e\.txt), which matches some filenames, I need to add some other single character suffixes in addition to the e. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)
That is, which of the following two seems "better", and why:
a) (.*(e|f|x)\.txt), or
b) (.*[efx]\.txt)
Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.
I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.
My reasoning (without ever having written a regex engine, so this is pure conjecture):
The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"
(a|b|c) however tells the regex engine to
remember the current position in the string for backtracking, if necessary
check if it's possible to match a. If so, success. If not:
check if it's possible to match b. If so, success. If not:
check if it's possible to match c. If so, success. If not:
give up.
Here is a benchmark:
updated according to tchrist comment, the difference is more significant
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Benchmark qw(:all);
my #l;
foreach(qw/b c d f g h j k l m n ñ p q r s t v w x z B C D F G H J K L M N ñ P Q R S T V W X Z/) {
push #l, "abc$_.txt";
}
my $re1 = qr/^(.*(b|c|d|f|g|h|j|k|l|m|n|ñ|p|q|r|s|t|v|w|x|z)\.txt)$/;
my $re2 = qr/^(.*[bcdfghjklmnñpqrstvwxz]\.txt)$/;
my $cpt;
my $count = -3;
my $r = cmpthese($count, {
'alternation' => sub {
for(#l) {
$cpt++ if $_ =~ $re1;
}
},
'class' => sub {
for(#l) {
$cpt++ if $_ =~ $re2;
}
}
});
result:
Rate alternation class
alternation 2855/s -- -50%
class 5677/s 99% --
With a single character, it's going to have such a minimal difference that it won't matter. (unless you're doing LOTS of operations)
However, for readability (and a slight performance increase) you should be using the character class method.
For a bit further information - opening a round bracket ( causes Perl to start backtracking for that current position, which, as you don't have further matches to go against, you really don't need for your regex. A character class will not do this.