Using alternation or character class for single character matching? - regex

(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)
Given this regex: (.*_e\.txt), which matches some filenames, I need to add some other single character suffixes in addition to the e. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)
That is, which of the following two seems "better", and why:
a) (.*(e|f|x)\.txt), or
b) (.*[efx]\.txt)

Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.
I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.
My reasoning (without ever having written a regex engine, so this is pure conjecture):
The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"
(a|b|c) however tells the regex engine to
remember the current position in the string for backtracking, if necessary
check if it's possible to match a. If so, success. If not:
check if it's possible to match b. If so, success. If not:
check if it's possible to match c. If so, success. If not:
give up.

Here is a benchmark:
updated according to tchrist comment, the difference is more significant
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Benchmark qw(:all);
my #l;
foreach(qw/b c d f g h j k l m n ñ p q r s t v w x z B C D F G H J K L M N ñ P Q R S T V W X Z/) {
push #l, "abc$_.txt";
}
my $re1 = qr/^(.*(b|c|d|f|g|h|j|k|l|m|n|ñ|p|q|r|s|t|v|w|x|z)\.txt)$/;
my $re2 = qr/^(.*[bcdfghjklmnñpqrstvwxz]\.txt)$/;
my $cpt;
my $count = -3;
my $r = cmpthese($count, {
'alternation' => sub {
for(#l) {
$cpt++ if $_ =~ $re1;
}
},
'class' => sub {
for(#l) {
$cpt++ if $_ =~ $re2;
}
}
});
result:
Rate alternation class
alternation 2855/s -- -50%
class 5677/s 99% --

With a single character, it's going to have such a minimal difference that it won't matter. (unless you're doing LOTS of operations)
However, for readability (and a slight performance increase) you should be using the character class method.
For a bit further information - opening a round bracket ( causes Perl to start backtracking for that current position, which, as you don't have further matches to go against, you really don't need for your regex. A character class will not do this.

Related

Remove substrings between < and > (including the brackets) with no angle brackets inside

I have to modify a html-like text with the sed command. I have to delete substrings starting with one or more < chars, then having 0 or more occurrences of any characters but angle brackets and then any 1 or more > chars.
For example: from
aaa<bbb>ccc I would like to get aaaccc
I am able to do this with
"s/<[^>]\+>//g"
but this command doesn't work if between <> characters is an empty string, or if there is double <<>> in the text.
For example, from
aa<>bb<cc>vv<<gg>>h
I get
aa<>bbvv>h
instead of
aabbvvh
How can I modify it to give me the right result?
The problem is that once you allow nesting the < and > characters, you convert the language type from "regular" to "context free".
Regular languages are those that are matched by regular expressions, while context free grammars cannot be parsed in general by a regular expression. The unbounded level of nesting is what impedes this, needing a pile based automaton to be able to parse such languages.
But there's a little complicated workaround to this, if you consider that there's an upper limit to the level of nesting you will allow in the text you are facing, then you can convert into regular a language that is not, based on the premise that the non-regular cases will never occur:
Let's suppose you will never have more than three levels of nesting into your pattern, (this allows you to see the pattern and be able to extend it to N levels) you can use the following algorithm to build a regular expression that will allow you to match three levels of nesting, but no more (you can make a regexp to parse N levels, but no more, this is the umbounded bounded nature of regexps :) ).
Let's construct the expression recursively from the bottom up. With only one level of nesting, you have only < and > and you cannot find neither of these inside (if you allow < you allow more nesting levels, which is forbidden at level 0):
{l0} = [^<>]*
a string including no < and > characters.
Your matching text will be of this class of strings, surrounded by a pair of < and > chars:
{l1} = <[^<>]*>
Now, you can build a second level of nesting by alternating {l0}{l1}{l0}{l1}...{l0} (this is, {l0}({l1}{l0})* and surrounding the whole thing with < and >, to build {l2}
{l2} = <{l0}({l1}{l0})*> = <[^<>]*(<[^<>]*>[^<>]*)*>
Now, you can build a third, by alternating sequences of {l0} and {l2} in a pair of brackets... (remember that {l-i} represents a regexp that allows upto i levels of nesting or less)
{l3} = <{l0}({l2}{l0})*> = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
and so on, successively, you form a sequence of
{lN} = <{l0}({l(N-1)}{l0})*>
and stop when you consider there will not be a deeper nesting in your input file.
So your level three regexp is:
<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l3--------------------------------------}
<{l0--}({l2---------------------}{l0--})*>
<{l0--}({l1----}{l0--})*>
<{l0--}>
You can see that the regexp grows as you consider more levels. The good things is that you can consider a maximum level of three or four and most text will fit in this cathegory.
See demo.
NOTE
Never hesitate to build a regular expression, despite of it appearing somewhat complex. Think that you can build it inside your program, just using the techniques I've used to build it (e.g. for a 16 level nesting regexp, you'll get a large string, very difficult to write it by hand, but very easy to build with a computer)
package com.stackoverflow.q61630608;
import java.util.regex.Pattern;
public class NestingRegex {
public static String build_regexp( char left, char right, int level ) {
return level == 0
? "[^" + left + right + "]*"
: level == 1
? left + build_regexp( left, right, 0 ) + right
: left + build_regexp( left, right, 0 )
+ "(" + build_regexp( left, right, level - 1 )
+ build_regexp( left, right, 0 )
+ ")*" + right;
}
public static void main( String[] args ) {
for ( int i = 0; i < 5; i++ )
System.out.println( "{l" + i + "} = "
+ build_regexp( '<', '>', i ) );
Pattern pat = Pattern.compile( build_regexp( '<', '>', 16 ), 0 );
String s = "aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp";
System.out.println(
String.format( "pat.matcher(\"%s\").replaceAll(\"#\") => %s",
s, pat.matcher( s ).replaceAll( "#" ) ) );
}
}
which, on run gives:
{l0} = [^<>]*
{l1} = <[^<>]*>
{l2} = <[^<>]*(<[^<>]*>[^<>]*)*>
{l3} = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l4} = <[^<>]*(<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>[^<>]*)*>
pat.matcher("aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp").replaceAll("#") => aa#bb#vv#h#ppp
The main advantage of using regular expressions is that once you have written it, it compiles into an internal representation that only has to visit each character of the string being matched once, leading to a very efficient final matching code (probably you'll not get so efficient writing the code yourself)
Sed
for sed, you only need to generate an enough deep regexp, and use it to parse your text file:
sed 's/<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>//g' file1.xml
will give you appropiate results (this is 6 levels of nesting or less ---remember the ( and ) must be escaped to be considered group delimiters in sed)
Your regexp can be constructed using shell variables with the following approach:
l0="[^<>]*"
l1="<${l0}>"
l2="<${l0}\(${l1}${l0}\)*>"
l3="<${l0}\(${l2}${l0}\)*>"
l4="<${l0}\(${l3}${l0}\)*>"
l5="<${l0}\(${l4}${l0}\)*>"
l6="<${l0}\(${l5}${l0}\)*>"
echo regexp is "${l6}"
regexp is <[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>
sed -e "s/${l6}/#/g" <<EOF
aa<>bb<cc>vv<<gg>>h<iii<jj<>j>k<k>k<<lll>mmm>ooo>ppp
EOF
aa#bb#vv#h#ppp
(I've used # as substitution pattern, instead, so you can see where in the input string have the patterns been detected)
You may use
sed 's/<\+[^>]*>\+//g'
sed 's/<\{1,\}[^>]*>\{1,\}//g'
sed -E 's/<+[^>]*>+//g'
The patterns match
<\+ / <\{1,\} - 1 or more occurrences of < char
[^>]* - negated bracket expression that matches 0 or more chars other than >
>\+ / >\{1,\} - 1 or more occurrences of > char
Note that in the last, POSIX ERE, example, + that is unescaped is a quantifier matching 1 or more occurrences, same as \+ in the POSIX BRE pattern.
See the online sed demo:
s='aa<>bb<cc>vv<<gg>>h'
sed 's/<\+[^>]*>\+//g' <<< "$s"
sed 's/<\{1,\}[^>]*>\{1,\}//g' <<< "$s"
sed -E 's/<+[^>]*>+//g' <<< "$s"
Result of each sed command is aabbvvh.

C++: Regex pattern

I got a regex pattern: (~[A-Z]){10,30} (Thanks to KekuSemau). And I need to edit it, so it will skip 1 letter. So it will be like down below.
Input: CABBYCRDCEBFYGGHQIPJOK
Output: A B C D E F G H I J K
Just match two letters each iteration but only capture the second part.
(?:~[A-Z](~[A-Z])){5,15}
live: https://regex101.com/r/pIAxH8/1
I cut the repetition count (the bit inside the {}'s) by half since the new regex is matching two at a time.
The ?: in (?:...) bit disables capturing of the group.
In regex only, there is no way you can achieve this directly.
But you can do this in code:
Use following regex:
(.(?<pick>[A-Z]))+
and in code make a loop on "captures" of desired group, like in c#:
string value = "";
for (int i = 0; i < match.Groups["pick"].Captures.Count; i++)
{
value = match.Groups["pick"].Captures[0].Value;
}

Alphabetic order regex using backreferences

I recently came across a puzzle to find a regular expression that matches:
5-character-long strings comprised of lowercase English letters in ascending ASCII order
Valid examples include:
aaaaa
abcde
xxyyz
ghost
chips
demos
Invalid examples include:
abCde
xxyyzz
hgost
chps
My current solution is kludgy. I use the regex:
(?=^[a-z]{5}$)^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*)$
which uses a non-consuming capture group to assert a string length of 5, and then verifies that the string comprises of lowercase English letters in order (see Rubular).
Instead, I'd like to use back references inside character classes. Something like:
^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])$
The logic for the solution (see Rubular) in my head is to capture the first character [a-z], use it as a backrefence in the second character class and so on. However, \1, \2 ... within character classes seem to refer to ASCII values of 1, 2... effectively matching any four- or five-character string.
I have 2 questions:
Can I use back references in my character classes to check for ascending order strings?
Is there any less-hacky solution to this puzzle?
I'm posting this answer more as a comment than an answer since it has better formatting than comments.
Related to your questions:
Can I use back references in my character classes to check for ascending order strings?
No, you can't. If you take a look a backref regular-expressions section, you will find below documentation:
Parentheses and Backreferences Cannot Be Used Inside Character Classes
Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).
Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it's an octal escape.
Regarding your 2nd question:
Is there any less-hacky solution to this puzzle?
Imho, your regex is perfectly well, you could shorten it very little at the beginning like this:
(?=^.{5}$)^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
^--- Here
Regex demo
If you are willing to use Perl (!), this will work:
/^([a-z])((??{"[$1-z]"}))((??{"[$2-z]"}))((??{"[$3-z]"}))(??{"[$4-z]"})$/
Since someone has broken the ice by using Perl, this is a
Perl solution I guess ..
Note that this is a basic non-regex solution that just happens to be
stuffed into code constructs inside a Perl regex.
The interesting thing is that if a day comes when you need the synergy
of regex/code this is a good choice.
It is possible then that instead of a simple [a-z] character, you may
use a very complex pattern in it's place and using a check vs. last.
That is power !!
The regex ^(?:([a-z])(?(?{ $last gt $1 })(?!)|(?{ $last = $1 }))){5}$
Perl code
use strict;
use warnings;
$/ = "";
my #DAry = split /\s+/, <DATA>;
my $last;
for (#DAry)
{
$last = '';
if (
/
^ # BOS
(?: # Cluster begin
( [a-z] ) # (1), Single a-z letter
# Code conditional
(?(?{
$last gt $1 # last > current ?
})
(?!) # Fail
| # else,
(?{ $last = $1 }) # Assign last = current
)
){5} # Cluster end, do 5 times
$ # EOS
/x )
{
print "good $_\n";
}
else {
print "bad $_\n";
}
}
__DATA__
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
Output
good aaaaa
good abcde
good xxyyz
good ghost
good chips
good demos
bad abCde
bad xxyyzz
bad hgost
bad chps
Ah, well, it's a finite set, so you can always enumerate it with alternation! This emits a "brute force" kind of regex in a little perl REPL:
#include <stdio.h>
int main(void) {
printf("while (<>) { if (/^(?:");
for (int a = 'a'; a <= 'z'; ++a)
for (int b = a; b <= 'z'; ++b)
for (int c = b; c <= 'z'; ++c) {
for (int d = c; d <= 'y'; ++d)
printf("%c%c%c%c[%c-z]|", a, b, c, d, d);
printf("%c%c%czz", a, b, c);
if (a != 'z' || b != 'z' || c != 'z') printf("|\n");
}
printf(")$/x) { print \"Match!\\n\" } else { print \"No match.\\n\" }}\n");
return 0;
}
And now:
$ gcc r.c
$ ./a.out > foo.pl
$ cat > data.txt
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
^D
$ perl foo.pl < data.txt
Match!
Match!
Match!
Match!
Match!
Match!
No match.
No match.
No match.
No match.
The regex is only 220Kb or so ;-)

Error with regex, match numbers

I have a string 00000001001300000708303939313833313932E2
so, I want to match everything between 708 & E2..
So I wrote:
(?<=708)(.*\n?)(?=E2) - tested in RegExr (it's working)
Now, from that result 303939313833313932 match to get result
(every second number):
099183192
How ?
To match everything between 708 and E2, use:
708(\d+)
if you are sure that there will be only digits. Otherwise try with:
708(.*?)E2
To match every second digit from 303939313833313932, use:
(?:\d(\d))+
use a global replace:
find: \d(\d)
replace: $1
Are you expecting a regular expression answer to this?
You are perhaps better off doing this using string operations in whatever programming language you're using. If you have text = "abcdefghi..." then do output = text[0] + text[2] + text[4]... in a loop, until you run out of characters.
You haven't specified a programming language, but in Python I would do something like:
>>> text = "abcdefghjiklmnop"
>>> for n, char in enumerate(text):
... if n % 2 == 0: #every second char
... print char
...
a
c
e
g
j
k
m
o

RegEx to find words with characters

I've found answers to many of my questions here but this time I'm stuck. I've looked at 100's of questions but haven't found an answer that solves my problem so I'm hoping for your help :D
Considering the following list of words:
iris
iridium
initialization
How can I use regex to find words in this list when I am looking using exactly the characters u, i, i? I'm expecting the regex to find "iridium" only because it is the only word in the list that has two i's and one u.
What I've tried
I've been searching both here and elsewhere but haven't come across any that helps me.
[i].*[i].*[u]
matches iridium, as expected, and not iris nor initialization. However, the characters i, i, u must be in that sequence in the word, which may or may not be the case. So trying with a different sequence
[u].*[i].*[i]
This does not match iridium (but I want it to, iridium contains u, i, i) and I'm stuck for what to do to make it match. Any ideas?
I know I could try all sequences (in the example above it would be iiu; iui; uii) but that gets messy when I'm looking for more characters (say 6, tnztii which would match initialization).
[t].*[n].*[z].*[t].*[i].*[i]
[t].*[z].*[n].*[t].*[i].*[i]
[t].*[z].*[n].*[i].*[t].*[i]
..... (long list until)
[i].*[n].*[i].*[t].*[z].*[t] (the first matching sequence)
Is there a way to use regex to find the word, irrespective of the sequence of the characters?
I don't think there's a way to solve this with RegularExpressions which does not end in a horribly convoluted expression - might be possible with LookForward and LookBehind expressions, but I think it's probably faster and less messy if you simply solve this programmatically.
Chop the string up by its whitespaces and then iterate over all the words and count the instances your characters appear inside this word. To speed things up, discard all words with a length less than your character number requirement.
Is this an academic exercise, or can you use more than a single regular expression? Is there a language wrapped around this? The simplest way to do what you want is to have a regexp that matches just i or u, and examine (count) the matches. Using python, it could be a one-liner. What are you using?
The part you haven't gotten around to yet is that there might be additional i's or u's in the word. So instead of matching on .*, match on [^iu].
Here's what I would do:
Array.prototype.findItemsByChars = function(charGroup) {
console.log('charGroup:',charGroup);
charGroup = charGroup.toLowerCase().split('').sort().join('');
charGroup = charGroup.match(/(.)\1*/g);
for (var i = 0; i < charGroup.length; i++) {
charGroup[i] = {char:charGroup[i].substr(0,1),count:charGroup[i].length};
console.log('{char:'+charGroup[i].char+' ,count:'+charGroup[i].count+'}');
}
var matches = [];
for (var i = 0; i < this.length; i++) {
var charMatch = 0;
//console.log('word:',this[i]);
for (var j = 0; j < charGroup.length; j++) {
try {
var count = this[i].match(new RegExp(charGroup[j].char,'g')).length;
//console.log('\tchar:',charGroup[j].char,'count:',count);
if (count >= charGroup[j].count) {
if (++charMatch == charGroup.length) matches.push(this[i]);
}
} catch(e) { break };
}
}
return matches.length ? matches : false;
};
var words = ['iris','iridium','initialization','ulisi'];
var matches = words.findItemsByChars('iui');
console.log('matches:',matches);
EDIT: Let me know if you need any explanation.
I know this is a really old post, but I found this topic really interesting and thought people might look for a similar answer some day.
So the goal is to match all words with a specific set of characters in any order. There is a simple way to do this using lookaheads :
\b(?=(?:[^i\W]*i){2})(?=[^u\W]*u)\w+\b
Here is how it works :
We use one lookahead (?=...) for each letter to be matched
In this, we put [^x\W]*x where x is the the letter that must be present.
We then make this pattern occur n times, where n is the number of times that x must appear in th word using (?:...){n}
The resulting regex for a letter x having to appear n times in the word is then (?=(?:[^x\W]*x){n})
All you have to do then is to add this pattern for each letter and add \w+ at the end to match the word !