how to limit, characters between a range using regular expression - regex

As far as I know {} curly braces are used to limit characters in regular expression like {3,12}, would match character length between 3 to 12.
I am trying to validate username that might contain a period . or _ either one, but not both, doesn't matter placement. For this below regex is working very well.
(^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)
But I also need to limit the string length between 3 to 12, I had tried to put {3,12} in regex, but that doesn't work.
((^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)){3,12}
See Example: https://regex101.com/r/kN3aO1/1

As hwnd suggested, a simpler solution would be:
^(?=.{3,12}$)[a-z0-9]+(?:[._][a-z0-9]+)?$
Old solution, which is rather complex and convoluted,is left here for reference, but use the one above instead.
^(?!(?:.{13,}|.{1,2})$)(?:([a-z0-9]+)|([a-z0-9]*[\.\_][a-z0-9]*))$
You can add a lookahead for this.
Demo on regex101

I would do this in three steps.
Check to see if the string has any '/' in it.
Check to see if the string has any '_' in it.
Check to see if string length is between 3 and 12.
In Perl:
if ( ( ( $name =~ /_/ ) && ( $name =~ /\./ ) ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
If you want to make sure that the username contains only one dot or underscore, you may count them. Again, in Perl:
my $dcnt = $name =~ tr /././;
my $ucnt = $name =~ tr /_/_/;
if ( ( $dcnt > 0 && $ucnt > 0 ) ||
( $dcnt > 1 ) ||
( $ucnt > 1 ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
Why not one monster regular expression that does everything at once? Well, for the sake of maintainability. If you or a colleague looks at this code in a year's time, when requirements have changed, this approach will make it easier to update the code.
Notice also that {3,12} says nothing about lengths. It allows the previous pattern to match three to twelve times.

Related

Remove substrings between < and > (including the brackets) with no angle brackets inside

I have to modify a html-like text with the sed command. I have to delete substrings starting with one or more < chars, then having 0 or more occurrences of any characters but angle brackets and then any 1 or more > chars.
For example: from
aaa<bbb>ccc I would like to get aaaccc
I am able to do this with
"s/<[^>]\+>//g"
but this command doesn't work if between <> characters is an empty string, or if there is double <<>> in the text.
For example, from
aa<>bb<cc>vv<<gg>>h
I get
aa<>bbvv>h
instead of
aabbvvh
How can I modify it to give me the right result?
The problem is that once you allow nesting the < and > characters, you convert the language type from "regular" to "context free".
Regular languages are those that are matched by regular expressions, while context free grammars cannot be parsed in general by a regular expression. The unbounded level of nesting is what impedes this, needing a pile based automaton to be able to parse such languages.
But there's a little complicated workaround to this, if you consider that there's an upper limit to the level of nesting you will allow in the text you are facing, then you can convert into regular a language that is not, based on the premise that the non-regular cases will never occur:
Let's suppose you will never have more than three levels of nesting into your pattern, (this allows you to see the pattern and be able to extend it to N levels) you can use the following algorithm to build a regular expression that will allow you to match three levels of nesting, but no more (you can make a regexp to parse N levels, but no more, this is the umbounded bounded nature of regexps :) ).
Let's construct the expression recursively from the bottom up. With only one level of nesting, you have only < and > and you cannot find neither of these inside (if you allow < you allow more nesting levels, which is forbidden at level 0):
{l0} = [^<>]*
a string including no < and > characters.
Your matching text will be of this class of strings, surrounded by a pair of < and > chars:
{l1} = <[^<>]*>
Now, you can build a second level of nesting by alternating {l0}{l1}{l0}{l1}...{l0} (this is, {l0}({l1}{l0})* and surrounding the whole thing with < and >, to build {l2}
{l2} = <{l0}({l1}{l0})*> = <[^<>]*(<[^<>]*>[^<>]*)*>
Now, you can build a third, by alternating sequences of {l0} and {l2} in a pair of brackets... (remember that {l-i} represents a regexp that allows upto i levels of nesting or less)
{l3} = <{l0}({l2}{l0})*> = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
and so on, successively, you form a sequence of
{lN} = <{l0}({l(N-1)}{l0})*>
and stop when you consider there will not be a deeper nesting in your input file.
So your level three regexp is:
<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l3--------------------------------------}
<{l0--}({l2---------------------}{l0--})*>
<{l0--}({l1----}{l0--})*>
<{l0--}>
You can see that the regexp grows as you consider more levels. The good things is that you can consider a maximum level of three or four and most text will fit in this cathegory.
See demo.
NOTE
Never hesitate to build a regular expression, despite of it appearing somewhat complex. Think that you can build it inside your program, just using the techniques I've used to build it (e.g. for a 16 level nesting regexp, you'll get a large string, very difficult to write it by hand, but very easy to build with a computer)
package com.stackoverflow.q61630608;
import java.util.regex.Pattern;
public class NestingRegex {
public static String build_regexp( char left, char right, int level ) {
return level == 0
? "[^" + left + right + "]*"
: level == 1
? left + build_regexp( left, right, 0 ) + right
: left + build_regexp( left, right, 0 )
+ "(" + build_regexp( left, right, level - 1 )
+ build_regexp( left, right, 0 )
+ ")*" + right;
}
public static void main( String[] args ) {
for ( int i = 0; i < 5; i++ )
System.out.println( "{l" + i + "} = "
+ build_regexp( '<', '>', i ) );
Pattern pat = Pattern.compile( build_regexp( '<', '>', 16 ), 0 );
String s = "aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp";
System.out.println(
String.format( "pat.matcher(\"%s\").replaceAll(\"#\") => %s",
s, pat.matcher( s ).replaceAll( "#" ) ) );
}
}
which, on run gives:
{l0} = [^<>]*
{l1} = <[^<>]*>
{l2} = <[^<>]*(<[^<>]*>[^<>]*)*>
{l3} = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l4} = <[^<>]*(<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>[^<>]*)*>
pat.matcher("aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp").replaceAll("#") => aa#bb#vv#h#ppp
The main advantage of using regular expressions is that once you have written it, it compiles into an internal representation that only has to visit each character of the string being matched once, leading to a very efficient final matching code (probably you'll not get so efficient writing the code yourself)
Sed
for sed, you only need to generate an enough deep regexp, and use it to parse your text file:
sed 's/<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>//g' file1.xml
will give you appropiate results (this is 6 levels of nesting or less ---remember the ( and ) must be escaped to be considered group delimiters in sed)
Your regexp can be constructed using shell variables with the following approach:
l0="[^<>]*"
l1="<${l0}>"
l2="<${l0}\(${l1}${l0}\)*>"
l3="<${l0}\(${l2}${l0}\)*>"
l4="<${l0}\(${l3}${l0}\)*>"
l5="<${l0}\(${l4}${l0}\)*>"
l6="<${l0}\(${l5}${l0}\)*>"
echo regexp is "${l6}"
regexp is <[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>
sed -e "s/${l6}/#/g" <<EOF
aa<>bb<cc>vv<<gg>>h<iii<jj<>j>k<k>k<<lll>mmm>ooo>ppp
EOF
aa#bb#vv#h#ppp
(I've used # as substitution pattern, instead, so you can see where in the input string have the patterns been detected)
You may use
sed 's/<\+[^>]*>\+//g'
sed 's/<\{1,\}[^>]*>\{1,\}//g'
sed -E 's/<+[^>]*>+//g'
The patterns match
<\+ / <\{1,\} - 1 or more occurrences of < char
[^>]* - negated bracket expression that matches 0 or more chars other than >
>\+ / >\{1,\} - 1 or more occurrences of > char
Note that in the last, POSIX ERE, example, + that is unescaped is a quantifier matching 1 or more occurrences, same as \+ in the POSIX BRE pattern.
See the online sed demo:
s='aa<>bb<cc>vv<<gg>>h'
sed 's/<\+[^>]*>\+//g' <<< "$s"
sed 's/<\{1,\}[^>]*>\{1,\}//g' <<< "$s"
sed -E 's/<+[^>]*>+//g' <<< "$s"
Result of each sed command is aabbvvh.

Find any permutation of a set using Perl's RegEx

I need to find a way of checking for the existence of sets of the type {1,2,3,4,5,6,8,9,10}, that have a preset number of elements. Also, notice the missing 7. Obviously the numbers could be in any order and should appear only once, since according to definition, {1,2,3} = {3,2,1} = {1,2,3,3} = ... and so forth.
How could I do this with Perl (or is it even possible)? One thing I tried was
{([1-6],|[8-9],|10,){8}([1-6]|[8-9]|10)} here, but this doesn't take care of the multiple instances of the same number within the brackets.
Regexes are almost certainly the wrong tool here. You want something that deals with permutations of an input list.
This blog post gives a useful overview of Perl modules that deal with permutations and combinations. Sounds to me like Algorithm::Combinatorics would be a good place to start. Something like this, perhaps:
use Algorithm::Combinatorics;
my #input = qw[1 2 3 4 5 6 8 9 10];
my #perms = permutations(\#input);
You then need some way to compare the valid permutations with the sets you want to test. I'd consider constructing a string representation of the sets (by joining them with a known delimiter) and doing a simple string comparison.
my #perm_strs = map { join ':' } #perms;
my #test = qw[2 4 3 1 10 5 9 8 6];
my $test_str = join ':', #test;
my $match = 0;
for (#perm_strs) {
if ($test_str eq $_) {
$match = 1;
last;
}
}
The success of the match is now in $match.
This regex does that.
Here 10 slots are allocated, but you can add as many as you want ( a hundred ? ).
It doesn't mean you have to match 10 unique numbers in a set,
You can match anything less than or equal to 10 (example {5}),
or even a range like {3,7}
The slots will be filled sequentially starting from 1.
So, you just have to sit in a loop from 1 - N, seeing if it is defined.
If you're looking for speed, this is the demon you want !
/\{(?>(?>(?(1)(?!))((?&GetNum))|(?(2)(?!))((?&GetNum))|(?(3)(?!))((?&GetNum))|(?(4)(?!))((?&GetNum))|(?(5)(?!))((?&GetNum))|(?(6)(?!))((?&GetNum))|(?(7)(?!))((?&GetNum))|(?(8)(?!))((?&GetNum))|(?(9)(?!))((?&GetNum))|(?(10)(?!))((?&GetNum)))(?:,(?!\})|(?=\}))){3,7}\}(?(DEFINE)(?<GetNum>(?!(?:\g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10})\b)\d+))/
https://regex101.com/r/pPwPTe/1
Readable regex
# Unique numbers in set, 10 slots
\{
(?> # Atomic, no backtracking allowed
(?> # ditto
(?(1) (?!) ) ( (?&GetNum) ) # (1), Slot 1
| (?(2) (?!) ) ( (?&GetNum) ) # (2), Slot 2
| (?(3) (?!) ) ( (?&GetNum) ) # (3), Slot 3
| (?(4) (?!) ) ( (?&GetNum) ) # (4), Slot 4
| (?(5) (?!) ) ( (?&GetNum) ) # (5), Slot 5
| (?(6) (?!) ) ( (?&GetNum) ) # (6), Slot 6
| (?(7) (?!) ) ( (?&GetNum) ) # (7), Slot 7
| (?(8) (?!) ) ( (?&GetNum) ) # (8), Slot 8
| (?(9) (?!) ) ( (?&GetNum) ) # (9), Slot 9
| (?(10) (?!) ) ( (?&GetNum) ) # (10), Slot 10
)
(?: , (?! \} ) | (?= \} ) )
){3,7} # Set range, example: 3 to 7 unique numbers in set
\}
(?(DEFINE)
(?<GetNum> # (4) Get a new number, must not be seen before
(?! (?: \g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10} ) \b )
\d+
)
)
Given front matter and test cases of
#! /usr/bin/env perl
use strict;
use warnings;
my #tests = (
"{}",
"{1,1}",
"{1,2,3,4,5,6,8,9,10}",
"{1,1,2,3,4,5,6,8,9,10}",
"{1,2,3,4,5,6,7,8,9,10}",
"{10,9,8,7,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1",
"{10,9,8,6,5,4,3,2,1,1}",
"{2,4,6,8,10,9,5,3,1}",
);
you have at least three approaches to implementing what you want.
Brute force
When in doubt, try a bigger hammer. Generate all permutations and bake those into your pattern directly. Note that this has a factorial cost, so it quickly becomes intractable as the number of elements in your set grows.
# perlfaq4: How do I permute N elements of a list?
sub permute (&#) {
my $code = shift;
my #idx = 0..$#_;
while ( $code->(#_[#idx]) ) {
my $p = $#idx;
--$p while $idx[$p-1] > $idx[$p];
my $q = $p or return;
push #idx, reverse splice #idx, $p;
++$q while $idx[$p-1] > $idx[$q];
#idx[$p-1,$q]=#idx[$q,$p-1];
}
}
my $brute_force;
permute { local $" = ",";
$brute_force .= "|" if $brute_force;
$brute_force .= "{#_}" }
#members;
$brute_force = qr/ ^ (?: $brute_force ) $/x;
for (#tests) {
my $result = /$brute_force/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Generating all permutations on my laptop takes about 3 minutes. Precomputing the pattern may or may not make sense depending on your application.
Piggyback on the regex engine’s backtracking
One way to do it is to take advantage of the Perl regex engine’s backtracking and running (?{ code }) at various points within your pattern.
Define members of your set as below. Note that these must be global variables because of limitations of the regex engine, so use our and not my.
# must use package variables inside (?{ })
our #members = (1 .. 6, 8 .. 10);
our %remaining;
A pattern that matches permutations becomes
my $permutation = qr!
\{ (?{ #remaining{#members} = map +($_ => 1), #members })
( ([0-9]+), (?(?{ delete local $remaining{$^N} })|(*FAIL)))+
([0-9]+)\} (?(?{ delete local $remaining{$^N} && keys %remaining == 0 })|(*FAIL))
!x;
Code inside (?{ code }) sections runs at corresponding points of the pattern match. For example, the first one initializes the hash %remaining to contain all members of the set as keys.
The second and third (?{ code }) sections are within (?(condition)yes-pattern|no-pattern) sections and (*FAIL) backtracking control verbs. For any member before the last in the set (which we know because it is terminated by a comma), the member just matched, available in the $^N special variable, must be still available in %remaining. For the last member (terminated by right curly brace), the member just matched must be available and we must have covered all elements of the set to succeed. If these constraints are met, we match against an empty yes-pattern and continue successfully, but if one of these conditions fails, we meet (*FAIL) in the no-pattern. This causes the current attempted match to fail and the regex engine backtracks to attempt the next possibility.
Writing delete local localizes deletion of the particular key from %remaining. This delegates the error-prone bookkeeping to the regex engine that correctly restores localized values when it backtracks past a non-viable match.
Note that this implementation requires a set of at least two members.
Use it as in
for (#tests) {
my $result = /^ $permutation $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Hybrid approach
Finally, combine the approaches by searching for everything that looks like a set and reject invalid permutations.
sub _assert_permutation_of {
my($members,$set) = #_;
my %seen = map +($_ => 1), #$members;
while ($set =~ /\b([0-9]+)\b/g) {
return unless delete $seen{$1};
}
keys %seen == 0;
}
my $hybrid = qr!
( \{ # opening brace
(?: [0-9]+ , )+ # comma-terminated integers
[0-9]+ # final integer
\} # closing brace
)
(?(?{ _assert_permutation_of \#members, $^N })|(*FAIL))
!x;
for (#tests) {
my $result = /^ $hybrid $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Test output
For all three, the output is
{} - REJECT
{1,1} - REJECT
{1,2,3,4,5,6,8,9,10} - ACCEPT
{1,1,2,3,4,5,6,8,9,10} - REJECT
{1,2,3,4,5,6,7,8,9,10} - REJECT
{10,9,8,7,6,5,4,3,2,1} - REJECT
{10,9,8,6,5,4,3,2,1} - ACCEPT
{10,9,8,6,5,4,3,2,1 - REJECT
{10,9,8,6,5,4,3,2,1,1} - REJECT
{2,4,6,8,10,9,5,3,1} - ACCEPT

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Perl hash substitution with special characters in keys

My current script will take an expression, ex:
my $expression = '( a || b || c )';
and go through each boolean combination of inputs using sub/replace, like so:
my $keys = join '|', keys %stimhash;
$expression =~ s/($keys)\b/$stimhash{$1}/g;
So for example expression may hold,
( 0 || 1 || 0 )
This works great.
However, I would like to allow the variables (also in %stimhash) to contain a tag, *.
my $expression = '( a* || b* || c* )';
Also, printing the keys of the stimhash returns:
a*|b*|c*
It is not properly substituting/replacing with the extra special character, *.
It gives this warning:
Use of uninitialized value within %stimhash in substitution iterator
I tried using quotemeta() but did not have good results so far.
It will drop the values. An example after the substitution looks like:
( * || * || * )
Any suggestions are appreciated,
John
Problem 1
You use the pattern a* thinking it will match only a*, but a* means "0 or more a". You can use quotemeta to convert text into a regex pattern that matches that text.
Replace
my $keys = join '|', keys %stimhash;
with
my $keys = join '|', map quotemeta, keys %stimhash;
Problem 2
\b
is basically
(?<!\w)(?=\w)|(?<=\w)(?!\w)
But * (like the space) isn't a word character. The solution might be to replace
s/($keys)\b/$stimhash{$1}/g
with
s/($keys)(?![\w*])/$stimhash{$1}/g
though the following make more sense to me
s/(?<![\w*])($keys)(?![\w*])/$stimhash{$1}/g
Personally, I'd use
s{([\w*]+)}{ $stimhash{$1} // $1 }eg

Regular expression in Groovy. Catch fragment of a String

Given the following input:
BGM+220+105961-44+9'
DTM+137:20140121:102'
NAD+BY+0048003479::91'
NAD+SE+0000805406::91'
NAD+DP+0048003479::91'
CUX+2:USD+9'
PIA+1+M1PL05883LOT+":BP::92'
PIA+1+927700077001:VP::91'
PRI+AAA:9:::1:PCE'
SCC+1'
QTY+21:10000:PCE'
DTM+2:11022014:102'
PIA+1+M1PL05883LOT+":BP::92'
PIA+1+927700077001:VP::91'
PRI+AAA:9:::1:PCE'
SCC+1'
QTY+21:20000:PCE'
DTM+2:04022014:102'
UNS+S'
UNT++1'
UNZ+1+10596144'
The goal is to capture from the first line:
BGM+220+105961-44+9'
the value between "-" and "end of the digit". In the above example, it would be "44".
Thanks in advance
You could do:
text.tokenize( '\n' ) // split it based on newlines
.head() // grab the first one
.find( /-\d+/ ) // find '-44'
.substring( 1 ) // remove the '-'
Actually, you don't need to split it, so just:
text.find( /-\d+/ )?.substring( 1 )
does the same thing (as it's the first line you're interested in)
Edit after comment:
To get both the numbers surrounding the -, you could do:
def (pre,post) = text.find( /\d+-\d+/ )?.tokenize( '-' )
assert pre == '105961'
assert post == '44'