I need a regex to allow either a single 0 or any other number of digits that do not start with zero so:
0 or 23443 or 984756 are allowed but 0123 is not allowed.
I have got the following which allows only 1 to 9
[1-9]\d
Look for a lone 0 or 1-9 followed by any other digits.
^(0|[1-9]\d*)$
If you want to match numbers inside of larger strings, use the word boundary marker \b in place of ^ and $:
\b(0|[1-9]\d*)\b
You don't need to force everything into a single regex to do this.
It will be far clearer if you use multiple regexes, each one making a specific check. In Perl, you would do it like this, but you can adapt it to C# just fine.
if ( ($s eq '0') || (($s =~ /^\d+$/) && not ($s =~ /^0/)) )
You have made explicitly clear what the intent is:
if ( (string is '0') OR ((string is all digits) AND (string does not start with '0')) )
Note that the first check to see if the string is 0 doesn't even use a regex at all, because you are comparing to a single value.
Let the expressive form of your host language be used, rather than trying to cram logic into a regex.
Related
I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.
I am trying understand the difference between glob and regex patterns. I need to do some pattern matching in TCL.
The purpose is to find out if a hexadecimal value has been entered.
The value may or may not start with 0x
The value shall contain between 1 and 12 hex characters i.e 0-9, a-f, A-F and these shall follow the 0x if it exists
The thing is that glob does not allow use of {a,b} to tell about how many characters to look for. Also, at start I tried to use (0x[Xx])? but I think this is not working.
It is not essential to use glob. I can see that there are subtle differences between glob and regex. I just want to know if this can be done only through regex and not glob.
Tcl's glob patterns are much simpler than regular expressions. All they support is:
* to mean any number of any character.
? to mean any single character.
[…] to mean any single character from the set (the chars inside the brackets, which may include ranges).
\x to mean mean a literal x (which can be any character). That's how you put a glob metacharacter in a glob pattern.
They're also always anchored at both ends. (Regular expressions are much more powerful. They're also slower. You pay for power.)
To match hex numbers like 0xF00d, you'd use a glob pattern like this:
0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
(or, as an actual Tcl command; we put the pattern in {braces} to avoid needing lots of backslashes for all the brackets…)
string match {0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]} $value
Note that we have to match an exact number of characters. (You can shorten the pattern by using case-insensitive matching, to 0x[0-9a-f][0-9a-f][0-9a-f][0-9a-f].)
Matching hex numbers is better done with regexp or scan (which also parses the hex number). Everyone likes to forget scan for parsing, yet it's quite good at it…
regexp {^0x([[:xdigit:]]+)$} $value -> theHexDigits
scan $value "0x%x" theParsedValue
The thing is that glob does not allow use of {a,b} to tell about how
many characters to look for. Also, at start I tried to use (0x[Xx])?
but I think this is not working.
A commonly used regular expression, not specific to Tcl at all, is ^(0[xX])?[A-Fa-f0-9]{1,12}$.
Update
As Donal writes, there is a power-cost tradeoff when it comes to regexp. I was curious and, for the given requirements (optional 0x prefix, range check [1,12]), found that a carefully crafted script using string operations incl. string match (see isHex1 below) outperforms regexp in this setting (see isHex2), whatever the input case:
proc isHex1 {str min max} {
set idx [string last "0x" $str]
if {$idx > 0} {
return 0
} elseif {$idx == 0} {
set str [string range $str 2 end]
}
set l [string length $str]
expr {$l >= $min && $l <= $max && [string match -nocase [string repeat {[0-9a-f]} $l] $str]}
}
proc isHex2 {str min max} {
set regex [format {^(0x)?[[:xdigit:]]{%d,%d}$} $min $max]
regexp $regex $str
}
isHex1 extends the idea of computing the string match pattern based on the input length (w/ or w/o prefix) and string repeat. My own timings suggest that isHex1 runs at least 40% faster than isHex2 (all using time, 10000 iterations), in a worst case (within range, final character decides). Other cases (e.g., out-of-range) are substantially faster.
The glob syntax is described in the string match documentation. Compared to regular expressions, glob is a blunt instrument.
With regular expressions, you get the standard character classes, including [:xdigit:] to match a hexadecimal digit.
To contrast with mrcalvin's answer, a Tcl-specific regex would be: (?i)^0x[[:xdigit:]]{1,12}$
the leading (?i) means the expression will be matched case-insensitively.
If all you care about is determining if the input is a valid number, you can use string is integer:
set s 0xdeadbeef
string is integer $s ;# => 1
set s deadbeef
string is integer $s ;# => 0
set s 0xdeadbeetle
string is integer $s ;# => 0
I am a newbie to RegEx and have done a lot of searching already but have not found anything specific.
I am writing a regular expression that validates a password string.
The acceptable string must have at least 3 of 4 character types: digits, lowercase, uppercase, special char[<+$*)], but must not include another special set of characters(|;{}).
I got an idea regarding the inclusion(that is if it is the right way).
It looks like this:
^((a-z | A-Z |\d)(a-z|A-Z|[<+$*])(a-z|[<+$*]|\d)(A-Z|[<+$*]|\d)).*$
How do I ensure that user does not enter special chars(|;{})
This is what I tried with the exclusion string:
^(?:(?![|;{}]).)*$
I have tried a bit of tricks to combine the two in a single regEx but can't get it to work.
Any input on how to do this right?
Don't try to do it all in one regex. Make two different checks.
Say you're working in Perl (since you didn't specify language):
$valid_pw =
( $pw =~ /^((a-z | A-Z |\d)(a-z|A-Z|[<+$*])(a-z|[<+$*]|\d)(A-Z|[<+$*]|\d)).*$/ ) &&
( $pw !~ /\|;{}/ );
You're saying "If the PW matches all the inclusion rules, and the PW does NOT match any of the excluded characters, then the password is valid."
Look how much clearer that is than something like #Jerry's response above of:
^(?![^a-zA-Z]*$|[^a-z0-9]*$|[^a-z<+$*]*$|[^A-Z0-9]*$|[^A-Z<+$*]*$|[^0-9<+$*]*$|.*[|;{}]).*$
I don't doubt that Jerry's version works, but which one do you want to maintain?
In fact, you could break it down even further and be extremely clear:
my $cases_matched = 0;
$cases_matched++ if ( $pw =~ /\d/ ); # digits
$cases_matched++ if ( $pw =~ /[a-z]/ ); # lowercase
$cases_matched++ if ( $pw =~ /[A-Z]/ ); # uppercase
$cases_matched++ if ( $pw =~ /<\+\$\*/ ); # special characters
my $is_valid = ($cases_matched >= 3) && ($pw !~ /\|;{}/); # At least 3, and none forbidden.
Sure, that takes up 6 lines instead of one, but in a year when you go back and have to add a new rule, or figure out what the code does, you'll be glad you wrote it that way.
Just because you can do it in one regex doesn't mean you should.
Your current regex will not work for enforcing the at least 3 of 4 requirement. Using regex for this gets pretty complicated, but in my opinion the best way to do this is to use a negative lookahead that contains all of the failure cases, so that the entire match will fail if any of the negative cases are met. In this case the "at least 3 of 4" requirement can also be described as "fail if any 2 groups are not found". This also makes it very easy to add the final requirement to ensure that no characters from [|;{}] are found:
^ # beginning of string anchor
(?! # fail if
[^a-zA-Z]*$ # no [a-z] or [A-Z] anywhere
| # OR
[^a-z0-9]*$ # no [a-z] or [0-9] anywhere
| # OR
[^a-z<+$*]*$ # no [a-z] or [<+$*] anywhere
| # OR
[^A-Z0-9]*$ # no [A-Z] or [0-9] anywhere
| # OR
[^A-Z<+$*]*$ # no [A-Z] or [<+$*] anywhere
| # OR
[^0-9<+$*]*$ # no [0-9] or [<+$*] anywhere
| # OR
.*[|;{}] # a character from [|;{}] exists
)
.*$ # made it past the negative cases, match the entire string
Here it is as a single line:
^(?![^a-zA-Z]*$|[^a-z0-9]*$|[^a-z<+$*]*$|[^A-Z0-9]*$|[^A-Z<+$*]*$|[^0-9<+$*]*$|.*[|;{}]).*$
Example: http://rubular.com/r/4YV6Aj0vqh
This is for accepting only the characters you mentioned:
^(?:(?=.*[0-9])(?=.*[a-z])(?=.*[<+$*)])|(?=.*[a-z])(?=.*[<+$*)])(?=.*[A-Z])|(?=.*[0-9])(?=.*[A-Z])(?=.*[<+$*)])|(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z]))[0-9A-Za-z<+$*)]+$
And this one for all the characters you mentioned, and any special characters except |;{}.
^(?:(?=.*[0-9])(?=.*[a-z])(?=.*[<+$*)])|(?=.*[a-z])(?=.*[<+$*)])(?=.*[A-Z])|(?=.*[0-9])(?=.*[A-Z])(?=.*[<+$*)])|(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z]))(?!.*[|;{}].*$).+$
(One difference is that the first regex doesn't accept the special char # but the second does).
I have also used + since passwords logically can't be 0 width.
However, it's quite long, longer than F.J's regex, oh well. That's because I'm using positive lookaheads, which require more checks.
I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}
I'm struggling with checking the validity of version numbers in Perl. Correct version number is like this:
Starts with either v or ver,
After that a number, if it is 0, then no other numbers are allowed in this part (e.g. 10, 3993 and 0 are ok, 01 is not),
After that a full stop, a number, full stop, number, full stop and number.
I.e. a valid version number could look something like v0.123.45.678 or ver18.493.039.1.
I came up with the following regexp:
if ($ver_string !~ m/^v(er)?(0{1}\.)|([1-9]+\d*\.)\d+\.\d+\.\d+/)
{
#print error
}
But this does not work, because a version number like verer01.34.56.78 gets accepted. I can't understand this, I know Perl tends to be greedy, but shouldn't ^v(er)? make sure that there can be a max of one "er"? And why doesn't 0{1}. match only "0.", instead of accepting "01." as well?
This regex actually catched the "rere" thing: m/^v(er)?[0-9.]+/ but I can't see where I allow it in my attempt.
Your problem is that the or - | - you are using is splitting the whole pattern in two. A | will scope to brackets or the end of an expression rather than just on the two neighbouring items.
You need to put some extra brackets two show which part of the expression you want or-ed. So a first step to fixing your pattern would be:
^v(er)?((0{1}\.)|([1-9]+\d*\.))\d+\.\d+\.\d+
You also want to put a $ at the end to ensure there are no spurious characters at the end of the version number.
Also, putting {1} is unnecessary is it means the previous item exactly once which is the default. However you could use {3} at the end of your pattern as you want three dot-digit groups at the end.
Similarly, you don't need the + after the [1-9] as other digits will be grabbed by the \d*.
And we can also remove the unnessary brackets.
So you can simplify your patten to the following:
^v(er)?(0|[1-9]\d*)(.\d+){3}$
You could do it with a single regexp, or you could do it in 2 steps, the second step being to check that the first number doesn't start with a 0.
BTW, I tend to use [0-9] instead of \d for numbers, there are lots of characters that are classified as numbers in the Unicode standard (and thus in Perl) that you may not want to deal with.
Here is a sample code, the version_ok sub is where everything happens.
#!/usr/bin/perl
use strict;
use warnings;
use Test::More tests => 7;
while( <DATA>)
{ chomp;
my( $version, $expected)= split /\s*=>\s*/;
is( version_ok( $version), $expected, $version);
}
sub version_ok
{ my( $version)=#_;
if( $version=~ m{^v(?:er)? # starts with v or ver
([0-9]+) # the first number
(?:\.[0-9]+){3} # 3 times dot number
$}x) # end
{ if( $1 =~ m{^0[0-9]})
{ return 0; } # no good: first number starts with 0
else
{ return 1; }
}
else
{ return 0; }
}
__DATA__
v0.123.45.678 => 1
ver18.493.039.1 => 1
verer01.34.56.78 => 0
v01.5.5.5 => 0
ver101.5.5.5 => 1
ver101.5.5. => 0
ver101.5.5 => 0
The regex may work for your test cases, but the CPAN module Perl::Version would seem like the best option, with two caveats:
haven't tried it myself
seems like the latest module release was in 2007 - kind of makes it a recursive problem