How to replace lookahead in regex? - regex

I wrote a regex that validates an input string. It must have a minimum length of 8 chars (composed by alphanumeric and punctuation chars) and it must have at least one digit and one alphabetic char. So I've come up with the regex:
^(?=.*[0-9])(?=.*[a-zA-Z])[a-zA-Z0-9-,._;:]{8,}$
Now I have to rewrite this regex in a language that doesn't support lookahead, how should I rewrite that regex?
Valid inputs are:
1foo,bar
foo,bar1
1fooobar
foooobar1
fooo11bar
1234x567
a1234567
Invalid inputs:
fooo,bar
1234-567
.1234567

There are two approaches. One is to compose a single expression which handles all possible alternatives:
^[a-zA-Z][0-9][a-zA-Z0-9-,._;:]{6,}$
|
^[a-zA-Z][a-zA-Z0-9-,._;:][0-9][a-zA-Z0-9-,._;:]{5,}$
|
^[a-zA-Z][a-zA-Z0-9-,._;:]{2}[0-9][a-zA-Z0-9-,._;:]{4,}$
etc. This is a combinatoric nightmare, but it would work.
A much simpler approach is to validate the same string twice using two expressions:
^[a-zA-Z0-9-,._;:]{8,}$ # check length and permitted characters
and
[a-zA-Z].*[0-9]|[0-9].*[a-zA-Z] # check required characters
EDIT: #briandfoy correctly points out that it will be more efficient to search for each required character separately:
[a-zA-Z] # check for required alpha
and
[0-9] # check for required digit

This question was original tagged as perl, and that's how I answered it. For the oracle stuff, I have no idea how you'd do the same thing. However, I'd try to validate this stuff before it got that far.
I wouldn't do this in one regular expression. When you decide to change the rules, you'll have the same amount of work to craft the new regular expression. I wouldn't use lookarounds for this even if they were available since I wouldn't want to tolerate all the backtracking.
This looks like it's a lot of code, but the part that addresses your problem is just the subroutine. It has very simple patterns. When the password rules change, you add or delete patterns. It might be worth it to use study, but I didn't investigate that:
use v5.10;
use strict;
use Test::More;
my #valids = qw(
1foo,bar
foo,bar1
1fooobar
foooobar1
fooo11bar
);
my #invalids = qw(
fooo,bar
short
nodigitbutlong
12345678
,,,,,,,,
);
sub is_good_password {
my( $password ) = #_;
state $rules = [
qr/\A[A-Z0-9,._;:-]{8,}\z/i,
qr/[0-9]/,
qr/[A-Z]/i,
];
foreach my $rule ( #$rules ) {
return 0 unless $password =~ $rule;
}
return 1;
}
foreach my $valid ( #valids ) {
ok( is_good_password( $valid ), "Password $valid is valid" );
}
foreach my $invalid ( #invalids ) {
ok( ! is_good_password( $invalid ), "Password $invalid is invalid" );
}
done_testing();

The best I can come up with right now is
(.*[a-zA-Z].*[0-9].*|.*[0-9].*[a-zA-Z].*)
But you have to check the length of the string separately.

I would play around with these ideas to get the best performance:
should be faster for short valid inputs, but will be slower (backtrack) for inputs like "0a000000000000000000" or "aaaaaaaaaaaaaaa":
regexp_like(regexp_substr(input_string, '^[a-zA-Z0-9_,.;:-]{8,}$'),
'[0-9].*[a-zA-Z]|[a-zA-Z].*[0-9]')
should be faster if there are a lot of invalid inputs (don't miss [^...] on the 2nd line):
(length(input_string) >= 8 and
not regexp_like(input_string, '[^a-zA-Z0-9_,.;:-]') and
regexp_like(input_string, '[a-zA-Z]') and
regexp_like(input_string, '[0-9]'))

Related

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

Is it possible to make conditional regex of following?

Hello I am wondering whether it is possible to do this type of regex:
I have certain characters representing okjects i.e. #,#,$ and operations that may be used on them like +,-,%..... every object has a different set of operations and I want my regex to find valid pairs.
So for examle I want pairs #+, #-, $+ to be matched, but yair $- not to be matched as it is invalid.
So is there any way to do this with regexes only, without doing some gymnastics inside language using regex engine?
every okject with it's own rules in []
/(#[+-]|\$[+]|#[+-])/
you need to properly escape special characters
Gymnastics is hard. Try something like /#\+|#-|\$\+/ or something like that.
Just remember, +, $, and ^ are reserved, so they'll need to be escaped.
Another approach, mix not allowed with raw combinations, but this might be slower.
/(?!\$-|\$\%)([\#\$\#][+\-\%])/, though not if there are many alternations of the first character.
my $str = '
#+, #-, $+ to be matched,
but yair $- not to be matched asit is invalid.
$% $- #% $%
';
my $regex =
qr/
(?!\$-|\$\%) # Specific combinations not allowed
(
[\#\$\#][+\-\%] # Raw combinations allowed
)
/x;
while ( $str =~ /$regex/g ) {
print "found: '$1'\n";
}
__END__
Output:
found: '#+'
found: '#-'
found: '$+'
found: '#%'

How can I parse a phone number in Perl?

I am trying to grab any digits in front of a known line number of a phone, if they exist (in Perl). There will be no dashes, only digits.
For example, say I know the line number will always be 8675309. 8675309 may or may not have leading digits, if it does I want to capture them. There is not really a limit on the number of leading digits.
$input $digits $number
'8675309' '' '8675309'
'8008675309' '800' '8675309'
'18888675309' '1888' '8675309'
'18675309' '1' '8675309'
'86753091' not a match
/8675309$/ this will match how to capture the pre-digits in one regex?
Some regexes work better backwards than forwards. So sometimes it is useful to use sexeger, rather than regexes.
my $pn = '18008675309';
reverse($pn) =~ /^9035768(\d*)/;
my $got = reverse $1;
The regex is cleaner and avoids a lot of back tracking at the cost of some fummery with reversing the input and captured values.
The backtracking gain is smaller in this case than it would be if you had a general phone number extraction regex:
Regex: /^(\d*)\d{7}$/
Sexeger: /^\d{7}(\d*)/
There is a whole class of problems where this technique is useful. For more info see the sexeger post on Perlmonks.
my($digits,$number);
if ($input =~ /^(\d*)(8675309)$/) {
($digits,$number) = ($1,$2);
}
The * quantifier is greedy, but that means it matches as much as possible while still allowing a match. So initially, yes, \d* tries to gobble up all the digits in $number, but it reluctantly gives up character-by-character what it's matched until the whole pattern matches successfully.
Another approach is to chop off the tail:
(my $digits = $input) =~ s/8675309$//;
You could do the same without using a regular expression:
my $digits = $input;
substr($digits, -7) = "";
The above, at least with perl-5.10-1, could even be condensed to
substr(my $digits = $input, -7) = "";
The regex special variables $` and $& are another way of grabbing those pieces of information. They hold the contents of the data preceding the match and the match itself respectively.
if ( /8675309$/ )
{
printf( "%s,%s,%s\n", $_, $`, $& );
}
else
{
printf( "%s,Not a match\n", $_ );
}
There's a Perl package that deals with at least UK and US phone numbers.
It's called Number::Phone and the code is somewhere on the cpan.org site.
How about /(\d)?(8675309)/?
UPDATE:
whoops that should haev been /(\d*)(8675309)/
I might not understand the problem. Why is there a difference between the first and fourth examples:
'8675309' '' '8675309'
...
'8675309' '1' '8675309'
If all you want is to separate the last seven digits from everything else, you could have said it that way rather than provide confusing examples. A regex for that would be:
/(\d*)(\d{7,7})$/
If you weren't just providing a hypothetical number, and really are only looking for lines with '8675309' (seems strange), replace the '\d{7,7}' with '8675309'.

Regex to check fix length field with packed space

Say I have a text file to parse, which contains some fixed length content:
123jackysee 45678887
456charliewong 32145644
<3><------16------><--8---> # Not part of the data.
The first three characters is ID, then 16 characters user name, then 8 digit phone number.
I would like to write a regular expression to match and verify the input for each line, the one I come up with:
(\d{3})([A-Za-z ]{16})(\d{8})
The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?
No, no, no, no! :-)
Why do people insist on trying to pack so much functionality into a single RE or SQL statement?
My suggestion, do something like:
Ensure the length is 27.
Extract the three components into separate strings (0-2, 3-18, 19-26).
Check that the first matches "\d{3}".
Check that the second matches "[A-Za-z]{8,} *".
Check that the third matches "\d{8}".
If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.
Even something like this would do the trick:
def isValidLine(s):
if s.len() != 27 return false
return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):
Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.
The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.
To do this sort of thing with a single RE would be some monstrosity like:
^\d{3}(([A-za-z]{8} {8})
|([A-za-z]{9} {7})
|([A-za-z]{10} {6})
|([A-za-z]{11} {5})
|([A-za-z]{12} )
|([A-za-z]{13} )
|([A-za-z]{14} )
|([A-za-z]{15} )
|([A-za-z]{16}))
\d{8}$
You could do it by ensuring it passes two separate REs:
^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$
but, since that last one is simply a length check, it's no different to the isValidLine() above.
I would use the regex you suggested with a small addition:
(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})
which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.
Hmm... Depending on the exact version of Regex you're running, consider:
(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})
Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.
See if it works. I'm only intermediate with RegEx myself, so I might be in error.
Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.
EDIT:
Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:
(?P<id>\d{3})
This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.
(?=[A-Za-z\s]{16}\d)
The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.
(?P<username>[A-Za-z]{8,16})\s*
If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.
Finally,
(?P<phone>\d{8})
This should check the eight-digit phone number.
I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.
I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.
You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.
Assuming you mean perl regex and if you allow '_' in the username:
perl -ne 'exit 1 unless /(\d{3})(\w{8,16})\s+(\d{8})/ && length == 28'
#OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them.
the following minimal example is done in Python.
import sys
for line in open("file"):
line=line.strip()
# check first 3 char for digit
if not line[0:3].isdigit(): sys.exit()
# check length of username.
if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
# check phone number length and whether they are digits.
if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
print line
I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my #fields = split;
if (
( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
And here is another way:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my ($id, $name, $phone) = unpack 'A3A16A8';
if ( is_valid_id($id)
and is_valid_name($name)
and is_valid_phone($phone)
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
sub is_valid_id { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }
sub is_valid_name { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }
sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Generalizing:
#!/usr/bin/perl
use strict;
use warnings;
my %validators = (
id => make_validator( qr/^([0-9]{3})$/ ),
name => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
phone => make_validator( qr/^([0-9]{8})$/ ),
);
INPUT:
while ( <DATA> ) {
chomp;
last unless /\S/;
my %fields;
#fields{qw(id name phone)} = unpack 'A3A16A8';
for my $field ( keys %fields ) {
unless ( $validators{$field}->($fields{$field}) ) {
warn "Invalid line: $_\n";
next INPUT;
}
}
print "$_ : $fields{$_}\n" for qw(id name phone);
}
sub make_validator {
my ($re) = #_;
return sub { ($_[0]) = ($_[0] =~ $re) };
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$
Testing:
123jackysee 45678887 Match
456charliewong 32145644 Match
789jop 12345678 No Match - username too short
999abcdefghijabcde12345678 No Match - username 'column' is less that 16 characters
999abcdefghijabcdef12345678 Match
999abcdefghijabcdefg12345678 No Match - username column more that 16 characters

Checking version number validity with Perl

I'm struggling with checking the validity of version numbers in Perl. Correct version number is like this:
Starts with either v or ver,
After that a number, if it is 0, then no other numbers are allowed in this part (e.g. 10, 3993 and 0 are ok, 01 is not),
After that a full stop, a number, full stop, number, full stop and number.
I.e. a valid version number could look something like v0.123.45.678 or ver18.493.039.1.
I came up with the following regexp:
if ($ver_string !~ m/^v(er)?(0{1}\.)|([1-9]+\d*\.)\d+\.\d+\.\d+/)
{
#print error
}
But this does not work, because a version number like verer01.34.56.78 gets accepted. I can't understand this, I know Perl tends to be greedy, but shouldn't ^v(er)? make sure that there can be a max of one "er"? And why doesn't 0{1}. match only "0.", instead of accepting "01." as well?
This regex actually catched the "rere" thing: m/^v(er)?[0-9.]+/ but I can't see where I allow it in my attempt.
Your problem is that the or - | - you are using is splitting the whole pattern in two. A | will scope to brackets or the end of an expression rather than just on the two neighbouring items.
You need to put some extra brackets two show which part of the expression you want or-ed. So a first step to fixing your pattern would be:
^v(er)?((0{1}\.)|([1-9]+\d*\.))\d+\.\d+\.\d+
You also want to put a $ at the end to ensure there are no spurious characters at the end of the version number.
Also, putting {1} is unnecessary is it means the previous item exactly once which is the default. However you could use {3} at the end of your pattern as you want three dot-digit groups at the end.
Similarly, you don't need the + after the [1-9] as other digits will be grabbed by the \d*.
And we can also remove the unnessary brackets.
So you can simplify your patten to the following:
^v(er)?(0|[1-9]\d*)(.\d+){3}$
You could do it with a single regexp, or you could do it in 2 steps, the second step being to check that the first number doesn't start with a 0.
BTW, I tend to use [0-9] instead of \d for numbers, there are lots of characters that are classified as numbers in the Unicode standard (and thus in Perl) that you may not want to deal with.
Here is a sample code, the version_ok sub is where everything happens.
#!/usr/bin/perl
use strict;
use warnings;
use Test::More tests => 7;
while( <DATA>)
{ chomp;
my( $version, $expected)= split /\s*=>\s*/;
is( version_ok( $version), $expected, $version);
}
sub version_ok
{ my( $version)=#_;
if( $version=~ m{^v(?:er)? # starts with v or ver
([0-9]+) # the first number
(?:\.[0-9]+){3} # 3 times dot number
$}x) # end
{ if( $1 =~ m{^0[0-9]})
{ return 0; } # no good: first number starts with 0
else
{ return 1; }
}
else
{ return 0; }
}
__DATA__
v0.123.45.678 => 1
ver18.493.039.1 => 1
verer01.34.56.78 => 0
v01.5.5.5 => 0
ver101.5.5.5 => 1
ver101.5.5. => 0
ver101.5.5 => 0
The regex may work for your test cases, but the CPAN module Perl::Version would seem like the best option, with two caveats:
haven't tried it myself
seems like the latest module release was in 2007 - kind of makes it a recursive problem