Regex to check fix length field with packed space - regex

Say I have a text file to parse, which contains some fixed length content:
123jackysee 45678887
456charliewong 32145644
<3><------16------><--8---> # Not part of the data.
The first three characters is ID, then 16 characters user name, then 8 digit phone number.
I would like to write a regular expression to match and verify the input for each line, the one I come up with:
(\d{3})([A-Za-z ]{16})(\d{8})
The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?

No, no, no, no! :-)
Why do people insist on trying to pack so much functionality into a single RE or SQL statement?
My suggestion, do something like:
Ensure the length is 27.
Extract the three components into separate strings (0-2, 3-18, 19-26).
Check that the first matches "\d{3}".
Check that the second matches "[A-Za-z]{8,} *".
Check that the third matches "\d{8}".
If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.
Even something like this would do the trick:
def isValidLine(s):
if s.len() != 27 return false
return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):
Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.
The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.
To do this sort of thing with a single RE would be some monstrosity like:
^\d{3}(([A-za-z]{8} {8})
|([A-za-z]{9} {7})
|([A-za-z]{10} {6})
|([A-za-z]{11} {5})
|([A-za-z]{12} )
|([A-za-z]{13} )
|([A-za-z]{14} )
|([A-za-z]{15} )
|([A-za-z]{16}))
\d{8}$
You could do it by ensuring it passes two separate REs:
^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$
but, since that last one is simply a length check, it's no different to the isValidLine() above.

I would use the regex you suggested with a small addition:
(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})
which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.

Hmm... Depending on the exact version of Regex you're running, consider:
(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})
Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.
See if it works. I'm only intermediate with RegEx myself, so I might be in error.
Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.
EDIT:
Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:
(?P<id>\d{3})
This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.
(?=[A-Za-z\s]{16}\d)
The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.
(?P<username>[A-Za-z]{8,16})\s*
If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.
Finally,
(?P<phone>\d{8})
This should check the eight-digit phone number.
I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.
I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.
You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.

Assuming you mean perl regex and if you allow '_' in the username:
perl -ne 'exit 1 unless /(\d{3})(\w{8,16})\s+(\d{8})/ && length == 28'

#OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them.
the following minimal example is done in Python.
import sys
for line in open("file"):
line=line.strip()
# check first 3 char for digit
if not line[0:3].isdigit(): sys.exit()
# check length of username.
if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
# check phone number length and whether they are digits.
if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
print line

I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my #fields = split;
if (
( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
And here is another way:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my ($id, $name, $phone) = unpack 'A3A16A8';
if ( is_valid_id($id)
and is_valid_name($name)
and is_valid_phone($phone)
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
sub is_valid_id { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }
sub is_valid_name { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }
sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Generalizing:
#!/usr/bin/perl
use strict;
use warnings;
my %validators = (
id => make_validator( qr/^([0-9]{3})$/ ),
name => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
phone => make_validator( qr/^([0-9]{8})$/ ),
);
INPUT:
while ( <DATA> ) {
chomp;
last unless /\S/;
my %fields;
#fields{qw(id name phone)} = unpack 'A3A16A8';
for my $field ( keys %fields ) {
unless ( $validators{$field}->($fields{$field}) ) {
warn "Invalid line: $_\n";
next INPUT;
}
}
print "$_ : $fields{$_}\n" for qw(id name phone);
}
sub make_validator {
my ($re) = #_;
return sub { ($_[0]) = ($_[0] =~ $re) };
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678

You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$
Testing:
123jackysee 45678887 Match
456charliewong 32145644 Match
789jop 12345678 No Match - username too short
999abcdefghijabcde12345678 No Match - username 'column' is less that 16 characters
999abcdefghijabcdef12345678 Match
999abcdefghijabcdefg12345678 No Match - username column more that 16 characters

Related

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

How to replace lookahead in regex?

I wrote a regex that validates an input string. It must have a minimum length of 8 chars (composed by alphanumeric and punctuation chars) and it must have at least one digit and one alphabetic char. So I've come up with the regex:
^(?=.*[0-9])(?=.*[a-zA-Z])[a-zA-Z0-9-,._;:]{8,}$
Now I have to rewrite this regex in a language that doesn't support lookahead, how should I rewrite that regex?
Valid inputs are:
1foo,bar
foo,bar1
1fooobar
foooobar1
fooo11bar
1234x567
a1234567
Invalid inputs:
fooo,bar
1234-567
.1234567
There are two approaches. One is to compose a single expression which handles all possible alternatives:
^[a-zA-Z][0-9][a-zA-Z0-9-,._;:]{6,}$
|
^[a-zA-Z][a-zA-Z0-9-,._;:][0-9][a-zA-Z0-9-,._;:]{5,}$
|
^[a-zA-Z][a-zA-Z0-9-,._;:]{2}[0-9][a-zA-Z0-9-,._;:]{4,}$
etc. This is a combinatoric nightmare, but it would work.
A much simpler approach is to validate the same string twice using two expressions:
^[a-zA-Z0-9-,._;:]{8,}$ # check length and permitted characters
and
[a-zA-Z].*[0-9]|[0-9].*[a-zA-Z] # check required characters
EDIT: #briandfoy correctly points out that it will be more efficient to search for each required character separately:
[a-zA-Z] # check for required alpha
and
[0-9] # check for required digit
This question was original tagged as perl, and that's how I answered it. For the oracle stuff, I have no idea how you'd do the same thing. However, I'd try to validate this stuff before it got that far.
I wouldn't do this in one regular expression. When you decide to change the rules, you'll have the same amount of work to craft the new regular expression. I wouldn't use lookarounds for this even if they were available since I wouldn't want to tolerate all the backtracking.
This looks like it's a lot of code, but the part that addresses your problem is just the subroutine. It has very simple patterns. When the password rules change, you add or delete patterns. It might be worth it to use study, but I didn't investigate that:
use v5.10;
use strict;
use Test::More;
my #valids = qw(
1foo,bar
foo,bar1
1fooobar
foooobar1
fooo11bar
);
my #invalids = qw(
fooo,bar
short
nodigitbutlong
12345678
,,,,,,,,
);
sub is_good_password {
my( $password ) = #_;
state $rules = [
qr/\A[A-Z0-9,._;:-]{8,}\z/i,
qr/[0-9]/,
qr/[A-Z]/i,
];
foreach my $rule ( #$rules ) {
return 0 unless $password =~ $rule;
}
return 1;
}
foreach my $valid ( #valids ) {
ok( is_good_password( $valid ), "Password $valid is valid" );
}
foreach my $invalid ( #invalids ) {
ok( ! is_good_password( $invalid ), "Password $invalid is invalid" );
}
done_testing();
The best I can come up with right now is
(.*[a-zA-Z].*[0-9].*|.*[0-9].*[a-zA-Z].*)
But you have to check the length of the string separately.
I would play around with these ideas to get the best performance:
should be faster for short valid inputs, but will be slower (backtrack) for inputs like "0a000000000000000000" or "aaaaaaaaaaaaaaa":
regexp_like(regexp_substr(input_string, '^[a-zA-Z0-9_,.;:-]{8,}$'),
'[0-9].*[a-zA-Z]|[a-zA-Z].*[0-9]')
should be faster if there are a lot of invalid inputs (don't miss [^...] on the 2nd line):
(length(input_string) >= 8 and
not regexp_like(input_string, '[^a-zA-Z0-9_,.;:-]') and
regexp_like(input_string, '[a-zA-Z]') and
regexp_like(input_string, '[0-9]'))

How can I parse a phone number in Perl?

I am trying to grab any digits in front of a known line number of a phone, if they exist (in Perl). There will be no dashes, only digits.
For example, say I know the line number will always be 8675309. 8675309 may or may not have leading digits, if it does I want to capture them. There is not really a limit on the number of leading digits.
$input $digits $number
'8675309' '' '8675309'
'8008675309' '800' '8675309'
'18888675309' '1888' '8675309'
'18675309' '1' '8675309'
'86753091' not a match
/8675309$/ this will match how to capture the pre-digits in one regex?
Some regexes work better backwards than forwards. So sometimes it is useful to use sexeger, rather than regexes.
my $pn = '18008675309';
reverse($pn) =~ /^9035768(\d*)/;
my $got = reverse $1;
The regex is cleaner and avoids a lot of back tracking at the cost of some fummery with reversing the input and captured values.
The backtracking gain is smaller in this case than it would be if you had a general phone number extraction regex:
Regex: /^(\d*)\d{7}$/
Sexeger: /^\d{7}(\d*)/
There is a whole class of problems where this technique is useful. For more info see the sexeger post on Perlmonks.
my($digits,$number);
if ($input =~ /^(\d*)(8675309)$/) {
($digits,$number) = ($1,$2);
}
The * quantifier is greedy, but that means it matches as much as possible while still allowing a match. So initially, yes, \d* tries to gobble up all the digits in $number, but it reluctantly gives up character-by-character what it's matched until the whole pattern matches successfully.
Another approach is to chop off the tail:
(my $digits = $input) =~ s/8675309$//;
You could do the same without using a regular expression:
my $digits = $input;
substr($digits, -7) = "";
The above, at least with perl-5.10-1, could even be condensed to
substr(my $digits = $input, -7) = "";
The regex special variables $` and $& are another way of grabbing those pieces of information. They hold the contents of the data preceding the match and the match itself respectively.
if ( /8675309$/ )
{
printf( "%s,%s,%s\n", $_, $`, $& );
}
else
{
printf( "%s,Not a match\n", $_ );
}
There's a Perl package that deals with at least UK and US phone numbers.
It's called Number::Phone and the code is somewhere on the cpan.org site.
How about /(\d)?(8675309)/?
UPDATE:
whoops that should haev been /(\d*)(8675309)/
I might not understand the problem. Why is there a difference between the first and fourth examples:
'8675309' '' '8675309'
...
'8675309' '1' '8675309'
If all you want is to separate the last seven digits from everything else, you could have said it that way rather than provide confusing examples. A regex for that would be:
/(\d*)(\d{7,7})$/
If you weren't just providing a hypothetical number, and really are only looking for lines with '8675309' (seems strange), replace the '\d{7,7}' with '8675309'.