Get the second string of the URI with Perl regex - regex

I need to get the second part of the URI, the possible URI are:
/api/application/v1/method
/web/application/v1/method
I can get "application" using:
([^\/api]\w*)
and
([^\/web]\w*)
But I know is not the best approach, what would be the good way?
Thanks!
Edit: thank you all for the input, the goal was to set the second parte of the uri into a header in apache with rewrite rules

A general regex (Perl or PCRE syntax) solution would be:
^/[^/]+/([^/]+)
Each section is delimited with /, so just capture as many non-/ characters as there are.
This is preferable to non-greedy regexes because it does not need to backtrack, and allows for whatever else the sections may contain, which can easily contain non-word characters such as - that won't be matched by \w.

There are so many options that we can do so, not sure which one would be best, but it could be as simple as:
\/(.+?)\/(.+?)\/.*
which our desired output is in the second capturing group $2.
Demo 1
Example
#!/usr/bin/perl -w
use strict;
use warnings;
use feature qw( say );
main();
sub main{
my $string = '/api/application/v1/method
/web/application/v1/method';
my $pattern = '\/(.+?)\/(.+?)\/.*';
my $match = replace($pattern, '$2', $string);
say $match , " is a match πŸ’šπŸ’šπŸ’š ";
}
sub replace {
my ($pattern, $replacement, $string) = #_;
$string =~s/$pattern/$replacement/gee;
return $string;
}
Output
application
application is a match πŸ’šπŸ’šπŸ’š
Advice
zdim advises that:
A legitimate approach, notes:
(1) there is no need for the trailing .*
(2) Need /|$ (not just /), in case the path finishes without / (to
terminate the non-greedy pattern at the end of string, if there is no
/)
(3) note though that /ee can be vulnerable (even just to errors),
since the second evaluation (e) will run code if the first evaluation
results in code. And it may be difficult to ensure that that is always
done under full control. More to the point, for this purpose there is
no reason to run a substitution --- just match and capture is enough.

With all the regex, explicitly asked for, I'd like to bring up other approaches.
These also parse only a (URI style) path, like the regex ones, and return the second directory.
The most basic and efficient one, just split the string on /
my $dir = ( split /\//, $path )[2];
The split returns '' first (before the first /) thus we need the third element. (Note that we can use an alternate delimiter for the separator pattern, it being regex: split m{/}, $path.)
Use appropriate modules, for example URI
use URI;
my $dir = ( URI->new($path)->path_segments )[2];
or Mojo::Path
use Mojo::Path;
my $dir = Mojo::Path->new($path)->parts->[1];
What to use depends on details of what you do -- if you've got any other work with URLs and web then you clearly want modules for that; otherwise they may (or may not) be an overkill.
I've benchmarked these for a sanity check of what one is paying with modules.
The split either beats regex by up to 10-15% (the regex using negated character class and the one based on non-greedy .+? come around the same), or is about the same with them. They are faster than Mojo by about 30%, and only URI lags seriously, by a factor of 5 behind Mojo.
That's for paths typical for real-life URLs, with a handful of short components. With only two very long strings (10k chars), Mojo::Path (surprisingly for me) is a factor of six ahead of split (!), which is ahead of character-class regex by more than an order of magnitude.
The negated-character-class regex for such long strings beats the non-greedy (.+?) one by a factor of 3, good to know in its own right.
In all this the URI and Mojo objects were created once, ahead of time.
Benchmark code. I'd like to note that the details of these timings are far less important than the structure and quality of code.
use warnings;
use strict;
use feature 'say';
use URI;
use Mojo::Path;
use Benchmark qw(cmpthese);
my $runfor = shift // 3; #/
#my $path = '/' . 'a' x 10_000 . '/' . 'X' x 10_000;
my $path = q(/api/app/v1/method);
my $uri = URI->new($path);
my $mojo = Mojo::Path->new($path);
sub neg_cc {
my ($dir) = $path =~ m{ [^/]+ / ([^/]+) }x; return $dir; #/
}
sub non_greedy {
my ($dir) = $path =~ m{ .+? / (.+?) (?:/|$) }x; return $dir; #/
}
sub URI_path {
my $dir = ( $uri->path_segments )[2]; return $dir;
}
sub Mojo_path {
my $dir = $mojo->parts->[1]; return $dir;
}
sub just_split {
my $dir = ( split /\//, $path )[2]; return $dir;
}
cmpthese( -$runfor, {
neg_cc => sub { neg_cc($path) },
non_greedy => sub { non_greedy($path) },
just_split => sub { just_split($path) },
URI_path => sub { URI_path($path) },
Mojo_path => sub { Mojo_path($path) },
});
With a (10-second) run this prints, on a laptop with v5.16
Rate URI_path Mojo_path non_greedy neg_cc just_split
URI_path 146731/s -- -82% -87% -87% -89%
Mojo_path 834297/s 469% -- -24% -28% -36%
non_greedy 1098243/s 648% 32% -- -5% -16%
neg_cc 1158137/s 689% 39% 5% -- -11%
just_split 1308227/s 792% 57% 19% 13% --
One should keep in mind that the overhead of the function-call is very large for such a simple job, and in spite of Benchmark's work these numbers are probably best taken as a cursory guide.

Your pattern ([^\/api]\w*) consists of a capturing group and a negated character class that will first match 1 time not a /, a, p or i. See demo.
After that 0+ times a word char will be matched. The pattern could for example only match a single char which is not listed in the character class.
What you might do is use a capturing group and match \w+
^/(?:api|web)/(\w+)/v1/method
Explanation
^ Start of string
(?:api|web) Non capturing group with alternation. Match either api or web
(\w+) Capturing group 1, match 1+ word chars
/v1/method Match literally as in your example data.
Regex demo

Related

Regular expression puzzler

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
Β 
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

Perl: Method to convert regexp with greedy quantifiers to non-greedy

My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.
How do I convert the regexp so any greedy quantifier will be non-greedy?
Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?
If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?
E.g., a user may enter:
.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}
While these four examples may look similar, they have completely different meanings.
If you simply add ? after every */} you will change the character sets in the last three examples.
Instead they should be changed to/behave like:
.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?
but where the matched string is the minimal match, and not first-match, that Perl will default to:
$a="aab";
$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;
But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:
$a="aab";
$a=~/.*(a.*?b)$/;
# Matches ab
print $1;
"Greedyness" is not a property of the whole regular expression. It's a property of a quantifier.
It can be controlled for each quantifier separately. Just add a ? after a quantifier to make it non-greedy, e.g.
[a-z]*?
a{2,3}?
[0-9]??
\s+?
And no, there isn't any built-in way to turn the whole regex to some "default-non-greedy" mode. You need to parse the regex, detect all quantifiers and change them accordingly. Maybe there's a regex-parsing library somewhere on CPAN.
The closest I've found so far is the Regexp::Parser module. I didn't try it, but looks like it could parse the regex, walk the tree, make appropriate changes and then build a modified regex. Please take a look.
You can use a state machine:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );
for (#regexes) {
print "give: $_\n";
my $ungreedy = make_ungreedy($_,0);
print "got: $ungreedy\n";
print "============================================\n"
}
sub make_ungreedy {
my $regex = shift;
my $class_state = 0;
my $escape_state = 0;
my $found = 0;
my $ungreedy = "";
for (split (//, $regex)) {
if ($found) {
$ungreedy .= "?" unless (/\?/);
$found = 0;
}
$ungreedy .= $_;
$escape_state = 0, next if ($escape_state);
$escape_state = 1, next if (/\\/);
$class_state = 1, next if (/\[/);
if ($class_state) {
$class_state = 0 if (/\]/);
next;
}
$found = 1 if (/[*}+]/);
}
$ungreedy .= '?' if $found;
return $ungreedy;
}

Extracting text with the inner and outer-most boundary with Perl regex

Given this two text as example
my $line = "[cytokine]<ADJVNT-PROP-0> signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]<EXP-PP-V-0>";
my $line2 = "[Human [papillomavirus]<VACC-PROP-0>]<VACC-PROP-0> genotype [31]<NUM> does not [express]<EXP-V-0> detectable [microRNA]<MIR-0> levels [during]<PREP> latent or productive virus replication.";
What I want to do to extract all the string that are bounded by <VAC or <ADJ and <EXP
On the left side when there are multiple match extract the string from innermost
onwards to the end to the right until the further most.
For example the above result I want to have a single regex that returns these:
Output1: signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]
Output2: genotype [31]<NUM> does not [express]
Why this code doesn't work:
my #lines = ("[cytokine]<ADJVNT-PROP-0> signaling, which have not [to]<PREP> date been shown [to]<PREP> be [[regulat]<EXP-V-0>ed]<EXP-PP-V-0>",
"[Human [papillomavirus]<VACC-PROP-0>]<VACC-PROP-0> genotype [31]<NUM> does not [express]<EXP-V-0> detectable [microRNA]<MIR-0> levels [during]<PREP> latent or productive virus replication.");
my $count = 0;
foreach $line (#lines) {
$count++;
my ($sel) = $line =~ /<VAC|<ADJ.*>(.*)<EXP.*>/;
print "Output $count: $sel\n";
}
Executable here: https://eval.in/50772
What's the right way to do it?
First your OR operator has the wrong scope:
/<VAC|<ADJ.*>(.*)<EXP.*>/
This will match either <VAC or <ADJ.*>(.*)<EXP.*>. Wrap the needed part around non-capture groups:
/<(?:VAC|ADJ).*>(.*)<EXP.*>/
Then, I think it's safer to use some negated class here, and by that, I mean [^>]+ instead of .*:
/<(?:VAC|ADJ)[^>]+>(.*)<EXP[^>]+>/
Lastly, you don't seem to want any <VAC or <ADJ in the captures. So I added a negative lookahead (and made the (.*) lazy) in the (.*) part:
/<(?:VAC|ADJ)[^>]+>((?:(?!<VAC|ADJ).)*?)<EXP[^>]+>/
eval.in updated
If you want to get the <EXP part in (your first example), extend the capturing group:
/<(?:VAC|ADJ)[^>]+>((?:(?!<VAC|ADJ).)*?<EXP[^>]+>)/
eval.in for this part.
Several problems:
| means "or", but you did not use any kind of parentheses, so it is <VAC or the rest. You in fact want <VAC or ADJ, then the rest.
.* is greedy. It matches as much as it can. If you want it to match less, use .*?.
The regex tries to match as soon as possible. If you want it to match later, prepend a greedy .*.
This should work:
/.*<(?:VAC|ADJ).*?>(.*)<EXP.*>/

Get date in different format in Perl?

I need to get the date which could be in 3 possible format.
11/20/2012
11.20.2012
11-20-2012
How could I achieve this in Perl. I'm trying RegEx to get what I want. Here's my code.
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012"); #array values may vary in every run
foreach my $date (#dates){
$date =~ /[-.\/\d+]/g;
print "Date: $date \n";
}
I want the output to be. (code above doesn't print anything)
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
Where am I wrong? Please Help. Thanks
Note: I want to achieve this without using any CPAN module as much as possible. I know there are a lot of CPAN modules that could provide what I want.
Your code almost produces what you want. I assume your input is a bit more complicated, or you have posted code that you are not actually running.
Either way, the problem is this
$date =~ /[-.\/\d+]/g;
First off, your plus multiplier is inside the character class: It should be after it. Second, it is just a pattern match, you need to use it in list context, and store its return value:
my ($match) = $date =~ /[-.\/\d]+/g;
print "Date: $match\n";
Then it will return the first of the strings found that contains one or more of dash, period, slash or a number. Be aware that it will match other things as well, as it is a rather unstrict regex.
Why does it work? Because a pattern match in list context returns a list of the matches when the global /g modifier is used.
I highly recommend the use of DateTime::Format::Strptime module, which has a rich set of funcionality. Think not only in parsing strings, but also in checking the date is valid.
Why not search for the formats one at a time?
=~ m!(\d{2}/\d{2}/\d{2}|\d{4}\.\d{2}\.\d{2}|\d{2}-\d{2}-\d{4})!
should do the trick. Other than that, there's a module dealing with dates called DateTime.
Try matching the formats in turn. The regex below matches any of your permitted separators (/, ., or -) and then requires the same separator via backreference (\2 or \3). Otherwise, you have three possible separators times two possible positions for the year to make six alternatives in your pattern.
#! /usr/bin/env perl
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr<
\b # begin on word boundary
(
(?: [0-9][0-9] ([-/.]) [0-9][0-9] \2 [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] ([-/.]) [0-9][0-9] \3 [0-9][0-9])
)
\b # end on word boundary
>x;
foreach my $date (#dates) {
if (my($match) = $date =~ /$date_pattern/) {
print "Date: $match\n";
}
}
Output:
Date: 11/20/2012
Date: 2012.11.20
Date: 20-11-2012
On my first try at the code above, I had \2 in the YYYY-MM-DD alternative where I should have had \3, which failed to match. To spare us counting parentheses, version 5.10.0 added named capture buffers.
Named Capture Buffers
It is now possible to name capturing parenthesis in a pattern and refer to the captured contents by name. The naming syntax is (?<NAME>....). It's possible to backreference to a named buffer with the \k<NAME> syntax. In code, the new magical hashes %+ and %- can be used to access the contents of the capture buffers.
Using this handy feature, the code above becomes
#! /usr/bin/env perl
use 5.10.0; # named capture buffers
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b # begin on word boundary
(?<date>
(?: [0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9][0-9][0-9])
| (?: [0-9][0-9][0-9][0-9] (?<sep>[-/.]) [0-9][0-9] \k{sep} [0-9][0-9])
)
\b # end on word boundary
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
and produces the same output.
The code above still contains a lot of repetition. Using the (DEFINE) special case combined with named captures, we can make the pattern much nicer.
#! /usr/bin/env perl
use 5.10.0;
use strict;
use warnings;
#array values may vary in every run
my #dates = ("Mon 11/20/2012","2012.11.20","20-11-2012");
my $date_pattern = qr!
\b (?<date> (?&YMD) | (?&DMY)) \b
(?(DEFINE)
(?<SEP> [-/.])
(?<YYYY> [0-9][0-9][0-9][0-9])
(?<MM> [0-9][0-9])
(?<DD> [0-9][0-9])
(?<YMD> (?&YYYY) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&DD))
(?<DMY> (?&DD) (?<sep>(?&SEP)) (?&MM) \k<sep> (?&YYYY))
)
!x;
foreach my $date (#dates) {
if ($date =~ /$date_pattern/) {
print "Date: $+{date}\n";
}
}
Yes, the subpattern named DMY also matches dates int MDY form. For now it suffices, and you ain’t gonna need it.

Regex to check fix length field with packed space

Say I have a text file to parse, which contains some fixed length content:
123jackysee 45678887
456charliewong 32145644
<3><------16------><--8---> # Not part of the data.
The first three characters is ID, then 16 characters user name, then 8 digit phone number.
I would like to write a regular expression to match and verify the input for each line, the one I come up with:
(\d{3})([A-Za-z ]{16})(\d{8})
The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?
No, no, no, no! :-)
Why do people insist on trying to pack so much functionality into a single RE or SQL statement?
My suggestion, do something like:
Ensure the length is 27.
Extract the three components into separate strings (0-2, 3-18, 19-26).
Check that the first matches "\d{3}".
Check that the second matches "[A-Za-z]{8,} *".
Check that the third matches "\d{8}".
If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.
Even something like this would do the trick:
def isValidLine(s):
if s.len() != 27 return false
return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):
Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.
The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.
To do this sort of thing with a single RE would be some monstrosity like:
^\d{3}(([A-za-z]{8} {8})
|([A-za-z]{9} {7})
|([A-za-z]{10} {6})
|([A-za-z]{11} {5})
|([A-za-z]{12} )
|([A-za-z]{13} )
|([A-za-z]{14} )
|([A-za-z]{15} )
|([A-za-z]{16}))
\d{8}$
You could do it by ensuring it passes two separate REs:
^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$
but, since that last one is simply a length check, it's no different to the isValidLine() above.
I would use the regex you suggested with a small addition:
(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})
which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.
Hmm... Depending on the exact version of Regex you're running, consider:
(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})
Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.
See if it works. I'm only intermediate with RegEx myself, so I might be in error.
Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.
EDIT:
Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:
(?P<id>\d{3})
This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.
(?=[A-Za-z\s]{16}\d)
The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.
(?P<username>[A-Za-z]{8,16})\s*
If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.
Finally,
(?P<phone>\d{8})
This should check the eight-digit phone number.
I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.
I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.
You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.
Assuming you mean perl regex and if you allow '_' in the username:
perl -ne 'exit 1 unless /(\d{3})(\w{8,16})\s+(\d{8})/ && length == 28'
#OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them.
the following minimal example is done in Python.
import sys
for line in open("file"):
line=line.strip()
# check first 3 char for digit
if not line[0:3].isdigit(): sys.exit()
# check length of username.
if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
# check phone number length and whether they are digits.
if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
print line
I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my #fields = split;
if (
( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
And here is another way:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my ($id, $name, $phone) = unpack 'A3A16A8';
if ( is_valid_id($id)
and is_valid_name($name)
and is_valid_phone($phone)
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
sub is_valid_id { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }
sub is_valid_name { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }
sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Generalizing:
#!/usr/bin/perl
use strict;
use warnings;
my %validators = (
id => make_validator( qr/^([0-9]{3})$/ ),
name => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
phone => make_validator( qr/^([0-9]{8})$/ ),
);
INPUT:
while ( <DATA> ) {
chomp;
last unless /\S/;
my %fields;
#fields{qw(id name phone)} = unpack 'A3A16A8';
for my $field ( keys %fields ) {
unless ( $validators{$field}->($fields{$field}) ) {
warn "Invalid line: $_\n";
next INPUT;
}
}
print "$_ : $fields{$_}\n" for qw(id name phone);
}
sub make_validator {
my ($re) = #_;
return sub { ($_[0]) = ($_[0] =~ $re) };
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$
Testing:
123jackysee 45678887 Match
456charliewong 32145644 Match
789jop 12345678 No Match - username too short
999abcdefghijabcde12345678 No Match - username 'column' is less that 16 characters
999abcdefghijabcdef12345678 Match
999abcdefghijabcdefg12345678 No Match - username column more that 16 characters