regular expressions in perl explanation - regex

I need help in understanding the below regular expressions.
Can somebody please tell what this means m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx
foreach my $line ( #{ $self->{'stdout'} } ) {
if ( $line =~ m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx ) {
$timestamp = $1;
}

This
if ( $line =~ m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx ) {
is quite unreadable, though the author made a start to make it readable as it has the /x flag (allowing whitespace, but not made use of), but it still suffers from backslashitis and doesn't limit the matches to what is really meant.
Rewriting it with different delimiters allows to get rid of some of the backslashes:
if ( $line =~ m{^(\d+/\d+/\d+\s\d+:\d+:\d+\.\d+)}msx ) {
Adding whitespace and using [.] instead to match a single dot and adding comments can provide a better idea of what would be matched:
if ( $line =~ m{^ # (start of line)
( # (capture group $1)
\d+ / \d+ / \d+ # digit(s) slash digit(s) slash digit(s)
\s # ANY whitespace character (space, tab, etc)
\d+ : \d+ : \d+ # digit(s) colon digit(s) colon digit(s)
[.] \d+ # dot digit(s)
) # (end capture group $1)
}msx ) {
Where digit(s) means one ore more digits 0-9 (or any utf8 digit). So this would happily match something like "00000/0000/0000000000 0000:0000000000000000000000:000000.0000", but it seems they meant to match e.g. "0000/00/00 00:00:00.000" (a time stamp including milliseconds).
A better regex (with a lower chance of matching something it shouldn't, though it is anchored to the start of the line so no real practical difference here but as a general rule it's highly advisable to be as specific as you can) would be something like this:
if ( $line =~ m{^
(
[0-9]{4} / [0-9]{2} / [0-9]{2}
[ ] # space character
[0-9]{2} : [0-9]{2} : [0-9]{2}
[.] [0-9]{3}
)
}msx ) {
With that in hand, the regex manpage others linked already should make more sense.

This original regex
m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx
is a very poorly-written pattern that matches a date/time string that looks like 2014/07/31 22:53:42.123
Since it contains no dots . the /s modifier is redundant.
The /x modifier allows whitespace layout to be added so we may as well do that, using a different delimiter so that slashes don't need escaping
m{ ^ ( \d+ / \d+ / \d+ \s \d+ : \d+ : \d+ \. \d+ ) }mx
So that matches
From the beginning of any line (i.e. at the start of the string or right after a newline because the /m modifier is in effect)
Capture the following
Some digits, a slash, some more digits, another slash, some more digits
A white space character (here I think a space was assumed)
Some digits, a colon, some more digits, another colon, some more digits, a dot, some more digits
Stop capturing
So, as I said, it would match (and capture)
2014/07/31 22:53:42.123
It would also match
0/1/2 3:4:5.6
I hope that helps

Related

Perl phone-number regex

Sorry for asking such a simple question, I'm still an inexperienced programmer. I stumbled across a phone-number-matching regex in some old perl code at work, I'd love it if somebody could explain exactly what it means (my regex skills are severely lacking).
if ($value !~ /^\+[[:space:]]*[0-9][0-9.[:space:]-]*(\([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*\))?([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?([[:space:]]+ext.[0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?$/i) {
...
}
Thank you in advance :)
The code roughly says "you should replace this with Number::Phone".
All joking and good advice aside, first thing to do when figuring out a regex is to expand it with /x. First pass is to break things up by capture group.
/^
\+[[:space:]]*[0-9][0-9.[:space:]-]*
(\([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*\))?
([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?
([[:space:]]+ext.[0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?
$/xi
Then, since this is dominated by character sets, I'd space by character sets.
/^
\+ [[:space:]]* [0-9] [0-9.[:space:]-]*
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
( [[:space:]]+ ext . [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
$/xi
Now you can start to see some similar elements. Try lining those up to see the similarities.
/^
\+ [[:space:]]* [0-9] [0-9.[:space:]-]*
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
( [[:space:]]+
ext .
[0-9.[:space:]-]* [0-9] [0-9.[:space:]-]*
)?
$/xi
Then zero in on an element and try figure it out. This is the important one, [0-9.[:space:]-]* meaning "Zero or more numbers, spaces, dashes or dots". That doesn't make much sense for phone parsing, maybe it will make more sense in context. Let's look at a line we can guess what it's trying to do.
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
Open paren.
Zero or more numbers, spaces, dashes or dots.
A number
Zero or more numbers, spaces, dashes or dots.
Close paren.
The parens suggest this is trying to parse an area code. The rest limits it to any number of numbers, spaces, dashes or dots, but the [0-9] ensures there is at least one number. This is likely the author's way of dealing with the multitude of phone number formats.
Let's give this a name, call it phone_chars, because it's what the author has decided phone numbers are made of. There's another element, the [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* which I'll call a "phone atom" because it's what the author decided an atom of a phone number can be. If we put that in its own regex and build the phone regex with it, things become a lot clearer.
my $phone_chars = qr{[0-9.[:space:]-]};
my $phone_atom = qr{$phone_chars* [0-9] $phone_chars*}x;
/^
\+ [[:space:]]* [0-9] $phone_chars*
( \( $phone_atom \) )?
( $phone_atom )?
( [[:space:]]+ ext . $phone_atom )?
$/xi;
If you know something about phone numbers, it's like this:
Mandatory country code (which must start with a + and a number)
Optional area code
Optional phone number
Optional extension
This regex doesn't do a very good job validating phone numbers. According to this regex, "+1" is a valid phone number, but "(555) 123-4567" isn't because it doesn't have a country code.
Phone number validation is hard. Did I mention Number::Phone?
use strict;
use warnings;
use v5.10;
use Number::Phone;
my $number = Number::Phone->new("+1(555)456-2398");
say $number->is_valid;
Amazing what extended mode, a little whitespace and a few comments can do ...
if ($value !~ /
^ # Anchor to start of string
\+ # followed (immediately) by literal '+'
[[:space:]]* # zero or more chars in the POSIX character class 'space'
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more digit, full-stop, space or hyphen
( # start capture to $1
\( # Literal open parentheses
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
\) # Literal close parentheses
)? # close capture to $1 - whole thing optional
( # start capture to $2
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
)? # close capture to $2 - whole thing optional
( # start capture to $3
[[:space:]]+ # at least one space (as definned by POSIX)
ext. # literal 'ext' followed by any character
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
)? # close capture to $3 - whole thing optional
$ # Anchor to end of string
/ix # close regex; ignore case, extended mode options
) {

Perl regex to extract digits from string with parenthesis

I have the following string:
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another variable.
The following statement does not work:
my ($NAdapter) = $string =~ /\((\d+)\)/;
What is the correct syntax?
\d+(?=[^(]*\))
You can use this.See demo.Yours will not work as inside () there is more data besides \d+.
https://regex101.com/r/fM9lY3/57
You could try something like
my ($NAdapter) = $string =~ /\(.*(\d+).*\)/;
After that, $NAdapter should include the number that you want.
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another
variable
Your regex (with some spaces for clarity):
/ \( (\d+) \) /x;
says to match:
A literal opening parenthesis, immediately followed by...
A digit, one or more times (captured in group 1), immediately followed by...
A literal closing parenthesis.
Yet, the substring you want to match:
(NIC 1)
is of the form:
A literal opening parenthesis, immediately followed by...
Some capital letters
STOP EVERYTHING! NO MATCH!
As an alternative, your substring:
(NIC 1)
could be described as:
Some digits, immediately followed by...
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (NIC 1234) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
(\d+) #Match any digit, one or more times, captured in group 1, followed by...
\) #a literal closing parenthesis.
#Parentheses have a special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
1234
Another description of your substring:
(NIC 1)
could be:
A literal opening parenthesis, immediately followed by...
Some non-digits, immediately followed by...
Some digits, immediately followed by..
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (ABC NIC789) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
789
If there might be spaces on some lines and not others, such as:
spaces
||
VV
(NIC 1 )
(NIC 2)
You can insert a \s* (any whitespace, zero or more times) in the appropriate place in the regex, for instance:
my ($match) = $string =~ /
#Parentheses have special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\s* #any whitespace, zero or more times, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.

RegEx solution for 1.23+1.23+1

I searched a lot but can not find this regular expression. My problem is that I made a calculator but can not validate my display entirely. My case is with the dot
I need my regular expression to be: digit dot digit operator digit dot ( 1.23+1.23+1.). The dot must be placed only once not like (1..23+ 1.1.1). I have found similar regular expression but it didn't cover the case (1.23 +1.)
Here is my regEx -> /[0-9-+/*]+(\.[0-9][0-9]?)?/g
Could use this
^[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[+-](?:\d+(?:\.\d*)?|\.\d+))*$
Expanded:
^ # BOS
[+-]? # Optional Plus or minus
(?: # Decimal term
\d+
(?: \. \d* )?
| \. \d+
)
(?: # Optionally, many more terms
[+-] # Required Plus or minus
(?: # Decimal term
\d+
(?: \. \d* )?
| \. \d+
)
)*
$ # EOS
Check this out(demo):
/^(([-+*\/ ]+)?(\b(\d+\.\d+)\b|\d))+$/
but it will work only if there is one equation per string - it matches at beginning (^) and ant the end ($) of a string. However you can also use it with /m or/and /g modifiers.
EDIT
If it is only about '–' character it is enough to add it to character class:
/^(([-–+*\/ ]+)?(\b(\d+\.\d+)\b|\d))+$/

Matching the rightmost and the leftmost symbols in perl with regular expression

I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?
It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.
$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.
Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )
These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols

Matching degree-based geographical coordinates with a regular expression

I'd like to be able to identify patterns of the form
28°44'30"N., 33°12'36"E.
Here's what I have so far:
use utf8;
qr{
(?:
\d{1,3} \s* ° \s*
\d{1,2} \s* ' \s*
\d{1,2} \s* " \s*
[ENSW] \s* \.?
\s* ,? \s*
){2}
}x;
Needless to say, this doesn't match. Does it have anything to do with the extended characters (namely the degree symbol)? Or am I just screwing this up big time?
I'd also appreciate directions to CPAN, if you know of something there that will solve my problem. I've looked at Regex::Common and Geo::Formatter, but none of these do what I want. Any ideas?
Update
It turns out that I needed to take out use utf8 when reading the coordinates from a file. If I manually initialize a variable with a coordinate, it would match fine, but as soon as I read that same line from a file, it wouldn't match. Taking out use utf8 solved that. I guess I don't really understand what utf8 is doing.
This:
use strict;
use warnings;
use utf8;
my $re = qr{
(?:
\d{1,3} \s* ° \s*
\d{1,2} \s* ' \s*
\d{1,2} \s* " \s*
[ENSW] \s* \.?
\s* ,? \s*
){2}
}x;
if (q{28°44'30"N., 33°12'36"E.} =~ $re) {
print "match\n";
} else {
print "no match\n";
}
works:
$ ./coord.pl
match
Try dropping the use utf8 statement.
The degree symbol corresponds to character value 0xB0 in my current encoding (whatever that is, but it ain't UTF8). 0xB0 is a "continuation byte" in UTF8; it is expected to by the second, third, or fourth character of a sequence that begins with something between 0xC2 and 0xF4. Using that string with utf8 will give you an error.
You forgot the x modifier on the qr operator.
The ?: at the beginning of the regex makes it non-capturing, which is probably why the matches cannot be extracted or seen. Dropping it from the regex may be the solution.
If all of the coordinates are fixed-format, unpack may be a better way of obtaining the desired values.
my #twoCoordinates = unpack 'A2xA2xA2xAx3A2xA2xA2xA', "28°44'30"N., 33°12'36"E.";
print "#twoCoordinates"; # returns '28 44 30 N 33 12 36 E'
If not, then modify the regex:
my #twoCoordinates = "28°44'30"N., 33°12'36"E." =~ /\w+/g;