Matching degree-based geographical coordinates with a regular expression - regex

I'd like to be able to identify patterns of the form
28°44'30"N., 33°12'36"E.
Here's what I have so far:
use utf8;
qr{
(?:
\d{1,3} \s* ° \s*
\d{1,2} \s* ' \s*
\d{1,2} \s* " \s*
[ENSW] \s* \.?
\s* ,? \s*
){2}
}x;
Needless to say, this doesn't match. Does it have anything to do with the extended characters (namely the degree symbol)? Or am I just screwing this up big time?
I'd also appreciate directions to CPAN, if you know of something there that will solve my problem. I've looked at Regex::Common and Geo::Formatter, but none of these do what I want. Any ideas?
Update
It turns out that I needed to take out use utf8 when reading the coordinates from a file. If I manually initialize a variable with a coordinate, it would match fine, but as soon as I read that same line from a file, it wouldn't match. Taking out use utf8 solved that. I guess I don't really understand what utf8 is doing.

This:
use strict;
use warnings;
use utf8;
my $re = qr{
(?:
\d{1,3} \s* ° \s*
\d{1,2} \s* ' \s*
\d{1,2} \s* " \s*
[ENSW] \s* \.?
\s* ,? \s*
){2}
}x;
if (q{28°44'30"N., 33°12'36"E.} =~ $re) {
print "match\n";
} else {
print "no match\n";
}
works:
$ ./coord.pl
match

Try dropping the use utf8 statement.
The degree symbol corresponds to character value 0xB0 in my current encoding (whatever that is, but it ain't UTF8). 0xB0 is a "continuation byte" in UTF8; it is expected to by the second, third, or fourth character of a sequence that begins with something between 0xC2 and 0xF4. Using that string with utf8 will give you an error.

You forgot the x modifier on the qr operator.

The ?: at the beginning of the regex makes it non-capturing, which is probably why the matches cannot be extracted or seen. Dropping it from the regex may be the solution.
If all of the coordinates are fixed-format, unpack may be a better way of obtaining the desired values.
my #twoCoordinates = unpack 'A2xA2xA2xAx3A2xA2xA2xA', "28°44'30"N., 33°12'36"E.";
print "#twoCoordinates"; # returns '28 44 30 N 33 12 36 E'
If not, then modify the regex:
my #twoCoordinates = "28°44'30"N., 33°12'36"E." =~ /\w+/g;

Related

Perl Regexp::Common package not matching certain real numbers when used with word boundary

The following code below print "34" instead of the expected ".34"
use strict;
use warnings;
use Regexp::Common;
my $regex = qr/\b($RE{num}{real})\s*/;
my $str = "This is .34 meters of cable";
if ($str =~ /$regex/) {
print $1;
}
Do I need to fix my regex? (The word boundary is need as not including it will cause it match something string like xx34 which I don't want to)
Or is it is a bug in Regexp::Common? I always thought that a longest match should win.
The word boundary is a context-dependent regex construct. When it is followed with a word char (letter, digit or _) this location should be preceded either with the start of a string or a non-word char. In this concrete case, the word boundary is followed with a non-word char and thus requires a word char to appear right before this character.
You may use a non-ambiguous word boundary expressed with a negative lookbehind:
my $regex = qr/(?<!\w)($RE{num}{real})/;
^^^^^^^
The (?<!\w) negative lookbehind always denotes one thing: fail the match if there
is no word character immediately to the left of the current location.
Or, use a whitespace boundary if you want your matches to only occur after whitespace or start of string:
my $regex = qr/(?<!\S)($RE{num}{real})/;
^^^^^^^
Try this patern: (?:^| )(\d*\.?\d+)
Explanation:
(?:...) - non-capturing group
^| - match either ^ - beginning oof a string or - space
\d* - match zero or more digits
\.? - match dot literally - zero or one
\d+ - match one or more digits
Matched number will be stored in first capturing group.
Demo

Perl phone-number regex

Sorry for asking such a simple question, I'm still an inexperienced programmer. I stumbled across a phone-number-matching regex in some old perl code at work, I'd love it if somebody could explain exactly what it means (my regex skills are severely lacking).
if ($value !~ /^\+[[:space:]]*[0-9][0-9.[:space:]-]*(\([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*\))?([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?([[:space:]]+ext.[0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?$/i) {
...
}
Thank you in advance :)
The code roughly says "you should replace this with Number::Phone".
All joking and good advice aside, first thing to do when figuring out a regex is to expand it with /x. First pass is to break things up by capture group.
/^
\+[[:space:]]*[0-9][0-9.[:space:]-]*
(\([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*\))?
([0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?
([[:space:]]+ext.[0-9.[:space:]-]*[0-9][0-9.[:space:]-]*)?
$/xi
Then, since this is dominated by character sets, I'd space by character sets.
/^
\+ [[:space:]]* [0-9] [0-9.[:space:]-]*
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
( [[:space:]]+ ext . [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
$/xi
Now you can start to see some similar elements. Try lining those up to see the similarities.
/^
\+ [[:space:]]* [0-9] [0-9.[:space:]-]*
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* )?
( [[:space:]]+
ext .
[0-9.[:space:]-]* [0-9] [0-9.[:space:]-]*
)?
$/xi
Then zero in on an element and try figure it out. This is the important one, [0-9.[:space:]-]* meaning "Zero or more numbers, spaces, dashes or dots". That doesn't make much sense for phone parsing, maybe it will make more sense in context. Let's look at a line we can guess what it's trying to do.
( \( [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* \) )?
Open paren.
Zero or more numbers, spaces, dashes or dots.
A number
Zero or more numbers, spaces, dashes or dots.
Close paren.
The parens suggest this is trying to parse an area code. The rest limits it to any number of numbers, spaces, dashes or dots, but the [0-9] ensures there is at least one number. This is likely the author's way of dealing with the multitude of phone number formats.
Let's give this a name, call it phone_chars, because it's what the author has decided phone numbers are made of. There's another element, the [0-9.[:space:]-]* [0-9] [0-9.[:space:]-]* which I'll call a "phone atom" because it's what the author decided an atom of a phone number can be. If we put that in its own regex and build the phone regex with it, things become a lot clearer.
my $phone_chars = qr{[0-9.[:space:]-]};
my $phone_atom = qr{$phone_chars* [0-9] $phone_chars*}x;
/^
\+ [[:space:]]* [0-9] $phone_chars*
( \( $phone_atom \) )?
( $phone_atom )?
( [[:space:]]+ ext . $phone_atom )?
$/xi;
If you know something about phone numbers, it's like this:
Mandatory country code (which must start with a + and a number)
Optional area code
Optional phone number
Optional extension
This regex doesn't do a very good job validating phone numbers. According to this regex, "+1" is a valid phone number, but "(555) 123-4567" isn't because it doesn't have a country code.
Phone number validation is hard. Did I mention Number::Phone?
use strict;
use warnings;
use v5.10;
use Number::Phone;
my $number = Number::Phone->new("+1(555)456-2398");
say $number->is_valid;
Amazing what extended mode, a little whitespace and a few comments can do ...
if ($value !~ /
^ # Anchor to start of string
\+ # followed (immediately) by literal '+'
[[:space:]]* # zero or more chars in the POSIX character class 'space'
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more digit, full-stop, space or hyphen
( # start capture to $1
\( # Literal open parentheses
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
\) # Literal close parentheses
)? # close capture to $1 - whole thing optional
( # start capture to $2
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
)? # close capture to $2 - whole thing optional
( # start capture to $3
[[:space:]]+ # at least one space (as definned by POSIX)
ext. # literal 'ext' followed by any character
[0-9.[:space:]-]* # zero or more ... (as above)
[0-9] # compolsory digit
[0-9.[:space:]-]* # zero or more ... (as above)
)? # close capture to $3 - whole thing optional
$ # Anchor to end of string
/ix # close regex; ignore case, extended mode options
) {

regular expressions in perl explanation

I need help in understanding the below regular expressions.
Can somebody please tell what this means m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx
foreach my $line ( #{ $self->{'stdout'} } ) {
if ( $line =~ m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx ) {
$timestamp = $1;
}
This
if ( $line =~ m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx ) {
is quite unreadable, though the author made a start to make it readable as it has the /x flag (allowing whitespace, but not made use of), but it still suffers from backslashitis and doesn't limit the matches to what is really meant.
Rewriting it with different delimiters allows to get rid of some of the backslashes:
if ( $line =~ m{^(\d+/\d+/\d+\s\d+:\d+:\d+\.\d+)}msx ) {
Adding whitespace and using [.] instead to match a single dot and adding comments can provide a better idea of what would be matched:
if ( $line =~ m{^ # (start of line)
( # (capture group $1)
\d+ / \d+ / \d+ # digit(s) slash digit(s) slash digit(s)
\s # ANY whitespace character (space, tab, etc)
\d+ : \d+ : \d+ # digit(s) colon digit(s) colon digit(s)
[.] \d+ # dot digit(s)
) # (end capture group $1)
}msx ) {
Where digit(s) means one ore more digits 0-9 (or any utf8 digit). So this would happily match something like "00000/0000/0000000000 0000:0000000000000000000000:000000.0000", but it seems they meant to match e.g. "0000/00/00 00:00:00.000" (a time stamp including milliseconds).
A better regex (with a lower chance of matching something it shouldn't, though it is anchored to the start of the line so no real practical difference here but as a general rule it's highly advisable to be as specific as you can) would be something like this:
if ( $line =~ m{^
(
[0-9]{4} / [0-9]{2} / [0-9]{2}
[ ] # space character
[0-9]{2} : [0-9]{2} : [0-9]{2}
[.] [0-9]{3}
)
}msx ) {
With that in hand, the regex manpage others linked already should make more sense.
This original regex
m/^(\d+\/\d+\/\d+\s\d+\:\d+\:\d+\.\d+)/msx
is a very poorly-written pattern that matches a date/time string that looks like 2014/07/31 22:53:42.123
Since it contains no dots . the /s modifier is redundant.
The /x modifier allows whitespace layout to be added so we may as well do that, using a different delimiter so that slashes don't need escaping
m{ ^ ( \d+ / \d+ / \d+ \s \d+ : \d+ : \d+ \. \d+ ) }mx
So that matches
From the beginning of any line (i.e. at the start of the string or right after a newline because the /m modifier is in effect)
Capture the following
Some digits, a slash, some more digits, another slash, some more digits
A white space character (here I think a space was assumed)
Some digits, a colon, some more digits, another colon, some more digits, a dot, some more digits
Stop capturing
So, as I said, it would match (and capture)
2014/07/31 22:53:42.123
It would also match
0/1/2 3:4:5.6
I hope that helps

Matching the rightmost and the leftmost symbols in perl with regular expression

I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?
It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.
$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.
Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )
These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols

What does this Perl regex mean: m/(.*?):(.*?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.