capture alphanumeric string using regex - regex

I have a file that contain any of the following strings
"155"
>555.123.4567
>555-123-4567
>(555).123.4567
>(555)123-4567
>(555)-123-4567
I would like to capture the strings except the 1st with the output like below using regex
555.123.4567
555-123-4567
(555).123.4567
(555)123-4567
(555)-123-4567
So far I am only able to come up with the regex below but it work only to the last three strings
/(\([\d]+\).\-?(-|)[\d]+.-?[\d]+)/g

You can use this regex with optional delimiters to match your inputs:
/[(]?\d{3}[)]?[.-]?\d{3}[.-]?\d{4}\b/
RegEx Demo

It looks like you want to search for
Three digits in parentheses, or just three digits
A dot or a hyphen, or nothing
Three more digits
At least one dot or hyphen
Four more digits
This regex is a transliteration of that. Note that it's always sensible to add the /x modifier with complex patterns so that you can add insignificant spaces and newlines to make your program more readable
use strict;
use warnings 'all';
use feature 'say';
while ( <DATA> ) {
next unless / (
(?: \( \d{3} \) | \d{3} )
([.-]|)
\d{3}
[.-]
\d{4}
) /x;
say $1;
}
__DATA__
"155"
>555.123.4567
>555-123-4567
>(555).123.4567
>(555)123-4567
>(555)-123-4567
output
555.123.4567
555-123-4567
(555).123.4567
(555)123-4567
(555)-123-4567
But that's very specific, and it looks like you're verifying manually-input phone numbers. I'm not sure you wouldn't be better off with this
Some digits and parentheses
A dot or a hyphen
More digits, dots, or hyphens
which looks like this
/ ( [()\d]+ [.-] [\d.-]+ ) /x

Related

Perl comprehensive phone number regex [duplicate]

This question already has answers here:
How to validate phone numbers using regex
(43 answers)
Closed 4 years ago.
I have a file that contains phone numbers of the following formats:
(xxx) xxx.xxxx
(xxx).xxx.xxxx
(xxx) xxx-xxxx
(xxx)-xxx-xxxx
xxx.xxx.xxxx
xxx-xxx-xxxx
xxx xxx-xxxx
xxx xxx.xxxx
I must parse the file for phone numbers of those and ONLY those formats, and output them to a separate file. I'm using perl, and so far I have what I think is a valid regex for two of these numbers
my $phone_regex = qr/^(\d{3}\-)?(\(\d{3}\))?\d{3}\-\d{4}$/;
But I'm not sure if this is correct, or how to do the rest all in one regex. Thank you!
Here you go
\(?\d{3}\)?[-. ]\d{3}[-. ]\d{4}
See a demo on regex101.com.
Broken down this is
\(? # "(", optional
\d{3} # three digits
\)? # ")", optional
[-. ] # one of "-", "." or " "
\d{3} # three digits
[-. ] # same as above
\d{4} # four digits
If you want, you can add word boundaries on the right site (\b), some potential matches may be filtered out then.
You haven't escaped parenthesis properly and have uselessly escaped hyphen which isn't needed. The regex you are trying to create is this,
^\(?\d{3}\)?[ .-]\d{3}[ .-]\d{4}$
Explanation:
^ -
\(? - Optional starting parenthesis (
\d{3} - Followed by three digits
\)? - Optional closing parenthesis )
[ .-] - A single character either a space or . or -
\d{3} - Followed by three digits
[ .-] - Again a single character either a space or . or -
\d{4} - Followed by four digits
$ - End of string
Demo
Your current regex allows too much, as it will allow xxx-(xxx) at the beginning. It also doesn't handle any of the . or space separated cases. You want to have only three sets of digits, and then allow optional parentheses around the first set which you can use an alternation for, and then you can make use of character classes to indicate the set of separators you want to allow.
Additionally, don't use \d as it will match any unicode digit. Since you likely only want to allow ASCII digits, use the character class [0-9] (there are other options, but this is the simplest).
Finally, $ allows a newline at the end of the string, so use \z instead which does not. Make sure if you are reading these from a file that you chomp them so they do not contain trailing newlines.
This leaves us with:
qr/^(?:[0-9]{3}|\([0-9]{3}\))[-. ][0-9]{3}[-.][0-9]{4}\z/
If you want to ensure that the two separators are the same if the first is a . or -, it is easiest to do this in multiple regex checks (these can be more lenient since we already validated the general format):
if ($str =~ m/^[0-9()]+ /
or $str =~ m/^[0-9()]+\.[0-9]{3}\./
or $str =~ m/^[0-9()]+-[0-9]{3}-/) {
# allowed
}

WKT: regex to extract only the first two floats values

I have the input below:
LINESTRING(-111.928130305897 33.4490602213529,-111.928130305897 33.4490602213529)
and I need a regex that generates this:
-111.928130305897 33.4490602213529
Its essentially the first two floats.
You can use the following regex:
(?<=\()-?(:?[1-9]\d*|\d)(:?\.\d*)\s+-?(:?[1-9]\d*|\d)(:?\.\d*)(?=,)
DEMO: https://regex101.com/r/Q2HreC/3
Explanations and hypothesis:
(?<=\() positive lookbehind to have the constraint that the floats follow a parenthesis
-?(:?[1-9]\d*|\d)(:?\.\d*) capture the first float: - is optional then a number with several digits starting by at least a 1, or a simple digit followed eventually by a . and some decimals.
\s+ some spaces in the middle
followed by a second float
(?=,) positive look ahead to add the constraint followed by ,
To match the first 2 floats for your example, you might use:
^LINESTRING\(([-+]?\d*\.?\d+) ([-+]?\d*\.?\d+)
That would match:
^LINESTRING from the beginning of the string
\( an opening parenthesis
followed by matching a float ([-+]?\d*\.?\d+) 2 times in a capturing group
The float regex:
( # Capturing group
[-+]? # Optional + or -
\d* # Match a digits zero or more times
\.? # Optional dot
\d+ # Match a digit one or more times
) # Close capturing group
Or to match -111.928130305897 33.4490602213529 for your example
without capturing groups you could use:
(?<=^LINESTRING\()[-+]?\d*\.?\d+ [-+]?\d*\.?\d+
or
(?<=^LINESTRING\()[^,]+
What about using the right tool for the right job ? This is a perl module to proper parse WKT :
Code :
#!/usr/bin/env perl
use strict; use warnings;
use Geo::WKT::Simple;
my $arr = [];
push #{ $arr }, Geo::WKT::Simple::wkt_parse_linestring("LINESTRING(-111.928130305897 33.4490602213529,-111.928130305897 33.4490602213529)");
print join "\n", #{ $arr->[0] };
Output :
-111.928130305897
33.4490602213529
Doc :
https://metacpan.org/pod/distribution/Geo-WKT/lib/Geo/WKT.pod

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Regex perl with letters and numbers

I need to extract a strings from a text file that contains both letters and numbers. The lines start like this
Report filename: ABCL00-67900010079415.rpt ______________________
All I need is the last 8 numbers so in this example that would be 10079415
while(<DATA>){
if (/Report filename/) {
my ($bagID) = ( m/(\d{8}+)./ );
print $bagID;
}
Right now this prints out the first 8 but I want the last 8.
You just need to escape the dot, so that it would match the 8 digit characters which exists before the dot charcater.
my ($bagID) = ( m/(\d{8}+)\./ );
. is a special character in regex which matches any character. In-order to match a literal dot, you must need to escape that.
To match the last of anything, just precede it with a wildcard that will match as many characters as possible
my ($bag_id) = / .* (\d{8}) /x
Note that I have also use the /x modifier so that the regex can contain insignificant whitespace for readability. Also, your \d{8}+ is what is called a possessive quantifier; it is used for optimising some regex constructions and makes no difference at the end of the pattern

Match a number in a string with letters and numbers

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.
Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.
The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'
"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)
\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;
Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.
I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/