Match a number in a string with letters and numbers - regex

I need to write a Perl regex to match numbers in a word with both letters and numbers.
Example: test123. I want to write a regex that matches only the number part and capture it
I am trying this \S*(\d+)\S* and it captures only the 3 but not 123.

Regex atoms will match as much as they can.
Initially, the first \S* matched "test123", but the regex engine had to backtrack to allow \d+ to match. The result is:
+------------------- Matches "test12"
| +-------------- Matches "3"
| | +--------- Matches ""
| | |
--- --- ---
\S* (\d+) \S*
All you need is:
my ($num) = "test123" =~ /(\d+)/;
It'll try to match at position 0, then position 1, ... until it finds a digit, then it will match as many digits it can.

The * in your regex are greedy, that's why they "eat" also numbers. Exactly what #Marc said, you don't need them.
perl -e '$_ = "qwe123qwe"; s/(\d+)/$numbers=$1/e; print $numbers . "\n";'

"something122320" =~ /(\d+)/ will return 122320; this is probably what you're trying to do ;)

\S matches any non-whitespace characters, including digits. You want \d+:
my ($number) = 'test123' =~ /(\d+)/;

Were it a case where a non-digit was required (say before, per your example), you could use the following non-greedy expressions:
/\w+?(\d+)/ or /\S+?(\d+)/
(The second one is more in tune with your \S* specification.)
Your expression satisfies any condition with one or more digits, and that may be what you want. It could be a string of digits surrounded by spaces (" 123 "), because the border between the last space and the first digit satisfies zero-or-more non-space, same thing is true about the border between the '3' and the following space.
Chances are that you don't need any specification and capturing the first digits in the string is enough. But when it's not, it's good to know how to specify expected patterns.

I think parentheses signify capture groups, which is exactly what you don't want. Remove them. You're looking for /\d+/ or /[0-9]+/

Related

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Regex perl with letters and numbers

I need to extract a strings from a text file that contains both letters and numbers. The lines start like this
Report filename: ABCL00-67900010079415.rpt ______________________
All I need is the last 8 numbers so in this example that would be 10079415
while(<DATA>){
if (/Report filename/) {
my ($bagID) = ( m/(\d{8}+)./ );
print $bagID;
}
Right now this prints out the first 8 but I want the last 8.
You just need to escape the dot, so that it would match the 8 digit characters which exists before the dot charcater.
my ($bagID) = ( m/(\d{8}+)\./ );
. is a special character in regex which matches any character. In-order to match a literal dot, you must need to escape that.
To match the last of anything, just precede it with a wildcard that will match as many characters as possible
my ($bag_id) = / .* (\d{8}) /x
Note that I have also use the /x modifier so that the regex can contain insignificant whitespace for readability. Also, your \d{8}+ is what is called a possessive quantifier; it is used for optimising some regex constructions and makes no difference at the end of the pattern

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo

Perl Regular Expression extracting sub-string?

I have a String variable containing something like ABCD.asd.qwe.com:/dir1.
I want to extract the ABCD portion i.e. the portion from beginning till the first appearance of .. The problem is that there can be almost any characters (only alphanumeric) of any length before the .. So I created this regexp.
if($arg =~ /(.*?\.?)/)
{
my $temp_name = $1;
}
However it is giving me blank string. The logic is that :
.*? - any character non-greedily
\.? - till first or none appearance of .
What could be wrong?
You can instead use negative character class like this
^[^.]+
[^.] would match any character except .
[^.]+ would match 1 to many characters(except .)
^ depicts the start of string
OR
^.+?(?=\.|$)
(?=) is a lookahead which checks for a particular pattern after the current position..So for text abcdad with regex a(?=b) only a would match
$ depicts the end of line(if used with multiline option) or end of string(if used with singleline option)
\.? doesn't mean "till first or none appearance of .". It means "a . here or not".
If the first character of the string is .:
.*? matches 0 chars at position 0.
\.? matches 1 char at position 0.
$1 contains ..
If the first character of the string isn't .:
.*? matches 0 chars at position 0.
\.? matches 0 chars at position 0.
$1 is empty.
To match ABCD, the following would do:
/^(.*?)\./
However, I hate the non-greedy modifier. It's fragile, in the sense that it stops doing what you want if you use two in the same pattern. I'd use the following instead ("match non-periods"):
/^([^.]*)\./
or even just
/^([^.]*)/
use strict;
my $string = "ABCD.asd.qwe.com:/dir1";
$string =~ /([^.]+)/;
my $capture = $1;
print"$capture\n";
OR you can also use Split function like,
my $sub_string = ( split /\./, $string )[0];
print"$sub_string\n";
Note in general: For the explaination of Regex (understanding the complex Regex), take a look at YAPE::Regex::Explain module.
This should work:
if($arg =~ /(.*?)\..+/)
{
my $temp_name = $1;
}
That would match anything before the first . .
You could change the .+ to .* if your input may end after the first ..
You could change the first .*? to .+? if you are sure that there is always at least one character before the first ..

perl regex to get trailing numbers

I'm trying to basically trying to separate a specific amount of text from the one or more numbers that appear at the end. The below works when there is 1 trailing number but not when there is two or more? Shouldn't the (\d+) be getting the "12" in "P_TIME12"?
my #strs = ('P_ABC1','P_DFRES3','P_TIME12');
foreach my $str (#strs) {
if ($str =~ /^P_(\w+)(\d+)$/) {
print "word " . $1 . " digits " . $2 . "\n";
}
}
Results in
word ABC digits 1
word DFRES digits 3
word TIME1 digits 2
TIA
\w contains digits, use [_a-zA-Z] instead, if the only digits are at the end
and \w+ is greedy, it will first match the whole word and leaves nothing for \d+, so it has to backtrack 1 character and the last character is good enought for \d+
if you want lazy operator, because you have digits in the middle, use ^P_(\w+?)(\d+)$
/^P_(\D+)(\d+)$/
The character class \d matches digits; its negation \D matches everything else.
In case it is acceptable for you to capture also spaces in the first part, a simpler solution is to match anything ungreedily before the trailing numbers, then the trailing numbers greedily.
This has the advantage that you can match even digits in the first part (provided that they don't appear at the end).
And spaces as well, as already said.
That is:
my #strs = qw(P_1ABC1 P_DFRES3 P_3TIME12);
foreach (#strs) {
if ( /^P_(.*?)(\d+)$/ ) {
print ">$1<", "\t\t", ">$2<", "\n"
}
}
which produces:
>1ABC< >1<
>DFRES< >3<
>3TIME< >12<
\w matches "word characters", including digits and underscore. Because you've asked for at least one digit (\d+), \w is being greedy and matching one as well.
You should be more explicit than \w, and use /^P_([A-Za-z_]+)(\d+)$/ instead.