Extracting specific values from a Perl regex - regex

I want to use a Perl regex to extract certain values from file names.
They have the following (valid) names:
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
From the above I would like to extract the first 3 digits (always guaranteed to be 3), and the last part of the string, after the last _ (which is either off, or (m orp) followed by 3 digits). So the first thing I would be extracting are 3 digits, the second a string.
And I came out with the following method (I realise this might be not the most optimal/nicest one):
my $marker = '^testImrr[a-zA-z_]+\d{3}_(off|(m|p)\d{3})$';
if ($str =~ m/$marker/)
{
print "1=$1 2=$2";
}
Where only $1 has a valid result (namely the last bit of info I want), but $2 turns out empty. Any ideas on how to get those 3 digits in the middle?

You were almost there.
Just :
- capture the three digits by adding parenthesis around: (\d{3})
- don't capture m|p by adding ?: after the parenthesis before it ((?:m|p)), or by using [mp] instead:
^testImrr[a-zA-z_]+(\d{3})_(off|[mp]\d{3})$
And you'll get :
1=001 2=off
1=000 2=m030
1=231 2=p030

You can capture both at once, e.g with
if ($str =~ /(\d{3})_(off|(?:m|p)\d{3})$/ ) {
print "1=$1, 2=$2".$/;
}
You example has two capture groups as well (off|(m|p)\d{3} and m|p). In case of you first filename, for the second capture group nothing is catched due to matching the other branch. For non-capturing groups use (?:yourgroup).

There's really no need for regular expressions when a simple split and substr will suffice:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split(/_/);
my $digits = substr($fields[1], -3);
print "1=$digits 2=$fields[2]\n";
}
__DATA__
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
Output:
1=001 2=off
1=000 2=m030
1=231 2=p030

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match
In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101
You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']
Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919
Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'