Finding the N th Occurrence of a Match line - regex

I have list (multiline text string) with same number of line (order of items may differ in many ways and numbers of line may be however):
Ardei
Mere
Pere
Ardei
Castraveti
I want to find 2 th occurrence of a match line that contain 'Ardei' and replace name of item with another name and, separately in another regex, find 1 st occurrence of 'Ardei' and replace name with something else (perl).

Let's say you want to replace the 2nd "Ardei" with "XYZ". You could do that like this (PCRE syntax):
^(?s)(.*?Ardei.*?)Ardei
and replace it with:
$1XYZ
The $1 contains everything that is captured in (.*?Ardei.*?) and the (?s) will cause the . to match really every character (also line break chars).
A little demo:
#!/usr/bin/perl -w
my $text = 'Ardei
Mere
Pere
Ardei
Castraveti
Ardei';
$text =~ s/^(?s)(.*?Ardei.*?)Ardei/$1XYZ/;
# or just: $text =~ s/^(.*?Ardei.*?)Ardei/$1XYZ/s;
print $text;
will print:
Ardei
Mere
Pere
XYZ
Castraveti
Ardei

Ardei[\W\w]*?(Ardei)
will match exactly the second "Ardei" by its \1, so you can use it to replace exactly the second instance.

Related

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Extracting specific values from a Perl regex

I want to use a Perl regex to extract certain values from file names.
They have the following (valid) names:
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
From the above I would like to extract the first 3 digits (always guaranteed to be 3), and the last part of the string, after the last _ (which is either off, or (m orp) followed by 3 digits). So the first thing I would be extracting are 3 digits, the second a string.
And I came out with the following method (I realise this might be not the most optimal/nicest one):
my $marker = '^testImrr[a-zA-z_]+\d{3}_(off|(m|p)\d{3})$';
if ($str =~ m/$marker/)
{
print "1=$1 2=$2";
}
Where only $1 has a valid result (namely the last bit of info I want), but $2 turns out empty. Any ideas on how to get those 3 digits in the middle?
You were almost there.
Just :
- capture the three digits by adding parenthesis around: (\d{3})
- don't capture m|p by adding ?: after the parenthesis before it ((?:m|p)), or by using [mp] instead:
^testImrr[a-zA-z_]+(\d{3})_(off|[mp]\d{3})$
And you'll get :
1=001 2=off
1=000 2=m030
1=231 2=p030
You can capture both at once, e.g with
if ($str =~ /(\d{3})_(off|(?:m|p)\d{3})$/ ) {
print "1=$1, 2=$2".$/;
}
You example has two capture groups as well (off|(m|p)\d{3} and m|p). In case of you first filename, for the second capture group nothing is catched due to matching the other branch. For non-capturing groups use (?:yourgroup).
There's really no need for regular expressions when a simple split and substr will suffice:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split(/_/);
my $digits = substr($fields[1], -3);
print "1=$digits 2=$fields[2]\n";
}
__DATA__
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
Output:
1=001 2=off
1=000 2=m030
1=231 2=p030

Perl pattern matching "nothing"/empty

This is driving me nuts!
I read a txt file into a string called $filestring.
sysopen(handle, $filepath, O_RDONLY) or die "WHAT?";
local $/ = undef;
my $filestring = <handle>;
I made a pattern variable called $regex which is generated dynamically, but takes on the format:
(a)|(b)|(c)
I search the text for patterns separated by a space
while($filestring =~ m/($regex)\s($regex)/g){
print "Match: $1 $2\n";
#...more stuff
}
Most of the matches are valid, but for some reason I get a match like the following every once and a while:
Match: and
whereas a normal match should have two outputs like the following:
Match: , and
Does anyone know what might be causing this?
EDIT: it appears that the NULL character is being matched in the pattern.
Each of the alternatives in your regexp is a separate capture group. The whole regexp looks like:
((a)|(b)|(c))\s((a)|(b)|(c))
12 3 4 56 7 8
I've notated it with the capture group number for each piece of the regexp.
So if $filestring is b a, $1 will be b, $2 will be the empty strying because nothing matched (a).
To avoid this, you should use non-capturing groups for the alternatives:
((?:a)|(?:b)|(?:c))\s((?:a)|(?:b)|(?:c))

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";