Regex named group in different locations [duplicate] - regex

I have an input which contains substring in the format of TXT number or number TXT. I would like to write regexp which will match the format and returns only the number.
I came up with something like this:
$regex = '/TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/'
The problem is that the compiler says that group with name number is already defined even though there is or operator in between.
Is it possible in php to write 2 groups with the same name? If it is not then how can I write regexp like that?

To write 2 groups with the same name you need to use the (?J) inline flag:
'/(?J)TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/'
See the regex demo
Documentation:
J (PCRE_INFO_JCHANGED)
The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns. As of PHP 7.2.0 J is supported as modifier as well.
PHP demo:
$regex = '/(?J)TXT(?<number>[0-9]+)|(?<number>[0-9]+)TXT/';
if (preg_match_all($regex, "TXT123 and 456TXT1", $matches, PREG_SET_ORDER, 0)) {
foreach ($matches as $m) {
echo $m["number"] . PHP_EOL;
}
}
Note that in your case, you do not need the groups:
'/TXT\K[0-9]+|[0-9]+(?=TXT)/'
The lookarounds will also do the job here.

You could use a branch reset group (?| and add a space between the digits and the TXT.
(?|TXT (?<number>[[0-9]+)|(?<number>[[0-9]+) TXT)
Regex demo | Php demo
For example
$re = '/(?|TXT (?<number>[[0-9]+)|(?<number>[[0-9]+) TXT)/';
$str = 'TXT 4
4 TXT';
preg_match_all($re, $str, $matches);
print_r($matches["number"]);
Output
Array
(
[0] => 4
[1] => 4
)

Related

Extracting specific values from a Perl regex

I want to use a Perl regex to extract certain values from file names.
They have the following (valid) names:
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
From the above I would like to extract the first 3 digits (always guaranteed to be 3), and the last part of the string, after the last _ (which is either off, or (m orp) followed by 3 digits). So the first thing I would be extracting are 3 digits, the second a string.
And I came out with the following method (I realise this might be not the most optimal/nicest one):
my $marker = '^testImrr[a-zA-z_]+\d{3}_(off|(m|p)\d{3})$';
if ($str =~ m/$marker/)
{
print "1=$1 2=$2";
}
Where only $1 has a valid result (namely the last bit of info I want), but $2 turns out empty. Any ideas on how to get those 3 digits in the middle?
You were almost there.
Just :
- capture the three digits by adding parenthesis around: (\d{3})
- don't capture m|p by adding ?: after the parenthesis before it ((?:m|p)), or by using [mp] instead:
^testImrr[a-zA-z_]+(\d{3})_(off|[mp]\d{3})$
And you'll get :
1=001 2=off
1=000 2=m030
1=231 2=p030
You can capture both at once, e.g with
if ($str =~ /(\d{3})_(off|(?:m|p)\d{3})$/ ) {
print "1=$1, 2=$2".$/;
}
You example has two capture groups as well (off|(m|p)\d{3} and m|p). In case of you first filename, for the second capture group nothing is catched due to matching the other branch. For non-capturing groups use (?:yourgroup).
There's really no need for regular expressions when a simple split and substr will suffice:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split(/_/);
my $digits = substr($fields[1], -3);
print "1=$digits 2=$fields[2]\n";
}
__DATA__
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
Output:
1=001 2=off
1=000 2=m030
1=231 2=p030

Generic Regular expression in Perl for the following text

I need to match the following text with a regular expression in Perl.
PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00
The expression that I have written is:
(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)\-(\w+)
But I want something more generic. By generic what I mean is I want to have any number of hyphens (-) in it.
Maybe there is something like if - then in regex, i.e. if some character is present then look for some other thing. Can anyone please help me out?
More about my problem:
AB-ab
abc-mno-xyz
lmi-jlk-mno-xyz
......... and so on...!
I wish to match all patterns.. to be more precise my string(feel free to use \w Since I can have uppercase , lowercase , numeric and '_'underscore here.) can be considered as a set of any number of alphanumeric substrings with hyphen('-') as a delimiter
You are looking for a regex with quatifiers (see perldoc perlre - Section Quantifiers).
You have several possibilities:
/\w+(?:-\w+)+)/ will match any two groups of \w characters if linked by a hyphen (-). For example, AB-CD will match. Pay attention that with \w you are matching upper and lower case letters, so you will also match a word like pre-owned as key.
/\w+(?:-\w+){5})/ will match keys with exactly 6 groups. It's equivalent to the one you have
/\w+(?:-\w+){5,})/ will match keys with 6 groups or more.
If there are more than one key in the document, you can do an implicit loop in the regex with the /g option.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw{say};
use Data::Dumper;
my $text = "some text here PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 some text there";
my #matches = $text =~ /\w+(?:-\w+)+)/g;
print Dumper(\#matches);
Result:
$VAR1 = [
'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00'
];
How about using split:
my $str = 'PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00';
my #elem = split(/-/, $str);
Edit according to comments:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $str = 'Text before PS3XAY3N5SZ4K-XX_5C9F-S801-F04BN01K-00000-00 text after';
my ($str2) = $str =~ /(\w+(?:-\w+)+)/;
my #elem = split(/-/, $str2);
say Dumper\#elem;
Output:
$VAR1 = [
'PS3XAY3N5SZ4K',
'XX_5C9F',
'S801',
'F04BN01K',
'00000',
'00'
];

regular expression to match multiline text including delimiters

I want to get data between delimiters and include the delimiters in the match.
Example text:
>>> Possible error is caused by the segmentation fault
provided detection report:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>
---------------------------------------------
have a nice day
My current code is:
if($oopsmessage =~/(?<=<detection-report>)((.|\n|\r|\s)+)(?=<\/detection-report>)/) {
$this->{'detection_report'} = $1;
}
It retrieves the following:
This is something that already in the report.
just an example report.
How can i include both the detection-report delimiters?
You can simplify the regex to the following:
my ($report) = $oopsmessage =~ m{(<detection-report>.*?</detection-report>)}s;
Notice I used a different delimiters to avoid the "leaning toothpick syndrome".
The s modifier makes . match newlines.
The parentheses in ($report) force list context, so the match returns all the matching groups. $1 is therefore assigned to $report.
(<detection-report>(?:(?!<\/detection-report>).)*<\/detection-report>)
Try this.Put flags g and s.See demo.
http://regex101.com/r/xT7yD8/18
Just do:
if ($oopsmessage =~ #(<detection-report>[\s\S]+?</detection-report>#) {
$this->{'detection_report'} = $1;
}
or, if you're dreading a file line by line:
while(<$fh>) {
if (/<detection-report>/ .. /<\/detection-report>/) {
$this->{'detection_report'} .= $_;
}
}
Use the below regex to get the data with delimiters.
(<detection-report>[\S\s]+?<\/detection-report>)
Group index 1 contains the string you want.
DEMO
[\S\s] would match one or more space or non-space characters.
/(<detection-report>.*?<\/detection-report>)/gs
You can simplify your regex to the following:
if($oopsmessage =~ m#(<detection-report>.+</detection-report>)#s) {
$this->{'detection_report'} = $1;
}
say $this->{'detection_report'};
Using the modifiers s allows a multiline match where . can be a new line. Using # instead of / means no faffing around with escaping slashes.
Output:
<detection-report>
This is somthing that already in the report.
just an example report.
</detection-report>

Perl assign regex match groups to variables

I'have a string in a following format: _num1_num2. I need to assign num1 and num2 values to some variables. My regex is (\d+) and it shows me the correct match groups on Rubular.com, but I don't know how to assign these match groups to some variables. Can anybody help me? Thanks in adv.
That should be (assuming your string is stored in '$string'):
my ($var1, $var2) = $string =~ /_(\d+)_(\d+)/s;
The idea is to grab numbers until you get a non-number character: here '_'.
Each capturing group is then assign to their respective variable.
As mentioned in this question (and in the comments below by Kaoru):
\d can indeed match more than 10 different characters, if applied to Unicode strings.
So you can use instead:
my ($var1, $var2) = $string =~ /_([0-9]+)_([0-9]+)/s;
Using the g-modifier also allows you to do away with the the grouping parenthesis:
my ($five, $sixty) = '_5_60' =~ /\d+/g;
This allows any separation of integers but it doesn't verify the input format.
The use of the global flag in the first answer is a bit confusing. The regex /_(\d+)_(\d+)/ already captures two integers. Additionally the g modifier tries to match multiple times. So this is redundant.
IMHO the g modifier should be used when the number of matches is unknown or when it simplifies the regex.
As far as I see this works exactly the same way as in JavaScript.
Here are some examples:
use strict;
use warnings;
use Data::Dumper;
my $str_a = '_1_22'; # three integers seperated by an underscore
# expect two integert
# using the g modifier for global matching
my ($int1_g, $int2_g) = $str_a =~ m/_(\d+)/g;
print "global:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";
# match two ints explicitly
my ( $int1_e, $int2_e) = $str_a =~ m/_(\d+)_(\d+)/;
print "explicit:\n", Dumper( $str_a, $int1_e, $int2_e ), "\n";
# matching an unknown number of integers
my $str_b = '_1_22_333_4444';
my #ints = $str_b =~ m/_(\d+)/g;
print "multiple integers:\n", Dumper( $str_b, \#ints ), "\n";
# alternatively you can use split
my ( $int1_s, $int2_s ) = split m/_/, $str_a;
print "split:\n", Dumper( $str_a, $int1_g, $int2_g ), "\n";

Perl pattern matching "nothing"/empty

This is driving me nuts!
I read a txt file into a string called $filestring.
sysopen(handle, $filepath, O_RDONLY) or die "WHAT?";
local $/ = undef;
my $filestring = <handle>;
I made a pattern variable called $regex which is generated dynamically, but takes on the format:
(a)|(b)|(c)
I search the text for patterns separated by a space
while($filestring =~ m/($regex)\s($regex)/g){
print "Match: $1 $2\n";
#...more stuff
}
Most of the matches are valid, but for some reason I get a match like the following every once and a while:
Match: and
whereas a normal match should have two outputs like the following:
Match: , and
Does anyone know what might be causing this?
EDIT: it appears that the NULL character is being matched in the pattern.
Each of the alternatives in your regexp is a separate capture group. The whole regexp looks like:
((a)|(b)|(c))\s((a)|(b)|(c))
12 3 4 56 7 8
I've notated it with the capture group number for each piece of the regexp.
So if $filestring is b a, $1 will be b, $2 will be the empty strying because nothing matched (a).
To avoid this, you should use non-capturing groups for the alternatives:
((?:a)|(?:b)|(?:c))\s((?:a)|(?:b)|(?:c))