how to get the value in a string using regex - regex

I am confused on how to get the number in the string below. I crafted a regex
$regex = '(?s)^.+\bTotal Members in Group: ([\d.]+).*$'
I need only the number part 2 inside a long string a line reads Total Members in Group: 2.
My $regex returns me the entire line but what i really need is the number.
The number is random

cls
$string = 'Total Members in Group: 2'
$membersCount = $string -replace "\D*"
$membersCount
One more way:
cls
$string = 'Total Members in Group: 2'
$membersCount = [Regex]::Match($string, "(?<=Group:\s*)\d+").Value
$membersCount

Fors1k's helpful answer shows elegant solutions that bypass the need for a capture group ((...)) - they are the best solutions for the specific example in the question.
To answer the general question as to how to extract substrings
if and when capture groups are needed, the PowerShell-idiomatic way is to:
Either: Use -match, the regular-expression matching operator with a single input string: if the -match operation returns $true, the automatic $Matches variable reflects what the regex captured, with property (key) 0 containing the full match, 1 the first capture group's match, and so on.
$string = 'Total Members in Group: 2'
if ($string -match '(?s)^.*\bTotal Members in Group: (\d+).*$') {
# Output the first capture group's match
$Matches.1
}
Note:
-match only ever looks for one match in the input.
Direct use of the underlying .NET APIs is required to look for all matches, via [regex]::Matches() - see this answer for an example.
While -match only populates $Matches with a single input string (with an array, it acts as a filter and returns the sub-array of matching input strings), you can use a switch statement with -Regex to apply -match behavior to an array of input strings; here's a simplified example (outputs '1', '2', '3'):
switch -Regex ('A1', 'A2', 'A3') {
'A(.)' { $Matches.1 }
}
Or: Use -replace, the regular-expression-based string replacement operator, to match the entire input string and replace it with a reference to what the capture group(s) of interest captured; e.g, $1 refers to the first capture group's value.
$string = 'Total Members in Group: 2'
$string -replace '(?s)^.*\bTotal Members in Group: (\d+).*$', '$1'
Note:
-replace, unlike -match, looks for all matches in the input
-replace also supports an array of input strings, in which case each array element is processed separately (-match does too, but in that case it does not populate $Matches; as stated, switch can remedy that).
A caveat re -replace is that if the regex does not match, the input string is returned as-is

Related

Perl is returning hash when I am trying to find the characters after a searched-for character

I want to search for a given character in a string and return the character after it.
Based on a post here, I tried writing
my $string = 'v' . '2';
my $char = $string =~ 'v'.{0,1};
print $char;
but this returns 1 and a hash (last time I ran it, the exact output was 1HASH(0x11823a498)). Does anyone know why it returns a hash instead of the character?
Return a character after a specific pattern (a character here)
my $string = 'example';
my $pattern = qr(e);
my ($ret) = $string =~ /$pattern(.)/; #--> 'x'
This matches the first occurrence of $pattern in the $string, and captures and returns the next character, x. (The example doesn't handle the case when there may not be a character following, like for the other e; it would simply fail to match so $ret would stay undef.)
I use qr operator to form a pattern but a normal string would do just as well here.
The regex match operator returns different things in scalar and list contexts: in the scalar context it is true/false for whether it matched, while in the list context it returns matches. See perlretut
So you need that matching to be in the list context, and a common way to provide that is to put the variable that is being assigned to in parenthesis.
The first problem with the example in the question is that the =~ operator binds more tightly than the . operator, so the example is effectively
my $char = ( ($string =~ 'v') . {0,1} );
So there's first the regex match, which succeeds and returns 1 (since it is in the scalar context, imposed by the . operator) and then there is a hash-reference {0,1} which is concatenated to that 1. So $char gets assigned the 1 concatenated with a stringification for a hashref, which is a string HASH(0x...) (in the parens is a hex stringification of an address).
Next, the needed . in the pattern isn't there. Got confused with the concatenation . operator?
Then, the capturing parenthesis are absent, while needed for the intended subpattern.
Finally, the matching is the scalar context, as mentioned, what would only yield true/false.
Altogether, that would need to be
my ($char) = $string =~ ( q{v} . q{(.)} );
But I'd like to add: while Perl has very fluid semantics I'd recommend to not build regex patterns on the fly like that. I'd also recommend to actually use delimiters in the match operator, for clarity (even though you indeed mostly don't have to).

Perl regex exclude optional word from match

I have a strings and need to extract only icnnumbers/numbers from them.
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
I need to extract below data from above example.
9876AB54321
987654321FR
987654321YQ
Here is my regex, but its working for first line of data.
(icnnumber|number):(\w+)(?:_IN)
How can I have expression which would match for three set of data.
Given your strings to extract are only upper case and numeric, why use \w when that also matches _?
How about just matching:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
m/number:([A-Z0-9]+)/;
print "$1\n";
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Another alternative to get only the values as a match using \K to reset the match buffer
\b(?:icn)?number:\K[^\W_]+
Regex demo | Perl demo
For example
my $str = 'icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ';
while($str =~ /\b(?:icn)?number:\K[^\W_]+/g ) {
print $& . "\n";
}
Output
9876AB54321
987654321FR
987654321YQ
You may replace \w (that matches letters, digits and underscores) with [^\W_] that is almost the same, but does not match underscores:
(icnnumber|number):([^\W_]+)
See the regex demo.
If you want to make sure icnnumber and number are matched as whole words, you may add a word boundary at the start:
\b(icnnumber|number):([^\W_]+)
^^
You may even refactor the pattern a bit in order not to repeat number using an optional non-capturing group, see below:
\b((?:icn)?number):([^\W_]+)
^^^^^^^^
Pattern details
\b - a word boundary (immediately to the right, there must be start of string or a char other than letter, digit or _)
((?:icn)?number) - Group 1: an optional sequence of icn substring and then number substring
: - a : char
([^\W_]+) - Group 2: one or more letters or digits.
Just another suggestion maybe, but if your strings are always valid, you may consider just to split on a character class and pull the second index from the resulting array:
my $string= "number:987654321FR";
my #part = (split /[:_]/, $string)[1];
print #part
Or for the whole array of strings:
#Array = ("icnnumber:9876AB54321_IN", "number:987654321FR", "icnnumber:987654321YQ");
foreach (#Array)
{
my $el = (split /[:_]/, $_)[1];
print "$el\n"
}
Results in:
9876AB54321
987654321FR
987654321YQ
Regular expression can have 'icn' as an option and part of the interest is 11 characters after :.
my $re = qr/(icn)?number:(.{11})/;
Test code snippet
use strict;
use warnings;
use feature 'say';
my $re = qr/(icn)?number:(.{11})/;
while(<DATA>) {
say $2 if /$re/;
}
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ
Output
9876AB54321
987654321FR
987654321YQ
Already you got best and better answers here anyway I trying to solve your question right now.
Get the whole string,
my $str = do { local $/; <DATA> }; #print $str;
You can check the first grouping method upto _ or \b from the below line,
#arrs = ($str=~m/number\:((?:(?!\_).)*)(?:\b|\_)/ig);
(or)
You can check the non-words \W and _ for the first grouping here, and pushing the matches in the array
#arrs = ($str=~m/number\:([^\W\_]+)(?:\_|\b)/ig);
print the output
print join "\n", #arrs;
__DATA__
icnnumber:9876AB54321_IN
number:987654321FR
icnnumber:987654321YQ

Extracting specific values from a Perl regex

I want to use a Perl regex to extract certain values from file names.
They have the following (valid) names:
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
From the above I would like to extract the first 3 digits (always guaranteed to be 3), and the last part of the string, after the last _ (which is either off, or (m orp) followed by 3 digits). So the first thing I would be extracting are 3 digits, the second a string.
And I came out with the following method (I realise this might be not the most optimal/nicest one):
my $marker = '^testImrr[a-zA-z_]+\d{3}_(off|(m|p)\d{3})$';
if ($str =~ m/$marker/)
{
print "1=$1 2=$2";
}
Where only $1 has a valid result (namely the last bit of info I want), but $2 turns out empty. Any ideas on how to get those 3 digits in the middle?
You were almost there.
Just :
- capture the three digits by adding parenthesis around: (\d{3})
- don't capture m|p by adding ?: after the parenthesis before it ((?:m|p)), or by using [mp] instead:
^testImrr[a-zA-z_]+(\d{3})_(off|[mp]\d{3})$
And you'll get :
1=001 2=off
1=000 2=m030
1=231 2=p030
You can capture both at once, e.g with
if ($str =~ /(\d{3})_(off|(?:m|p)\d{3})$/ ) {
print "1=$1, 2=$2".$/;
}
You example has two capture groups as well (off|(m|p)\d{3} and m|p). In case of you first filename, for the second capture group nothing is catched due to matching the other branch. For non-capturing groups use (?:yourgroup).
There's really no need for regular expressions when a simple split and substr will suffice:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split(/_/);
my $digits = substr($fields[1], -3);
print "1=$digits 2=$fields[2]\n";
}
__DATA__
testImrrFoo_Bar001_off
testImrrFooBar_bar000_m030
testImrrFooBar_bar231_p030
Output:
1=001 2=off
1=000 2=m030
1=231 2=p030

Why does regex capturing group not return the captured text when assigned to scalar variable?

I want to capture a number contained in certain lines of a file. I am using Perl and I am using a matching operator to capture the number occurring at a specific position relative to other symbols in the lines of the file. Here is an example line:
fixedStep chrom=chr1 start=3000306 step=1
Here is the relevant portion of the script:
while ( <FILE> ) {
if ( $_=~m/fixedStep/ ) {
my $line = $_;
print $line;
my $position = ($line =~ /start\=(\d+)/);
print "position is $position\n\n";
}
$position prints as 1, not the number I need. According the online regex tool regex101.com, the regex I am using works; it captures the appropriate element in the line.
To get the capture groups from a match, you have to call it in list context. It can be turned on by enclosing the scalar on the left hand side of the assignment operator into parentheses:
my ($position) = $line =~ /start=(\d+)/;
Note that = is not special in regexes, so no need to backslash it. Also be careful with \d if your input is unicode - you probably do not want to match non-arabic digits (as 四 or ௫).
When you use my $position = ($line =~ /start\=(\d+)/);, you are evaluating the match in scalar context, because of the scalar assignment on the LHS. In scalar context, you are going to get the size of the list produced by the matching operation in $position, which will be either 0 or 1 depending on whether this particular match succeeded.
By using my ($position) = on the LHS, you create list context. The successful matched substring ends up in $position (if there are more, they get discarded).
Also, in general, avoid bareword filehandles such as FILE (except for special builtin ones such as DATA and ARGV). Those are package level variables. Also, assign to a lexical variable in the smallest possible scope, instead of overwriting $_. In addition, the test and match can be combined, resulting in a more specific specification of the string you want to match. Of course, you know the constraints best, so, for example, if the chrom field always appears second in valid input, you should specify that.
The pattern below just requires that the lines begin with fixedStep and there is one more field before the one you want to capture.
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if (my ($position) = ($line =~ m{
\A
fixedStep
\s+ \S+ \s+
start=([0-9]+)
}x)) {
print "$position\n";
}
}
__DATA__
fixedStep chrom=chr1 start=0 step=1
fixedStep chrom=chr1 start=3000306 step=1
start=9999 -- hey, that's wrong
Output:
C:\Temp> tt
0
3000306
[ EDIT: See comments for explanation about why struck text is wrong ]
You can use
my ($position) = ($line =~ /start\=(\d+)/);
or
my $position = $line =~ /start\=(\d+)/;
either should work
Otherwise, you are mixing list and scalar contexts, and subsequently just getting the length of the list

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}