Perl Regular Expression - What does gc modifier means? - regex

I have a regex which matches some text as:
$text =~ m/$regex/gcxs
Now I want to know what 'gc' modifier means:
I have searched and found that gc means "Allow continued search after failed /g match".
This is not clear to me. What does continued search means?
As far as I have understood, it means that start matching at the beginning if the /g search fails. But doesn't /g modififier matches the whole string?

The /g modifier is used to remember the "position in a string" so you can incrementally process a string. e.g.
my $txt = "abc3de";
while( $txt =~ /\G[a-z]/g )
{
print "$&";
}
while( $txt =~ /\G./g )
{
print "$&";
}
Because the position is reset on a failed match, the above will output
abcabc3de
The /c flag does not reset the position on a failed match. So if we add /c to the first regex like so
my $txt = "abc3de";
while( $txt =~ /\G[a-z]/gc )
{
print "$&";
}
while( $txt =~ /\G./g )
{
print "$&";
}
We end up with
abc3de
Sample code: http://ideone.com/cC9wb

In the perldoc perlre http://perldoc.perl.org/perlre.html#Modifiers
Global matching, and keep the Current position after failed matching. Unlike i, m, s and x, these two flags affect the way the regex is used rather than the regex itself. See Using regular expressions in Perl in perlretut for further explanation of the g and c modifiers.
The specified ref leads to:
http://perldoc.perl.org/perlretut.html#Using-regular-expressions-in-Perl
This URI has a sub-section entitled, 'Global matching' which contains a small tutorial/working example, including:
A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c , as in /regexp/gc . The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently.
HTH
Lee

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Why does regex capturing group not return the captured text when assigned to scalar variable?

I want to capture a number contained in certain lines of a file. I am using Perl and I am using a matching operator to capture the number occurring at a specific position relative to other symbols in the lines of the file. Here is an example line:
fixedStep chrom=chr1 start=3000306 step=1
Here is the relevant portion of the script:
while ( <FILE> ) {
if ( $_=~m/fixedStep/ ) {
my $line = $_;
print $line;
my $position = ($line =~ /start\=(\d+)/);
print "position is $position\n\n";
}
$position prints as 1, not the number I need. According the online regex tool regex101.com, the regex I am using works; it captures the appropriate element in the line.
To get the capture groups from a match, you have to call it in list context. It can be turned on by enclosing the scalar on the left hand side of the assignment operator into parentheses:
my ($position) = $line =~ /start=(\d+)/;
Note that = is not special in regexes, so no need to backslash it. Also be careful with \d if your input is unicode - you probably do not want to match non-arabic digits (as 四 or ௫).
When you use my $position = ($line =~ /start\=(\d+)/);, you are evaluating the match in scalar context, because of the scalar assignment on the LHS. In scalar context, you are going to get the size of the list produced by the matching operation in $position, which will be either 0 or 1 depending on whether this particular match succeeded.
By using my ($position) = on the LHS, you create list context. The successful matched substring ends up in $position (if there are more, they get discarded).
Also, in general, avoid bareword filehandles such as FILE (except for special builtin ones such as DATA and ARGV). Those are package level variables. Also, assign to a lexical variable in the smallest possible scope, instead of overwriting $_. In addition, the test and match can be combined, resulting in a more specific specification of the string you want to match. Of course, you know the constraints best, so, for example, if the chrom field always appears second in valid input, you should specify that.
The pattern below just requires that the lines begin with fixedStep and there is one more field before the one you want to capture.
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
if (my ($position) = ($line =~ m{
\A
fixedStep
\s+ \S+ \s+
start=([0-9]+)
}x)) {
print "$position\n";
}
}
__DATA__
fixedStep chrom=chr1 start=0 step=1
fixedStep chrom=chr1 start=3000306 step=1
start=9999 -- hey, that's wrong
Output:
C:\Temp> tt
0
3000306
[ EDIT: See comments for explanation about why struck text is wrong ]
You can use
my ($position) = ($line =~ /start\=(\d+)/);
or
my $position = $line =~ /start\=(\d+)/;
either should work
Otherwise, you are mixing list and scalar contexts, and subsequently just getting the length of the list

Search for value from command output and just print that found value

I am calling my programm from perl and getting the output with:
$output = `$calling 2>>bla.txt`;
Now I need just a specific value that will be presented in the output which I can check with Regex.
The needed output is:
Distance from Segment XY to its Centroid is: 3.455564713591596
Where XY is any number, and I just match for the "to its Centroid is: " the following:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d)*$/)
But how do I get only the value that is presented near to the end?
I just want it to be printed on the screen.
Any advice?
Instead of \d* ("zero or more digits"), you probably need to match \d+([.]\d+)? ("one or more digits, optionally followed by a decimal point and one or more additional digits"). That would give you:
if( $output =~ m/\sto\sits\sCentroid\sis:\s\d+([.]\d+)?$/)
(hat-tip to Jonathan Leffler for pointing that out).
That done — you want to capture the \d+([.]\d+)?, so, wrap it in parentheses to create a capture-group:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d+([.]\d+)?)$/)
and then the special variable $1 will be whatever it captured:
if( $output =~ m/\sto\sits\sCentroid\sis:\s(\d+([.]\d+)?)$/)
{ print $1; }
See the "Extracting matches" section of the perlretut ("Perl regular expressions tutorial") manual-page.
By the way, \s matches a single white-space character. Usually you'd want either to match only an actual space — write e.g. to its rather than to\sits — or to match one or more white-space characters — e.g. to\s+its.
You print the number you captured in the regex with the parentheses:
print "$1\n" if ($output =~ m/\sto\sits\sCentroid\sis:\s([-+]?\d*\.?\d+)$/);
You also make sure that the regex can pick up a number with a decimal point, and I've allowed an optional sign, too. If you need to worry about optional exponents, add (?:[eE][-+]?\d+)? after the \d+ in my regex.
If you have other things to do with the value, then convert into a regular if statement:
if ($output =~ m/\sto\sits\sCentroid\sis:\s([-+]?\d*\.?\d+)$/)
{
print "$1\n";
process_centroid($1);
}

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";