Perl regular expression variables and matched pattern substitution - regex

Can anyone explain regular expression text substitutions when the regular expression is held in a variable? I'm trying to process some text, Clearcase config specs actually, and substitute text as I go. The rules for the substitution are held in an array of hashes that have the regular expression to match and the text to substitute.
The input text looks somthing like this:
element /my_elem/releases/... VERSION_STRING.020 -nocheckout
Most of the substitutions are simply to remove lines that contain a specific text string, this works fine. In some cases I want to substitute the text, but re-use the VERSION_STRING text. I've tried using $1 in the substitution expression but it doesn't work. $1 gets the version string in the match, but the replacement of $1 doesn't work in the substitution.
In these cases the output should look something like this:
element -directory /my_elem/releases/... VERSION_STRING.020 -nocheckout
element /my_elem/releases/.../*.[ch] VERSION_STRING.020 -nocheckout
ie. One line input became two output and the version string has been re-used.
The code looks something like this. First the regular expressions and substitutions:
my #Special_Regex = (
{ regex => "\\s*element\\s*\/my_elem_removed\\s*\/main\/\\d+\$", subs => "# Line removed" },
{ regex => "\\s*element\\s*\/my_elem_changed\/releases\/\.\.\.\\s*\(\.\*\$\)",
subs => "element \-directory \/my_elem\/releases\/\.\.\. \\1\nelement \/my_elem\/releases\/\.\.\.\/\*\.\[ch\] \\1" }
);
In the second regex the variable $1 is defined in the portion (.*\$) and this is working correctly. The subs expression does not substitute it, however.
foreach my $line (<INFILE>)
{
chomp($line);
my $test = $line;
foreach my $hash (#Special_Regex)
{
my $regex = qr/$hash->{regex}/is;
if($test =~ s/$regex/$hash->{subs}/)
{
print "$test\n";
print "$line\n";
print "$1\n";
}
}
}
What am I missing? Thanks in advance.

The substitution string in your regex is only getting evaluated once, which transforms $hash->{subs} into its string. You need to evaluate it again to interpolate its internal variables. You can add the e modifier to the end of the regex which tells Perl to run the substitution through eval which can perform the second interpolation among other things. You can apply multiple e flags to evaluate more than once (if you have a problem that needs it). As tchrist helpfully points out, in this case, you need ee since the first eval will just expand the variable, the second is needed to expand the variables in the expansion.
You can find more detail in perlop about the s operator.

There is no compilation for a replace expression. So about the only thing you can do is exec or eval it with the e flag:
if($test =~ s/$regex/eval qq["$hash->{subs}"]/e ) { #...
worked for me after changing \\1 to \$1 in the replacement strings.
s/$regex/$hash->{subs}/
only replaces the matched part with the literal value stored in $hash->{subs} as the complete substitution. In order to get the substitution working, you have to force Perl to evaluate the string as a string, so that means you even have to add the dquotes back in in order to get the interpolating behavior you are looking for (because they are not part of the string.)
But that's kind of clumsy, so I changed the replace expressions into subs:
my #Special_Regex
= (
{ regex => qr{\s*element\s+/my_elem_removed\s*/main/\d+$}
, subs => sub { '#Line removed' }
}
, { regex => qr{\s*element\s+/my_elem_changed/releases/\.\.\.\s*(.*$)}
, subs => sub {
return "element -directory /my_elem/releases/... $1\n"
. "element /my_elem/releases/.../*.[ch] $1"
;
}
}
);
I got rid of a bunch of stuff that you don't have to escape in a substitution expression. Since what you want to do is interpolate the value of $1 into the replacement string, the subroutine does simply that. And because $1 will be visible until something else is matched, it will be the right value when we run this code.
So now the replacement looks like:
s/$regex/$hash->{subs}->()/e
Of course making it pass $1 makes it a little more bulletproof, because you're not depending on the global $1:
s/$regex/$hash->{subs}->( $1 )/e
Of course, you would change the sub like so:
subs => sub {
my $c1 = shift;
return "element -directory /my_elem/releases/... $c1\n"
. "element /my_elem/releases/.../*.[ch] $c1"
;
}
Just one last note: "\.\.\." didn't do what you think it did. You just ended up with '...' in the regex, which matches any three characters.

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

Capture the match content of two different regexp in perl

I am using a while loop with two separate regular expression
while(($string1=~m/(\d+)/igs)==($string2=~m/([^^]*?)\n+/igs)) {}
to store the value of the matching pattern of the $string1 i have used $temp1=$1,
How can I store the matching pattern of the $string2. Please give some suggestion.
my ($m1,$m2);
while (do{
($m1,$m2) = ();
$m1 = $1 if $string1 =~ /(\d+)/igs;
$m2 = $1 if $string2 =~ /([^^]*?)\n+/igs;
defined $m1 == defined $m2;
}) {
# print "$m1-$m2-\n";
}
There might be more clever ways, but I'd just break them up into separate statements:
while (1) {
$res1 = $string1=~m/(\d+)/igs;
$temp1 = $1;
$res2 = $string2=~m/([^^]*?)\n+/igs
$temp2 = $1;
last unless $res1 == $res2;
...
}
Just because it's perl you don't have to find the most terse, cryptic way to write something (that's what APL is for).
If the "g" and "s" options aren't really necessary to your task and you actually only want to compare the first matching substrings, you can make a one-line test as follows:
if (($a =~ /regex1/)[0] == ($b =~ regex2/)[0]) {
...
And if you need to know what the two matched strings were, just add some temporary variables to hold them:
if (($first = ($a =~ /regex1/)[0]) == ($second = ($b =~ regex2/)[0])) {
...
But if you really want to compare all of the successive matches in each string to see if each pair are equal, there's no single-statement solution I can think of that will do it. Your regexes each return a list and "==" only compares their lengths. You've got to use the first solution proposed above and write out the comparison code in "long-hand".
The second solution above won't work since it will keep testing only the first match in each string.
It's a bit hard to understand what you're trying to do but you could at least drop the "i" option on the first test for /(\d+)/. Presumably the "s" option is only needed for the second string since you're looking for embedded new-lines.

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}