Replace text between START & END strings excluding the END string in perl - regex

I was going through examples and questions on the web related to finding and replacing a text between two strings (say START and END) using perl. And I was successful in doing that with the provided solutions. In this case my START and END are also inclusive to replacement text. The syntax I used was s/START(.*?)END/replace_text/s to replace multiple lines between START and END but stop replacing when first occurrence of END is hit.
But I would like to know how to replace a text between START and END excluding END like this in perl only.
Before:
START
I am learning patten matching.
I am using perl.
END
After:
Successfully replaced.
END

To perform the check but avoid matching the characters you can use positive look ahead:
s/START.*?(?=END)/replace_text/s

One solution is to capture the END and use it in the replacement text.
s/START(.*?)(END)/replace_text$2/s

Another option is using range operator .. to ignore every line of input until you find the end marker of a block, then output the replace string and end marker:
#!/usr/bin/perl
use strict;
use warnings;
my $rep_str = 'Successfully replaced.';
while (<>) {
my $switch = m/^START/ .. /^END/;
print unless $switch;
print "$rep_str\n$_" if $switch =~ m/E0$/;
}
It is quite easy to adapt it to work for an array of string:
foreach (#strings) {
my $switch = ...
...
}

To use look-around assertions you need to redefine the input record separator ($/) (see perlvar), perhaps to slurp the while file into memory. To avoid this, the range ("flip-flop") operator is quite useful:
while (<>) {
if (/^START/../^END/) {
next unless m{^END};
print "substituted_text\n";
print;
}
else {
print;
}
}
The above preserves any lines in the output that precede or follow the START/END block.

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Perl Regex Find and Return Every Possible Match

Im trying to create a while loop that will find every possible sub-string within a string. But so far all I can match is the largest instance or the shortest. So for example I have the string
EDIT CHANGE STRING FOR DEMO PURPOSES
"A.....B.....B......B......B......B"
And I want to find every possible sequence of "A.......B"
This code will give me the shortest possible return and exit the while loop
while($string =~ m/(A(.*?)B)/gi) {
print "found\n";
my $substr = $1;
print $substr."\n";
}
And this will give me the longest and exit the while loop.
$string =~ m/(A(.*)B)/gi
But I want it to loop through the string returning every possible match. Does anyone know if Perl allows for this?
EDIT ADDED DESIRED OUTPUT BELOW
found
A.....B
found
A.....B.....B
found
A.....B.....B......B
found
A.....B.....B......B......B
found
A.....B.....B......B......B......B
There are various ways to parse the string so to scoop up what you want.
For example, use regex to step through all A...A substrings and process each capture
use warnings;
use strict;
use feature 'say';
my $s = "A.....B.....B......B......B......B";
while ($s =~ m/(A.*)(?=A|$)/gi) {
my #seqs = split /(B)/, $1;
for my $i (0..$#seqs) {
say #seqs[0..$i] if $i % 2 != 0;
}
}
The (?=A|$) is a lookahead, so .* matches everything up to an A (or the end of string) but that A is not consumed and so is there for the next match. The split uses () in the separator pattern so that the separator, too, is returned (so we have all those B's). It only prints for an even number of elements, so only substrings ending with the separator (B here).
The above prints
A.....B
A.....B.....B
A.....B.....B......B
A.....B.....B......B......B
A.....B.....B......B......B......B
There may be bioinformatics modules that do this but I am not familiar with them.

Deleting a line with a pattern unless another pattern is found?

I have a very messy data file, that can look something like this
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
I would like to delete all the lines with "=" only in them, but keep the line if TOTAL is also in the line.
My code is as follows:
for my $file (glob '*.csv') {
open my $in, '<', $file;
my #lines;
while (<$in>) {
next if /===/; #THIS IS THE PROBLEM
push #lines, $_;
}
close $in;
open my $out, '>', $file;
print $out $_ for #lines;
close $out;
}
I was wondering if there was a way to do this in perl with regular expressions. I was thinking something like letting "TOTAL" be condition 1 and "===" be condition 2. Then, perhaps if both conditions are satisfied, the script leaves the line alone, but if only one or zero are fulfilled, then the line is deleted?
Thanks in advance!
You need \A or ^ to check whether the string starts with = or not.Put anchor in regex like:
next if /^===/;
or if only = is going to exist then:
next if /^=+/;
It will skip all the lines beginning with =.+ is for matching 1 or more occurrences of previous token.
Edit:
Then you should use Negative look behind like
next if /(?<!TOTAL)===/
This will ensure that you === is not preceded by TOTAL.
As any no of character's may occur between TOTAL and ===, I will suggest you to use two regexes to ensure string contains === but it doesn't contain TOTAL like:
next if (($_ =~ /===/) && ($_ !~ /TOTAL/))
You can use Negative look behind assertion
next if /(?<!TOTAL)===/
matches === when NOT preceded by TOTAL
As a general rule, you should avoid making your regexes more complicated. Compressing too many things into a single regex may seem clever, but it makes it harder to understand and thus debug.
So why not just do a compound condition?
E.g. like this:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
next if ( m/====/ and not m/TOTAL/ );
push #lines, $_;
}
print $_ for #lines;
__DATA__
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
Will skip any lines with === in, as long as they don't contain TOTAL. And doesn't need advanced regex features which I assure you will get your maintenance programmers cursing you.
You're current regex will pick up anything that contains the string === anywhere in the string.
Hello=== Match
===goodbye Match
======= Match
foo======bar Match
=== Match
= No Match
Hello== No Match
========= Match
If you wanted to ensure it picks up only strings made up of = signs then you would need to anchor to the start and the end of the line and account for any number of = signs. The regex that will work will be as follows:
next if /^=+$/;
Each symbols meaning:
^ The start of the string
= A literal "=" sign
+ One or more of the previous
$ The end of the string
This will pick up a string of any length from the start of the string to the end of the string made up of only = signs.
Hello=== No Match
===goodbye No Match
======= No Match
foo======bar No Match
=== Match
= Match
Hello== No Match
========= Match
I suggest you read up on perl's regex and what each symbol means it can be a very powerful tool if you know what's going on.
http://perldoc.perl.org/perlre.html#Regular-Expressions
EDIT:
If you want to skip a line on matching both TOTAL and the = then just put in 2 checks:
next if(/TOTAL/ and /=+/)
This can probably be done with a single line of regex. But why bother making it complicated and less readable?

find the occurrences of particular string in give string?

hi friends now 'm working in perl i need to check the give string occurence in a set of array using perl!i tried lot but i can't can anybody?
open FILE, "<", "tab_name.txt" or die $!;
my #tabName=<FILE>;
my #all_files= <*>;
foreach my $tab(#tabName){
$_=$tab;
my $pgr="PGR Usage";
if(m/$pgr/)
{
for(my $t=0;scalar #all_files;$t++){
my $file_name='PGR.csv';
$_=$all_files[$t];
if(m\$file_name\)
{
print $file_name;
}
}
print "\n$tab\n";
}
}
Here is a problem:
for(my $t=0;scalar #all_files;$t++){
The second part of the for loop needs to be a condition, such as:
for(my $t=0;$t < #all_files;$t++){
Your code as written will never end.
However, this is much better:
foreach (#all_files){
In addition, you have a problem with your regex. A variable in a regex is treated as a regular expression. . is a special character matching anything. Thus, your code would match PGR.csv, but also PGRacsv, etc. And it would also match filenames where that is a part of the name, such as FOO_PGR.csvblah. To solve this:
Use quote literal (\Q...\E) to make sure the filename is treated literally.
Use anchors to match the beginning and end of the string (^, $).
Also, backslashes are valid, but they are a strange character to use.
The corrected regex looks like this:
m/^\Q$file_name\E$/
Also, you should put this at the top of every script you write:
use warnings;
use strict;
This line :
for(my $t=0;scalar #all_files;$t++){
produces an infinite loop, you'd better use:
for(my $t=0;$t < #all_files;$t++){
Aside from the problems you have going through the array, are you looking for substr?

cant get the perl regex to work

My perl is getting rusty. It only prints "matched=" but $1 is blank!?!
EDIT 1: WHo the h#$! downvoted this? There are no wrong questions. If you dont like it, move on to next one!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([.\n\r]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 2: This is the code fragment with updated regex, works great!
$crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/([\s\S]+)/gsi) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
EDIT 3: Haha, i see perl police strikes yet again!!!
I don't know if this is your exact problem, but inside square brackets, '.' is just looking for a period. I didn't see a period in the input, so I wondered which you meant.
Aside from the period, the rest of the character class is looking for consecutive whitespace. And as you didn't use the multiline switch, you've got newlines being counted as whitespace (and any character), but no indication to scan beyond the first record separator. But because of the way that you print it out, it also gives some indication that you meant more than the literal period, as mentioned above.
Axeman is correct; your problem is that . in a character class doesn't do what you expect.
By default, . outside a character class (and not backslashed) matches any character but a newline. If you want to include newlines, you specify the /s flag (which you seem to already have) on your regex or put the . in a (?s:...) group:
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if ($crazy =~ m/((?s:.+))/) {
print "matched=", $1, "\n";
} else {
print "not matched!\n";
}
. in a character class is a literal period, not match anything. What you really want is /(.+)/s. The /g flag says to match multiple times, but you are using the regex in scalar context, so it will only match the first item. The /i flag makes the regex case insensitive, but there are no characters with case in your regex. The \s flag makes . match newlines, and it always matches "\r", so instead of [.\n\r], you can just use ..
However, /(.+)/s will match any string with one or more characters, so you would be better off with
my $crazy="abcd\r\nallo\nXYZ\n\n\nQQQ";
if (length $crazy) {
print "matched=$crazy\n";
} else {
print "not matched!\n";
}
It is possible you meant to do something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
while ($crazy =~ /(.+)[\r\n]+/g) {
print "matched=$1\n";
}
But that would probably be better phrased:
#!/usr/bin/perl
use strict;
use warnings;
my $crazy = "abcd\r\nallo\nXYZ\n\n\nQQQ";
for my $part (split /[\r\n]+/, $crazy) {
print "matched=$part\n";
}
$1 contains white space, that's why you don't see it in a print like that, just add something after it/quote it.
Example:
perl -E "qq'abcd\r\nallo\nXYZ\n\n\nQQQ'=~/([.\n\r]+)/gsi;say 'got(',length($1),qq') >$1<';"
got(2) >
<
Updated for your comments:
To match everything you can simply use /(.+)/s
[.] (dot inside a character class) does not mean "match any character", it just means match the literal . character. So in an input string without any dots,
m/([.\n\r]+)/gsi
will just match strings of \n and \r characters.
With the /s modifier, you are already asking the regex engine to include newlines with . (match any character), so you could just write
m/(.+)/gsi