Deleting a line with a pattern unless another pattern is found? - regex

I have a very messy data file, that can look something like this
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
I would like to delete all the lines with "=" only in them, but keep the line if TOTAL is also in the line.
My code is as follows:
for my $file (glob '*.csv') {
open my $in, '<', $file;
my #lines;
while (<$in>) {
next if /===/; #THIS IS THE PROBLEM
push #lines, $_;
}
close $in;
open my $out, '>', $file;
print $out $_ for #lines;
close $out;
}
I was wondering if there was a way to do this in perl with regular expressions. I was thinking something like letting "TOTAL" be condition 1 and "===" be condition 2. Then, perhaps if both conditions are satisfied, the script leaves the line alone, but if only one or zero are fulfilled, then the line is deleted?
Thanks in advance!

You need \A or ^ to check whether the string starts with = or not.Put anchor in regex like:
next if /^===/;
or if only = is going to exist then:
next if /^=+/;
It will skip all the lines beginning with =.+ is for matching 1 or more occurrences of previous token.
Edit:
Then you should use Negative look behind like
next if /(?<!TOTAL)===/
This will ensure that you === is not preceded by TOTAL.
As any no of character's may occur between TOTAL and ===, I will suggest you to use two regexes to ensure string contains === but it doesn't contain TOTAL like:
next if (($_ =~ /===/) && ($_ !~ /TOTAL/))

You can use Negative look behind assertion
next if /(?<!TOTAL)===/
matches === when NOT preceded by TOTAL

As a general rule, you should avoid making your regexes more complicated. Compressing too many things into a single regex may seem clever, but it makes it harder to understand and thus debug.
So why not just do a compound condition?
E.g. like this:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
next if ( m/====/ and not m/TOTAL/ );
push #lines, $_;
}
print $_ for #lines;
__DATA__
========
Line 1
dfa====dsfdas==
Line 2
df as TOTAL ============
Will skip any lines with === in, as long as they don't contain TOTAL. And doesn't need advanced regex features which I assure you will get your maintenance programmers cursing you.

You're current regex will pick up anything that contains the string === anywhere in the string.
Hello=== Match
===goodbye Match
======= Match
foo======bar Match
=== Match
= No Match
Hello== No Match
========= Match
If you wanted to ensure it picks up only strings made up of = signs then you would need to anchor to the start and the end of the line and account for any number of = signs. The regex that will work will be as follows:
next if /^=+$/;
Each symbols meaning:
^ The start of the string
= A literal "=" sign
+ One or more of the previous
$ The end of the string
This will pick up a string of any length from the start of the string to the end of the string made up of only = signs.
Hello=== No Match
===goodbye No Match
======= No Match
foo======bar No Match
=== Match
= Match
Hello== No Match
========= Match
I suggest you read up on perl's regex and what each symbol means it can be a very powerful tool if you know what's going on.
http://perldoc.perl.org/perlre.html#Regular-Expressions
EDIT:
If you want to skip a line on matching both TOTAL and the = then just put in 2 checks:
next if(/TOTAL/ and /=+/)
This can probably be done with a single line of regex. But why bother making it complicated and less readable?

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

Perl Regex Find and Return Every Possible Match

Im trying to create a while loop that will find every possible sub-string within a string. But so far all I can match is the largest instance or the shortest. So for example I have the string
EDIT CHANGE STRING FOR DEMO PURPOSES
"A.....B.....B......B......B......B"
And I want to find every possible sequence of "A.......B"
This code will give me the shortest possible return and exit the while loop
while($string =~ m/(A(.*?)B)/gi) {
print "found\n";
my $substr = $1;
print $substr."\n";
}
And this will give me the longest and exit the while loop.
$string =~ m/(A(.*)B)/gi
But I want it to loop through the string returning every possible match. Does anyone know if Perl allows for this?
EDIT ADDED DESIRED OUTPUT BELOW
found
A.....B
found
A.....B.....B
found
A.....B.....B......B
found
A.....B.....B......B......B
found
A.....B.....B......B......B......B
There are various ways to parse the string so to scoop up what you want.
For example, use regex to step through all A...A substrings and process each capture
use warnings;
use strict;
use feature 'say';
my $s = "A.....B.....B......B......B......B";
while ($s =~ m/(A.*)(?=A|$)/gi) {
my #seqs = split /(B)/, $1;
for my $i (0..$#seqs) {
say #seqs[0..$i] if $i % 2 != 0;
}
}
The (?=A|$) is a lookahead, so .* matches everything up to an A (or the end of string) but that A is not consumed and so is there for the next match. The split uses () in the separator pattern so that the separator, too, is returned (so we have all those B's). It only prints for an even number of elements, so only substrings ending with the separator (B here).
The above prints
A.....B
A.....B.....B
A.....B.....B......B
A.....B.....B......B......B
A.....B.....B......B......B......B
There may be bioinformatics modules that do this but I am not familiar with them.

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

Why can't I match a substring which may appear 0 or 1 time using /(subpattern)?/

The original string is like this:
checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19
The last part "fail1:19" may appear 0 or 1 time. And I tried to match the number after "fail1:", which is 19, using this:
($reg_suc, $reg_fail) = ($1, $2) if $line =~ /^checksession\s+ok:(\d+).*(fail1:(\d+))?/;
It doesn't work. The $2 variable is empty even if the "fail1:19" does exist. If I delete the "?", it can match only if the "fail1:19" part exists. The $2 variable will be "fail1:19". But if the "fail1:19" part doesn't exist, $1 and $2 neither match. This is incorrect.
How can I rewrite this pattern to capture the 2 number correctly? That means when the "fail1:19" part exist, two numbers will be recorded, and when it doesn't exit, only the number after "ok:" will be recorded.
First, the number in fail field would end in $3, as those variables are filled according to opening parentheses. Second, as codaddict shows, the .* construct in RE is hungry, so it will eat even the fail... part. Third, you can avoid numbered variables like this:
my $line = "checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
if(my ($reg_suc, $reg_fail, $addend)
= $line =~ /^checksession\s+ok:(\d+).*?(fail1:(\d+))?$/
) {
warn "$reg_suc\n$reg_fail\n$addend\n";
}
Try the regex:
^checksession\s+ok:(\d+).*?(fail1:(\d+))?$
Ideone Link
Changes made:
.* in the middle has been made
non-greedy and
$ (end anchor) has been added.
As a result of above changes .*? will try to consume as little as possible and the end anchor forces the regex to match till the end of the string, matching fail1:number if present.
I think this is one of the few cases where a split is actually more robust than a regex:
$bar[0]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081 fail1:19";
$bar[1]="checksession ok:6178 avg:479 avgnet:480 MaxTime:18081";
for $line (#bar){
(#fields) = split/ /,$line;
$reg_suc = $fields[1];
$reg_fail = $fields[5];
print "$reg_suc $reg_fail\n";
}
I try to avoid the non-greedy modifier. It often bites back. Kudos for suggesting split, but I'd go a step further:
my %rec = split /\s+|:/, ( $line =~ /^checksession (.*)/ )[0];
print "$rec{ok} $rec{fail1}\n";