Negate regular expression in Perl - regex

I am splitting a text file into blocks in order to extract those blocks which do not contain a certain line by using a regular expression.
The text file looks like this:
[Term]
id: id1
name: name1
xref: type1:aab
xref: type2:cdc
[Term]
id: id2
name: name2
xref: type1:aba
xref: type3:fee
Someone helped me a few days ago by showing me how to extract those blocks which do contain a certain regular expression (for example "xref: type3"):
while (<MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
if ( $line =~ m/xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}
}
Now I want to write all blocks in a new file which do not contain "xref: type3". I tried to do this by simply negating the regex
if ( $line !~ m/xref:\s*type3/ )
or alternatively by negating the if statement by using
unless ( $line =~ m/xref:\s*type3/ )
Unfortunately it doesn't work - the output file is the same as the the original one. Any ideas what I'm doing wrong?

You have:
For every line, print this block if this line doesn't match the pattern.
But you want:
For every line, print this line if none of the other lines in the block match the pattern.
As such, you can't start printing the block before you examined every line in the block (or at all lines until you find a matching line).
local $/ = q||;
while (<MYFILE>) {
my #lines = split /\n/;
my $skip = 0;
for my $line ( #lines ) {
if ( $line =~ m/^xref:\s*type3/ ) {
$skip = 1;
last;
}
}
if (!$skip) {
for my $line ( #lines ) {
print NEWFILE $line;
}
}
}
But there's no need to split into lines. We can check and print the whole block at once.
local $/ = q||;
while (<MYFILE>) {
print NEWFILE $_ if !/^xref:\s*type3/m;
}
(Note the /m to make ^ match the start of any line.)

The problem is that you are using unless with !~ which is interpreted as if $line does not NOT match do this. ( a double negative )
When using the unless block with the normal pattern matching operator =~ you code worked perfectly, that is I see the first block as output because it does not contain type3.
LOOP:
while (<$MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
unless ( $line =~ m/xref:\s*type3/ ) {
printf qq|%s|, $_;
last LOOP;
}
}
}
# prints
# [Term]
# id: id1
# name: name1
# xref: type1:aab
# xref: type2:cdc

Do not process the records line by line. Use a paragraph mode:
{ local $/ = q();
while (<MYFILE>) {
if (! /xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}

Related

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

How do I get perl to print n lines following a specific string?

I have a very large file and want to pull out all atom symbols and coordinates for the equilibrium geometry. The desired information is displayed as below:
***** EQUILIBRIUM GEOMETRY LOCATED *****
COORDINATES OF ALL ATOMS ARE (ANGS)
ATOM CHARGE X Y Z
-----------------------------------------------------------
C 6.0 0.8438492825 -2.0554543742 0.8601734285
C 6.0 1.7887997955 -1.2651150894 0.4121141006
N 7.0 1.3006136046 0.0934593194 0.2602148346
NOTE: After the coordinates finish there is a blank line.
I have so-far patched together a code that makes sense to me however it produces errors and I am not sure why.
It expects one file after calling on the script, saves each line and changes to $start==1 when it sees the string containing EQUILIBRIUM GEOMETRY which triggers recording of symbols and coordinates. It continues to save lines that contain the coordinate format until it sees a blank line where it finishes recording into $geom.
#!/usr/bin/perl
$num_args = $#ARGV + 1;
if ($num_args != 1) {
   print "\nMust supply GAMESS .log file.\n";
   exit;
}
$file = $ARGV[0];
open FILE, "<", $file;
$start = 0;
$geom="";
while (<FILE>) {
 $line = $_;
if ( $line eq "\n" && ($start == 1) ) {
   $start = 0; }
 if ( $start == 1 && $line =~ m/\s+[A-Z]+\s+[0-9\.]+\s+[0-9\.\-]+\s+[0-9\.\-]+\s+[0-9\.\-]+/ ) {
$line =~ s/^\s+//;
#coordinates = split(/\s+/,$line);
$geom=$coordinates[0],$coordinates[3],$coordinates[4],$coordinates[5];
 }
 if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
   $geom = "";
   $start = 1;
 }
}
print $geom;
Error Message:
Unrecognized character \xC2; marked by <-- HERE after <-- HERE near column 1 at ./perl-grep line 5.
There is an invisible character on line 13
i have created a file with only this line (by cut/paste)
and then add one line above which is my retyping
$geom="";
$geom="";
that looks the same but it is not (the second line is the buggy one)
[tmp]=> cat x | perl -ne '$LINE = $_; $HEX = unpack "H*"; print "$HEX $LINE" '
2467656f6d3d22223b0a $geom="";
2467656f6d3d22223be280a80a $geom="";
you can see there are some more character when you hexamine the file.
So => just remove completely this single line and retype
By the way, there is another issue in your file, you miss to close the regexp '/'
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
but I guess, there are still work to do to finish your script cause i don't see too much the purpose ;)
I copied this over to my Linux box and got the same problem.
Basically, the script is saying that there's an unreadable character at the line:
$geom="";
I re-typed that line in gedit and it ran file.
Also, there's an unclosed regex at the bottom of your script. I add a "/" to the line that reads:
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) {
OK, first and foremost - strict and warnings are really the first port of call when you're writing a scirpt and having problems. Actually, even before you have problems - switch them on, and adhere to them*.
So as with your code:
$geom="";? - trailing question mark. Should be removed.
$num_args = $#ARGV + 1; - redundant. scalar #ARGV has the same result.
open FILE, "<", $file; - 3 arg open is good. Not using lexical filehandles or checking success is bad.
$line = $_; - redundant. Just use while ( my $line = <FH> ) { instead.
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+) - broken regex, no trailing /.
$geom=$coordinates[0],$coordinates[3],$coordinates[4],$coordinates[5]; - doesn't do what you think. You probably want to join these or concatenate them.
$line eq "\n" - picks up blank lines, but might be better if you chomp; first and eq ''.
$start looks like you're trying to do the same thing as the range operator. http://perldoc.perl.org/perlop.html#Range-Operators
you overwrite $geom as you go. Is that your intention?
$line =~ s/^\s+//; - is redundant given all you do is split. split ' ' does the same thing.
it's good form to close your filehandle after using it. Especially when it's not lexically scoped.
So with that in mind, your code might look bit like this:
#!/usr/bin/perl
use strict;
use warnings;
if ( #ARGV != 1 and not -f $ARGV[0] ) {
print "\nMust supply GAMESS .log file.\n";
exit;
}
open( my $input_fh, "<", $ARGV[0] ) or die $!;
my $geom = "";
while ( my $line = <$input_fh> ) {
chomp $line;
if ( $line =~ m/\s+\*+ EQUILIBRIUM GEOMETRY LOCATED\s\*+\s+/ .. m/^$/ ) {
if ( $line
=~ m/\s+[A-Z]+\s+[0-9\.]+\s+[0-9\.\-]+\s+[0-9\.\-]+\s+[0-9\.\-]+/
)
{
my #coordinates = split( ' ', $line );
$geom = join( "",
$coordinates[0], $coordinates[3],
$coordinates[4], $coordinates[5] );
}
}
}
close($input_fh);
print $geom;
(If you have some sample input, I'll verify it).
* There are occasions you might want to switch them off. If you know what these are and why, then you switch them off. Otherwise just assume they're mandatory.

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

I'm just starting to learn Perl. I need to parse JavaScript file. I came up with the following subroutine, to do it:
sub __settings {
my ($_s) = #_;
my $f = $config_directory . "/authentic-theme/settings.js";
if ( -r $f ) {
for (
split(
'\n',
$s = do {
local $/ = undef;
open my $fh, "<", $f;
<$fh>;
}
)
)
{
if ( index( $_, '//' ) == -1
&& ( my #m = $_ =~ /(?:$_s\s*=\s*(.*))/g ) )
{
my $m = join( '\n', #m );
$m =~ s/[\'\;]//g;
return $m;
}
}
}
}
I have the following regex, that removes ' and ; from the string:
s/[\'\;]//g;
It works alright but if there is a mentioned chars (' and ;) in string - then they are also removed. This is undesirable and that's where I stuck as it gets a bit more complicated for me and I'm not sure how to change the regex above correctly to only:
Remove only first ' in string
Remove only last ' in string
Remove ont last ; in string if exists
Any help, please?
You can use the following to match:
^'|';?$|;$
And replace with '' (empty string)
See DEMO
Remove only first ' in string
Remove only last ' in string
^[^']*\K'|'(?=[^']*$)
Try this .See demo.
https://regex101.com/r/oF9hR9/8
Remove ont last ; in string if exists
;(?=[^;]*$)
Try this.See demo.
https://regex101.com/r/oF9hR9/9
All three in one
^[^']*\K'|'(?=[^']*$)|;(?=[^;]*$)
See Here
You can use this code:
#!/usr/bin/perl
$str = "'string; 'inside' another;";
$str =~ s/^'|'?;?$//g;
print $str;
IDEONE demo
The main idea is to use anchors: ^ beginning of string, $ end of string and ;? matches the ";" symbol at the end only if it is present (? quantifier is making the pattern preceding it optional).EDIT: Also, ; will get removed even if there is no preceding '.
I suggest that your original code should look more like this. It is much more idiomatic Perl and I think more straightforward to follow
sub __settings {
my ($_s) = #_;
my $file = "$config_directory/authentic-theme/settings.js";
return unless -r $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #file = <$fh>;
chomp #file;
for ( #file ) {
next if m{//};
if ( my #matches = $_ =~ /(?:$_s\s*=\s*(.*))/g ) {
my $matches = join "\n", #matches;
$matches =~ tr/';//d;
return $matches;
}
}
}

Using iterated variables with regex

The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?
At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;

Capturing output with Perl, until a specific pattern is found

I feel like I am missing something very simple here, but this is the first time I've needed to do this and am having trouble finding an example.
I have a giant foreach loop that walks through output logs and extracts various bits of information based on matching regular expressions. My main problem is that a few larger types of output have a header and footer, like *** Begin bangle tracking log*** followed by several lines of gibberish and then a ***End bangle tracking log***.
Is there a way, from within a foreach loop, to have an inner loop that stores all the lines until a footer is found?
foreach my $line( #parseme )
{
if( $line =~ m/***Begin bangle tracking log***/ )
{
#Help! Push all lines into an array until bangle tracking footer is found.
}
if( $line =~ m/Other stuff I am tracking/ )
{
#Do other things
}
}
You could use the range operator, which acts as a flip-flop in scalar context:
foreach ( #parseme ) {
if ( /Begin bangle tracking log/ .. /End bangle tracking log/ ) {
push #array, $_;
}
# other stuff...
}
I used $_ for the foreach loop because it allows for more concise syntax. You can use another variable if you like, but then you'll have to write the condition as something like:
if ( $line =~ /Begin .../ .. $line =~ /End .../ ) {
which might be more readable with some extra parentheses:
if ( ($line =~ /Begin .../) .. ($line =~ /End .../) ) {
One issue to note about the flip-flop operator is that it remembers its state even after the loop ends. This means that, if you intend to run the loop again, you really ought to make sure that the #parseme array ends with a line that matches the /End .../ regexp, so that the flip-flop will be in a known state when the loop starts the next time.
Edit: Per DVK's comment below, if you want to process the collected lines as soon as you reach the footer line, you can do that by checking the return value of the .. operator, which will end with E0 on the last line:
foreach ( #parseme ) {
my $in_block = /Begin bangle tracking log/ .. /End bangle tracking log/;
if ( $in_block ) {
push #array, $_;
}
if ( $in_block =~ /E0$/ ) { # last line
# process the lines in #array
#array = ();
}
# other stuff...
}
You can do that somewhat easily by implementing a primitive state machine:
my $inside_bangle = 0; # 0=outside block, 1=inside
my #buffer;
foreach my $line( #parseme ) {
if ($line =~ m/***Begin bangle tracking log***/ ) {
$inside_bangle = 1;
next;
}
if ($line =~ m/***End bangle tracking log***/ ) {
$inside_bangle = 0;
# PROCESS #buffer somehow
next;
}
if ($inside_bangle) {
push #buffer, $line;
next;
}
if ($line =~ m/other stuff i am tracking/ ) {
#Do other things
}
}
Another option is to use flip-flop (..)
You're probably looking for the .. operator, which has some magical properties when applied with regular expressions. The following example is stolen from PLEAC:
while (<>) {
if (/BEGIN PATTERN/ .. /END PATTERN/) {
# line falls between BEGIN and END in the
# text, inclusive.
}
}
Within the block, append to your array variable as you see fit.
And now you can store multiple instances of your log snippet, if it occurs more than once (DVK's original code):
my $inside_bangle = 0; # 0=outside block, 1=inside
my %buffer;
my $index = 0;
foreach my $line( #parseme ) {
if ($line =~ m/***Begin bangle tracking log***/ ) {
$inside_bangle = 1;
next;
}
if ($line =~ m/***End bangle tracking log***/ ) {
$inside_bangle = 0;
$index++;
# PROCESS #buffer somehow
next;
}
if ($inside_bangle) {
push #{ $buffer{$index} }, $line;
next;
}
if ($line =~ m/other stuff i am tracking/ ) {
#Do other things
}
}
Here's what I wrote initially but I thought DVK's code is more readable and neat:
open FILE, "<", 'testfile.log';
#parseme = <FILE>;
my $initialize = shift #parseme;
my $startLogging = $initialize =~ m/^Log Start$/ ? 1 : 0; # test if first line of array is start of log
my %storage = ();
my $index = 0;
foreach my $line (#parseme) {
$startLogging = 1 if $line =~ m/^Log Start$/;
if ($startLogging == 1) {
push( #{ $storage{$index} }, $line ) unless $line =~ m/(Log Start|Log End)$/;
if ($line =~ m/^Log End$/) {
$startLogging = 0;
$index++;
}
}
}