The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg, is doing.
Is there not a more elegant way to accomplish the task?
The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text and $cmnt. The $text includes the left part of every line, i.e., from the first column up to the first %, if it exists, or to end of the line if it doesn't. The $cmnt includes the rest.
Motivation: The files to be read are LaTeX markup, where % announces the beginning of a comment. We could change the value of $breaker to equal # if we were reading through a perl script. After separating $text from $cmnt, one could perform a match across lines such as
print "match" if ($text =~ /WOLF\s*DOG/s);
Please see the line labeled "kludge."
Without that line, something funny happens after the last % in a record. If there are lines of $text
(material not commented out by %) after the last commented line of the record, those lines are included both at the end of $cmnt and in $text.
In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text, where it belongs, and also in $cmnt.
(The kludge causes an unnecessary % to appear at the end of every non-void $cmnt. This is because the kludge-pasted-on % announces a final, fictitious empty comment line.)
According to https://perldoc.perl.org/perlre.html#Modifiers, the /m regex modifier means
Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
Therefore, I would expect the 2nd match in
s/^([^$breaker]*)($breaker.*?)$/$2/mg
to start with the first %, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g regex modifier?
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/$breaker/; # kludge
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Example file to run it on:
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
You must also delete the lines that does not contain a comment from $cmnt:
use feature qw(say);
use strict;
use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg) # non-greedy
{
$cmnt = $_;
$cmnt =~ s/^[^$breaker]*?$//mg;
die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg); # non-greedy
}
else
{
$cmnt = '';
}
print "\nRECORD $count_record:\n";
print "******** text==";
print "\n|";
print $text;
print "|\n";
print "******** cmnt==|";
print $cmnt;
print "|\n";
}
Output:
RECORD 1:
******** text==
|dog wolf
DOG WOLF
DOG WOLLLLLLF
|
******** cmnt==|% flea
% FLEA
% FLLLLLLEA
|
RECORD 2:
******** text==
|
cat lion
|
******** cmnt==|% what was that?
|
RECORD 3:
******** text==
|no comments in this line
|
******** cmnt==||
RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|
My main source of confusion was a failure to distinguish between
whether or not an entire record matches -- here, a record is potentially a multi-line paragraph, and
whether or not a line inside a record matches.
The following script incorporates insights from both answers that others offered, and includes extensive explanation.
#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<DATA>)
{
$count_record++;
my $text = $_;
my $cmnt;
s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
print "RECORD $count_record:";
print "\n|"; print $_; print "|\n";
# https://perldoc.perl.org/perlre.html#Modifiers
# the following regex:
# ^ /m: ^==start of line, not of record
# ([^$breaker]*) zero or more characters that are not $breaker
# ($breaker.*?) non-greedy: the first instance of $breaker, followed by everything after $breaker
# $ /m: $==end of line, not of record
# /g: "globally match the pattern repeatedly in the string"
if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)
{
$cmnt = $_;
# In at least one line of this record, the pattern above has matched.
# But this does not mean every line matches. There may be any number of
# lines inside the record that do not match /$breaker/; for these lines,
# in spite of /g, there will be no match, and thus the exclusion of $1 and printing only of $2,
# in the substitution below, will not take place. Thus, those particular lines must be deleted from $cmnt.
# Thus:
$cmnt =~ s/^[^$breaker]*?$/\n/mg; # remove entire line if it does not match /$breaker/
# recall that /m guarantees that ^ and $ match the start and end of the line, not of the record.
die "code error: cmnt does not match this record" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);
if ( $text =~ /\S/ )
{
print "|text|==\n|$text|\n";
}
else
{
print "NO text found\n";
}
print "|cmnt|==\n|$cmnt|\n";
}
else
{
print "NO comment found\n";
}
}
__DATA__
one dogs% one comment %d**n lies %statistics
two %two comment
thuh-ree
fower
fi-yiv % (he means 5)
SIX 66 % ¿666==antichrist?
seven % the seventh seal, the seven days
ate
niner
ten
As Douglass said to Lincoln ...
%Darryl Pinckney
Regular expression modifier mg assumes that a string it applied to includes multiple lines (includes \n in the string). It instructs regular expression to look through all lines in the string.
Please study following code which should simplify solution to your problem.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $breaker = '%';
my #records = do { local $/ = ''; <DATA> };
for( #records ) {
my %hash = ( /(.*?)$breaker(.*)/mg );
next unless %hash;
say Dumper(\%hash);
}
__DATA__
dog wolf % flea
DOG WOLF % FLEA
DOG WOLLLLLLF % FLLLLLLEA
% what was that?
cat lion
no comments in this line
%The last paragraph of this file is nothing but a single-line comment.
Output
$VAR1 = {
'DOG WOLF ' => ' FLEA ',
'dog wolf ' => ' flea ',
'DOG WOLLLLLLF ' => ' FLLLLLLEA '
};
$VAR1 = {
'' => ' what was that?'
};
$VAR1 = {
'' => 'The last paragraph of this file is nothing but a single-line comment.'
};
Related
Editing to be more concise, pardon.
I need to be able to grep from an array using a string that may contain one of the following characters: '.', '+', '/', '-'. The string will be captured via from the user. The array contains each line of the file I'm searching through (I'm chomping the file into the array to avoid keeping it open while the user is interfacing with the program because it is on a cron and I do not want to have it open when the cron runs), and each line has a unique identifier within it which is the basis for the search string used in the regexp. The code below shows the grep statement I am using, and I use OUR and MY in my programs to make the variables I want access to in all namespaces available, and the ones I use only in subroutines not. If you do want to try and replicate the issue
#!/usr/bin/perl -w
use strict;
use Switch;
use Data::Dumper;
our $pgm_path = "/tmp/";
our $device_info = "";
our #new_filetype1 = ();
our #new_filetype2 = ();
our #dev_info = ();
our #pgm_files = ();
our %arch_rtgs = ();
our $file = "/path/file.csv";
open my $fh, '<', $file or die "Couldn't open $file!\n";
chomp(our #source_file = <$fh>);
close $fh;
print "Please enter the device name:\n";
chomp(our $dev = <STDIN>);
while ($device_info eq "") {
# Grep the device info from the sms file
my #sms_device = grep(/\Q$dev\E/, #source_file);
if (scalar(#sms_device) > 1) {
my $which_dup = find_the_duplicate(\#sms_device);
if ($which_dup eq "program") {
print "\n-> $sms_dev <- must be a program name instead of a device name." .
"\nChoose the device from the list you are working on, specifically.\n";
foreach my $fix(#sms_device) {
my #fix_array = split(',', $fix);
print "$fix_array[1]\n";
undef #fix_array;
}
chomp($sms_dev = <STDIN>);
} else { $device_info = $which_dup; }
} elsif (scalar(#sms_device) == 1) {
($device_info) = #sms_device;
#sms_device = ();
}
}
When I try the code with an anchor:
my #sms_device = grep(/\Q$dev\E^/, #source_file);
No more activity from the program is noticed. It just sits there like it's waiting on some more input from the user. This is not what I expected to happen. The reason I would like to anchor the search pattern is because there are many, many examples of similarly named devices that have the same character order as the search pattern, but also include additional characters that are ignored in the regexp evaluation. I don't want them to be ignored, in the sense that they are included in matches. I want to force an exact match of the string in the variable.
Thanks in advance for wading through my terribly inexperienced code and communication attempts at detailing my problem.
The device id followed by the start of the string? /\Q$dev\E^/ makes no sense. You want the device id to be preceded by the start of the string and followed by the end of the string.
grep { /^\Q$dev\E\z/ }
Better yet, let's avoid spinning up the regex engine for nothing.
grep { $_ eq $dev }
For example,
$ perl -e'my $dev = "ccc"; CORE::say for grep { /^\Q$dev\E\z/ } qw( accc ccc ccce );'
ccc
$ perl -e'my $dev = "ccc"; CORE::say for grep { $_ eq $dev } qw( accc ccc ccce );'
ccc
I would use quotemeta. Here is an example of how it compares:
my $regexp = '\t';
my $metaxp = quotemeta ($regexp);
while (<DATA>) {
print "match \$regexp - $_" if /$regexp/;
print "match \$metaxp - $_" if /$metaxp/;
}
__DATA__
This \t is not a tab
This is a tab
(there is literally a tab in the second line)
The meta version will match line 1, as it turned "\t" into essentially "\t," and the non-meta (original) version will match line 2, which assumes you are looking for a tab.
match $metaxp - This \t is not a tab
match $regexp - This is a tab
Hopefully you get my meaning.
I think adding $regexp = quotemeta ($regexp) (or doing it when you capture the standard input) should meet your need.
kindly explain, why this issue comes
my data file
DATA----1
DATA----2
DATA----3
DATA----4
DATA----5
DATA----6
DATA----7
SAMPLE----1
SAMPLE----12
SAMPLE----13
SAMPLE----2
SAMPLE----3
SAMPLE----4
SAMPLE----5
OTHER----1
OTHER----2
OTHER----3
where I need entire line which start with DATA and SAMPLE to an array and an another array should have content which start with SAMPLE end with two digit number
I have got output with following script
use strict;
use warnings;
open(FH, "di.txt");
my #file = <FH>;
close(FH);
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this array prints
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
print #arr1,"\n\t~~~~~~~~~~~\n\n",#arr2;
First writen as
use strict;
use warnings;
open(FH, "di.txt");
my #file = <FH>;
close(FH);
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this doesn't print
print #arr1,"\n\t~~~~~~~~~~~\n\n",#arr2;
while run this one, prints only #arr1
what would be the reason #arr2 don't print
The problem is because of the behaviour of the global match /g option in scalar context
Every scalar variable has a marker that remembers where the most recent global match left off, and hence where the next one should start searching. It enables the use of the \G anchor in regex patterns, as well as while loops like this
my $s = 'aaabacad';
while ( $s =~ /a(.)/g ) {
print "$1 ";
}
which prints
a b c d
In truth you're not interested in a global match in this case, you just want to discover whether OR NOT the pattern can be found in the string. The grep operator applies scalar context to its first parameter, so in using the /g option in this statement
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
you have left every element of the #file with the marker set to right after DATA or SAMPLE. That means the next match on the same element m/^SAMPLE.+\d\d$/g will start looking from there and clearly can't even find the ^ anchor to the match fails
The pos function gives you access to the marker, and you can fix your original code by resetting it to the start of the string after the first grep call. If you write this instead
my #arr1 = grep { $_ =~ m/^DATA|^SAMPLE/g } #file;
pos($_) = 0 for #file;
my #arr2 = grep { $_ =~ m/^SAMPLE.+\d\d$/g } #file; ## this doesn't print
then the output will be what you expected
The correct fix, however, is to write what you mean anyway, which means you should remove the /g option from the pattern matches. This code also works fine, and it's also more concise, more readable, and far less fragile
my #arr1 = grep /^DATA|^SAMPLE/, #file;
my #arr2 = grep /^SAMPLE.+\d\d$/, #file;
This is a part of my input Genbank file:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)
This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.
foreach my $line (#lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
Another way to do this, is to take advantage of BioPerl, which can parse GenBank files:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
If you run this script with AC_000005.1 saved in a file called AC_000005.1.gb, you get:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:
foreach my $line (#lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.
To get the text without using $1, I think you're trying to do something like this:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:
foreach my $line (#lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
You can combine the two techniques to eliminate the issue number and dates from the journal data:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
I think it would have been easier just to get the data from the Pubmed API! ;)
The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?
At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;
I have a text file filled with sentences with unique pattern. The unique pattern is:
NAME [ e_NAME ]
simple rule: the "NAME" must follow after "e_" if the "e_" appearers inside the brackets!
The problem comes out when the string is complicated. I'll show the end point situations that may be hard to analyse:
Lines that won't match the rule:
(1) NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1]
(2) NAME1[blabla] + NAME2[e_BAD2]
(3) NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3]
(4) NAME1[e_NAME1BAD1] -> means it has to be only NAME1
Lines that match the rule:
(1) FOO1[blabla + 1]
(2) [blalbla] + bla
(3) bla + blabla
(4) FOO1[ccc + ddd + FOO2[e_FOO2]] = 123
(5) FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]
I already asked this question but I couldn't catch this end points...
Edited after requirements were clarified
Either Text::Balanced or Regexp::Common might be useful. I initially posted an answer using the former but didn't like it very much. The following example uses Regexp::Common and seems fairly straightforward.
use strict;
use warnings;
use Regexp::Common;
my $PRE = '[^[]*?';
my $VAR = '\w+';
my $BRACK = $RE{balanced}{-parens=>'[]'};
my $POST = '.*';
while (<DATA>){
my ($bad, $full);
# Brackets, if any, must balance
$bad = 1 unless s/\[/[/g == s/\]/]/g;
$full = $_;
until ($bad){
# Find some bracketed text and store all components.
my ($pre, $var, $brack, $post) =
$full =~ /^($PRE)($VAR)($BRACK)($POST)$/;
last unless defined $brack;
# Create a copy of the bracketed text, removing both the outer
# brackets and all instances of inner-bracketed text.
chop (my $clean = substr $brack, 1);
$clean =~ s/$BRACK/ /g;
# If e_FOO exists, FOO must equal $var.
$bad = 1 if $clean =~ /e_(\w+)/ and $1 ne $var;
# Remove the part of $full we've already checked.
substr($full, 0, length($pre) + length($var) + 1, '');
}
print if $bad;
}
# Your test data, with some trailing comments.
__DATA__
NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1] NOT OK 1
NAME1[blabla] + NAME2[e_BAD2] NOT OK 2
NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3] NOT OK 3
NAME1[e_NAME1BAD1] NOT OK 4
FOO1[blabla + 1] OK 1
[blalbla] + bla OK 2
bla + blabla OK 3
FOO1[ccc + ddd + FOO2[e_FOO2]] = 123 OK 4
FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3] OK 5
Maybe you are looking for something like:
if ($string =~ /(\w+)\[e\\_(\w+)/ && $1 eq $2) {
print "Pattern '$1' contained in string '$string'\n";
}
Based on the accepted answer to your first question, I came up with this:
use strict;
use warnings;
while (<DATA>) {
my $l = $_;
while (s/(\w+)\[([^\[\]]*)\]//) {
my ($n, $chk) = ($1, $2);
unless ($chk =~ /\be_$n\b/) {
warn "Bad line: $l";
last;
}
}
}
The \b checks for a word boundary. This version still doesn't check for unbalanced brackets, but it does seem to catch all the examples you gave, and will also complain when the e_NAME1 is inside another nested block, like so:
NAME1[stuff + NAME2[e_NAME1 + e_NAME2] + morestuff]
use Text::Balanced;
CPAN is wonderful.