parse multiple lines in perl regular expression and extract value - regex

I am a beginner in perl. I have a text file with text similar to as below. i need to extract VALUE="<NEEDED VALUE>". Say for SPINACH, i should be getting SALAD alone.
How to use perl regex to get the value. i need to parse multiple lines to get it. ie between each #ifonly --- #endifonly
$cat check.txt
while (<$file>)
{
if (m/#ifonly .+ SPINACH .+ VALUE=(")([\w]*)(") .+ #endifonly/g)
{
my $chosen = $2;
}
}
#ifonly APPLE CARROT SPINACH
VALUE="SALAD" REQUIRED="yes"
QW RETEWRT OIOUR
#endifonly
#ifonly APPLE MANGO ORANGE CARROT
VALUE="JUICE" REQUIRED="yes"
as df fg
#endifonly

use strict;
use warnings;
use 5.010;
while (<DATA>) {
my $rc = /#ifonly .+ SPINACH/ .. (my ($value) = /VALUE="([^"]*)"/);
next unless $rc =~ /E0$/;
say $value;
}
__DATA__
#ifonly APPLE CARROT SPINACH
VALUE="SALAD" REQUIRED="yes"
QW RETEWRT OIOUR
#endifonly
#ifonly APPLE MANGO ORANGE CARROT
VALUE="JUICE" REQUIRED="yes"
as df fg
#endifonly
This uses a small trick described by brian d foy here. As the link describes, it uses the scalar range operator / flipflop.

In case your file is very big (or you want to read it line by line for some other reason) you could do it as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
my ($file, $keyword);
# now get command line options (see Usage note below)
GetOptions(
"f=s" => \$file,
"k=s" => \$keyword,
);
# if either the file or the keyword has not been provided, display a
# help text and exit
if (! $file || ! $keyword) {
print STDERR<<EOF;
Usage: script.pl -f filename -k keyword
EOF
exit(1);
}
my $found; # indicator that the keyword has been found
my $returned_word; # will store the word you want to retrieve
open FILE, "<$file" or die "Cannot open file '$file': $!";
while (<FILE>) {
if (/$keyword/) {
$found = 1;
}
# the following condition will be true between all lines that
# start with '#ifonly' or '#endifonly' - but only if the keyword
# has been found!
if (/^#ifonly/ .. /^#endifonly/ && $found) {
if (/VALUE="(\w+)"/) {
$returned_word = $1;
print "looking for $keyword --> found $returned_word\n";
last; # if you want to get ALL values after the keyword
# remove the 'last' statement, as it makes the script
# exit the while loop
}
}
}
close FILE;

You can read the file contents in a string and then search for the pattern in the string:
my $file;
$file.=$_ while(<>);
if($file =~ /#ifonly.+?\bSPINACH\b.+?VALUE="(\w*)".+?#endifonly/s) {
print $1;
}
Your original regex needs some tweaking:
You need to make your quantifiers
non-greedy.
Use the s modifier to make .
match newline as-well.
Ideone Link

Here's another answer based on the flip-flop operator:
use strict;
use warnings;
use 5.010;
while (<$file>)
{
if ( (/^#ifonly.*\bSPINACH\b/ .. /^#endifonly/) &&
(my ($chosen) = /^VALUE="(\w+)"/) )
{
say $chosen;
}
}
This solution applies the second test to all of the lines in the range. The trick #Hugmeir used to exclude the start and end lines isn't needed because the "inner" regex, /^VALUE="(\w+)"/, can never match them anyway (I added the ^ anchor to all regexes to make doubly sure of that).

These two lines in one answer given two days ago
my $file;
$file.=$_ while(<>);
are not very efficient. Perl will likely read the file in big chunks, break those chunks into lines of text for the <> and then the .= will join those lines back to make a big string. It would be more efficient to slurp the file. The basic style is to alter \$ the input record separator.
undef $/;
$file = <>;
The module File::Slurp; (see perldoc File::Slurp) may be even better.

Related

Can anyone spot the problem in my for loop in Perl?

I have a text file which contains a protein sequence in FASTA format. FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. This is my code:
#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
}
#print $seq;
$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
$j=$i+1;
$motif= substr ($seq,$i,4);
if ($motif=~m/VSE(.)/) {
print "motif $motif found in position $j \n" ;
}
}
I'm pretty sure I have messed up the loop, but I don't know what went wrong. The output I get on cygwin is the following:
motif found in position 2
motif found in position 2
motif found in position 2
So some general perl tips:
Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.
Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:
#!usr/bin/perl
use strict;
use warnings;
#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4
my $header = <$input_data>;
#slurp all the rest of the file into the seq string.
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;
#printing what either looks like for clarity
print $header, "\n\n";
print $seq, "\n\n";
#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
# use pos() function to report where matches happened. $1 is the contents
# of the first 'capture bracket'.
print "motif $1 at ", pos($seq), "\n";
}
So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.
Use semicolons in the C-style loop:
for ($i=0; $i<=$l-4; $i++) {
Or, use a Perl style loop:
for my $i (0 .. $l - 4) {
But you don't have to loop over the positions, Perl can do that for you (see pos):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
chomp;
s/ //g;
$seq .= $_;
}
while ($seq =~ /(VSE.)/g) {
say "Motif $1 found at ", pos($seq) - 3;
}
Note that I followed some good practices:
I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.
Your base program can be reduced to a one-liner:
perl -0777 -nwE's/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s+//g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match
Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:
perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.
The first <> reads the header, which can be used later.
The one-liner as a program file looks like this:
use strict;
use warnings; # never skip these pragmas
use feature 'say';
my $h = <>; # header, skip for now
$_ = do { local $/; <> }; # slurp rest of file
s/\s+//g; # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;
Use it like this:
perl fasta.pl P51170.fasta.txt

Regex/Perl to match blocks of text that contain a string

So I have a log file that looks something like this:
EVENT-header
apple
orange
peach
blueberry
EVENT-header
bike
car
blueberry
EVENT-header
reddit
hacker news
stack overflow
slashdot?
voat
What I am trying to do is extract the blocks of text (from EVENT-header to the two newlines before the next EVENT-header) that contain the word "peach".
I think this is a problem that regex would solve, but I am having trouble making regex that does this. Here's what I have come up so far:
's/EVENT-header((?!\n\n).)+peach((?!\n\n).)+\n\n/&/p'
I'm not an expert at this. Is there an easy way to do it using regex/perl?
You can do this easily using paragraph mode which makes perl read blocks of text delimited by blank lines
perl -00 -ne'print if /peach/' logfile.log
If you prefer a full program file then it looks like this
use strict;
use warnings;
open my $fh, '<', 'logfile.log' or die $!;
{
local $/ = '';
while ( <$fh> ) {
print if /peach/;
}
}
EVENT-header\n[\s\S]*?(?=(?:\n\nEVENT-header|$))
You can use this.See demo.
https://regex101.com/r/hR7tH4/3
There are various ways to do this, with multiline regex match being a good candidate. If the data file is as regular as it appears, specifically with each "record" separated by the marker 'EVENT-header', then you can also use the trick of setting $/ (aka $RS aka $INPUT_RECORD_SEPARATOR) to be this marker and then slurping the file into an array. You'll get an array entry for each record in the file, and then it's trivial to loop over the array, select the elements that match 'peach', and print out the entire containing record.
For example:
#!/usr/bin/perl -w
use strict;
$/='EVENT-header';
my (#entries, $entry);
my $infile = 'data.txt';
open(IN, "<$infile") or die "Aaargh: $^E\n";
#entries = <IN>;
chomp #entries;
close(IN);
foreach $entry (#entries)
{
if ($entry =~ m/peach/)
{
print "matching entry: $entry\n";
}
}
Borodin has already given best solution for your question. But here is a code in case you don't want use one liner:
#!/usr/bin/perl
use warnings;
use strict;
local $/ = ""; #to enable paragraph mode
open my $fh, "<", "input.log" or die "Unable to open file: $!";
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/peach/)
{
print $line, "\n";
}
}
Output:
EVENT-header
apple
orange
peach
blueberry

How to append a RANDOM string to all occurrences of another string in file

I'm trying to write a bash script that would modify all occurrences of a certain string in a file.
I have a file with a bunch of text, in which urls occur. All urls are in the following format:http://goo.gl/abc23 (that's goo.gl/, followed by 4 OR 5 alphanumeric characters).
What I'd like do is append a string to all urls. I managed (with the help of user Dan Fego) to get this done with sed, but it only works by appending a static string.
What I'm looking for is a way to append a different string to each occurrence. Let's say I have a function generatestring that echoes a different string every time. I'd like to append a different generated string to each url. http://goo.gl/abc23 would become http://goo.gl/abc23?GeneratedString1, http://goo.gl/JB007 would become http://goo.gl/JB007?GeneratedString2 and so on.
Does anyone know if this can be done? I've been told that perl is the way to go, but I have zero experience with perl. That's why I'm asking here.
Thanks in advance for any help.
ETA: Assuming the URLs are embedded in other text:
$ perl -lnwe 's#http://goo.gl/\w{5}\K\b# "?" . rand(100) #ge; print' googl.txt
For example:
$ cat googl
random text here, and perhaps some html <a href="http://goo.gl/abc23">
more stuff http://goo.gl/abc23 foo fake link http://foo.bar/abc12
longer http://goo.gl/abc23123123 foo fake link http://foo.bar/abc12
$ perl -lnwe 's#http://goo.gl/\w{5}\K\b# "?" . rand(100) #ge; print' googl
random text here, and perhaps some html <a href="http://goo.gl/abc23?69.998515">
more stuff http://goo.gl/abc23?26.186867532985 foo fake link http://foo.bar/abc12
longer http://goo.gl/abc23123123 foo fake link http://foo.bar/abc12
-l chomps the file and adds newline to print. -n adds a while(<>) loop around the script, which basically means it reads either from argument file names or from STDIN. \K means "keep the matching text", \b is word boundary, so that you do not match partial strings.
Do note that it will still match http://goo.gl/abc12/foo, but since I do not know what your data looks like, you will have to determine what boundaries are acceptable.
Of course, rand(100) is just there as a placeholder for whatever function you intend to use.
If you needed the script version, here's the deparsed code:
use strict;
use warnings;
BEGIN { $/ = "\n"; $\ = "\n"; }
while (<>) {
chomp;
s[http://goo.gl/\w{5}\K\b]['?' . rand(100);]eg;
print;
}
If the URLs aren't alone in each line, you can do:
#!/usr/bin/perl
use strict;
use warnings;
sub generate {
my $i = shift;
return "GeneratedString$i";
}
my $i = 0;
while(my $line = <>) {
$line =~ s~(http://\S+)~$1 . "?" . &generate($i++)~eg;
print $line;
}
usage:
test.pl file_to__modify
output:
http://goo.gl/abc23?GeneratedString1
http://goo.gl/JB007?GeneratedString2
You can do it in a lot of languages, but in Perl it's pretty straight forward:
#!/usr/bin/perl
use strict;
use constant MAX_RANDOM_STRING_LENGTH => 5;
my $regex_url = '(http://goo.gl/\w{5})';
my #alphanumeric = ("A".."Z", "0".."9");
my $random_cap = $#alphanumeric + 1;
sub generate_string
{
my $string = "?";
for (my $i = 0; $i < MAX_RANDOM_STRING_LENGTH; $i++)
{
$string .= $alphanumeric[int(rand($random_cap))];
}
return $string;
}
my #input = <>;
for(#input)
{
my $cur = $_;
while ($cur =~ /$regex_url/)
{
$cur = $';
my $new_url = $1 . generate_string();
s/$1/$new_url/g;
}
}
print(#input);
Usage:
script_name.pl < input.txt > output.txt
This might work for you:
gs(){ echo $(tr -cd '[:alnum:]' </dev/urandom | head -c5); }
export -f gs
cat <<\! file
> http://goo.gl/abc23
> http://goo.gl/JB007
> bunch of text http://goo.gl/qwert another bunch of text
> another bot http://goo.gl/qwert another bot http://goo.gl/qaza
!
sed '\|http://goo\.gl/[0-9a-zA-Z]\{4,5\}\>|{s//&?'\''$(gs)'\''/g;s/^/echo '\''/;s/$/'\''/}' file |
sh
http://goo.gl/abc23?0Az23
http://goo.gl/JB007?ugczB
bunch of text http://goo.gl/qwert?LDW27 another bunch of text
another bot http://goo.gl/qwert?U9my2 another bot http://goo.gl/qaza?Ybtlp

Perl regex which grabs ALL double-letter occurrences in a line

Still plugging away at teaching myself Perl. I'm trying to write some code that will count the lines of a file that contain double letters and then place parentheses around those double letters.
Now what I've come up with will find the first occurrence of double letters, but not any other ones. For instance, if the line is:
Amp, James Watt, Bob Transformer, etc. These pioneers conducted many
My code will render this:
19 Amp, James Wa(tt), Bob Transformer, etc. These pioneers conducted many
The "19" is the count (of lines containing double letters) and it gets the "tt" of "Watt" but misses the "ee" in "pioneers".
Below is my code:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (/(\w)\1/) {
$counter += 1;
s/$&/\($&\)/g;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
What am I overlooking?
use strict;
use warnings;
use 5.010;
use autodie;
my $file = '/path/to/file/electricity.txt';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
if (/(\w)\1/) {
$counter++;
s/
(?<full>
(?<letter>\p{L})
\g{letter}
)
/($+{full})/xg;
$_ = $counter . ' ' . $_;
}
say;
}
You are overlooking a few things. strict and warnings; 5.010 (or higher!) for say; autodie so you don't have to keep typing those 'or die'; Lexical filehandles and the three-argument form of open; A bit nitpicky, but knowing when (not) to use parens for function calls; Understanding why you shouldn't use $&; The autoincrement operator..
But on the regex part specifically, $& is only set on matches (m//), not substitution Actually no, ysth is right as usual. Sorry!
(I took the liberty of modifying your regex a bit; it makes use of named captures - (?) instead of bare parens, accessed through \g{} notation inside the regex, and the %+ hash outside of it - and Unicode-style properties - \p{Etc}). A lot more about those in perlre and perluniprops, respectively.
You need to use a back reference:
#! /usr/bin/env perl
use warnings;
use strict;
my $line = "this is a doubble letter test of my scrippt";
$line =~ s/([[:alpha:]])(\1)/($1$2)/g;
print "$line\n";
And now the test.
$ ./test.pl
this is a dou(bb)le le(tt)er test of my scri(pp)t
It works!
When you do a substitution, you use the $1 to represent what is in the parentheses. When you are referring to a part of the regular expression itself, you use the \1 form.
The [[:alpha:]] is a special POSIX class. You can find out more information by typing in
$ perldoc perlre
at the command line.
You're overcomplicating things by messing around with $&. s///g returns the number of substitutions performed when used in scalar context, so you can do it all in one shot without needing to count matches by hand or track the position of each match:
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = $text =~ s/(\w)\1/($1$1)/g;
print "$doubles $text\n";
Output:
4 James Wa(tt), a pion(ee)r of wa(tt)age engin(ee)ring
Edit: OP stated in comments that the exercise in question says not to use =~, so here's a non-regex-based solution, since all regex matches use =~ (implicitly or explicitly):
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = 0;
for my $i (reverse 1 .. length $text) {
if (substr($text, $i, 1) eq substr($text, $i - 1, 1)) {
$doubles++;
substr($text, $i - 1, 2) = '(' . substr($text, $i - 1, 2) . ')';
}
}
print "$doubles $text\n";
The problem is that you're using $& in the second regex which only matched the first occurance of a double letter set
if (/(\w)\1/) { #first occurance matched, so the pattern in the replace regex will only be that particular set of double letters
Try doing something like this:
s/(\w)\1/\($1$1\)/g; instead of s/$&/\($&\)/g;
Full code after editing:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (s/(\w)\1/\($1$1\)/g) {
$counter++;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
notice that you can use the s///g replace in a conditional statement which is true when a replace occurred.

How to search for lines in a file between two timestamps using Perl?

In Perl I am trying to read a log file and will print only the lines that have a timestamp between two specific times. The time format is hh:mm:ss and this is always the third value on each log. For example, I would be searching for lines that would fall between 12:52:33 to 12:59:33
I am new to Perl and have no idea which route to take to even begin to program this. I am pretty sure this would use some type of regex, but for the life of me I cannot even begin to fathom what that would be. Could someone please assist me with this.
Also, to make this more difficult I have to do this with the core Perl modules because my company will not allow me to use any other modules until they have been tested and verified there will be no ill effects on any of the systems the script may interact with.
In pseudocode, you'd do something like this:
read in the file line by line:
parse the timestamp for this line.
if it's less than the start time, skip to the next line.
if it's greater than the end time, skip to the next line!
else: this is a line you want: print it out.
This may be too advanced for your needs, but the flip-flop operator .. immediately comes to mind as something that would be useful here.
For reading in a file from stdin, this is the conventional pattern:
while (my $line = <>)
{
# do stuff...
}
Parsing a line into fields can be done easily with split (see perldoc -f split). You will probably need to split the line by tabs or spaces, depending on the format.
Once you've got the particular field (containing the timestamp), you can examine it using a customized regexp. Read about those at perldoc perlre.
Here's something which might get you closer:
use strict;
use warnings;
use POSIX 'mktime';
my $starttime = mktime(33, 52, 12);
my $endtime = mktime(33, 59, 12);
while (my $line = <>)
{
# split into fields using whitespace as the delimiter
my #fields = split(/\s+/, $line);
# the timestamp is the 3rd field
my $timestamp = $fields[2];
my ($hour, $min, $sec) = split(':', $timestamp);
my $time = mktime($sec, $min, $hour);
next unless ($time < $starttime) .. ($time > $endtime);
print $line;
}
If the start and end times are known, a Perl one-liner with a flip-flop operator is what you need:
perl -ne 'print if /12:52:33/../12:59:33/' logFile
If there is some underlying logic needed in order for you to determine the start and end times, then 'unroll' the one-liner to a formal script:
use strict;
use warnings;
open my $log, '<', 'logFile';
my $startTime = get_start_time(); # Sets $startTime in hh:mm:ss format
my $endTime = get_end_time(); # Sets $endTime in hh:mm:ss format
while ( <$log> ) {
print if /$startTime/../$endTime/;
}
As noted by Ether's comment, this will fail if the exact time is not present. If this is a possibility, one might implement the following logic instead:
use strict;
use warnings;
use autosplit;
open my $log, '<', 'logFile';
my $startTime = get_start_time(); # Sets $startTime in hh:mm:ss format
my $endTime = get_end_time(); # Sets $endTime in hh:mm:ss format
while ( <$log> ) {
my $time = (split /,/, $_)[2]; # Assuming fields are comma-separated
# and timelog is 3rd field
last if $time gt $endTime; # Stop when stop time reached
print if $time ge $startTime;
}
If each line in the file has the time stamp, then in 'sed' you could write:
sed -n '/12:52:33/,/12:59:33/p' logfile
This will echo the relevant lines.
There is a Perl program, s2p, that will convert 'sed' scripts to Perl.
The basic Perl structure is along the lines of:
my $atfirst = 0;
my $atend = 0;
while (<>)
{
last if $atend;
$atfirst = 1 if m/12:52:33/;
$atend = 1 if m/12:59:33/;
if ($atfirst)
{
process line as required
}
}
Note that as written, the code will process the first line that matches the end marker. If you don't want that, move the 'last' after the test.
If your log files are segregated by day, you could convert the timestamps to seconds and compare those. (If not, use the technique from my answer to a question you asked earlier.)
Say your log is
12:52:32 outside
12:52:43 strictly inside
12:59:33 end
12:59:34 outside
Then with
#! /usr/bin/perl
use warnings;
use strict;
my $LOGPATH = "/tmp/foo.log";
sub usage { "Usage: $0 start-time end-time\n" }
sub to_seconds {
my($h,$m,$s) = split /:/, $_[0];
$h * 60 * 60 +
$m * 60 +
$s;
}
die usage unless #ARGV == 2;
my($start,$end) = map to_seconds($_), #ARGV;
open my $log, "<", $LOGPATH or die "$0: open $LOGPATH: $!";
while (<$log>) {
if (/^(\d+:\d+:\d+)\s+/) {
my $time = to_seconds $1;
print if $time >= $start && $time <= $end;
}
else {
warn "$0: $LOGPATH:$.: no timestamp!\n";
}
}
you'd get the following output:
$ ./between 12:52:33 12:59:33
12:52:43 strictly inside
12:59:33 end