perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"` - regex

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?

A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.

You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

Related

Regex: Interpret groups with the same content as a single group

I have the following situation:
^ID[ \t]*=[ \t]*('(.*)'|"(.*)")
The group with content
01
when a file contains:
ID = '01'
is the second.
Instead if:
ID = "01"
is the third.
This cause me a problem with perl:
perl -lne "print \$2 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/" test.txt
That if group with single quotes matches then i get the output:
01
Otherwise i obtain an empty string.
How do I make both the case of single quotes and double quotes interpret as group two in regex?
You can print both the groups, as they can never match at the same time:
perl -lne "print \$2.\$3 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/"
or remember the quotes in $2 and use $3 for the quoted string, followed by the remembered quote:
perl -lne "print \$3 if /^ID[ \t]*=[ \t]*((['\"])(.*)\2)/"
This looks like it's a good candidate for the branch reset operator, (?|...). Either capture in that alternation is $1, and the branch-reset construct takes care of the grouping without capturing anything:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01');
foreach ( #strings ) {
say $1 if m/^ID \h* = \h* (?|'(\d+)'|"(\d+)") /x
}
You need v5.10, and that allows you to use the \h to match horizontal whitespace.
But, you don't need to repeat the pattern. You can match the quote and match that same quote later. A relative backreference, \g{N}, can do that:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01' );
foreach ( #strings ) {
say $2 if m/^ID \h* = \h* (['"])(\d+)\g{-2} /x
}
I prefer that \g{-2} because I usually don't have to update numbering if I change the pattern to include more captures before the thing if refers to.
And, since this is a one-liner, don't type out the literal quotes (as ikegami has already shown):
say $2 if m/^ID \h* = \h* ([\x22\x27])(\d+)\g{-2} /x
Only one of the two will be defined, so simply use the one that's defined.
perl -nle'print $1//$2 if /^ID\h*=\h*(?:\x27(.*)\x27|"(.*)")/' # \x27 is '
You could also use a backreference.
perl -nle'print $2 if /^ID\h*=\h*(["\x27])(.*)\1/'
Note that all the provided solutions including these two fail (leave the escape sequence in) if you have something like ID="abc\"def" or ID="abc\ndef", assuming those are supported.
Thank you #brian_d_foy:
perl -lne "print \$1 if /^ID\h*=\h*(?|'(.*)'|\"(.*)\")/" test.txt
Or better:
perl -lne "print \$2 if /^ID\h*=\h*(['\"])(.*)\1/" test.txt
I have decided of accept also
ID = 01 #Followed by one or more horizontal spaces.
In addition to:
ID = "01" #Followed by one or more horizontal spaces.
And:
ID = '01' #Followed by one or more horizontal spaces.
Therefore I have adopted a super very complex solution:
perl -lne "print \$2 if /^ID\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/" test.txt
I have done a fusion of your both solutions #brian_d_foy. The double round parentheses are used to bring the second alternative to the second group as well, otherwise it would be the first group and without even the "branch reset operator", it would be group 4.
I after have enhanced the sintax in a function
function parse-config {
command perl -pe "s/\R/\n/g" "$2" | command perl -lne "print \$2 if /^$1\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/"
return $?
}
parse-config "ID" "test.txt"
In this:
"s/\R/\n/g"
I replace all CRLF or CR or LF, in LF. \R is a super powerfull special character present from perl v5.10. Apparently this version of perl has introduced several fundamental innovations for me. The chance would have that I needed all (\h \R ?|). Whoever did the update was brilliant.
I needed this because the dollar "$" at the end of the line did not work, because there was a "\r" before the "Linux end of line" "\n".

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

How to remove the whitespaces in fasta file using perl?

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?
Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;
Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.

How to make a perl one-liner "line-endings agnostic"

I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.
I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?
Original command is
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
"Correct" command is
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.
Example file (save with CRLF line endings!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
Expected output
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
Output with original command (bar goes at line beginning because it's matched together with carriage return):
bar:06:57.033] foo my
bar:06:57.033] foo your
First of all, let's keep in mind that
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
is short for something close to
perl -e'
while (<>) {
chomp;
s/foo bar(.*)\z/foo $1 bar/g;
print $_, $/;
}
' file.txt
Perl makes it so code can read/write local text files in a platform independent manner.
In a comment, you asked how to read/write both local text files and foreign text files in a platform independent manner.
First, you'll have to disable Perl's normal handling.
binmode STDIN;
binmode STDOUT;
Then you'll have to handle the multiple line endings.
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
So instead of
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
you would have
perl -e'
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
binmode STDIN;
binmode STDOUT;
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
' <file
In newer perls, you can use \R in your regex to strip off all end-of-line characters (it includes both \n and \r). See perldoc perlre.
The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]
With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.
By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.
[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)
You can say:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
The line endings would be preserved, i.e. would be the same as the input file.
You might also want to refer to perldoc perlport.
is there a way to get perl handle automatically platform-specific line-ending?
Yes. It's actually the default.
The issue is that you're trying to handle Windows line endings on a unix platform.
This will definitely do it:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
Might I suggest you keep doing it manually?
Alternatively, you could convert the file to a text file and convert it back.
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new

Multi platform script perl or awk

I am trying to match records in following format:
(-,username,domain1.co.uk)\
(-,username,domain2.co.uk)
either awk or perl must be used. I am using cygwin and wrote following code which works and matches both above entries:
awk 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' netgroup
But if I try to modify this regexp to be more specific the output is nothing:
1st: match record then last backslash and then match newline:
"(-,username,[^)]+.co.uk)\\$"
2nd: match new line immediatelly after record without backslash:
"(-,username,[^)]+.co.uk)$"
So I decided to rewrite script into perl, hoping that perl can deal with backslashes and end of line symbols. For this purpose I used a2p this way:
echo 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' | a2p.exe
#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$#"}'
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
# process any FOO=bar switches
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$musr = '(-,username,[^)]+.co.uk)';
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}
This generated perl script also matches both entries, however if I try modify this script to more specific I get the following errors:
1st:
$musr = "(-,username,[^)]+.co.uk)\\";
Trailing \ in regex m/(-,username,[^)]+.co.uk)\/ at perlmatch.pl line 18, <> line 1.
2nd:
$musr = "(-,username,[^)]+.co.uk)$";
Final $ should be \$ or $name at perlmatch.pl line 14, within string
syntax error at perlmatch.pl line 14, near "= "(-,username,[^)]+.co.uk)$""
Execution of perlmatch.pl aborted due to compilation errors.
3rd:
$musr = "(-,username,[^)]+.co.uk)\$";
[the output is nothing]
What I am doing wrong ? My question is also pointing to fact that if somebody needs to use script on several platforms (aix, solaris, linux) than using perl should be better approach that dealing with (non)GNU utils and various (g|n)awk versions etc. Regards
Your problems arise from string quoting in Perl.
$musr = "(-,username,[^)]+.co.uk)\\"; replaces \\ with a single backslash when the string is created. But you would need to pass two backslashes to the regex. So you would have to put four in when you create the string.
$musr = "(-,username,[^)]+.co.uk)$"; tries to perform variable interpolation within the string.
In addition, parentheses should be escaped, as John Kugelman noted.
The solution is to use Perl's built-in delimiters for regular expressions, rather than normal quoted strings. The simple way is to put it right into your loop:
while (<>) {
chomp; # strip record separator
if ($_ =~ /\(-,username,[^)]+.co.uk\)$/) {
print $_;
}
}
If you do need to put the pattern into a variable first, use the special qr//
operator.
my $musr = qr/\(-,username,[^)]+.co.uk\)$/;
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}
(-,username,[^)]+.co.uk)\\$
The problem here is not with the backslash at the end of the line, it's the parentheses. Parentheses are used for grouping. You need to escape them to match literal ( ) characters. You should also escape the dots so they match literal dots instead of "any character".
$ awk '/\(-,username,[^)]+\.co\.uk\)$/ {print}' netgroup
(-,username,domain2.co.uk)
$ awk '/\(-,username,[^)]+\.co\.uk\)\\$/ {print}' netgroup
(-,username,domain1.co.uk)\
If you stick with plain awk and don't use [gn]awk-specific features awk is very portable. More portable than perl is, I would think.
Parentheses must be escaped. Otherwise they group expressions. To be more specific, match an optional backslash at the end of the line (Backslashes are doubled because as string they must be escaped too).
awk 'BEGIN {musr="\\(-,username,[^)]+.co.uk\\)\\\\?$"} {if ($0~musr) print $0}' netgroup