I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.
I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?
Original command is
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
"Correct" command is
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.
Example file (save with CRLF line endings!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
Expected output
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
Output with original command (bar goes at line beginning because it's matched together with carriage return):
bar:06:57.033] foo my
bar:06:57.033] foo your
First of all, let's keep in mind that
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
is short for something close to
perl -e'
while (<>) {
chomp;
s/foo bar(.*)\z/foo $1 bar/g;
print $_, $/;
}
' file.txt
Perl makes it so code can read/write local text files in a platform independent manner.
In a comment, you asked how to read/write both local text files and foreign text files in a platform independent manner.
First, you'll have to disable Perl's normal handling.
binmode STDIN;
binmode STDOUT;
Then you'll have to handle the multiple line endings.
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
So instead of
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
you would have
perl -e'
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
binmode STDIN;
binmode STDOUT;
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
' <file
In newer perls, you can use \R in your regex to strip off all end-of-line characters (it includes both \n and \r). See perldoc perlre.
The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]
With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.
By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.
[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)
You can say:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
The line endings would be preserved, i.e. would be the same as the input file.
You might also want to refer to perldoc perlport.
is there a way to get perl handle automatically platform-specific line-ending?
Yes. It's actually the default.
The issue is that you're trying to handle Windows line endings on a unix platform.
This will definitely do it:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
Might I suggest you keep doing it manually?
Alternatively, you could convert the file to a text file and convert it back.
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new
Related
This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?
A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.
You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'
I have the following situation:
^ID[ \t]*=[ \t]*('(.*)'|"(.*)")
The group with content
01
when a file contains:
ID = '01'
is the second.
Instead if:
ID = "01"
is the third.
This cause me a problem with perl:
perl -lne "print \$2 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/" test.txt
That if group with single quotes matches then i get the output:
01
Otherwise i obtain an empty string.
How do I make both the case of single quotes and double quotes interpret as group two in regex?
You can print both the groups, as they can never match at the same time:
perl -lne "print \$2.\$3 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/"
or remember the quotes in $2 and use $3 for the quoted string, followed by the remembered quote:
perl -lne "print \$3 if /^ID[ \t]*=[ \t]*((['\"])(.*)\2)/"
This looks like it's a good candidate for the branch reset operator, (?|...). Either capture in that alternation is $1, and the branch-reset construct takes care of the grouping without capturing anything:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01');
foreach ( #strings ) {
say $1 if m/^ID \h* = \h* (?|'(\d+)'|"(\d+)") /x
}
You need v5.10, and that allows you to use the \h to match horizontal whitespace.
But, you don't need to repeat the pattern. You can match the quote and match that same quote later. A relative backreference, \g{N}, can do that:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01' );
foreach ( #strings ) {
say $2 if m/^ID \h* = \h* (['"])(\d+)\g{-2} /x
}
I prefer that \g{-2} because I usually don't have to update numbering if I change the pattern to include more captures before the thing if refers to.
And, since this is a one-liner, don't type out the literal quotes (as ikegami has already shown):
say $2 if m/^ID \h* = \h* ([\x22\x27])(\d+)\g{-2} /x
Only one of the two will be defined, so simply use the one that's defined.
perl -nle'print $1//$2 if /^ID\h*=\h*(?:\x27(.*)\x27|"(.*)")/' # \x27 is '
You could also use a backreference.
perl -nle'print $2 if /^ID\h*=\h*(["\x27])(.*)\1/'
Note that all the provided solutions including these two fail (leave the escape sequence in) if you have something like ID="abc\"def" or ID="abc\ndef", assuming those are supported.
Thank you #brian_d_foy:
perl -lne "print \$1 if /^ID\h*=\h*(?|'(.*)'|\"(.*)\")/" test.txt
Or better:
perl -lne "print \$2 if /^ID\h*=\h*(['\"])(.*)\1/" test.txt
I have decided of accept also
ID = 01 #Followed by one or more horizontal spaces.
In addition to:
ID = "01" #Followed by one or more horizontal spaces.
And:
ID = '01' #Followed by one or more horizontal spaces.
Therefore I have adopted a super very complex solution:
perl -lne "print \$2 if /^ID\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/" test.txt
I have done a fusion of your both solutions #brian_d_foy. The double round parentheses are used to bring the second alternative to the second group as well, otherwise it would be the first group and without even the "branch reset operator", it would be group 4.
I after have enhanced the sintax in a function
function parse-config {
command perl -pe "s/\R/\n/g" "$2" | command perl -lne "print \$2 if /^$1\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/"
return $?
}
parse-config "ID" "test.txt"
In this:
"s/\R/\n/g"
I replace all CRLF or CR or LF, in LF. \R is a super powerfull special character present from perl v5.10. Apparently this version of perl has introduced several fundamental innovations for me. The chance would have that I needed all (\h \R ?|). Whoever did the update was brilliant.
I needed this because the dollar "$" at the end of the line did not work, because there was a "\r" before the "Linux end of line" "\n".
My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?
Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas
use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;
Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.
chomp(); # Remove the newline.
$_ =~ s/\s*//; # No extra spaces
$_ =~ s/\\//; # Kill any line connector
I am not very familiar with Perl/regex. I am modifying an existing perl script with the above snippet. This chunk of code removes the newline character, spaces and line ending with '\' as connector.
My question is, in the following line, what do the two bold character mean? I understand that anything between '/ /' is regular expression. But,
1) What does the s preceding the '/ /' mean?
2) What does the second '/' at the end mean?
In $_ =~ s/\s* //;
s means "do replace".
$_ =~ s/SEARCH/REPLACE/;
So, s/\s*//; means find every white spaces and remove them (or replace them with empty string).
The "s" is the substitution function so it is saying "substitute"/"what matches here"/"with what's here"/
The "/" here is the opening and closing delimiter of the parameters: s/something_to_match/replacement. '/' is very commonly used as the delimiter character, but perl pretty much allows any character that isn't used in the regexes as the delimiter.
See: http://perldoc.perl.org/functions/s.html
I am trying to match records in following format:
(-,username,domain1.co.uk)\
(-,username,domain2.co.uk)
either awk or perl must be used. I am using cygwin and wrote following code which works and matches both above entries:
awk 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' netgroup
But if I try to modify this regexp to be more specific the output is nothing:
1st: match record then last backslash and then match newline:
"(-,username,[^)]+.co.uk)\\$"
2nd: match new line immediatelly after record without backslash:
"(-,username,[^)]+.co.uk)$"
So I decided to rewrite script into perl, hoping that perl can deal with backslashes and end of line symbols. For this purpose I used a2p this way:
echo 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' | a2p.exe
#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$#"}'
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
# process any FOO=bar switches
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$musr = '(-,username,[^)]+.co.uk)';
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}
This generated perl script also matches both entries, however if I try modify this script to more specific I get the following errors:
1st:
$musr = "(-,username,[^)]+.co.uk)\\";
Trailing \ in regex m/(-,username,[^)]+.co.uk)\/ at perlmatch.pl line 18, <> line 1.
2nd:
$musr = "(-,username,[^)]+.co.uk)$";
Final $ should be \$ or $name at perlmatch.pl line 14, within string
syntax error at perlmatch.pl line 14, near "= "(-,username,[^)]+.co.uk)$""
Execution of perlmatch.pl aborted due to compilation errors.
3rd:
$musr = "(-,username,[^)]+.co.uk)\$";
[the output is nothing]
What I am doing wrong ? My question is also pointing to fact that if somebody needs to use script on several platforms (aix, solaris, linux) than using perl should be better approach that dealing with (non)GNU utils and various (g|n)awk versions etc. Regards
Your problems arise from string quoting in Perl.
$musr = "(-,username,[^)]+.co.uk)\\"; replaces \\ with a single backslash when the string is created. But you would need to pass two backslashes to the regex. So you would have to put four in when you create the string.
$musr = "(-,username,[^)]+.co.uk)$"; tries to perform variable interpolation within the string.
In addition, parentheses should be escaped, as John Kugelman noted.
The solution is to use Perl's built-in delimiters for regular expressions, rather than normal quoted strings. The simple way is to put it right into your loop:
while (<>) {
chomp; # strip record separator
if ($_ =~ /\(-,username,[^)]+.co.uk\)$/) {
print $_;
}
}
If you do need to put the pattern into a variable first, use the special qr//
operator.
my $musr = qr/\(-,username,[^)]+.co.uk\)$/;
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}
(-,username,[^)]+.co.uk)\\$
The problem here is not with the backslash at the end of the line, it's the parentheses. Parentheses are used for grouping. You need to escape them to match literal ( ) characters. You should also escape the dots so they match literal dots instead of "any character".
$ awk '/\(-,username,[^)]+\.co\.uk\)$/ {print}' netgroup
(-,username,domain2.co.uk)
$ awk '/\(-,username,[^)]+\.co\.uk\)\\$/ {print}' netgroup
(-,username,domain1.co.uk)\
If you stick with plain awk and don't use [gn]awk-specific features awk is very portable. More portable than perl is, I would think.
Parentheses must be escaped. Otherwise they group expressions. To be more specific, match an optional backslash at the end of the line (Backslashes are doubled because as string they must be escaped too).
awk 'BEGIN {musr="\\(-,username,[^)]+.co.uk\\)\\\\?$"} {if ($0~musr) print $0}' netgroup