Why does `perl -pe 's/$/\n/g'` add 2 blank lines?

Why does `perl -pe 's/$/\n/g'` add 2 blank lines? - regex

I'm working through the one liner book and came across
perl -pe 's/$/\n/' file
which inserts a blank line after each line by setting the end of the line to new line thus adding a new line to the existing newline resulting in a blank line.
As this is the first example without g at the end of the pattern, I tried
perl -pe 's/$/\n/g' file
this results in 2 blank lines between lines.
I would have expected no difference since there is only one $ per line so replacing all of them should be the same as replacing just the first one.
What's going on here?

/$/ matches the “end of string”. This might be
the end of string (like /\z/),
or just before a newline before the end of string (like /(?=\n\z)/).
(Additionally, /$/m matches the “end of line”. This might be
the end of string,
or just before a newline (like /(?=\n)/).
).
With your substitution /$/\n/g, the regex matches twice: once before the newline, then again at the end of string:
The first match is before the newline:
"foo\n"
# ^ match
A newline is placed before the current match end:
"foo\n\n"
# ^ insert before
The next match is at the end of string:
"foo\n\n"
# ^ match
A newline is inserted before the current match end:
"foo\n\n\n"
# ^ insert before
No further match is found.
The solution: if $ is to DWIMmy for you, always match \z or \n explicitly, possibly together with lookaheads like (?=\n). Consider matching all Unicode line separators \R instead of just \n.

This isn't a sound understanding of the situation. $ is a badly-defined and unintuitive metacharacter
It is a zero-width match
It will match before a newline character at the end of the bound string
It will match at the end of the bound string
With the /m modifier in place, it will also match before any newline character anywhere, but not immediately after it unless it is the last character of the string
\z is much more useful: it only ever matches at the end of the string
"by setting the end of the line to new line"
Mentioning "lines" at all is misleading, and you should be careful to explain in comments what meaning you're applying. If you have
my $s = "xxx\n"
then
say pos($s) while $s =~ /$/g
will produce
3
4
i.e. both before and after the newline, because it happens to be at the end of the string
This is also why your s/$/\n/g adds two newlines: there are two zero-width matches for /$/ within this string, and a global substitution finds them and replaces them both with a newline, resulting in three newlines instead of the original one
It's unclear what you intended
Adding a newline to the end of a string, regardless of what's there already is s/\z/\n/ or just $s .= "\n"
If you want to ensure that, say, there are exactly two newlines at the end of a string, then just remove any existing linefeeds first with s/\n+\z/\b\n/
As you can see, \z is much more useful than $
And don't forget \R if you're dealing with cross-platform data. It will match any standard line terminator: any of CR, LF or CRLF
If this still leaves you with a problem then please ask again. I was going to write about zero-width matches but it's hard to know whether my answer is clear without it

Related

How does Perl match annotation "//" for verilog files?

I have found one method, but I don't understand the principle:
#remove lines starting with //
$file =~ s/(?<=\n)[ \t]*?\/\/.*?\n//sg;
How does (?<=\n)[ \t]*? work?

The critical piece is the lookbehind (?<=...). It is a zero-width assertion, what means that it does not consume its match -- it only asserts that the pattern given inside is indeed in the string, right before the pattern that follows it.
So (?<=\n)[ \t] matches either a space or a tab, [ \t], that has a newline before it. With the quantifier, [ \t]*, it matches a space-or-tab any number of times (possibly zero). Then we have the // (each escaped by \). Then it matches any character any number of times up to the first newline, .*?\n.
Here ? makes .* non-greedy so that it stops at the first match of the following pattern.
This can be done in other ways, too.
$file =~ s{ ^ \s* // .*? \n }{}gmx
The modifier m makes anchors ^ and $ (unused here) match the beginning and end of each line. I use {}{} as delimiters so that I don't have to escape /. The modifier x allows use of spaces (and comments and newlines) inside for readability.
You can also do it by split-ing the string by newline and passing lines through grep
my $new_file = join '\n', grep { not m|^\s*//.*| } split /\n/, $file;
The split returns a list of lines and this is input for grep, which passes those for which the code in the block evaluates to true. The list that it returns is then joined back, if you wish to again have a multiline string.
If you want lines remove join '\n' and assign to an array instead.
The regex in the grep block is now far simpler, but the whole thing may be an eye-full in comparison with the previous regex. However, this approach can turn hard jobs into easy ones: instead of going for a monster master regex, break the string and process the pieces easily.

\1 not defined in the RE

In my script, I'm in passing a markdown file and using sed, I'm trying to find lines that do not have one or more # and are not empty lines and then surround those lines with <p></p> tags
My reasoning:
^[^#]+ At beginning of line, find lines that do not begin with 1 or more #
.\+ Then find lines that contain one or more character (aka not empty lines)
Then replace the matched line with <p>\1</p>, where \1 represents the matched line.
However, I'm getting "\1 not defined in the RE". Is my reasoning above correct and how do I fix this error?
BODY=$(sed -E 's/^[^#]+.\+/<p>\1</p>/g' "$1")

Backslash followed by a number is replaced with the match for the Nth capture group in the regexp, but your regexp has no capture groups.
If you want to replace the entire match, use &:
BODY=$(sed -E 's%^[^#].*%<p>&</p>%' "$1")
You don't need to use .+ to find non-empty lines -- the fact that it has a character at the beginning that doesn't match # means it's not empty. And you don't need + after [^#] -- all you care is that the first character isn't #. You also don't need the g modifier when the regexp matches the entire line -- that's only needed to replace multiple matches per line.
And since your replacement string contains /, you need to either escape it or change the delimiter to some other character.

interpreting regular expression in perl

I am trying to reverse engineer a Perl script. One of the lines contains a matching operator that reads:
$line =~ /^\s*^>/
The input is just FASTA sequences with header information. The script is looking for a particular pattern in the header, I believe.
Here is an example of the files the script is applied to:
>mm9_refGene_NM_001252200_0 range=chr1:39958075-39958131 5'pad=0 3'pad=0 strand=+
repeatMasking=none
ATGGCGAACGACTCTCCCGCGAAGAGCCTGGTGGACATTGACCTGTCGTC
CCTGCGG
>mm9_refGene_NM_001252200_1 range=chr1:39958354-39958419 5'pad=0 3'pad=0 strand=+
repeatMasking=none
GACCCTGCTGGGATTTTTGAGCTGGTGGAAGTGGTTGGAAATGGCACCTA
TGGACAAGTCTATAAG
This is a matching operator asking whether the line, from its beginning, contains white spaces of at least more than zero, but then I lose its meaning.
This is how I have parsed the regex so far:
from beginning [ (/^... ], contains white spaces [ ...\s... ] of at least more than zero [ ...*... }.

Using RegexBuddy (or, as r3mus said, regex101.com, which is free):
Assert position at the beginning of the string «^»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the beginning of the string «^»
Match the character “>” literally «>»
EDIT: Birei's answer is probably more correct if the regex in question is actually wrong.

You have to get rid of the second ^ character. It is a metacharacter and means the beginning of a line (without special flags like /m), but that meaning it's already achieved with the first one.
The character > will match at the beginning of the line without the second ^ because the initial whitespace is optional (* quantifier). So, use:
$line =~ /^\s*>/

It is much easier to reverse engineer perl script with debugger.
"perl -d script.pl" or if you have Linux ddd: "ddd cript.pl &".
For multiline regex this regex match for emptyline with spaces and begin of the next FASTA.
http://www.rexfiddle.net/c6locQg

Removing repeated characters, including spaces, in one line

I currently have a string, say $line='55.25040882, 3,,,,,,', that I want to remove all whitespace and repeated commas and periods from. Currently, I have:
$line =~ s/[.,]{2,}//;
$line =~ s/\s{1,}//;
Which works, as I get '55.25040882,3', but when I try
$line =~ s/[.,\s]{2,}//;
It pulls out the ", " and leaves the ",,,,,,". I want to retain the first comma and just get rid of the whitespace.
Is there a way to elegantly do this with one line of regex? Please let me know if I need to provide additional information.
EDIT: Since there were so many solutions, I decided to update my question with the answer below:
$line =~ s/([.,])\1{1,}| |\t//g;
This removes all repeated periods and commas, removes all spaces and tabs, while retaining the \r and \n characters. There are so many ways to do this, but this is the one I settled for. Thanks so much!

This is mostly a critique of Rohit's answer, which seems to contain several misconceptions about character class syntax, especially the negation operator (^). Specifically:
[(^\n^\r)\s] matches ( or ^ or ) or any whitespace character, including linefeed (\n) and carriage return (\r). In fact, they're each specified twice (since \s matches them too), though the class still only consumes one character at a time.
^[\n\r]|\s matches a linefeed or carriage return at the beginning of the string, or any whitespace character anywhere (which makes the first part redundant, since any whitespace character includes linefeed and carriage return, and anywhere includes the beginning of the string).
Inside a character class, the caret (^) negates the meaning of everything that follows iff it appears immediately after the opening [; anywhere else, it's just a caret. All other metacharacters except \ lose their special meanings entirely inside character classes. (But the normally non-special characters, - and ], become special.)
Outside a character class, ^ is an anchor.
Here's how I would write the regex:
$line =~ s/([.,])\1+|\h+//g;
Explanation:
Since you finally went with ([.,])\1{1,}, I assume you want to match repeated periods or repeated commas, not things like ., or ,.. Success with regexes means learning to look at text the way the regex engine does, and it's not intuitive. You'll help yourself a lot if you try to describe each problem the way the regex engine would, if it could speak.
{1,} is not incorrect, but why add all that clutter to your regex when + does the same thing?
\h matches horizontal whitespace, which includes spaces and tabs, but not linefeeds or carriage returns. (That only works in Perl, AFAIK. In Ruby/Oniguruma, \h matches a hex digit; in every other flavor I know of, it's a syntax error.)

You can try using: -
my $line='55.25040...882, 3,,,,,,';
$line =~ s/[^\S\n\r]|[.,]{2,}//g; # Negates non-whitespace char, \n and \r
print $line
OUTPUT: -
55.25040882,3
[^\S\n\r]|[.,]{2,} -> This means either [^\S\n\r] or [.,]{2,}
[.,]{2,} -> This means replace , or . if there is more than 2 in the same
line.
[^\S\n\r] -> Means negate all whitespace character, linefeed, and newline.

How can I match at the beginning of any line, including the first, with a Perl regex?

According the Perl documentation on regexes:
By default, the "^" character is guaranteed to match only the beginning of the string ... Embedded newlines will not be matched by "^" ... You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string ... you can do this by using the /m modifier on the pattern match operator.
The "after any newline" part means that it will only match at the beginning of the 2nd and subsequent lines. What if I want to match at the beginning of any line (1st, 2nd, etc.)?
EDIT: OK, it seems that the file has BOM information (3 chars) at the beginning and that's what's messing me up. Any way to get ^ to match anyway?
EDIT: So in the end it works (as long as there's no BOM), but now it seems that the Perl documentation is wrong, since it says "after any newline"

The ^ does match the 1st line with the /m flag:
~:1932$ perl -e '$a="12\n23\n34";$a=~s/^/:/gm;print $a'
:12
:23
:34
To match with BOM you need to include it in the match.
~:1939$ perl -e '$a="ï»¿12\n23\n34";$a=~s/^(\d)/<\1>:/mg;print $a'
ï»¿12
<2>:3
<3>:4
~:1940$ perl -e '$a="ï»¿12\n23\n34";$a=~s/^(?:ï»¿)?(\d)/<\1>:/mg;print $a'
<1>:2
<2>:3
<3>:4

You can use the /^(?:\xEF\xBB\xBF)?/mg regex to match at the beginning of the line anyway, if you want to preserve the BOM.

Conceptually, there's assumed to be a newline before the beginning of the string. Consequently, /^a/ will find a letter 'a' at the beginning of a string.

Put a empty line at the beginning of the file, this cool things down, and avoid to make regex hard to read.
Yes, the BOM. It might appear at the beginning of the file, so put an empty at the beginning of the file. The BOM will not be \s, or something can be seen by bare eye. It kills my hours when a BOM make my regex fail.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does `perl -pe 's/$/\n/g'` add 2 blank lines? - regex

Related

How does Perl match annotation "//" for verilog files?

\1 not defined in the RE

interpreting regular expression in perl

Removing repeated characters, including spaces, in one line

How can I match at the beginning of any line, including the first, with a Perl regex?

Categories

Resources