$ and Perl's global regular expression modifier - regex

I finally figured out how to append text to the end of each line in a file:
perl -pe 's/$/addthis/' myfile.txt
However, as I'm trying to learn Perl for frequent regex use, I can't figure out why is it that the following perl command adds the text 'addthis' to the end and start of each line:
perl -pe 's/$/addthis/g' myfile.txt
I thought that '$' matched the end of a line no matter what modifier was used for the regex match, but I guess this is wrong?

Summary: For what you're doing, drop the /g so it only matches before the newline. The /g is telling it to match before the newline and at the end of the string (after the newline).
Without the /m modifier, $ will match either before a newline (if it occurs at the end of the string) or at the end of the string. For instance, with both "foo" and "foo\n", the $ would match after foo. With "foo\nbar", though, it would match after bar, because the embedded newline isn't at the end of the string.
With the /g modifier, you're getting all the places that $ would match -- so
s/$/X/g;
would take a line like "foo\n" and turn it into "fooX\nX".
Sidebar:
The /m modifier will allow $ to match newlines that occur before the end of the string, so that
s/$/X/mg;
would convert "foo\nbar\n" into "fooX\nbarX\nX".

As Jim Davis pointed out, $ matches both the end of the string, or before the \n character (with the /m option). (See the Regular Expressions section of the perlre Perldoc page. Using the g modifier allowed it to continue matching.
Multiple line Perl regular expressions (i.e., Perl regular expressions with the new line character in them even if it only occurs once at the end of the line) causes all sorts of complications that most Perl programmers have issues handling.
If you're reading in a file one line at a time, always use chomp before doing ANYTHING with that line. This would have solved your issue when using the g qualifier.
Further issues can happen if you're reading files on Linux/Mac which came from Windows. In that case, you will have both the \r and \n character. As I found out recently in attempting to debug a program, the \r character isn't removed by chomp. I now make sure I always open my text files for reading
Like this:
open my $file_handle, "<:crlf", $file...
This will automatically substitute the \r\n characters with just \n if this is in fact a Windows file on a Linux/Mac system. If this is a regular Linux/Mac text file, it will do nothing. Other obvious solution is not to use Windows (rim shot!).
Of course, in your case, using chomp first would have done the following:
$cat file
line one
line two
line three
line four
$ perl -pe 'chomp;s/$/addthis::/g`
line oneaddthis::line twoaddthis::line threeaddthis::line fouraddthis::
The chomp removed the \n, so now, you don't see it when the line print out. Hmm...
$ perl -pe 'chomp;s/$/addthis/g;print "\n";
line oneaddthis
line twoaddthis
line threeaddthis
line fouraddthis
That works! And, your one liner is only mildly incomprehensible.
The other thing is to take a more modern approach that Damian Conway recommends in Chapter 12 of his book Perl Best Practices:
Use \A and \z as string boundary anchors.
Even if you don’t adopt the previous practice of always using /m, using ^ and $ with their default meanings is a bad idea. Sure, you know what ^ and $ actually mean in a Perl regex1. But will those who read or maintain your code know? Or is it more likely that they will misinterpret those metacharacters in the ways described earlier?
Perl provides markers that always—and unambiguously—mean “start of string” and “end of string”: \A and \z (capital A, but lowercase z). They mean “start/end of string” regardless of whether /m is active. They mean “start/end of string” regardless of what the reader thinks ^ and $ mean.
If you followed Conaway's advice, and did this:
perl -pe 's/\z/addthis/mg' myfile.txt
You would see that your phrase addthis got added to only to the end of each and every line:
$cat file
line one
line two
line three
line four
$ perl -pe `s/\z/addthis/mg` myfile.txt
line one
addthisline two
addthisline three
addthisline four
addthis
See how well that works. That addthis was added to the very end of each line! ...Right after the \n character on that line.
Enough fun and back to work. (Wait, it's President's Day. It's a paid holiday. No work today except of course all that stuff I promised to have done by Tuesday morning).
Hope this helped you understand how much fun regular expressions are and why so many people have decided to learn Python.
1. Know what ^ and $ really mean in Perl? Uh, yes of course I do. I've been programming in Perl for a few decades. Yup, I know all this stuff. (Note to self: $ apparently doesn't mean what I always thought it meant.)

A workaround :
perl -pe 's/\n/addthis\n/'
no need g modifier : the regex is treated line by lines.

Related

How to find and replace multiple line texts in perl

I have a text file named "data" with the content:
a
b
c
abc
I'd like to find all "abc" (doesn't need to be on the same line) and replace the leading "a" to "A". Here "b" can be any character (one or more) but not 'c'.
(This is a simplification of my actual use case.)
I thought this perl command would do
perl -pi.bak -e 's/a([^c]+?)c/A\1c/mg' data
With this 'data' was changed to:
a
b
c
Abc
I was expecting:
A
b
c
Abc
I'm not sure why perl missed the first occurrence (on line 1-3).
Let me know if you spot anything wrong with my perl command or you know a working alternative. Much appreciated.
You're reading a line at a time, applying the code to that one line. It can't possibly match across multiple lines. The simple solution is to tell perl to treat the entire file as one line using -0777.
perl -i.bak -0777pe's/a([^c]+c)/A$1/g' data
Replaced the incorrect \1 with $1.
Removed the useless /m. It only affects ^ and $, but you don't use those.
Removed the useless non-greedy modifier.
Moved the c into the capture to avoid repeating it.

Can someone breakdown this regular expression?

While looking for a way to format 'ifconfig' output and display only the network interfaces names, I found a regular expression that worked like a charm for OS X.
ifconfig -a | sed -E 's/[[:space:]:].*//;/^$/d'
How can I breakdown this regular expression so I can understand it?
Here is the sed command
s/[[:space:]:].*//;/^$/d
There is a semicolon in the middle, so it's actually two commands:
s/[[:space:]:].*//
/^$/d
First command is a substitution. What to substitute? It's between the 1st 2 slashes.
[[:space:]:].*
Character class [] of any kind of whitespace or a colon, followed by zero or more * of any character .. This matches everything in a line after the first whitespace or colon.
Substitute with what? Between the 2nd two slashes: s/...//: Nothing. The matched strings are deleted from each line.
This leaves the interface names which start their lines, the other lines remain too, but they are empty, as they start with whitespace.
How to remove these empty lines? That's the second command:
/^$/d
Find empty lines that match regex with nothing between start of line ^ and end of line $. Then delete them with command d.
All that's left are the interface names.
This is more a sequence of commands than it is a regular expression, but I suppose breaking the sequence down may be instructive.
Read the manpage on ifconfig to find this
Optionally, the -a flag may be used instead of an interface name. This
flag instructs ifconfig to display information about all interfaces in
the system. The -d flag limits this to interfaces that are down, and
-u limits this to interfaces that are up. When no arguments are given,
-a is implied.
That's one part done. The pipe (|) sends what ifconfig would normally print to the standard output to the standard input of sed instead.
You're passing sed the option -E. Again, man sed is your friend and tells you that this option means
Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's). The
re_format(7) manual page fully describes both formats.
This isn't all you need though... The first string that you're giving sed lets it know which operation to perform.
Search the same manual for the word "substitute" to reach this
paragraph:
[2addr]s/regular expression/replacement/flags
Substitute the replacement string for the first instance of
the regular expression in the pattern space. Any character other than
backslash or newline can be used instead of a slash to delimit the RE
and the replacement. Within the RE and the replacement, the RE
delimiter itself can be used as a literal character if it is preceded
by a backslash.
Now we can run man 7 re_format to decode the first command s/[[:space:]:].*// which means "for each line passed to standard input, substitute the part matching the extended regular expression [[:space:]:].* with the empty string"
[[:space:]:] = match either a : or any character in the character class [:space:]
.* = match any character (.), zero or more times (*)
To understand the second command look for the [2addr]d part of the sed manual page.
[2addr]d
Delete the pattern space and start the next cycle.
Let's then look at the next command /^$/d which says "for each line passed to standard input, delete it if it corresponds to the extended regex ^$"
^$ = a line that contains no characters between its start (^) and its end ($)
We've discussed how to start with man pages and follow the clues to "decode" commands you see in everyday life.
Thanks Benjamin and Xufox for the resources. After taking a look, this is my conclusion:
s/[[:space:]:].*//;
[[:space:]:] this will search for spaces and/or : and begin the execution of the command, and this and anything that comes afterwards(hence the '.*') will be substituted by nothing (because the next thing is //, which in between should be what we would want to substitute for, which in this case is nothing.).
;
marks the end of the first command
and then we have
/^$/d
where ^$ means search for all empty spaces and d to delete them.
This is half wrong. Take a look at the other answer which gives you the complete and correct response! Thanks guys.

How can I match multi-line patterns in the command line with perl-style regex?

I regularly use regex to transform text.
To transform, giant text files from the command line, perl lets me do this:
perl -pe < in.txt > out.txt
But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.
How can I do this in the command-line?
To slurp a file instead of doing line by line processing, use the -0777 switch:
perl -0777 -pe 's/.../.../g' in.txt > out.txt
As documented in perlrun #Command Switches:
The special value -00 will cause Perl to slurp files in paragraph mode. Any value -0400 or above will cause Perl to slurp files whole, but by convention the value -0777 is the one normally used for this purpose.
Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.
Grepping across line boundaries
So you want to grep across lines boundaries...
You quite possibly already have pcregrep installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions, and the library is definitely Perl-style, though not identical to Perl.
To match across multiple lines, you have to turn on the multi-line mode -M, which is not the same as (?m)
Running pcregrep -M "(?s)^b.*\d+" text.txt
On this text file:
a
b
c11
The output will be
b
c11
whereas grep would return empty.
Excerpt from the doc:
-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The output
for a successful match may consist of more than one line, the last
of which is the one in which the match ended. If the matched string
ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in "mul- tiline"
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for
forward matching, and simi- larly the previous 8K characters (or all
the previous charac- ters, if fewer than 8K) are guaranteed to be
available for lookbehind assertions. This option does not work when
input is read line by line (see --line-buffered.)

sed regex stop at first match

I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>
this should be filtered into:
[quote=User]
I used the following regex in sed
s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g
this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)
You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.
The easiest way to do this is simply to use Perl in the first place.
If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:
s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;
Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.
Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:
s{<blockquote.*?written by (.*?)</i>}
{[quote=$1]}g;
Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:
$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...
The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.
I don't think sed supports non-greedy match. You can try perl though:
perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename
This might work for you:
sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file
Explanation:
If a line doesn't contain the pattern then skip it. /<blockquote.*written by .*<\/i>/!b
Change the front of the pattern into a newline globally throughout the line. s/<blockquote/\n/g
Globally replace the newline followed by the remaining pattern using a [^\n]* instead of .*. s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
Revert those newlines not changed to the original front pattern. s/\n/\<blockquote/g

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.