Proper Perl syntax for complex substitution - regex

I've got a large number of PHP files and lines that need to be altered from a standard
echo "string goes here"; syntax to:
custom_echo("string goes here");
This is the line I'm trying to punch into Perl to accomplish this:
perl -pi -e 's/echo \Q(.?*)\E;/custom_echo($1);/g' test.php
Unfortunately, I'm making some minor syntax error, and it's not altering "test.php" in the least. Can anyone tell me how to fix it?

Why not just do something like:
perl -pi -e 's|echo (\".*?\");|custom_echo($1);|g' file.php
I don't think \Q and \E are doing what you think they're doing. They're not beginning and end of quotes. They're in case you put in a special regex character (like .) -- if you surround it by \Q ... \E then the special regex character doesn't get interpreted.
In other words, your regular expression is trying to match the literal string (.?*), which you probably don't have, and thus substitutions don't get made.
You also had your ? and * backwards -- I assume you want to match non-greedily, in which case you need to put the ? as a non-greedy modifier to the .* characters.
Edit: I also strongly suggest doing:
perl -pi.bak -e ... file.php
This will create a "backup" file that the original file gets copied to. In my above example, it'll create a file named file.php.bak that contains the original, pre-substitution contents. This is incredibly useful during testing until you're certain that you've built your regex properly. Hell, disk is cheap, I'd suggest always using the -pi.bak command-line operator.

You put your grouping parentheses inside the metaquoting expression (\Q(pattern)\E) instead of outside ((\Qpattern\E)), so your parentheses also get escaped and your regex is not capturing anything.

Related

Expand environment variable inside Perl regex

I am having trouble with a short bash script. It seems like all forward slashes needs to be escaped. How can required characters in expanded (environment) variables be escaped before perl reads them? Or some other method that perl understands.
This is what I am trying to do, but this will not work properly.
eval "perl -pi -e 's/$HOME\/_TV_rips\///g'" '*$videoID.info.json'
That is part of a longer script where videoID=$1. (And for some reason perl expands variables both within single and double quotes.)
This simple workaround with no forward slash in the expanded environment variable $USER works. But I would like to not have /Users/ hard coded:
eval "perl -pi -e 's/\/Users\/$USER\/_TV_rips\///g'" '*$videoID.info.json'
This is probably solvable in some better way fetching home dir for files or something else. The goal is to remove the folder name in youtube-dl's json data.
I am using perl just because it can handle extended regex. But perl is not required. Any better substitute for extended regex on macOS is welcome.
You are building the following Perl program:
s//home/username\/_TV_rips\///g
That's quite wrong.
You shouldn't be attempting to build Perl code from the shell in the first place. There are a few ways you could pass values to the Perl code instead of generating Perl code. Since the value is conveniently in the environment, we can use
perl -i -pe's/\Q$ENV{HOME}\E\/_TV_rips\///' *"$videoID.info.json"
or better yet
perl -i -pe's{\Q$ENV{HOME}\E/_TV_rips/}{}' *"$videoID.info.json"
(Also note the lack of eval and the fixed quoting on the glob.)
Just assembling the ideas in comments, this should achieve what you expected :
perl -pi -e 's{$ENV{HOME}/_TV_rips/}{}g' *$videoID.info.json
#ikegami thanks for your comment! It is indeed safer with \Q...\E, in case $HOME contains characters like $.
All RegEx delimiters must of cource be escaped in input String.
But as Stefen stated, you can use other delimiters in perl, like %, §.
Special characters
# Perl comment - don't use this
?,[], {}, $, ^, . Regex control chars - must be escaped in Regex. That makes it easier if you have many slashes in your string.
You should always write a comment to make clear you are using different delimiters, because this makes your regex hard to read for inexperienced users.
Try out your RegEx here: https://regex101.com/r/cIWk1o/1

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

sed regex stop at first match

I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>
this should be filtered into:
[quote=User]
I used the following regex in sed
s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g
this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)
You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.
The easiest way to do this is simply to use Perl in the first place.
If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:
s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;
Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.
Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:
s{<blockquote.*?written by (.*?)</i>}
{[quote=$1]}g;
Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:
$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...
The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.
I don't think sed supports non-greedy match. You can try perl though:
perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename
This might work for you:
sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file
Explanation:
If a line doesn't contain the pattern then skip it. /<blockquote.*written by .*<\/i>/!b
Change the front of the pattern into a newline globally throughout the line. s/<blockquote/\n/g
Globally replace the newline followed by the remaining pattern using a [^\n]* instead of .*. s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
Revert those newlines not changed to the original front pattern. s/\n/\<blockquote/g

Hexadecimal Variables in substitution patterns

The file I am getting is full with badly formatted UTF-8 codes, like <0308> etc. I can identify them all right, but I want to replace them with the actual utf-8 letter, preferable with a regex. I've tried dozens of regexes like this:
s/<[0-9a-fA-F]{2,4}/\x{$1}/g
s/<[0-9a-fA-F]{2,4}/\N{U+$1}/g
And so on, but each time it tells me that $ is not a valid hex-char (to which I fully agree). Shouldn't it just take the number in my $1 and put it in there? Or does Perl really expect me to use \x{..} or \N{U+..} only with fixed values? If so, I'd have to hand-write the conversion for every possible hex-value - not very useful.
For one thing, you need to use parentheses to capture something in your regular expression; otherwise $1 will not get set to anything.
chr + hex with eval will do the trick here:
s/ <
([0-9a-fA-F]{2,4}) # parentheses to set $1
>
/
chr(hex($1))
/gex;
What version of perl are you using? This seems to work fine for me on 5.10.1:
$ perl -E '$foo = "<0308>"; $foo =~ s/<[0-9a-fA-F]{2,4}/\N{U+$1}/g; say $foo'
Wide character in print at -e line 1.
�>
(With \x{$1}, it seems to substitute the numbers with nothing, but I still don't get an error message.)
You probably need to use the eval switch to it. Try /\x{$1}/eg or /"\x{$1}"/eg

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.