Backtick Character ` in Perl Regex Meaning - regex

I have inherited a Perl codebase that uses regex to parse an XML file. Not optimal, I know. I have lines of code like:
$title =~ s`<(.*?)>``g;
and
$content =~ s`__DOUBLEFIG__`${fig_html}`;
The standard Perl find and replace takes the form of
s/foo/bar/g;
What does
s`foo`bar`g;
do?

Those are alternative ways of delimiting the regex. Whomever wrote that code chose to use something else for personal reasons or, possibly, for code readability. Often, if patterns are often going to involve /, something else will be chosen to avoid having to escape the slash character in the regex.
This answer provides more information.

Related

Perl, Change the Case of Letter at { character

I am a perl newb, and just need to get something done quick and dirty.
I have lines of text (from .bib files) such as
Title = {{the Particle Swarm - Explosion, Stability, and Convergence in a Multidimensional Complex Space}},
How can I write a regex such that the first letter after the second { becomes capitalised.
Thanks
One way, for the question as asked
$string =~ s/{{\K(\w)/uc($1)/ge;
whereby /e makes it evaluate the replacement side as code. The \K makes it drop all previous matches so {{ aren't "consumed" (and thus need not be retyped in the replacement side).
If you wish to allow for possible spaces:  $string =~ s/{{\s*\K(\w)/uc($1)/ge;, and as far as I know bibtex why not allow for spaces between curlies as well, so {\s*{.
If simple capitalization is all you need then \U$1 in the replacement side sufficies and there is no need for /e modifier with it, per comment by Grinnz. The \U is a generic quote-like operator, which can thus also be used in regex; see under Escape sequences in perlre, and in perlretut.
I recommend a good read through the tutorial perlretut. That will go a long way.
However, I must also ask: Are you certain that you may indeed just unleash that on your whole file? Will it catch all cases you need? Will it not clip something else you didn't mean to?

perl regular expressions substitue

I'm new to perl and I found this expressions bit difficult to understand
$a =~ s/\'/\'\"\'\"\'/g
Can someone help me understand what this piece of code does?
It is not clear which part of it is a problem. Here is a brief mention of each.
The expression $str =~ s/pattern/replacement/ is a regular expression, replacing the part of string $str that is matched by the "pattern", by the "replacement". With /g at the end it does so for all instances of the "pattern" found in the string. There is far, far more to it -- see the tutorial, a quick start, the reference, and full information in perlre and perlop.
Some characters have a special meaning, in particular when used in the "pattern". A common set is .^$*+?()[{\|, but this depends on the flavor of regex. Also, whatever is used as a delimiter (commonly /) has to be escaped as well. See documentation and for example this post. If you want to use any one of those characters as a literal character to be found in the string, they have to be escaped by the backslash. (Also see about escaped sequences.) In your example, the ' is escaped, and so are the characters used as the replacement.
The thing is, these in fact don't need to be escaped. So this works
use strict;
use warnings;
my $str = "a'b'c";
$str =~ s/'/'"'/g;
print "$str\n";
It prints the desired a'"'b'"'c. Note that you still may escape them and it will work. One situation where ' is indeed very special is in a one liner, where it delimits the whole piece of code to submit to Perl via shell, to be run. (However, merely escaping it there does not work.) Altogether, the expression you show does the replacement just as in the answer by Jens, but with a twist that is not needed.
A note on a related feature: you can use nearly anything for delimiters, not necessarily just /. So this |patt|repl| is fine, and nicely even this is: {patt}{repl} (unless they are used inside). This is sometimes extremely handy, improving readability a lot. For one thing, then you don't have to escape the / itself inside the regex.
I hope this helps a little.
All the \ in that are useless (but harmless), so it is equivalent to s/'/'"'"'/g. It replaces every ' with '"'"' in the string.
This is often used for shell quoting. See for example https://stackoverflow.com/a/24869016/17389
This peace of Code replace every singe Quote with '"'"'
Can be simplified by removing the needless back slashes
$a =~ s/'/"'"'/g
Every ' will we replaced with "'"'

What is that meaning of $testModule =~ s#/#::#ig; in Perl?

I am a newbie of perl and I'm reading some perl codes, I find one line below I could't understand, can anyone tell what is the meaning of
s#/#::#ig
I know =~ is try to match some regular expression. usually I would see code like s/<regular express>//gi, so I was a little confused of the following code. can anyone help to elaborate?
$testModule =~ s#/#::#ig;
You can use lots of different characters as regex delimiters.
This one is using # instead of / so it can use / as data inside the regex without escaping it.
It's equivalent to:
$testModule =~ s/\//::/ig;
See quote and quote-like operators in the Perl documentation for more details.

sed regex stop at first match

I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:
<blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>
this should be filtered into:
[quote=User]
I used the following regex in sed
s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g
this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)
You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.
The easiest way to do this is simply to use Perl in the first place.
If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:
s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;
Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.
Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:
s{<blockquote.*?written by (.*?)</i>}
{[quote=$1]}g;
Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:
$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...
The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.
I don't think sed supports non-greedy match. You can try perl though:
perl -pe 's/<blockquote.*?written by \(.*\)<\/i>/[quote=\1]/g' filename
This might work for you:
sed '/<blockquote.*written by .*<\/i>/!b;s/<blockquote/\n/g;s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g;s/\n/\<blockquote/g' file
Explanation:
If a line doesn't contain the pattern then skip it. /<blockquote.*written by .*<\/i>/!b
Change the front of the pattern into a newline globally throughout the line. s/<blockquote/\n/g
Globally replace the newline followed by the remaining pattern using a [^\n]* instead of .*. s/\n[^\n]*written by \([^\n]*\)<\/i>/[quote=\1]/g
Revert those newlines not changed to the original front pattern. s/\n/\<blockquote/g

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/