perl regular expressions substitue - regex

I'm new to perl and I found this expressions bit difficult to understand
$a =~ s/\'/\'\"\'\"\'/g
Can someone help me understand what this piece of code does?

It is not clear which part of it is a problem. Here is a brief mention of each.
The expression $str =~ s/pattern/replacement/ is a regular expression, replacing the part of string $str that is matched by the "pattern", by the "replacement". With /g at the end it does so for all instances of the "pattern" found in the string. There is far, far more to it -- see the tutorial, a quick start, the reference, and full information in perlre and perlop.
Some characters have a special meaning, in particular when used in the "pattern". A common set is .^$*+?()[{\|, but this depends on the flavor of regex. Also, whatever is used as a delimiter (commonly /) has to be escaped as well. See documentation and for example this post. If you want to use any one of those characters as a literal character to be found in the string, they have to be escaped by the backslash. (Also see about escaped sequences.) In your example, the ' is escaped, and so are the characters used as the replacement.
The thing is, these in fact don't need to be escaped. So this works
use strict;
use warnings;
my $str = "a'b'c";
$str =~ s/'/'"'/g;
print "$str\n";
It prints the desired a'"'b'"'c. Note that you still may escape them and it will work. One situation where ' is indeed very special is in a one liner, where it delimits the whole piece of code to submit to Perl via shell, to be run. (However, merely escaping it there does not work.) Altogether, the expression you show does the replacement just as in the answer by Jens, but with a twist that is not needed.
A note on a related feature: you can use nearly anything for delimiters, not necessarily just /. So this |patt|repl| is fine, and nicely even this is: {patt}{repl} (unless they are used inside). This is sometimes extremely handy, improving readability a lot. For one thing, then you don't have to escape the / itself inside the regex.
I hope this helps a little.

All the \ in that are useless (but harmless), so it is equivalent to s/'/'"'"'/g. It replaces every ' with '"'"' in the string.
This is often used for shell quoting. See for example https://stackoverflow.com/a/24869016/17389

This peace of Code replace every singe Quote with '"'"'

Can be simplified by removing the needless back slashes
$a =~ s/'/"'"'/g
Every ' will we replaced with "'"'

Related

Perl, Change the Case of Letter at { character

I am a perl newb, and just need to get something done quick and dirty.
I have lines of text (from .bib files) such as
Title = {{the Particle Swarm - Explosion, Stability, and Convergence in a Multidimensional Complex Space}},
How can I write a regex such that the first letter after the second { becomes capitalised.
Thanks
One way, for the question as asked
$string =~ s/{{\K(\w)/uc($1)/ge;
whereby /e makes it evaluate the replacement side as code. The \K makes it drop all previous matches so {{ aren't "consumed" (and thus need not be retyped in the replacement side).
If you wish to allow for possible spaces:  $string =~ s/{{\s*\K(\w)/uc($1)/ge;, and as far as I know bibtex why not allow for spaces between curlies as well, so {\s*{.
If simple capitalization is all you need then \U$1 in the replacement side sufficies and there is no need for /e modifier with it, per comment by Grinnz. The \U is a generic quote-like operator, which can thus also be used in regex; see under Escape sequences in perlre, and in perlretut.
I recommend a good read through the tutorial perlretut. That will go a long way.
However, I must also ask: Are you certain that you may indeed just unleash that on your whole file? Will it catch all cases you need? Will it not clip something else you didn't mean to?

Backtick Character ` in Perl Regex Meaning

I have inherited a Perl codebase that uses regex to parse an XML file. Not optimal, I know. I have lines of code like:
$title =~ s`<(.*?)>``g;
and
$content =~ s`__DOUBLEFIG__`${fig_html}`;
The standard Perl find and replace takes the form of
s/foo/bar/g;
What does
s`foo`bar`g;
do?
Those are alternative ways of delimiting the regex. Whomever wrote that code chose to use something else for personal reasons or, possibly, for code readability. Often, if patterns are often going to involve /, something else will be chosen to avoid having to escape the slash character in the regex.
This answer provides more information.

Regex doesn't match. Online generator does

I've want to check with a regex this kind of string:
2020_2021_01_01
I've putted it in a variable, say $session
so i do:
if [[ "$session" =~ \d{4}[_]\d{4}[_]\d{2}[_]\d{2} ]]; then
stuff
fi
you see...it doesn't work... but I don't know why....
any help?
THANKS!
The bash manual rather tersely explains that when the =~ operator "is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3))".
Here, regex(3) is a reference to man 3 regex, which might explain what an "extended regular expression" is. A longer description would be "Posix standard extended regular expressions", and you can find the documentation for those in the Posix document. If you're using an online regular expression tester, make sure you select "Posix regular expressions".
In short, they don't include Perlisms like \d. You can write [[:digit:]] or (if you are using the C locale) [0-9].
So your regex could have been written:
([[:digit:]]{4}_){2}[[:digit:]]{2}_[[:digit:]]{2}
(there is no need to quote _). However, be aware that the =~ operator looks for a substring which matches the pattern, rather than testing whether the left-hand operator precisely matches the pattern. So you quite possibly actually wanted an anchored match:
^([[:digit:]]{4}_){2}[[:digit:]]{2}_[[:digit:]]{2}$
The backslash character is an escape character in bash shell. In your example, I think that's making the the regular expression read like this:
d{4}[_]d{4}[_]d{2}[_]d{2}
You could confirm this by testing, setting $session to dddd_dddd_dd_dd
To workaround this, to preserve the backslash character in the regular expression, you'll need to "escape" it. In your case, preceding each backslash with an "extra" backslash may do the trick. The shell will see the two backslashes, and leave the second one, as part of the string.
if [[ "$session" =~ \\d{4}[_]\\d{4}[_]\\d{2}[_]\\d{2} ]]; then
I'm not sure if there are other characters that are going to need to be escaped. This calls for a real short script, one that you can change and run, to figure out what's working and whats not. Can you match the start of the string, a single digit character, etc.
(The whole escaping thing gets funkier... inside double quotes, inside single quotes, ...)
There was a website I used to use, put in the string I wanted, and it would give me back what it needed to look like in the shell script, I don't have a link to that anymore. There's probably a regular expression tester that let's you test "bash" regular expressions.

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.