How can I substitute regexp matches and map the substitutions in Perl? - regex

I.e.:
echo H#97llo | MagicPerlCommand
Stdout:
Hallo
were MagicPerlCommand is something like
perl -pnle "s/#(\d+)/chr(\1)/ge"
(but that doesn't work).

Change \1 to $1 in your MagicPerlCommand. The \digit backreference style doesn't t work when the replacement expression is evaluated (i.e. s///e).
That worked for me on Windows and Linux.

As per the j_random_hacker answer, you must use $1 rather than \1.
This is because using the '/e' modifier to the regex means the right hand half is just another normal Perl expression, and not a regex substitution. Since it's Perl, you've got to use Perl's syntax for the bracket reference, and not the usual regex syntax.

Related

PCRE regex behaves differently when moved to subroutine

Using PCRE v8.42, I am trying to abstract a regex into a named subroutine, but when it's in a subroutine, it seems to behave differently.
This outputs 10/:
echo '10/' | pcregrep '(?:0?[1-9]|1[0-2])\/'
This outputs nothing:
echo '10/' | pcregrep '(?(DEFINE)(?<MONTHNUM>(?:0?[1-9]|1[0-2])))(?&MONTHNUM)\/'
Are these two regular expressions not equivalent?
In versions of PCRE2 prior to 10.30, all subroutine calls are always treated as atomic groups. Your (?(DEFINE)(?<MONTHNUM>(?:0?[1-9]|1[0-2])))(?&MONTHNUM)\/ regex is actually equal to (?>0?[1-9]|1[0-2])\/. See this regex demo, where 10/ does not match as expected.
There is no match because 0?[1-9] matched the 1 in 10/ and since there is no backtracking allowed, the second alternative was not tested ("entered"), and the whole match failed as there is no / after 1.
You need to make sure the longer alternative comes first:
(?(DEFINE)(?<MONTHNUM>(?:1[0-2]|0?[1-9])))(?&MONTHNUM)/
See the regex demo. Note that in the pcregrep pattern, you do not need to escape /.
Alternatively, you can use PCRE2 v10.30 or newer.

Regex doesn't match. Online generator does

I've want to check with a regex this kind of string:
2020_2021_01_01
I've putted it in a variable, say $session
so i do:
if [[ "$session" =~ \d{4}[_]\d{4}[_]\d{2}[_]\d{2} ]]; then
stuff
fi
you see...it doesn't work... but I don't know why....
any help?
THANKS!
The bash manual rather tersely explains that when the =~ operator "is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3))".
Here, regex(3) is a reference to man 3 regex, which might explain what an "extended regular expression" is. A longer description would be "Posix standard extended regular expressions", and you can find the documentation for those in the Posix document. If you're using an online regular expression tester, make sure you select "Posix regular expressions".
In short, they don't include Perlisms like \d. You can write [[:digit:]] or (if you are using the C locale) [0-9].
So your regex could have been written:
([[:digit:]]{4}_){2}[[:digit:]]{2}_[[:digit:]]{2}
(there is no need to quote _). However, be aware that the =~ operator looks for a substring which matches the pattern, rather than testing whether the left-hand operator precisely matches the pattern. So you quite possibly actually wanted an anchored match:
^([[:digit:]]{4}_){2}[[:digit:]]{2}_[[:digit:]]{2}$
The backslash character is an escape character in bash shell. In your example, I think that's making the the regular expression read like this:
d{4}[_]d{4}[_]d{2}[_]d{2}
You could confirm this by testing, setting $session to dddd_dddd_dd_dd
To workaround this, to preserve the backslash character in the regular expression, you'll need to "escape" it. In your case, preceding each backslash with an "extra" backslash may do the trick. The shell will see the two backslashes, and leave the second one, as part of the string.
if [[ "$session" =~ \\d{4}[_]\\d{4}[_]\\d{2}[_]\\d{2} ]]; then
I'm not sure if there are other characters that are going to need to be escaped. This calls for a real short script, one that you can change and run, to figure out what's working and whats not. Can you match the start of the string, a single digit character, etc.
(The whole escaping thing gets funkier... inside double quotes, inside single quotes, ...)
There was a website I used to use, put in the string I wanted, and it would give me back what it needed to look like in the shell script, I don't have a link to that anymore. There's probably a regular expression tester that let's you test "bash" regular expressions.

Perl Extended Regular Expressions - match with multiple question marks inside

I have got a weird thing to solve in perl using regular expressions.
Consider the strings -
abcdef000000123
blaDeF002500456
wefdEF120045423
All of these strings are matching with the below regular expression when I tried in C with pcre library support :
???[dD][eE][fF][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
But I'm unable to achieve the same in perl code. I'm getting some weird errors.
Please help with the piece of perl code with which these two things match.
Thanks in advance...
? is called quantifier that makes preceding pattern or group an optional match. Independently ? doesn't make any sense in regex and you are getting an error like: Quantifier follows nothing in regex.
Following regex should work for you in perl:
...[dD][eE][fF][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
OR even more concise regex:
.{3}[dD][eE][fF][0-9]{9}
Each dot means match any character.
PS: You probably are getting confused by shell's glob vs regex.
That looks more like a file system regex than a PCRE. In Perl, the ? is a quantifier, not a wild card. You may want to replace them with . to get the same results in anything Perl compatible.
I might use ...[dD][eE][fF][0-9]{9} or even replace the [0-9] with \d.
qr/[A-z]{3}def[0-9]{9}/i
should be the Perl Regex object used to validate the mentioned strings.
Regards

Proper Perl syntax for complex substitution

I've got a large number of PHP files and lines that need to be altered from a standard
echo "string goes here"; syntax to:
custom_echo("string goes here");
This is the line I'm trying to punch into Perl to accomplish this:
perl -pi -e 's/echo \Q(.?*)\E;/custom_echo($1);/g' test.php
Unfortunately, I'm making some minor syntax error, and it's not altering "test.php" in the least. Can anyone tell me how to fix it?
Why not just do something like:
perl -pi -e 's|echo (\".*?\");|custom_echo($1);|g' file.php
I don't think \Q and \E are doing what you think they're doing. They're not beginning and end of quotes. They're in case you put in a special regex character (like .) -- if you surround it by \Q ... \E then the special regex character doesn't get interpreted.
In other words, your regular expression is trying to match the literal string (.?*), which you probably don't have, and thus substitutions don't get made.
You also had your ? and * backwards -- I assume you want to match non-greedily, in which case you need to put the ? as a non-greedy modifier to the .* characters.
Edit: I also strongly suggest doing:
perl -pi.bak -e ... file.php
This will create a "backup" file that the original file gets copied to. In my above example, it'll create a file named file.php.bak that contains the original, pre-substitution contents. This is incredibly useful during testing until you're certain that you've built your regex properly. Hell, disk is cheap, I'd suggest always using the -pi.bak command-line operator.
You put your grouping parentheses inside the metaquoting expression (\Q(pattern)\E) instead of outside ((\Qpattern\E)), so your parentheses also get escaped and your regex is not capturing anything.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.