Hexadecimal Variables in substitution patterns - regex

The file I am getting is full with badly formatted UTF-8 codes, like <0308> etc. I can identify them all right, but I want to replace them with the actual utf-8 letter, preferable with a regex. I've tried dozens of regexes like this:
s/<[0-9a-fA-F]{2,4}/\x{$1}/g
s/<[0-9a-fA-F]{2,4}/\N{U+$1}/g
And so on, but each time it tells me that $ is not a valid hex-char (to which I fully agree). Shouldn't it just take the number in my $1 and put it in there? Or does Perl really expect me to use \x{..} or \N{U+..} only with fixed values? If so, I'd have to hand-write the conversion for every possible hex-value - not very useful.

For one thing, you need to use parentheses to capture something in your regular expression; otherwise $1 will not get set to anything.
chr + hex with eval will do the trick here:
s/ <
([0-9a-fA-F]{2,4}) # parentheses to set $1
>
/
chr(hex($1))
/gex;

What version of perl are you using? This seems to work fine for me on 5.10.1:
$ perl -E '$foo = "<0308>"; $foo =~ s/<[0-9a-fA-F]{2,4}/\N{U+$1}/g; say $foo'
Wide character in print at -e line 1.
�>
(With \x{$1}, it seems to substitute the numbers with nothing, but I still don't get an error message.)

You probably need to use the eval switch to it. Try /\x{$1}/eg or /"\x{$1}"/eg

Related

Expand environment variable inside Perl regex

I am having trouble with a short bash script. It seems like all forward slashes needs to be escaped. How can required characters in expanded (environment) variables be escaped before perl reads them? Or some other method that perl understands.
This is what I am trying to do, but this will not work properly.
eval "perl -pi -e 's/$HOME\/_TV_rips\///g'" '*$videoID.info.json'
That is part of a longer script where videoID=$1. (And for some reason perl expands variables both within single and double quotes.)
This simple workaround with no forward slash in the expanded environment variable $USER works. But I would like to not have /Users/ hard coded:
eval "perl -pi -e 's/\/Users\/$USER\/_TV_rips\///g'" '*$videoID.info.json'
This is probably solvable in some better way fetching home dir for files or something else. The goal is to remove the folder name in youtube-dl's json data.
I am using perl just because it can handle extended regex. But perl is not required. Any better substitute for extended regex on macOS is welcome.
You are building the following Perl program:
s//home/username\/_TV_rips\///g
That's quite wrong.
You shouldn't be attempting to build Perl code from the shell in the first place. There are a few ways you could pass values to the Perl code instead of generating Perl code. Since the value is conveniently in the environment, we can use
perl -i -pe's/\Q$ENV{HOME}\E\/_TV_rips\///' *"$videoID.info.json"
or better yet
perl -i -pe's{\Q$ENV{HOME}\E/_TV_rips/}{}' *"$videoID.info.json"
(Also note the lack of eval and the fixed quoting on the glob.)
Just assembling the ideas in comments, this should achieve what you expected :
perl -pi -e 's{$ENV{HOME}/_TV_rips/}{}g' *$videoID.info.json
#ikegami thanks for your comment! It is indeed safer with \Q...\E, in case $HOME contains characters like $.
All RegEx delimiters must of cource be escaped in input String.
But as Stefen stated, you can use other delimiters in perl, like %, §.
Special characters
# Perl comment - don't use this
?,[], {}, $, ^, . Regex control chars - must be escaped in Regex. That makes it easier if you have many slashes in your string.
You should always write a comment to make clear you are using different delimiters, because this makes your regex hard to read for inexperienced users.
Try out your RegEx here: https://regex101.com/r/cIWk1o/1

Is there an alternative to negative look ahead in sed

In sed I would like to be able to match /js/ but not /js/m I cannot do /js/[^m] because that would match /js/ plus whatever character comes after. Negative look ahead does not work in sed. Or I would have done /js/(?!m) and called it a day. Is there a way to achieve this with sed that would work for most similar situations where you want a section of text that does not end in another section of text?
Is there a better tool for what I am trying to do than sed? Possibly one that allows look ahead. awk seems a bit too much with its own language.
Well you could just do this:
$ echo 'I would like to be able to match /js/ but not /js/m' |
sed 's:#:#A:g; s:/js/m:#B:g; s:/js/:<&>:g; s:#B:/js/m:g; s:#A:#:g'
I would like to be able to match </js/> but not /js/m
You didn't say what you wanted to do with /js/ when you found it so I just put <> around it. That will work on all UNIX systems, unlike a perl solution since perl isn't guaranteed to be available and you're not guaranteed to be allowed to install it.
The approach I use above is a common idiom in sed, awk, etc. to create strings that can't be present in the input. It doesn't matter what character you use for # as long as it's not present in the string or regexp you're really interested in, which in the above is /js/. s/#/#A/g ensures that every occurrence of # in the input is followed by A. So now when I do s/foobar/#B/g I have replaced every occurrence of foobar with #B and I KNOW that every #B represents foobar because all other #s are followed by A. So now I can do s/foo/whatever/ without tripping over foo appearing within foobar. Then I just unwind the initial substitutions with s/#B/foobar/g; s/#A/#/g.
In this case though since you aren't using multi-line hold-spaces you can do it more simply with:
sed 's:/js/m:\n:g; s:/js/:<&>:g; s:\n:/js/m:g'
since there can't be newlines in a newline-separated string. The above will only work in seds that support use of \n to represent a newline (e.g. GNU sed) but for portability to all seds it should be:
sed 's:/js/m:\
:g; s:/js/:<&>:g; s:\
:/js/m:g'

Proper Perl syntax for complex substitution

I've got a large number of PHP files and lines that need to be altered from a standard
echo "string goes here"; syntax to:
custom_echo("string goes here");
This is the line I'm trying to punch into Perl to accomplish this:
perl -pi -e 's/echo \Q(.?*)\E;/custom_echo($1);/g' test.php
Unfortunately, I'm making some minor syntax error, and it's not altering "test.php" in the least. Can anyone tell me how to fix it?
Why not just do something like:
perl -pi -e 's|echo (\".*?\");|custom_echo($1);|g' file.php
I don't think \Q and \E are doing what you think they're doing. They're not beginning and end of quotes. They're in case you put in a special regex character (like .) -- if you surround it by \Q ... \E then the special regex character doesn't get interpreted.
In other words, your regular expression is trying to match the literal string (.?*), which you probably don't have, and thus substitutions don't get made.
You also had your ? and * backwards -- I assume you want to match non-greedily, in which case you need to put the ? as a non-greedy modifier to the .* characters.
Edit: I also strongly suggest doing:
perl -pi.bak -e ... file.php
This will create a "backup" file that the original file gets copied to. In my above example, it'll create a file named file.php.bak that contains the original, pre-substitution contents. This is incredibly useful during testing until you're certain that you've built your regex properly. Hell, disk is cheap, I'd suggest always using the -pi.bak command-line operator.
You put your grouping parentheses inside the metaquoting expression (\Q(pattern)\E) instead of outside ((\Qpattern\E)), so your parentheses also get escaped and your regex is not capturing anything.

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/

What is the best way to do string manipulation in a shell script?

I have a path as a string in a shell-script, could be absolute or relative:
/usr/userName/config.cfg
or
../config.cfg
I want to extract the file name (part after the last /, so in this case: "config.cfg")
I figure the best way to do this is with some simple regex?
Is this correct? Should or should I use sed or awk instead?
Shell-scripting's string manipulation features seem pretty primative by themselves, and appear very esoteric.
Any example solutions are also appreciated.
If you're okay with using bash, you can use bash string expansions:
FILE="/path/to/file.example"
FILE_BASENAME="${FILE##*/}"
It's a little cryptic, but the braces start the variable expansion, and the double hash does a greedy removal of the specified glob pattern from the beginning of the string.
Double %% does the same thing from the end of a string, and a single percent or hash does a non-greedy removal.
Also, a simple replace construct is available too:
FILE=${FILE// /_}
would replace all spaces with underscores for instance.
A single slash again, is non-greedy.
Instead of string manipulation I'd just use
file=`basename "$filename"`
Edit:
Thanks to unwind for some newer syntax for this (which assumes your filename is held in $filename):
file=$(basename $filename)
Most environments have access to perl and I'm more comfortable with that for most string manipulation.
But as mentioned, from something this simple, you can use basename.
I typically use sed with a simple regex, like this:
echo "/usr/userName/config.cfg" | sed -e 's+^.*/++'
result:
>echo "/usr/userName/config.cfg" | sed -e 's+^.*/++'
config.cfg