Perl substitution output drops characters from Bash script input - regex

I have a variable in a bash script with a length of 64 characters
authkey=$(LC_ALL=C tr -cd 'a-zA-Z0-9,;.:_#*+~!#$%&()=?{[]}|><-' < /dev/random | head -c 64)
if i parse the variable to perl to do a string substitution
perl -pi -e "s/'AUTH_KEY', 'put your unique phrase here'/'AUTH_KEY', '$authkey'/" test.txt
depending on the selected random characters the length of the output differs.The output looks the following (The first string is the output in the resulting text file, the second string is the echo'ed output of the variable in the bash script)
q=dB7oUz59.IDBXI:i>ckW4oy3smX&k:-C.[rIf*9w}H=(N93yiB&nk{fP:y0_
q=dB7oUz59.IDBXI:i>ckW4oy3smX&k$s:-C.[rIf*9w}H=(N93yiB&nk{fP:y0_
5A+BwP~l3~<evp.ciTkMYtvmPjyMrL=):Qj1VaMI(,TSS,ZGMcd.m,4W
5A+BwP~l3~<evp.ciTkMYtvmPjyMrL=):Qj1VaMI(#Dk7UNgs,TSS,ZGMcd.m,4W
dX73}i5G1d;L*J=60WHHe<!61Ji_KJ)T5B~b2bCfaNDjBQr_N]}3HS=;GzAaX<gB
dX73}i5G1d;L*J=60WHHe<!61Ji_KJ)T5B~b2bCfaNDjBQr_N]}3HS=;GzAaX<gB
6Ndn(9+:>(6>*rh?B.m),3POp)>sfm8c1rh9vXr~fzZj;]!)kf3#60=M
6Ndn(9+:>(6>*rh?B.m),3POp)>sfm8c1rh9vXr~fzZj;]#YH!)kf3#$=$$ckt=M
FYMI,K|6WutC&dr-3]6)f(>QU-~{vBX>n!J-zq:kK84T|fZ7UW:{1&qU[nwYZLmC
FYMI,K|6WutC&dr-3]6)f(>QU-~{vBX>n!J-zq:kK84T|fZ7UW:{1&qU[nwYZLmC
5A+BwP~l3~<evp.ciTkMYtvmPjyMrL=):Qj1VaMI(,TSS,ZGMcd.m,4W
5A+BwP~l3~<evp.ciTkMYtvmPjyMrL=):Qj1VaMI(#Dk7UNgs,TSS,ZGMcd.m,4W
v1FR8c8}dZD(QGwOrr%M{FSUw*?h.JGI?Ay4tgRVp~l7C5eAxW<w<;c}emeX#S
v1FR8c8}dZD(QGwOrr%M{FSUw*?h.JGI?Ay4tgRVp#s~l7C5eAxW<w<;c}emeX#S
+MGg0=*NrhJ}.qPkk6v[lc)J.uiW1o?LL5t<HTC#Q-hSeqn%-ke!cwL5tk[e
+MGg$|=*NrhJ}.qPkk6v[lc)J.uiW1o?L$55L5t<HTC#Q-hSeqn%-ke!cwL5tk[e
each character dropout was caused by either a $ or # at the beginning of the group of characters. Is there a way to prevent that behaviour? Best regards Ralf

Using a single quote ' as the delimiter instead of slash / for the substitution suppresses variable interpolation
$ foobar=\$bar; perl -p -e "s'foo'$foobar'"
xx
xx
foo
$bar
^C
$
Unfortunately the single quotes that are matched in the substitution now need escaping
foobar=\$bar; perl -p -e "s'\'foo'$foobar\''"
x
x
'foo
$bar'
^C
But that seems to get passed through to Perl OK, without munging the authkey contents with sed

You can escape $ and # before calling perl:
authey=$(echo -n "$authkey" | sed -re 's/(\$|\#)/\\\1/g')

Related

Find multi-line text & replace it, using regex, in shell script

I am trying to find a pattern of two consecutive lines, where the first line is a fixed string and the second has a part substring I like to replace.
This is to be done in sh or bash on macOS.
If I had a regex tool at hand that would operate on the entire text, this would be easy for me. However, all I find is bash's simple text replacement - which doesn't work with regex, and sed, which is line oriented.
I suspect that I can use sed in a way where it first finds a matching first line, and only then looks to replace the following line if its pattern also matches, but I cannot figure this out.
Or are there other tools present on macOS that would let me do a regex-based search-and-replace over an entire file or a string? Maybe with Python (v2.7 and v3 is installed)?
Here's a sample text and how I like it modified:
keyA
value:474
keyB
value:474 <-- only this shall be replaced (follows "keyB")
keyC
value:474
keyB
value:474
Now, I want to find all occurances where the first line is "keyB" and the following one is "value:474", and then replace that second line with another value, e.g. "value:888".
As a regex that ignores line separators, I'd write this:
Search: (\bkeyB\n\s*value):474
Replace: $1:888
So, basically, I find the pattern before the 474, and then replace it with the same pattern plus the new number 888, thereby preserving the original indentation (which is variable).
You can use
sed -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
# Or, to replace the contents of the file inline in FreeBSD sed:
sed -i '' -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
Details:
/keyB$/ - finds all lines that end with keyB
n - empties the current pattern space and reads the next line into it
s/\(.*\):[0-9]*/\1:888/ - find any text up to the last : + zero or more digits capturing that text into Group 1, and replaces with the contents of the group and :888.
The {...} create a block that is executed only once the /keyB$/ condition is met.
See an online sed demo.
Use a perl one-liner with -0777 to scan over multiple lines:
$ # inline edit:
$ perl -0777 -i -pe 's/\bkeyB\s*value):\d*/$1:888/' file.txt
$ # to stdout:
$ cat file.txt | perl -0777 -pe 's/\bkeyB\s*value):\d*/$1:888/'
In plain bash:
#!/bin/bash
keypattern='^[[:blank:]]*keyB$'
valpattern='(.*):'
replacement=888
while read -r; do
printf '%s\n' "$REPLY"
if [[ $REPLY =~ $keypattern ]]; then
read -r
if [[ $REPLY =~ $valpattern ]]; then
printf '%s%s\n' "${BASH_REMATCH[0]}" "$replacement"
else
printf '%s\n' "$REPLY"
fi
fi
done < file

Replace the separator between pairs of numbers

I want to replace all strings like [0-9][0-9]-[0-9][0-9] with [0-9][0-9]/[0-9][0-9] using sed.
In other words, I want to replace - with /.
If I have somewhere in my text:
09-36
32-43
54-65
I want this change:
09/36
32/43
54/65
Using GNU sed:
$ echo '09-36 32-43 54-65' | sed -r 's|\<([0-9]{2})-([0-9]{2})\>|\1/\2|g'
09/36 32/43 54/65
-r turns on extended regular expressions, which:
doesn't require \-escaping ( ) { } char.
enables use of \< and /> to only match at word boundaries (if the expression should only match full lines, use ^ and $ instead, and omit the g option)
| is used as an alternative regex delimiter so that / can be used without \-escaping.
A BSD/macOS sed solution would look slightly different:
echo '09-36 32-43 54-65' | sed -E 's|[[:<:]]([0-9]{2})-([0-9]{2})[[:>:]]|\1/\2|g'
sed -e 's/\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1\/\2/g'
Might not be the most elegant version, but works for me. The gazillion backslashes make this rather unreadable in my opinion. You might improve the readability by not using / to separate the pattern and the replacement maybe?
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g' file
Input
维基 1-11 22-33 444-44 55-555 66-66百科
77-77
8 88-88
Output
维基 1-11 22/33 444-44 55-555 66/66百科
77/77
8 88/88
In the command above
-C enables Unicode;
-n causes Perl to process the script for each input line;
-p causes Perl to print the result of the script to the standard output;
-e accepts a Perl expression (particularly, it is a substitution).
In this mode (-npe), Perl works just like sed. The script substitutes each pair of digits separated with - to the same pair separated with a slash.
(?<!\d) and (?!\d) are negative lookaround expressions.
To edit the file in place use -i option: perl -C -i.backup -npe ....
If the input is not a file, you can pass the input to Perl via pipe, e.g.:
echo '维基 1-11 22-33 444-44 55-555 66-66百科' | \
perl -C -npe 's/(?<!\d)(\d\d)-(\d\d)(?!\d)/\1\/\2/g'

Is it possible to escape regex metacharacters reliably with sed

I'm wondering whether it is possible to write a 100% reliable sed command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:
#!/bin/bash
# Trying to replace one regex by another in an input file with sed
search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"
# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")
# Use it in a sed command
sed "s/$search/$replace/" input
I know that there are better tools to work with fixed strings instead of patterns, for example awk, perl or python. I would just like to prove whether it is possible or not with sed. I would say let's concentrate on basic POSIX regexes to have even more fun! :)
I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape would not lead anybody into the wrong direction.
Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.
Note:
If you're looking for prepackaged functionality based on the techniques discussed in this answer:
bash functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus a perl solution that uses perl's built-in support for such escaping).
#EdMorton's answer contains a tool (bash script) that robustly performs single-line substitutions.
Ed's answer now has an improved version of the sed command used below, corrected in calestyo's answer, which is needed if you want to escape string literals for potential use with other regex-processing tools, such as awk and perl. In short: for cross-tool use, \ must be escaped as \\ rather than as [\], which means: instead of the
sed 's/[^^]/[&]/g; s/\^/\\^/g' command used below, you must use
sed 's/[^^\]/[&]/g; s/[\^]/\\&/g;'
All snippets below assume bash as the shell (POSIX-compliant reformulations are possible):
SINGLE-line Solutions
Escaping a string literal for use as a regex in sed:
To give credit where credit is due: I found the regex used below in this answer.
Assuming that the search string is a single-line string:
search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3' # sample input containing metachars.
searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'
Every character except ^ is placed in its own character set [...] expression to treat it as a literal.
Note that ^ is the one char. you cannot represent as [^], because it has special meaning in that location (negation).
Then, ^ chars. are escaped as \^.
Note that you cannot just escape every char by putting a \ in front of it because that can turn a literal char into a metachar, e.g. \< and \b are word boundaries in some tools, \n is a newline, \{ is the start of a RE interval like \{1,3\}, etc.
The approach is robust, but not efficient.
The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:
the ability to specify literal characters inside a character set.
the ability to escape a literal ^ as \^
Escaping a string literal for use as the replacement string in sed's s/// command:
The replacement string in a sed s/// command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&) or specific capture-group results by index (\1, \2, ...), so these must be escaped, along with the (customary) regex delimiter, /.
Assuming that the replacement string is a single-line string:
replace='Laurel & Hardy; PS\2' # sample input containing metachars.
replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is
MULTI-line Solutions
Escaping a MULTI-LINE string literal for use as a regex in sed:
Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
Since tools such as sed and awk operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.
# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n') #'
# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
The newlines in multi-line input strings must be translated to '\n' strings, which is how newlines are encoded in a regex.
$!a\'$'\n''\\n' appends string '\n' to every output line but the last (the last newline is ignored, because it was added by <<<)
tr -d '\n then removes all actual newlines from the string (sed adds one whenever it prints its pattern space), effectively replacing all newlines in the input with '\n' strings.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.
If you're using GNU sed (only), you can use its -z option to simplify reading all input lines at once:
sed -z "s/$searchEscaped/foo/" <<<"$search"
Escaping a MULTI-LINE string literal for use as the replacement string in sed's s/// command:
# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'
# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}
# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"
Newlines in the input string must be retained as actual newlines, but \-escaped.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop.
's/[&/\]/\\&/g escapes all &, \ and / instances, as in the single-line solution.
s/\n/\\&/g' then \-prefixes all actual newlines.
IFS= read -d '' -r is used to read the sed command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)) would perform).
${REPLY%$'\n'} then removes a single trailing newline, which the <<< has implicitly appended to the input.
bash functions based on the above (for sed):
quoteRe() quotes (escapes) for use in a regex
quoteSubst() quotes for use in the substitution string of a s/// call.
both handle multi-line input correctly
Note that because sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
Also, using command substitutions ($(...)) to call the functions won't work for strings that have trailing newlines; in that event, use something like IFS= read -d '' -r escapedValue <(quoteSubst "$value")
# SYNOPSIS
# quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# SYNOPSIS
# quoteSubst <text>
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
Example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"
Note the use of -e ':a' -e '$!{N;ba' -e '}' to read all input at once, so that the multi-line substitution works.
perl solution:
Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta() function or its equivalent \Q...\E quoting.
The approach is the same for both single- and multi-line strings; for example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"
Note the use of -0777 to read all input at once, so that the multi-line substitution works.
The -s option allows placing -<var>=<val>-style Perl variable definitions following -- after the script, before any filename operands.
Building upon #mklement0's answer in this thread, the following tool will replace any single-line string (as opposed to regexp) with any other single-line string using sed and bash:
$ cat sedstr
#!/bin/bash
old="$1"
new="$2"
file="${3:--}"
escOld=$(sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g' <<< "$old")
escNew=$(sed 's/[&/\]/\\&/g' <<< "$new")
sed "s/$escOld/$escNew/g" "$file"
To illustrate the need for this tool, consider trying to replace a.*/b{2,}\nc with d&e\1f by calling sed directly:
$ cat file
a.*/b{2,}\nc
axx/bb\nc
$ sed 's/a.*/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 16: unknown option to `s'
$ sed 's/a.*\/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 23: invalid reference \1 on `s' command's RHS
$ sed 's/a.*\/b{2,}\nc/d&e\\1f/' file
a.*/b{2,}\nc
axx/bb\nc
# .... and so on, peeling the onion ad nauseum until:
$ sed 's/a\.\*\/b{2,}\\nc/d\&e\\1f/' file
d&e\1f
axx/bb\nc
or use the above tool:
$ sedstr 'a.*/b{2,}\nc' 'd&e\1f' file
d&e\1f
axx/bb\nc
The reason this is useful is that it can be easily augmented to use word-delimiters to replace words if necessary, e.g. in GNU sed syntax:
sed "s/\<$escOld\>/$escNew/g" "$file"
whereas the tools that actually operate on strings (e.g. awk's index()) cannot use word-delimiters.
NOTE: the reason to not wrap \ in a bracket expression is that if you were using a tool that accepts [\]] as a literal ] inside a bracket expression (e.g. perl and most awk implementations) to do the actual final substitution (i.e. instead of sed "s/$escOld/$escNew/g") then you couldn't use the approach of:
sed 's/[^^]/[&]/g; s/\^/\\^/g'
to escape \ by enclosing it in [] because then \x would become [\][x] which means \ or ] or [ or x. Instead you'd need:
sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
So while [\] is probably OK for all current sed implementations, we know that \\ will work for all sed, awk, perl, etc. implementations and so use that form of escaping.
It should be noted that the regular expression used in some answers above among this and that one:
's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
seems to be wrong:
Doing first s/\^/\\^/g followed by s/\\/\\\\/g is an error, as any ^ escaped first to \^ will then have its \ escaped again.
A better way seems to be: 's/[^\^]/[&]/g; s/[\^]/\\&/g;'.
[^^\\] with sed (BRE/ERE) should be just [^\^] (or [^^\]). \ has no special meaning inside a bracket expression and needs not to be quoted.
Bash parameter expansion can be used to escape a string for use as a Sed replacement string:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
: "${replace//\\/\\\\}"
: "${_//&/\\\&}"
: "${_//\//\\\/}"
: "${_//$'\n'/\\$'\n'}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
In bash 5.2+, it can be simplified further:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
shopt -s extglob
shopt -s patsub_replacement # An & in the replacement will expand to what matched. bash 5.2+
: "${replace//#(&|\\|\/|$'\n')/\\&}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
Encapsulate it in a bash function:
##
# escape_replacement -v var replacement
#
# Escape special characters in _replacement_ so that it can be
# used as the replacement part in a sed substitute command.
# Store the result in _var_.
escape_replacement() {
if ! [[ $# = 3 && $1 = '-v' ]]; then
echo "escape_replacement: invalid usage" >&2
echo "escape_replacement: usage: escape_replacement -v var replacement" >&2
return 1
fi
local -n var=$2 # nameref (requires Bash 4.3+)
# We use the : command (true builtin) as a dummy command as we
# trigger a sequence of parameter expansions
# We exploit that the $_ variable (last argument to the previous command
# after expansion) contains the result of the previous parameter expansion
: "${3//\\/\\\\}" # Backslash-escape any existing backslashes
: "${_//&/\\\&}" # Backslash-escape &
: "${_//\//\\\/}" # Backslash-escape the delimiter (we assume /)
: "${_//$'\n'/\\$'\n'}" # Backslash-escape newline
var=$_ # Assign to the nameref
# To support Bash older than 4.3, the following can be used instead of nameref
#eval "$2=\$_" # Use eval instead of nameref https://mywiki.wooledge.org/BashFAQ/006
}
# Test the function
# =================
# Define a sample multi-line literal. Include a trailing newline to test corner case
replace='a&b;c\1
d/e
'
escape_replacement -v replaceEscaped "$replace"
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''

why doesn't this Perl capture work

I expected this to capture and print just the group defined in parens, but instead it prints the whole line. How can I capture and print just the group in parens?
echo "abcdef" | perl -ne "print $1 if /(cd)/ "
What I want this to print: cd
What it actually prints: abcdef
How to fix?
In the perl command, you have to use single quotes or protect variables :
echo "abcdef" | perl -ne "print \$1 if /(cd)/"
or
echo "abcdef" | perl -ne 'print $1 if /(cd)/'
In double quotes, the shell expand $1.
The instant fix to your question is to change your double quotes to single quotes, like this:
$ echo abcdef | perl -ne 'print $1 if /(cd)/'
cd
With double quotes, the shell environment interprets your unprotected variable $1, which in your environment apparently evaluates to an empty string. So perl only receives the command print if /(cd)/ which is an implied command print $_ if /(cd)/ which prints the entire line.
You can also use a protected variable like this:
$ echo abcdef | perl -ne "print \$1 if /(cd)/"
cd
Note that matches which use different delimiters (other than / and /) are required to begin with the m keyword rather than using the shorthand form. But in your case, this does not matter, but it is often something worth being aware of when working with matches, e.g., m|/| would match a / character using the pipe as the delimiter for the regular expression.

How to remove invalid characters from an xml file using sed or Perl

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
Assuming UTF-8 XML documents:
perl -CSDA -pe'
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml
If you want to encode the bad bytes instead,
perl -CSDA -pe'
s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
"&#".ord($1).";"
/xeg;
' file.xml > file_fixed.xml
You can call it a few different ways:
perl -CSDA -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml # Inplace with backup
perl -CSDA -i -pe'...' file.xml # Inplace without backup
The tr command would be simpler. So, try something like:
cat <filename> | tr -d '\032' > <newfilename>
Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.
Try:
perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml
There is actually a way to do this with sed, like so:
cat input_file | LANG=C sed -E \
-e 's/.*/& /g' \
-e 's/(('\
'[\x9\xa\xd\x20-\x7f]|'\
'[\xc0-\xdf][\x80-\xbf]|'\
'[\xe0-\xec][\x80-\xbf][\x80-\xbf]|'\
'[\xed][\x80-\x9f][\x80-\xbf]|'\
'[\xee-\xef][\x80-\xbf][\x80-\xbf]|'\
'[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]'\
')*)./\1?/g' \
-e 's/(.*)\?/\1/g' \
-e 's|]]>|]]>]]<![CDATA[>|g' > output_file
This works in four steps:
Add a single whitespace character to the end of every line.
Replace every sequence of legal characters followed by any character
with the same sequence of legal characters followed by a question mark
character (instead of the any).
Note that in a line of only legal characters, the '.' matches the last
character in the line, which is why we added a space in step 1.
Remove the last character in the line, which we expect to be a question mark.
Replace the string ']]>' with ']]>]]'.
The LANG=C env variable is set to prevent sed from doing charset conversion itself - it should treat every character as 8-bit ascii.