how to use sed delete Unicode in some range? - regex

I want to remove Unicode in some range, e.g.:
echo "abcABC123" | sed 's/[\uff21-\uff3b]//g'
expect "abc123", but get:
sed: -e expression #1, char 20: Invalid range end
or use:
echo "abcABC123" | sed 's/[A-Z]//g'
get:
sed: -e expression #1, char 14: Invalid collation character

Unicode support in sed is not well defined. You may be better off using command line perl:
echo "abcABC123" | perl -CS -pe 's/[\x{FF21}-\x{FF3B}]+//g'
abc123
It is important to use -CS flags here to be able to get correct UTF8 encodings for input/output/error.

Not sure why sed is not working, but you can use tr instead
$ echo 'abcABC123' | tr -d 'A-Z'
abc123
From man tr
tr - translate or delete characters
-d, --delete
delete characters in SET1, do not translate

Related

sed not performing expected substitution

I have a bash variable, some file path (with spaces) and filename, e.g:
$ echo $tmp
/home/xyz/some/path/with spaces/AlbumArt_{random-number-sequence}_Large.jpg
When I attempt to identify the filename part with grep, e.g:
$ echo "$tmp" | egrep 'AlbumArt.*Large.jpe?g$'
/home/xyz/some/path/with spaces/**AlbumArt_{random-number-sequence}_Large.jpg**
The filename part appears to be identified correctly, but when I attempt to convert this to a sed substitution expression, e.g:
$ echo "$tmp" | sed 's#AlbumArt.*Large.jpe?g$#NewString#'
/home/xyz/some/path/with spaces/AlbumArt_{random-number-sequence}_Large.jpg
The expected substitution isn't happening. Thanks in advance for any help.
In fact egrep is a variant of grep -E, allowing to 'activate' extended regular expression (you can see: https://en.wikipedia.org/wiki/Regular_expression#Standards).
Thus, you just need to use the same option with sed:
echo "$tmp" | sed -E 's#AlbumArt.*Large.jpe?g$#NewString#'

Replace string with another string based on backreference with sed

I'm trying to convert a predefined string %c# where # can be some number with another string. The catch is that the length of the other string must be truncated to # number of characters.
Ideally these set of commands would work:
FORMAT="%c10"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
echo $FORMAT | sed "s/%c\([0-9]\+\)/${LAST_COMMIT:0:\1}/g"
but clearly there is a syntax error on the \1. You can replace it with a number to see what I'm trying to get as output.
I'm open to using some other program other than sed to achieve this but ideally it should be programs that are pretty much native to most linux installations.
Thanks!
This is my idea.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c//')
Get number with sed and get first some character with head.
EDIT1
This might be better.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c\([0-9]\+\)/\1/')
EDIT2
I make the script because it is too tough to understand. Please try this.
$ cat sample.sh
#!/bin/bash
FORMAT="%b-%t-%c10-%c5"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
## List numbers
lengths=$(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g")
## Substitute %cXX to first XX characters of LAST_COMMIT
for n in ${lengths}
do
to_str=$(echo ${LAST_COMMIT:0:${n}})
FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/")
done
## Print result
echo ${FORMAT}
This is the result.
$ ./sample.sh
%b-%t-5189e42b1410-5189e5
Also this is one line commands (Same contents but too long and too tough)
for n in $(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g"); do to_str=$(echo ${LAST_COMMIT:0:${n}}); FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/"); done; echo ${FORMAT}
The value of $LAST_COMMIT gets interpolated before sed runs, so there is no backreference to refer back to yet. There is an /e extension in GNU sed which would support something like this, but I would simply use a slightly more capable tool.
perl -e '$fmt = shift; $fmt=~ s/%c(\d+)/%.$1s/g; printf("$fmt\n", #ARGV)' '%c10' "$LAST_COMMIT"
Of course, if you can let go of your own ad-hoc format string specifier, and switch to a printf-compatible format string altogether, just use the printf shell command straight off.
length=$(echo $FORMAT | sed "s/%c\([0-9]\+\)/\1/g")
echo "${LAST_COMMIT:0:$length}"

Sed replace asterisk symbols

I'm am trying to replace a series of asterix symbols in a text file with a -999.9 using sed. However I can't figure out how to properly escape the wildcard symbol.
e.g.
$ echo "2006.0,1.0,************,-5.0" | sed 's/************/-999.9/g'
sed: 1: "s/************/-999.9/g": RE error: repetition-operator operand invalid
Doesn't work. And
$ echo "2006.0,1.0,************,-5.0" | sed 's/[************]/-999.9/g'
2006.0,1.0,-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9-999.9,-5.0
puts a -999.9 for every * which isn't what I intended either.
Thanks!
Use this:
echo "2006.0,1.0,************,-5.0" | sed 's/[*]\+/-999.9/g'
Test:
$ echo "2006.0,1.0,************,-5.0" | sed 's/[*]\+/-999.9/g'
2006.0,1.0,-999.9,-5.0
Any of these (and more) is a regexp that will modify that line as you want:
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\**/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\+/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*+/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\{12\}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*{12}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed 's/\*\{1,\}/999.9/g'
2006.0,1.0,999.9,-5.0
$ echo "2006.0,1.0,************,-5.0" | sed -r 's/\*{1,}/999.9/g'
2006.0,1.0,999.9,-5.0
sed operates on regular expressions, not strings, so you need to learn regular expression syntax if you're going to use sed and in particular the difference between BREs (which sed uses by default) and EREs (which some seds can be told to use instead) and PCREs (which sed never uses but some other tools and "regexp checkers" do). Only the first solution above is a BRE that will work on all seds on all platforms. Google is your friend.
* is a regex symbol that needs to be escaped.
You can even use BASH string replacement:
s="2006.0,1.0,************,-5.0"
echo "${s/\**,/-999.9,}"
2006.0,1.0,-999.9,-5.0
Using sed:
sed 's/\*\+/999.9/g' <<< "$s"
2006.0,1.0,999.9,-5.0
Ya, * are special meta character which repeats the previous token zero or more times. Escape * in-order to match literal * characters.
sed 's/\*\*\*\*\*\*\*\*\*\*\*\*/-999.9/g'
When this possibility was introduced into gawk I have no idea!
gawk -F, '{sub(/************/,"-999.9",$3)}1' OFS=, file
2006.0,1.0,-999.9,-5.0

(GNU)Sed: how to replace any character from nth character to nth+10?

I need to replace characters from 10th to 20th in the string which looks like that:
123456789012345678901234567890
So far I've tried:
a)
Works for the 10th character ONLY:
echo "123456789012345678901234567890" | sed 's/./X/10'
b)
Doesn't work on the range:
echo "123456789012345678901234567890" | sed 's/./X/10,20'
echo "123456789012345678901234567890" | sed 's/./X/10\,20'
echo "123456789012345678901234567890" | sed 's/./X/\{10,20\}'
echo "123456789012345678901234567890" | sed 's/./X/\{10\,20\}'
Does not work and I get error
unknown option to `s'
So - the question is - how do I make this to work:
echo "123456789012345678901234567890" | sed 's/./X/10,20'
Try:
$ sed -r "s/^(.{9})(.{11})/\1XXXXXXXXXX/" <<< 123456789012345678901234567890
123456789XXXXXXXXXX1234567890
It is a complex sed problem, I could just find this solution:
$ sed 's/^\(.\{10\}\)\(.\{10\}\)/\1XXXXXXXXXX/' <<< 123456789012345678901234567890
1234567890XXXXXXXXXX1234567890
With awk it looks nicer:
$ awk 'BEGIN{FS=OFS=""} {for (i=10;i<=20;i++) $i="X"} {print}' <<< 123456789012345678901234567890
123456789XXXXXXXXXXX1234567890
You can do it with bash parameter substitution like this:
#!/bin/bash
s="123456789012345678901234567890"
l=${s:0:9} # Extract left part
m=${s:10:11} # Extract middle part
r=${s:20} # Extract right part
# Diddle with middle part to your heart's content and re-assemble "$l$m$r" when done
m=$(sed 's/./X/g' <<<$m)
See here for more explanation and examples.
Or, you can do this:
transform the row of letters into a column so each is on its own line
apply your edits to LINES 10 through 20 (as opposed to characters 10 through 20)
transform column of letters back into a row (by deleting linefeeds)
as shown in the one-liner below:
$ echo "123456789012345678901234567890" | sed "s/\(.\)/\1\n/g" | sed "10,20s/./X/" | tr -d "\n"
I know, that it looks ugly, but:
echo "123456789012345678901234567890" | \
sed 's/^\(.\{10\}\).\{10\}\(.*\)/\1XXXXXXXXXX\2/'
Without placing multiple X in sed command:
sed -r 's/^(.{9})(.{10,20})(.*)$/\1\n\2\n\3/' | sed -e '2s/./X/g' -e 'N;N;s/\n//g'
To replace the 10th to 20th characters, inclusive, try:
echo 123456789012345678901234567890 | sed 's/\(.\{9\}\).\{11\}/\1XXXXXXXXXX/'
123456789XXXXXXXXXX1234567890
With the GNU sed, you can use the -r switch to remove most of the backslashes:
echo 123456789012345678901234567890 | sed -r 's/(.{9}).{11}/\1XXXXXXXXXX/'
Or the naive approach also works here:
echo 123456789012345678901234567890 | sed 's/\(.........\).........../\1XXXXXXXXXX/'
This might work for you (GNU sed):
sed ':a;/.\{9\}X\{11\}/!s/\(.\{9\}X*\)./\1X/;ta' file
or with a bit of syntactic sugar:
sed -r ':a;/.{9}X{11}/!s/(.{9}X*)./\1X/;ta' file

How to remove invalid characters from an xml file using sed or Perl

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
Assuming UTF-8 XML documents:
perl -CSDA -pe'
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml
If you want to encode the bad bytes instead,
perl -CSDA -pe'
s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
"&#".ord($1).";"
/xeg;
' file.xml > file_fixed.xml
You can call it a few different ways:
perl -CSDA -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml # Inplace with backup
perl -CSDA -i -pe'...' file.xml # Inplace without backup
The tr command would be simpler. So, try something like:
cat <filename> | tr -d '\032' > <newfilename>
Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.
Try:
perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml
There is actually a way to do this with sed, like so:
cat input_file | LANG=C sed -E \
-e 's/.*/& /g' \
-e 's/(('\
'[\x9\xa\xd\x20-\x7f]|'\
'[\xc0-\xdf][\x80-\xbf]|'\
'[\xe0-\xec][\x80-\xbf][\x80-\xbf]|'\
'[\xed][\x80-\x9f][\x80-\xbf]|'\
'[\xee-\xef][\x80-\xbf][\x80-\xbf]|'\
'[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]'\
')*)./\1?/g' \
-e 's/(.*)\?/\1/g' \
-e 's|]]>|]]>]]<![CDATA[>|g' > output_file
This works in four steps:
Add a single whitespace character to the end of every line.
Replace every sequence of legal characters followed by any character
with the same sequence of legal characters followed by a question mark
character (instead of the any).
Note that in a line of only legal characters, the '.' matches the last
character in the line, which is why we added a space in step 1.
Remove the last character in the line, which we expect to be a question mark.
Replace the string ']]>' with ']]>]]'.
The LANG=C env variable is set to prevent sed from doing charset conversion itself - it should treat every character as 8-bit ascii.