Remove non UTF-8 characters from an XML file, using sed

Remove non UTF-8 characters from an XML file, using sed - regex

A given XML file with UTF-8 declared as the encoding does not pass xmllint. With the assumption that a non UTF-8 character is causing the error, the following sed command is being run against the file. sed 's/[^\x00-\x7F]//g' file.xml. Either the command is wrong, or non UTF-8 characters are not the problem, as xmllint still fails after running the sed. The first question is: does the sed regex appear correct?
= = = = =
Here is the output of xmllint:
$ xmllint file.xml
file.xml:35533: parser error : CData section not finished
<p class="imgcont"><img alt="Diets of 2013" src="h
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35533: parser error : PCDATA invalid Char value 31
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35588: parser error : Sequence ']]>' not allowed in content
as.people.com/2013/11/07/kerry-washington-pregnant-diet-green-smoothie-recipe/"]
^
= = = = =
UPDATE: In TextMate, on viewing the file, there is a character that is being shown as <US>. If that character is manually deleted from the file, the file then passes xmllint.

It is somewhat hard to work with sed to remove specific code points from Unicode table.
In case you need to target specific Unicode categories of characters it makes more sense to work with Perl.
perl -pe -i 's/(?![\t\n\r])\p{Cc}//g' file
will remove all control characters but TAB, CR and LF.

Related

Need to extract entry names from file to populate list or variable

I have this config file with entry names encased in brackets: []. I need to extract each entry name into a list or variable to be used in a for loop. Still new and fumbling with some commands. I have a feeling grep is my answer but I don't know where to start. Any help would be appreciated.
[dropbox]
type = dropbox
scope = dropbox
token = {"access_token":"my_token"}
[drive2]
type = drive
scope = drive
token = {"access_token":"other_token"}

You can use sed:
sed -rn 's/(^\[)(.*)(\]$)/\2/p' configfile
Enable regex with -r. Split each line of the file (configfile) into three sections - start of line,[ then anything (.*) and then ], end of line. Substitute the whole line for just the second section and print.

You can use GNU grep:
echo "[dropbox]\ntype = dropbox" | grep -Po '\[\K[^\]]*'
# Prints: dropbox
Here, grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only, 1 match/line, not the entire lines.
\[\K[^\]]* : literal [, escaped, which is followed by the special character \K that tells the regex engine to pretend that the match starts at that point, which is followed by any non-] character, repeated 0 or more times ([^\]]*).
SEE ALSO:
grep manual

How to replace all unicode characters except for Spanish ones?

I am trying to remove all Unicode characters from a file except for the Spanish characters.
Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ are not replaced using the following regex (but all other Unicode appears to be replaced):
perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename
But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:
perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename does not replace the following (some are not printable):
³ � �

Am I missing something obvious here? I am also open to other ways of doing this on the terminal.

You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.
You should add -Mutf8 to let Perl recognize the UTF8-encoded characters used directly in your Perl code.
Also, you need to pass -CSD (equivalent to -CIOED) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.
perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename
Do not forget about Ü and ü.

sed: replace a match with the same match and add more text

I have a file called config.properties that contains the following text:
cat config.properties
// a lot of data
com.enterprise.project.AERO_CARRIERS = LA|LP|XL|4M|LU|4C
//more data
and my goal is keep the same data but adding more. For this example i want to add to the assignment of this variable |JJ|PZ results in:
cat config.properties
// a lot of data
com.enterprise.project.AERO_CARRIERS = LA|LP|XL|4M|LU|4C|JJ|PZ
//more data
The command that I've been using for this is :
sed 's/\(com\.enterprise\.project\.AERO_CARRIERS\s*\=\s*.+\)/\1\|JJ\|PZ/g' config.properties
But this doesn't works. What am I doing wrong?

\s and + are not POSIX compliant:
you can match spaces and tabs with [[:blank:]] and whitespace characters(including line breaks) with [[:space:]].
.+ can be replaced with .\{1,\} or ..*
And you don't need to use backreference here, use & instead to output lines matching your pattern:
sed 's/^com\.enterprise\.project\.AERO_CARRIERS[[:blank:]]*=[[:blank:]]*.\{1,\}/&|JJ|PZ/'

As an alternative to use stream-editors like sed, just use the native text editor, ed from UNIX-days for in-place search and replacement. The option used (-s) is POSIX compliant, so no issues on portability,
printf '%s\n' ",g/com.enterprise.project.AERO_CARRIERS/ s/$/\|JJ\|PZ/g" w q | ed -s -- inputFile
The part ,g/com.enterprise.project.AERO_CARRIERS/ searches for the line containing the pattern, and the part s/$/\|JJ\|PZ/g appends |JJ|PZ to end of that line and w q writes and saves the file, in-place.

You can match first:
sed '/com\.enterprise\.project\.AERO_CARRIERS\s*\=\s*.\+/ s/$/|JJ|PZ/g' config.properties

"sed" special characters handling

we have an sed command in our script to replace the file content with values from variables
for example..
export value="dba01upc\Fusion_test"
sed -i "s%{"sara_ftp_username"}%$value%g" /home_ldap/user1/placeholder/Sara.xml
the sed command ignores the special characters like '\' and replacing with string "dba01upcFusion_test" without '\'
It works If I do the export like export value='dba01upc\Fusion_test' (with '\' surrounded with ‘’).. but unfortunately our client want to export the original text dba01upc\Fusion_test with single/double quotes and he don’t want to add any extra characters to the text.
Can any one let me know how to make sed to place the text with special characters..
Before Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account >
<ser:description/>
<ser:static-account>
<con:username>{sara_ftp_username}</con:username>
</ser:static-account>
</ser:service-account>
After Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account>
<ser:description/>
<ser:static-account>
<con:username>dba01upcFusion_test</con:username>
</ser:static-account>
</ser:service-account>
Thanks in advance

You cannot robustly solve this problem with sed. Just use awk instead:
awk -v old="string1" -v new="string2" '
idx = index($0,old) {
$0 = substr($0,1,idx-1) new substr($0,idx+length(old))
}
1' file
Ah, #mklement0 has a good point - to stop escapes from being interpreted you need to pass in the values in the arg list along with the file names and then assign the variables from that, rather than assigning values to the variables with -v (see the summary I wrote a LONG time ago for the comp.unix.shell FAQ at http://cfajohnson.com/shell/cus-faq-2.html#Q24 but apparently had forgotten!).
The following will robustly make the desired substitution (a\ta -> e\tf) on every search string found on every line:
$ cat tst.awk
BEGIN {
old=ARGV[1]; delete ARGV[1]
new=ARGV[2]; delete ARGV[2]
lgthOld = length(old)
}
{
head = ""; tail = $0
while ( idx = index(tail,old) ) {
head = head substr(tail,1,idx-1) new
tail = substr(tail,idx+lgthOld)
}
print head tail
}
$ cat file
a\ta a a a\ta
$ awk -f tst.awk 'a\ta' 'e\tf' file
e\tf a a e\tf
The white space in file is tabs. You can shift ARGV[3] down and adjust ARGC if you like but it's not necessary in most cases.

Update with the benefit of hindsight, to present options:
Update 2: If you're intent on using sed, see the - somewhat cumbersome, but now robust and generic - solution below.
If you want a robust, self-contained awk solution that also properly handles both arbitrary search and replacement strings (but cannot incorporate regex features such as word-boundary assertions), see Ed Morton's answer.
If you want a pure bash solution and your input files are small and preserving multiple trailing newlines is not important, see Charles Duffy's answer.
If you want a full-fledged third-party templating solution, consider, for instance, j2cli, a templating CLI for Jinja2 - if you have Python and pip, install with sudo pip install j2cli.
Simple example (note that since the replacement string is provided via a file, this may not be appropriate for sensitive data; note the double braces ({{...}})):
value='dba01upc\Fusion_test'
echo "sara_ftp_username=$value" >data.env
echo '<con:username>{{sara_ftp_username}}</con:username>' >tmpl.xml
j2 tmpl.xml data.env # -> <con:username>dba01upc\Fusion_test</con:username>
If you use sed, careful escaping of both the search and the replacement string is required, because:
As Ed Morton points out in a comment elsewhere, sed doesn't support use of literal strings as replacement strings - it invariably interprets special characters/sequences in the replacement string.
Similarly, the search string literal must be escaped in a way that its characters aren't mistaken for special regular-expression characters.
The following uses two generic helper functions that perform this escaping (quoting) that apply techniques explained at "Is it possible to escape regex characters reliably with sed?":
#!/usr/bin/env bash
# SYNOPSIS
# quoteRe <text>
# DESCRIPTION
# Quotes (escapes) the specified literal text for use in a regular expression,
# whether basic or extended - should work with all common flavors.
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# '
# SYNOPSIS
# quoteSubst <text>
# DESCRIPTION
# Quotes (escapes) the specified literal string for safe use as the substitution string (the 'new' in `s/old/new/`).
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
# The search string.
search='{sara_ftp_username}'
# The replacement string; a demo value with characters that need escaping.
value='&\1%"'\'';<>/|dba01upc\Fusion_test'
# Use the appropriately escaped versions of both strings.
sed "s/$(quoteRe "$search")/$(quoteSubst "$value")/g" <<<'<el>{sara_ftp_username}</el>'
# -> <el>&\1%"';<>/|dba01upc\Fusion_test</el>
Both quoteRe() and quoteSubst() correctly handle multi-line strings.
Note, however, given that sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
quoteRe() is always safe to use with a command substitution ($(...)), because it always returns a single-line string (newlines in the input are encoded as '\n').
By contrast, if you use quoteSubst() with a string that has trailing newlines, you mustn't use $(...), because the latter will remove the last trailing newline and therefore break the encoding (since quoteSubst() \-escapes actual newlines, the string returned would end in a dangling \).
Thus, for strings with trailing newlines, use IFS= read -d '' -r escapedValue < <(quoteSubst "$value") to read the escaped value into a separate variable first, then use that variable in the sed command.

This can be done with bash builtins alone -- no sed, no awk, etc.
orig='{sara_ftp_username}' # put the original value into a variable
new='dba01upc\Fusion_test' # ...no need to 'export'!
contents=$(<Sara.xml) # read the file's content into
new_contents=${contents//"$orig"/$new} # use parameter expansion to replace
printf '%s' "$new_contents" >Sara.xml # write new content to disk
See the relevant part of BashFAQ #100 for information on using parameter expansion for string substitution.

Understand sed usage

I use sed to automaticaly update the version in my doxyfile using this :
sed -i -e "s/PROJECT_NUMBER.([ ]{2,}=.*)/PROJECT_NUMBER = $$VERSION/g" ".doxygen"
with $$VERSION = 1.1.0 (for example)
and as a source :
PROJECT_NUMBER = 1.0.10
But it generate an copy version of my .doxygen named .doxygen-e and don't change the line. I've tested my regex here.
I don't understand what's wrong given the fact that it works with my plist file using this :
sed -i -e "s/#VERSION#/$$VERSION/g" "./$${TARGET}.app/Contents/Info.plist"

There are a couple of problems here:
You need to refer to a shell variable $FOO as $$FOO in a Makefile. If you are attempting to do it in bash or any other shell, saying:
$$FOO
would result in the numeric PID of the current process concatenated with FOO, e.g. if the PID of the current process is 1234, then you'd get:
1234FOO
That said, your regex seems to be wrong on more than one count. You say:
PROJECT_NUMBER.([ ]{2,}=.*)
Since you are not using any option for sed that would enable the use of Extended Regular Expressions, this would match the string PROJECT_NUMBER, followed by one character, followed by (, followed by 2 or more whitespaces, an = sign, until it encounters the last ) in the string.
Since you haven't mentioned anything about how the line in the file looks like, I'd assume that it's of the form:
PROJECT_NUMBER = 42.42
The following might work for you:
sed 's/\(PROJECT_NUMBER[ ]*=[ ]*\)[^ ]*/\1$VERSION/' filename
If invoking from within a Makefile, you'd need to double the $.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove non UTF-8 characters from an XML file, using sed - regex

It is somewhat hard to work with sed to remove specific code points from Unicode table. In case you need to target specific Unicode categories of characters it makes more sense to work with Perl. perl -pe -i 's/(?![\t\n\r])\p{Cc}//g' file will remove all control characters but TAB, CR and LF.

Related

Need to extract entry names from file to populate list or variable

How to replace all unicode characters except for Spanish ones?

sed: replace a match with the same match and add more text

"sed" special characters handling

Understand sed usage

Categories

Resources