Remove content from XML node with sed - regex

My XML input file looks like this:
...
<logos>
<logo name="" primary="true" guid="c6aae8fe-bb04-4067-9b14-18b1bcf940d3" />
<logo name="" primary="false" guid="68b55f4d-f401-4180-b0e0-160974758348" />
</logos>
...
I need to remove the content, keeping the node. Expected output:
<logos></logos>
My command looks like this:
sed -i 's|\(<logos>\)\(.+\)\(</logos>\)|\1\3|gi' $filename
But it ain't working. What am I missing?
Edit: this is not a duplicate of delete node in a xml file with sed : that question is about deleting the whole node. Here I need to delete the content of the node only.

You could use address ranges in addition to c command:
sed -i.bak '/<logos>/,/<\/logos>/c<logos></logos>' $filename

sed and alike would be a bad choice for such cases.
Use a proper XML/HTML parsers.
xmlstarlet solution:
Sample input.xml:
<root>
<logos>
<logo name="" primary="true" guid="c6aae8fe-bb04-4067-9b14-18b1bcf940d3"/>
<logo name="" primary="false" guid="68b55f4d-f401-4180-b0e0-160974758348"/>
</logos>
</root>
xmlstarlet ed -O -d '//logos/*' input.xml
The output:
<root>
<logos/>
</root>

Related

Regex to extract http links from an XML file

I have an xml file with many lines like:
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?
I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.
I'm using this expression with egrep.
Don't parse HTML with regex, use a proper XML/HTML parser.
Check: Using regular expressions with HTML tags
You can use one of the following :
xmllint
xmlstarlet
saxon-lint
File:
<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>
Example with xmllint :
xmllint --xpath '//*[#vip="true"]/#href' file.xml 2>/dev/null
Output:
href="http://store.vcenter.com/stores/en/product/tigers-midi/100"
If you need a quick & dirty one time command, you can do:
egrep -o 'https?://[^"]+' file

Using sed with regex to subtitute string

I have this xml data
<institution>
<id>83812745840</id>
<code>2701811200</code>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name>place-name_1</name>
<code>415995</code></place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
I need to change <code> with <code_> and <place><name> with <place><name_>. How can be this done with sed and regex?
I tried with sed 's/<institution>.*<code>.*<\/code>/<institution>.*<code_>.*<\/code_>/g' but the .* on replaced string become .* not any string that matched with the regex.
The main issue here is not use XML/HTML parsers while you always should use them when dealing with XML/HTML data:
the right way with xmlstarlet tool:
xmlstarlet ed -O -r '//institution/code' -v 'code_' -r '//place/name' -v 'name_' input.xml
The output:
<institution>
<id>83812745840</id>
<code_>2701811200</code_>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name_>place-name_1</name_>
<code>415995</code>
</place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
To modify the file inplace add -L option: xmlstarlet ed -O -L ....

How to solve this sed syntax issue

I wrote a regex code to extract anchor tags from a html file and got this output.
mdlinks.txt
<a href='/aspnet/aspnet_refhtmlcontrols.asp'>ASP.NET Reference</a>
<a href='/aspnet/webpages_ref_classes.asp'>Razor Reference</a>
<a href='/html/html_examples.asp'>HTML Examples</a>
<a href='/css/css_examples.asp'>CSS Examples</a>
<a href='/w3css/w3css_examples.asp'>W3.CSS Examples</a>
JavaScript Examples
HTML DOM Examples
I have to represent the output as
"text to display" using the sed tool.
<a[\s]href=('|")([^>]+)">((?:.(?!\<\/a\>))*.)<\/a>
This is my regex which captures the text and href link.
Here is the sed command i wrote
sed -E "s/\"<a[\s]href=('|\")([^>]+)\">((?:.(?!\<\/a\>))*.)<\/a>\"/\[\2\] \(\1\)/" mdlinks.txt
But this gives me error.
Can some please help me?
This is not a job for regex (or any other string manipulation tool). You need tools able to parse html. An example using xsltproc:
1) install the xsltproc package (if needed)
2) Write this xsl file that describes how to transform the html input: stylesheet.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version= "1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="//a">[<xsl:value-of select="text()"/>] (<xsl:value-of select="#href"/>)</xsl:template>
</xsl:stylesheet>
3) Take your original file or your original html content in a variable (let's say "CONTENT"), but not mdlinks.txt (this step is useless and greping links in html content is error-prone and a waste of time (at least 5 hours for you)) and write:
xsltproc --html --novalid stylesheet.xsl <(echo "$CONTENT")
You obtain:
[Google.com] (http://google.com)
[An Example] (http://example.com/files.html)
[File #23] (file23.html)
[See my picture!] (images/mypic.png)
[Email Joel] (mailto:joelross#uw.edu)
Link: http://scott.dd.com.au/wiki/XSLT_Tutorial
Parsing html with line oriented tools will normally fail. Given your simple layout, you could try
tr -s "<" ">" < mdlinks.txt | cut -d">" -f3

Bash using sed to find symbols

I am using sed to parse an xml file from yahoo.finance. the file contains a bunch of uninteresting information and all global stock symbols which i want to extract. It's a 1 liner xml file with a big amount of stock symbols which are represented like that:
symbol="VALUE"
i am using sed like this:
sed "s/.* symbol=\"\(.*\)\".*/\1/" list_stocksymbols.xml >> ./tmpfile.txt
my output looks like that:
<?xml version="1.0" encoding="UTF-8"?>
WRG.AX
<!-- engine8.yql.bf1.yahoo.com -->
problem
as you can see only 1 symbol is extracted (WRG.AX).
question
how would i go about getting sed to write out all symbols?
i tried
sed "s/.* symbol=\"\(.*\)\".*/\1/g" list_stocksymbols.xml >> ./tmpfile.txt
global flag, but it didnt work :/
**xml file extract **
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="215" yahoo:created="2014-08-22T09:05:59Z" yahoo:lang="en-US">
<results><industry id="112" name="Agricultural Chemicals">
<company name="Adarsh Plant Protect Ltd" symbol="ADARSHPL.BO"/>
<company name="Agrium Inc" symbol="AGU.DE"/><company name="Agrium Inc" symbol="AGU.TO"/>
<company name="Agrium Inc." symbol="AGU"/>
<company name="Aimco Pesticides Ltd" symbol="AIMCO.BO"/>
<company name="American Vanguard Corp." symbol="AVD"/>
... and so on. The file is in 1 line only and not formatted like above.
** perl regex try **
perl -nle'print $& if m{(?<=symbol=")[^"]+}' list_stocksymbols
did also just bring out the first occurence
grep -Eo 'symbol="[^"]+' yahoo.txt | cut -c 9-
This works for all the grep versions without Perl support (as in Mac OS X in your case).
Also using only sed you could:
sed 's/.*symbol=\"//;s/\".*//' yahoo.txt

sed to edit only part of a file with regular expression

I have a file named test.txt with the following content
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test time="60" id="01">
<java.lang.String value="cat"/><java.lang.String value="dog"/>
<java.lang.String value="mouse"/>
<java.lang.String value="cow"/>
</test>
What I would like to do is that , i want to edit the file so that when i get something like , <java.lang.String value="something"/> i will change that part to <animal>something</animal>
So for previous example , after applying a script with sed/awk/grep command the file content will be changed to or a new file will be created like following:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test time="60" id="01">
<animal>cat</animal><animal>dog</animal>
<animal>mouse</animal>
<animal>cow</animal>
</test>
I tried to extract that particular part using following command :
$less test.txt | grep -Po 'java.lang.String value="\K[^"]*' | awk -F: '{print "<animal>" $1 "</animal>"}'
The output gives me the changed part, but I want this changed part along with the rest of the file unchanged :
<animal>cat</animal>
<animal>dog</animal>
<animal>mouse</animal>
<animal>cow</animal>
I am new to scripting , I don't know how to write the complete output in a file .
sed -r 's#<java.lang.String value="([^"]*)"/>#<animal>\1</animal>#g' test.txt
And you should not do XML transformations with regular expressions...
EDIT about how it works
By default sed uses "basic regular expressions", where many special characters have to be prefixed with \. -r flag switches to "extended regular expressions" where the syntax is less cumbersome. See OpenGroup for details.
By default sed prints output as-is unless commands modify it. The replacement command is like s#search_regexp#replacement#flags. The delimiter can be anything like /, #, or ,. I choose # so it doesn't clash with the \ character in XML.
Then we match things like <java.lang.String value="anything_except_quotes"/>. The part that we want to reuse has parenthesis, it's called a matching group. In the replacement we refer to the thing we captured inside the matching group by \1.
g flag makes sed replace all occurences of the search pattern, not only the first one.
ok some problems with your command:
less test.txt | grep -Po 'java.lang.String value="\K[^"]*' | awk -F: '{print "<animal>" $1 "</animal>"}'
to begin with, there's a useless use of less, grep can take a file as a parameter:
grep -Po 'java.lang.String value="\K[^"]*' test.txt | awk -F: '{print "<animal>" $1 "</animal>"}'
then you're using grep to select lines that matches a string, so basically, your sequence of commands is explicitely keeping only the lines that have the java.lang... string, taking everything else out... A simpler solution would be to use sed:
sed -r 's,<java.lang.String value="([^"]*)"\s*/>,<animal>\1</animal>,g' test.txt
which uses the substitution syntax of sed to replace the match, while extracting what's in the parenthesis ( and ) as \1 in the right part. The [^"] part is for matching everything that is not a " character, and the * operator is to apply the match 0 or more times. The \s is to match a space, *, 0 or more times.
A regex is an automaton that uses states and transitions to match a given string. Here's a visual of how the regex works:
demo of the regex on an example
Though in your particular case that simple regex works out, keep in mind that this is only a hack. You should instead use an XML parser and replace the nodes to match your needs, using XSLT/XSLFO that are tools designed to transform an XML into another one (or something else).
To do that, you could use a tool such as xsltproc and look at this Q for an example that transforms all foo nodes into bar nodes in an XML tree, here's how to do it:
test.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!--Identity Template. This will copy everything as-is.-->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!--Change "java.lang.String" element to "animal" element.-->
<xsl:template match="java.lang.String">
<animal>
<!-- get the attribute 'value' of java.lang.String -->
<xsl:copy-of select="#*"/>
<xsl:apply-templates/>
</animal>
</xsl:template>
</xsl:stylesheet>
run:
xsltproc test.xsl test.xml
result:
<?xml version="1.0"?>
<test time="60" id="01">
<animal value="cat"/>
<animal value="dog"/>
<animal value="mouse"/>
<animal value="cow"/>
</test>
and by the way, given your XML, it looks like it has been generated by Java, and there's multiple ways to apply that XSL from within your code, even before you need to handle it using command line tools.