Searching and replacing a block of XML formatted text in Bash - regex

I have been trying to figure out how to search a block of XML formatted text and modify it in Bash. The file I want to process is a simulation file with XML fomatting. Assume that the file contains multiple blocks of XML stataments as:
<mote>
<breakpoints />
<interface_config>
org.contikios.cooja.interfaces.Position
<x>0.0</x>
<y>75.0</y>
<z>0.0</z>
</interface_config>
<interface_config>
org.contikios.cooja.mspmote.interfaces.MspClock
<deviation>1.0</deviation>
</interface_config>
<interface_config>
org.contikios.cooja.mspmote.interfaces.MspMoteID
<id>4</id>
</interface_config>
<motetype_identifier>sky2</motetype_identifier>
</mote>
What I want to search is a block of XML statements here:
<id>4</id>
</interface_config>
<motetype_identifier>sky2</motetype_identifier>
And replace it with
<id>4</id>
</interface_config>
<motetype_identifier>sky3</motetype_identifier>
Rest of the XML statements before and after these statements will remain unchanged. This will enable me to change the mote type Node 4 from sky2 to sky3 in a script in Bash.

xmlstarlet ed --omit-decl -u "//mote[interface_config/id='4']/motetype_identifier" -v 'sky3' file.xml
Output:
<mote>
<breakpoints/>
<interface_config>
org.contikios.cooja.interfaces.Position
<x>0.0</x>
<y>75.0</y>
<z>0.0</z>
</interface_config>
<interface_config>
org.contikios.cooja.mspmote.interfaces.MspClock
<deviation>1.0</deviation>
</interface_config>
<interface_config>
org.contikios.cooja.mspmote.interfaces.MspMoteID
<id>4</id>
</interface_config>
<motetype_identifier>sky3</motetype_identifier>
</mote>
If you want to edit file.xml inplace, add option -L.
See: xmlstarlet ed --help

Related

How to extract the data from a consecutive xml tag attribute based on the previous tag value

I have trouble getting my regex right for the below use case.
<LOB>
<LOBStatusInfo>
<LOB>Mobile</LOB>
<Status>Active</Status>
</LOBStatusInfo>
<LOBStatusINfo>
<LOB>Voice</LOB>
<Status>Active</Status>
</LOBStatusInfo>
<LOBStatusInfo>
<LOB>Internet</LOB>
<Status>Disconnect</Status>
</LOBStatusInfo>
</LOBStatus>
In the above XML, I'm looking to extract only the status corresponding to Voice (which is active).
So far, I was able to get the LOB itself, but not the corresponding status.
ps: I'm a newbie, please pardon if the details weren't enough.
We don't parse XML with regex, check: Using regular expressions with HTML tags
Instead, you can use xpath and a proper xml parser. What is your environment, language ?
Test :
Input file
<LOB>
<LOBStatus>
<LOBStatusInfo>
<LOB>Mobile</LOB>
<Status>Active</Status>
</LOBStatusInfo>
<LOBStatusInfo>
<LOB>Voice</LOB>
<Status>Active</Status>
</LOBStatusInfo>
<LOBStatusInfo>
<LOB>Internet</LOB>
<Status>Disconnect</Status>
</LOBStatusInfo>
</LOBStatus>
</LOB>
Command
(just an example, now in shell, but the query can be used in any language of your choice)
xmllint --xpath '//LOB[text()="Voice"]/../Status/text()' file.xml
or
xmllint --xpath '//LOB[text()="Voice"]/following-sibling::Status/text()' file.xml
Output:
Active

shell script to recognise jira key

Below lines in my jenkins job configuration Execute shell retrieves jira key
JIRA_KEY=$(curl --request GET "http://jenkins-server/job/myProject/job/mySubProject/job/myComponent/${BUILD_NUMBER}/api/xml?xpath=/*/changeSet/item/comment" | sed -e "s/<comment>\(.*\)<\/comment>/\1/")
JIRA_KEY=$(echo $JIRA_KEY | cut -c10-17)
But in case if text doesn't start with jira key then as per the current logic it will assign any text in the range of 10-17. I need to store empty string "" in the variable JIRA_KEY when jira key is not present in the <comment>, how can we do that?
Here is the xml
<freeStyleBuild _class="hudson.model.FreeStyleBuild">
<changeSet _class="hudson.plugins.git.GitChangeSetList">
<item _class="hudson.plugins.git.GitChangeSet">
<comment>
JRA-1011 This is commit
message.
</comment>
</item>
</changeSet>
</freeStyleBuild>
As I mentioned in comment section it is not clear which output you need, so based on some assumptions, could you please try following and let me know on same.
I- If you need all the strings between to then you could run following.
awk '/<\/comment>/{a="";next}/<comment>/{a=1;next}a' Input_file
II- If you need to find only JRA string between to then you could do following.
awk '/<\/comment>/{a="";next}/<comment>/{a=1;next} a && /JRA/{match($0,/[a-zA-Z]+[^ ]*/);print substr($0,RSTART,RLENGTH)}' Input_file

BASH script to rename XML file to an attribute value

I have a lot of .xml files structured the same way:
<parent id="idvalue" attr1="val1" attr2="val2" ...>
<child attr3="val3" attr4="val4" ... />
<child attr3="val5" attr4="val6" ... />
...
</parent>
Each file has exactly one <parent> element with exactly one id attribute.
All of those files (almost 1,700,000 of them) are named as part.xxxxx where xxxxx is a random number.
I want to name each of those files as idvalue.xml, according to the sole id attribute from the file's content.
I believe doing it with a bash script would be the fastest and most automated way. But if there are other suggestions, I would love to hear them.
My main problem is that I am not able (don't know how) to get the idvalue in a specific file, so that I could use it with the mv file.xxxxx idvalue.xml command.
First I would iterate through the xml files using find:
find -maxdepth 1 -name 'part*.xml' -exec ./rename_xml.sh {} \;
The line above will execute rename_xml.sh for every xml file, passing the file name as command argument to the script.
rename_xml.sh should look like this:
#!/bin/bash
// Get the id using XPath. You might probably need
// to install xmllint for that if it is not already present.
// The xpath query will return a string like this (try it!):
//
// id="idvalue"
//
// We are using sed to extract the value from that
id=$(xmllint --xpath '//parent/#id' "$1" | sed -r 's/[^"]+"([^"]+).*/\1/')
mv -v "$1" "$id.xml"
Don't forget to
chmod +x rename_xml.sh
Use a proper XML handling tool to extract the id from the files. For example,
xsh:
for file in part.* ; do
mv "$file" $(xsh -aC 'open { shift }; echo /parent/#id' "$file").xml
done
Like I mentioned in my comment that I am not sure about the performance of XSLT in compared to bash scripts, but I created the XSLT for you to try out.
In the stylesheet below, Dire is the directory that contains the xml files.The select "tokenize(document-uri(.), '/')[last()]"
retrieves the filename and the second line concatenates the directory name with the filename to get the path of the file.The line with xsl:copy..is used to copy the entire xml.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxml="urn:schemas-microsoft-com:xslt" xmlns:random="http://www.microsoft.com/msxsl">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="collection('Dire/?select=*.xml')" >
<xsl:variable name="filename" select="tokenize(document-uri(.), '/')[last()]"/>
<xsl:variable name="filepath" select="concat('Dire/',$filename)"/>
<xsl:variable name="doc" select="document($filepath)"/>
<xsl:variable name="outname" select="$doc/parent/#id"/>
<xsl:result-document href="{$outname}.xml" method="xml">
<xsl:copy-of select="$doc/node()"/>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I ran the xslt using saxon8. Unfortunately I could not find any way to rename the xml directly.But the above code should be worth a try.

Removing specific tags in a KML file

I have a KML file which is a list of places around the world with coordinates and some other attributes. It looks like this for one place:
<Placemark>
<name>Albania - Durrës</name>
<open>0</open>
<visibility>1</visibility>
<description>(Spot ID: 275801) show <![CDATA[forecast]]></description>
<styleUrl>#wgStyle001</styleUrl><Point>
<coordinates>19.489747,41.277806,0</coordinates>
</Point>
<LookAt><range>200000</range><longitude>19.489747</longitude><latitude>41.277806</latitude></LookAt>
</Placemark>
I would like to remove everything except the name of the place. So in this case that would mean I would like to remove everything except
<name>Albania - Durrës</name>
The problem is, this KML file includes more than 1000 of these places. Doing this manually obviously isn't an option, so then how can I remove all tags except for the name tags for all of the items in the list? Can I use some kind of program for that?
Use a specialized command line tool that understands XML documents.
One such tool is xmlstarlet, which is available here for Linux, Windows and Solaris.
To address your particular problem, I used the xmlstarlet executable xml.exe like this (on Windows):
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -t -v /ns:kml/ns:Document/ns:Placemark/ns:name places.kml
This produces this output:
Albania - Durrës
Second Name
Third Name
...
Final Name
If you can guarantee that <name> occurs only as a child of <Placemark>, then this abbreviated version will produce the same result:
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -t -v //ns:name places.kml
(This is because this shorter version finds all <name> elements no matter where they occur in the document.)
If you really want an XML document, you'll need to do a little post-processing. Here's an example of a complete XML document:
<?xml version='1.0' encoding='utf-8'?>
<items>
<item>Albania - Durrës</item>
<item>Second Name</item>
<item>Third Name</item>
<!-- ... -->
<item>Final Name</item>
</items>
This first line is the XML declaration. It declares the Unicode encoding utf-8. You'll need to include this line so that XML processors recognize that your document includes Unicode characters. (As in Durrës.)
More: Here's an enhanced 'xmlstarlet' command that will produce the XML document above:
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -T -t -o "<?xml version='1.0' encoding='utf-8'?>" -n -t -v "'<items>'" -n -t -m //ns:Placemark -v "concat('<item>',ns:name,'</item>')" -n -t -o "</items>" -n places.kml
If you are on linux or similar:
grep "<name>" your_file.kml > file_with_only_name_tags
On windows, see What are good grep tools for Windows?

Access attributes from XML in shell

I'm trying to parse out values from a Widget config.xml using shell. I do want to use sed for this task. If there is something that sucks less than xsltproc, I'd love to know.
In this example I am after the id attribute value from the config.xml below:
<?xml version="1.0" encoding="UTF-8"?>
<widget xmlns="http://www.w3.org/ns/widgets" id="http://example.org/exampleWidget" version="2.0 Beta" height="200" width="200">
<name short="123">Foo Widget</name>
</widget>
I wish it was as simple as Jquery's attr: var id = $("widget").attr("id");
Currently this shell code utilising xsltproc fails:
snag () {
TMP=$(tempfile)
cat << EOF > $TMP
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template>
<xsl:value-of select="$1"/>
</xsl:template>
</xsl:stylesheet>
EOF
echo $(xsltproc $TMP config.xml)
rm -f $TMP
}
ID=$(snag "widget/#id")
if test "$ID" = "http://example.org/exampleWidget"
then
echo Mission accomplished.
else
echo "<$ID> is wrong."
fi
XMLStarlet (http://xmlstar.sourceforge.net/) is a nice command line tools that supports such queries:
xmlstarlet sel -N w=namespace -T -t -m "/w:widget/#id" -v . -n config.xml
template match="widget"
select value-of="#id"
<xsl:template xmlns:wgt="http://www.w3.org/ns/widgets" match="/wgt:widget">
<xsl:select value-of="#id" />
</xsl:template>
You don't need XSLT if you're not doing a transform.
If you only need to grab a value use XPath.
There's an xpath program that comes with Perl's XML::XPath module.
From the shell:
ID=$(xpath config.xml 'string(/widget/#id)' )
( The string() function is to get only the value of the id.
/widget/#id by itself returns "id=value" )
If you only need to produce some other output depending on the value, you could
do it all in xslt. There are also other XPath implementations available from
other scripting languages: I've used Java's XPath from both rhino and Jython.
There's also XQuery from the command line with Saxon.