Change XML structure - replace

Hi I need make some text manipulation of this part of xml.
Deleting some tags is no problem. I need before that rename car ID to CAR_ID and move inside TRIP tags.
ie: MLStarlet Toolkit ?
xmlstarlet somevariable
Original
<car>
<id>155028827</id>
<trip>
<id>1</id>
<date>1.1.1970</date>
</trip>
<trip>
<id>2</id>
<date>1.1.1970</date>
</trip>
</car>
Expection result
<trip>
<car_id>155028827</id>
<id>1</id>
<date>1.1.1970</date>
</trip>
<trip>
<car_id>155028827</id>
<id>2</id>
<date>1.1.1970</date>
</trip>

I'd say
xmlstarlet ed -i '/car/trip/descendant::node()[1]' -t elem -n car_id -u '/car/trip/car_id' -x 'ancestor::node()["car"]/id/text()' filename.xml | xmlstarlet sel -t -c '/car/trip'
This falls into two parts:
xmlstarlet ed \
-i '/car/trip/descendant::node()[1]' -t elem -n car_id \
-u '/car/trip/car_id' -x 'ancestor::node()["car"]/id/text()' \
filename.xml
and
xmlstarlet sel -t -c '/car/trip'
The first is an xmlstarlet ed command, which means that XML goes in, is edited, and edited XML goes out. The edits are
-i '/car/trip/descendant::node()[1]' -t elem -n car_id
which inserts a car_id before the first descendant of every /car/trip node, and
-u '/car/trip/car_id' -x 'ancestor::node()["car"]/id/text()'
which sets the value of all /car/trip/car_id nodes to the text inside the id subnode of their car ancestor node. This alone produces
<?xml version="1.0"?>
<car>
<id>155028827</id>
<trip>
<car_id>1550288271</car_id>
<id>1</id>
<date>1.1.1970</date>
</trip>
<trip>
<car_id>1550288272</car_id>
<id>2</id>
<date>1.1.1970</date>
</trip>
</car>
which is then piped through
xmlstarlet sel -t -c '/car/trip'
This selects (and prints) the /car/trip nodes of this XML data, producing
<trip>
<car_id>1550288271</car_id>
<id>1</id>
<date>1.1.1970</date>
</trip><trip>
<car_id>1550288272</car_id>
<id>2</id>
<date>1.1.1970</date>
</trip>
You could, if the formatting annoys you, use
xmlstarlet sel -t -c '/car/trip | /car/text()'
to preserve the whitespaces between the tags (and get more readably formatted output); with this change, the output is
<trip>
<car_id>1550288271</car_id>
<id>1</id>
<date>1.1.1970</date>
</trip>
<trip>
<car_id>1550288272</car_id>
<id>2</id>
<date>1.1.1970</date>
</trip>
...with two more blank lines at the top; they're the newlines before and after the /car/id node. Unfortunately, the output data is no longer valid XML, so we can't just pipe it through an XML pretty-printer (which is what I'd really like to do). Since my suspicion is that this will be embedded in further XML (so that it can be properly parsed), in the event that formatting is important, my suggestion is to embed this first and pipe the whole XML through a pretty-printer afterwards.

Related

Extract specific XMLs from log file

I have large log files (around 50mb each), which contain java debug information plus all kinds of XML responses
Here's an example of something I'm trying to extract from the log
<envelope>
<response>
<ATTR name="uniqueid" value="XYZ_00000-00-00_12345_1"/>
<ATTR name="status" value="Activated"/>
<ATTR name="datecreated" value="2018/10/04 09:39:05"/>
</response>
</envelope>
I need only the XMLs which the uniqueid attribute contains "12345" and the status attribute is set to "Activated"
By using "sed" I'm able to extract all the envelopes, and currently I'm using regex to check if the above conditions exist inside of it (by running all of them in a loop).
sed -n '/<envelope>/,/<\/envelope>/p' logfile
What would be a proper solution to extract what I need from the file?
Thanks!
assuming your xml is formatted as shown, this should work...
$ awk '/<envelope>/ {line=$0; p=0; next}
line {line=line ORS $0}
/uniqueid/ && $3~/12345/ {p=1}
/<\/envelope>/ && p {print line}' file
with the opening tag, start accumulating the lines, if the desired line found set the flag, with the end tag if the flag is set print the record.
with gawk you can do this instead
$ awk -F'\n' -v RS='</envelope>\n' \
'$3~/uniqueid.*12345/ && $4~/status.*Activated/{print $0, RT}' file
there will be an extra newline though.

Remove content from XML node with sed

My XML input file looks like this:
...
<logos>
<logo name="" primary="true" guid="c6aae8fe-bb04-4067-9b14-18b1bcf940d3" />
<logo name="" primary="false" guid="68b55f4d-f401-4180-b0e0-160974758348" />
</logos>
...
I need to remove the content, keeping the node. Expected output:
<logos></logos>
My command looks like this:
sed -i 's|\(<logos>\)\(.+\)\(</logos>\)|\1\3|gi' $filename
But it ain't working. What am I missing?
Edit: this is not a duplicate of delete node in a xml file with sed : that question is about deleting the whole node. Here I need to delete the content of the node only.
You could use address ranges in addition to c command:
sed -i.bak '/<logos>/,/<\/logos>/c<logos></logos>' $filename
sed and alike would be a bad choice for such cases.
Use a proper XML/HTML parsers.
xmlstarlet solution:
Sample input.xml:
<root>
<logos>
<logo name="" primary="true" guid="c6aae8fe-bb04-4067-9b14-18b1bcf940d3"/>
<logo name="" primary="false" guid="68b55f4d-f401-4180-b0e0-160974758348"/>
</logos>
</root>
xmlstarlet ed -O -d '//logos/*' input.xml
The output:
<root>
<logos/>
</root>

Using sed with regex to subtitute string

I have this xml data
<institution>
<id>83812745840</id>
<code>2701811200</code>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name>place-name_1</name>
<code>415995</code></place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
I need to change <code> with <code_> and <place><name> with <place><name_>. How can be this done with sed and regex?
I tried with sed 's/<institution>.*<code>.*<\/code>/<institution>.*<code_>.*<\/code_>/g' but the .* on replaced string become .* not any string that matched with the regex.
The main issue here is not use XML/HTML parsers while you always should use them when dealing with XML/HTML data:
the right way with xmlstarlet tool:
xmlstarlet ed -O -r '//institution/code' -v 'code_' -r '//place/name' -v 'name_' input.xml
The output:
<institution>
<id>83812745840</id>
<code_>2701811200</code_>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name_>place-name_1</name_>
<code>415995</code>
</place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
To modify the file inplace add -L option: xmlstarlet ed -O -L ....

Bash using sed to find symbols

I am using sed to parse an xml file from yahoo.finance. the file contains a bunch of uninteresting information and all global stock symbols which i want to extract. It's a 1 liner xml file with a big amount of stock symbols which are represented like that:
symbol="VALUE"
i am using sed like this:
sed "s/.* symbol=\"\(.*\)\".*/\1/" list_stocksymbols.xml >> ./tmpfile.txt
my output looks like that:
<?xml version="1.0" encoding="UTF-8"?>
WRG.AX
<!-- engine8.yql.bf1.yahoo.com -->
problem
as you can see only 1 symbol is extracted (WRG.AX).
question
how would i go about getting sed to write out all symbols?
i tried
sed "s/.* symbol=\"\(.*\)\".*/\1/g" list_stocksymbols.xml >> ./tmpfile.txt
global flag, but it didnt work :/
**xml file extract **
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="215" yahoo:created="2014-08-22T09:05:59Z" yahoo:lang="en-US">
<results><industry id="112" name="Agricultural Chemicals">
<company name="Adarsh Plant Protect Ltd" symbol="ADARSHPL.BO"/>
<company name="Agrium Inc" symbol="AGU.DE"/><company name="Agrium Inc" symbol="AGU.TO"/>
<company name="Agrium Inc." symbol="AGU"/>
<company name="Aimco Pesticides Ltd" symbol="AIMCO.BO"/>
<company name="American Vanguard Corp." symbol="AVD"/>
... and so on. The file is in 1 line only and not formatted like above.
** perl regex try **
perl -nle'print $& if m{(?<=symbol=")[^"]+}' list_stocksymbols
did also just bring out the first occurence
grep -Eo 'symbol="[^"]+' yahoo.txt | cut -c 9-
This works for all the grep versions without Perl support (as in Mac OS X in your case).
Also using only sed you could:
sed 's/.*symbol=\"//;s/\".*//' yahoo.txt

Multi line find (grapping) and replace text in XML file using perl

Here i am trying to find and replace text content in one XML file using perl regular expression.
Sample XML Code:
<root>
<add>
<st>xxxx</st>
<pin>xxx</pin>
</add>
</root>
Now i want to find / grep text from <add> to </add> and replace <xyz>xxx</xyz>
<add>
<st>xxxx</st>
<pin>xxx</pin>
</add>
Note:
if above content are in single line i mean without line break in between <add> to </add>, as <add><st>xxxx</st><pin>xxx</pin></add> i can use <add>(.*)<\/add> to find / grep.
Thanking You
Thirusanguraja V
Using XML::XSH2, a wrapper around XML::LibXML:
open input.xml ;
$add = /root/add ;
delete $add/* ;
insert element xyz into $add ;
insert text 'xxx' into $add/xyz ;
save :b ;