Using sed with regex to subtitute string - regex

I have this xml data
<institution>
<id>83812745840</id>
<code>2701811200</code>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name>place-name_1</name>
<code>415995</code></place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
I need to change <code> with <code_> and <place><name> with <place><name_>. How can be this done with sed and regex?
I tried with sed 's/<institution>.*<code>.*<\/code>/<institution>.*<code_>.*<\/code_>/g' but the .* on replaced string become .* not any string that matched with the regex.

The main issue here is not use XML/HTML parsers while you always should use them when dealing with XML/HTML data:
the right way with xmlstarlet tool:
xmlstarlet ed -O -r '//institution/code' -v 'code_' -r '//place/name' -v 'name_' input.xml
The output:
<institution>
<id>83812745840</id>
<code_>2701811200</code_>
<full_name>full name 1</full_name>
<address>adress 1</address>
<institution_type>
<id>191</id>
<code>inst code 1</code>
<name>institution name1</name>
</institution_type>
<place>
<id>812007638</id>
<name_>place-name_1</name_>
<code>415995</code>
</place>
<activity>
<code>811855905</code>
<name>act-name-1</name>
<equipment_specialty>false</equipment_specialty>
</activity>
</institution>
To modify the file inplace add -L option: xmlstarlet ed -O -L ....

Related

Why is my regex failing to select the correct elements, when it works on the online regex tester

I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.

Regex to extract http links from an XML file

I have an xml file with many lines like:
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?
I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.
I'm using this expression with egrep.
Don't parse HTML with regex, use a proper XML/HTML parser.
Check: Using regular expressions with HTML tags
You can use one of the following :
xmllint
xmlstarlet
saxon-lint
File:
<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>
Example with xmllint :
xmllint --xpath '//*[#vip="true"]/#href' file.xml 2>/dev/null
Output:
href="http://store.vcenter.com/stores/en/product/tigers-midi/100"
If you need a quick & dirty one time command, you can do:
egrep -o 'https?://[^"]+' file

Retrieving Value from XML using grep and regular expressions

I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.
<Artifacts>
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
</Artifacts>
I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'
Is there a way to just extract the following ?
http://build.system.org/path/to/artifact/folder/something.exe
The righteous way would be to use XML tools in this case, such as xmlstarlet
But that, of course, requires a valid XML structure. A valid XML structure would look like:
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
The command:
xmlstarlet sel -t -v "//artifact/file[contains(#url,'something.exe')]/#url" -n xmlfile
The output:
http://build.system.org/path/to/artifact/folder/something.exe
-v option (or --value-of ) - print value of XPATH expression
The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false.
As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:
xml sel -t -m 'Artifacts/artifact/file [contains(#url, "something.exe")]' -v #url
Output:
http://build.system.org/path/to/artifact/folder/something.exe
This regex should work: ([\w\d\s]*.exe)"\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").
$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"
Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.

Replacing tags in xml in bash

I have an xml file that is of the following format:
<list>
<version>1.5</version>
<version>1.4</version>
<version>1.3</version>
<version>1.2</version>
</list>
The idea is that I always update the first version tag with a new version. And when I do so, I replace the subsequent tags.
For example, when I update the 1.6 version as the first tag (which I know how to do), the following tags would be:
<list>
<version>1.6</version>
<version>1.5</version>
<version>1.4</version>
<version>1.3</version>
</list>
I've tried to get two options going.
First Option:
My preferred option would be to search the xml file and replace the version tag i+1 with version tag i. Something like:
sed -E '2,/<version>.*<\/version>/s#<version>(.*)</c>#<version>\1</version>#' file.xml
Where I search for the second instance of version and replace it with the first instance of version (currently not working).
Second Option:
My second option would be to store the version tags in variables like:
version=$(grep -oPm1 "(?<=version>)[^<]+" file.xml)
version2=$(grep -oPm2 "(?<=version>)[^<]+" file.xml)
Then replace version 2 by version 1 and do the replacement:
sed -i "s/${version2}/${version}/g" file.xml
However, this options gives:
sed: -e expression #1, char 9: unterminated 's' command.
And when I try:
sed -i "/$version2/s/${version2}/${version}/g" file.xml
I get:
unterminated address regex
Obviously, the idea for either option would be to put the code in a loop so that I can run it i times. However, I am stuck and both options I've tried don't work.
Don't use text-manipulation tools such as awk or sed to work with XML if you can at all avoid it. While this specific subset may be so simple as to make the approach feasible, having the right tools at hand will avoid headaches later (if the file format gets extended; if someone adds comments to the front; etc).
new_version=1.6
xmlstarlet ed \
-d '/list/version[last()]' \
-i '/list/version[1]' -t elem -n version -v "$new_version" \
<old.xml >new.xml
-d '/list/version[last()]' deletes the last version entry in the list.
-i '/list/version[1]' -t elem -n version -v 1.6 introduces a new element named version, with the value 1.6, in the position currently held by the very first version.
Use ! or # as separator in sed instead of /.
It breaks because your match and replace variables contain /

Sed get xml attribute value

I have next xml file:
<AutoTest>
<Source>EBS FX</Source>
<CreateCFF>No</CreateCFF>
<FoXML descriptor="pb.fx.spotfwd.trade.feed" version="2.0">
<FxSpotFwdTradeFeed>
<FxSpotFwd feed_datetime="17-Dec-2014 10:20:09"
cpty_sds_id="EBS" match_id="L845586141217" original_trade_id_feed="L80107141217"
value_date="20141218" trade_id_external="001-002141880445/5862" match_sds_id="EBSFeedCpty"
counter_ccy="USD" trade_id_feed="107" trade_type="S" feed_source_id="80" quoting_term="M"
deal_ccy="GBP" rate="1.5" trade_date="20141217" modified_by="automation" cpty_side="B" counter_amt="1500000"
smart_match="0" booking_status_id="10" trade_status_id="22" deal_amt="1000000" trade_direction="B">
<Notes />
</FxSpotFwd>
</FxSpotFwdTradeFeed>
<TestCases />
</FoXML>
</AutoTest>
How to get value of trade_id_external attribute by using sed?
I tried with this expression: sed -n '/trade_id_external/s/.*=//p' ./file.xml
but no luck
You don't even need a pattern /trade_id_external/ before the s///
$ sed -n 's/.*trade_id_external="\([^"]*\).*/\1/p' file
001-002141880445/5862
In basic sed , \(...\) called capturing groups which was used to capture the characters you want to print at the final.
Through grep,
$ grep -oP 'trade_id_external="\K[^"]*' file
001-002141880445/5862
-P would turn on the Perl-regex mode in grep. So we could use any PCRE regexes in grep with -P param enabled. \K in the above regex would discard the previously matched characters, that is, it won't consider the characters which are matched by the pattern exists before the \K