How to find properties containing matching certain pattern using xmllint

How to find properties containing matching certain pattern using xmllint - regex

I am trying to extract a value in a shell script using xmllint, I was able to find and extract values by matching complete key strings.
The problem is for some values I just know what the key starts with.
For example: let a part of xml be:
<property>
<name>foo.bar.random_part_of_name</name>
<value> SOME_VALUE</value>
</property>
I want to extract this entire segment as write it to an output file.
So far, I have been able to match complete segments with
if (xmllint --xpath '//property[name/text()="foo.bar"]/value/text()' "$INPUT_FILE"); then
value=$(xmllint --xpath '//property[name/text()="foo.bar"]/value/text()' "$INPUT_FILE")
echo "<property><name>foo.bar</name><value>$value</value></property>">> $OUTPUT_FILE
fi
Thanks in advance

Xpath 1.0 offers start-with(node, pattern) function to do what you want
name="foo.bar"
value=$(xmllint --xpath "//property[starts-with(name,'$name')]/value/text()" test.xml)
if [ -n "$value" ]; then
echo "<property><name>$name</name><value>$value</value></property>"
fi
Result:
<property><name>foo.bar</name><value> SOME_VALUE</value></property>

Related

Using sed to replace multiline xml

I'm trying to use sed to edit/change a xml file, but I'm having problems with multilines
The file I want to change has (extract)
<keyStore>
<location>repository/resources/security/apimanager.jks</location>
<password>wso2carbon</password>
</keyStore>
I want to change the password (and only the keyStore password, the file has another password tag)
I'm trying
sed -i 's/\(<keyStore.*>[\s\S]*<password.*>\)[^<>]*\(<\/password.*>\)/\1$WSO2_STORE_PASS\2/g' $WSO2_PATH/$1/repository/conf/broker.xml
but it's not working (change nothing, pattern not found)
If I test the pattern in on-line tester (https://regex101.com/) it seems to work find.
Also, I have tried to replace the [\s\S]* by [^]*, but in this case, sed generate a syntax error.
I'm using Ubuntu 16.04.1.
Any suggestion will be welcome

Parsing XML with regular expressions is always going to be problematic, as XML is not a regular language. Instead, you can use a proper XML parser, for example with XMLStarlet:
xmlstarlet ed --inplace -u "keyStore/password" -v "$WSO2_STORE_PASS" $WSO2_PATH/$1/repository/conf/broker.xml

Sed is not the tool for the job. Use an XML-aware tool, for example xsh:
open { shift } ;
insert text { shift } replace //keyStore/password/text() ;
save :b ;
Run as
xsh script.xsh "$WSO2_PATH/$1/repository/conf/broker.xml" "$WSO2_STORE_PASS"

Retrieving Value from XML using grep and regular expressions

I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.
<Artifacts>
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
</Artifacts>
I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'
Is there a way to just extract the following ?
http://build.system.org/path/to/artifact/folder/something.exe

The righteous way would be to use XML tools in this case, such as xmlstarlet
But that, of course, requires a valid XML structure. A valid XML structure would look like:
<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07"
endtime="2017-04-21 00:59:54.680601-07"
status="succeeded"
change="e850b01967222464ffca02bf94dc711236fa978a"
released="no">
<file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>
The command:
xmlstarlet sel -t -v "//artifact/file[contains(#url,'something.exe')]/#url" -n xmlfile
The output:
http://build.system.org/path/to/artifact/folder/something.exe
-v option (or --value-of ) - print value of XPATH expression
The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false.

As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:
xml sel -t -m 'Artifacts/artifact/file [contains(#url, "something.exe")]' -v #url
Output:
http://build.system.org/path/to/artifact/folder/something.exe

This regex should work: ([\w\d\s]*.exe)"\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").
$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"
Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.

Parsing HTML page using bash

I have a web HTML page and im trying to parse it.
Source ::
<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac> </td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac> </td><td>0</td><td>0s</td><td></td></tr></table><p>
What I want is ::
172.29.219.17 L7TOUT in 1001ms
So what Im trying right now is ::
grep redis index.html | grep 'a name=\"redis\/[0-9]*.*\"'
to extract the IP address.
But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.
Ive doublecheck the regex im using but it doesnt seem to work.
Any ideas ?

Using xpath expressions in xmllint with its built-in HTML parser would produce an output as
ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17
and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html
produces an output as
L7TOUT in 1001ms
Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')
removing the whitespaces and extracting out only the needed parts with Awk as
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms
put in a variable as
timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
Now you can print both the values together as
echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms
version details,
xmllint --version
xmllint: using libxml version 20902
Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as
htmlfile:147: HTML parser error : Unexpected end tag : table
remove the line before further testing.

Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.
JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
I have tested all the other tools, comment on this one
If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.
naive - Old Answer
Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:
PHP
RUBY
PYTHON
GOLANG
because these languages are cross platform and have parsers for all the above listed formats.

If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:
<a
name="redis/172.29.219.17">
Some text
</a>
Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:
sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'
Explanation:
The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
The first expressions removes the leading <a name="redis/ text
The last expression removes everything that comes after the closing "

Extract parent url from header with bash script

I have some time trying to get some part of a text (a header) on bash script but I couldn't. This is the string I have:
link: <https://api.some.com/v1/monitor/zzsomeLongIdzz?access_token=xxSomeLongTokenxx==>; rel="monitor",<https://api.some.com/v1/services/xx/something-more/accounts/2345?access_token=xxSomeLongTokenxx==>; rel="parent"
Here is with some format so you could see it better:
link:
<https://api.some.com/v1/monitor/zzsomeLongIdzz?access_token=xxSomeLongTokenxx==>; rel="monitor",
<https://api.some.com/v1/services/xx/something-more/accounts/2345?access_token=xxSomeLongTokenxx==>; rel="parent"
I need the second part, just the url, basically the values between ,< and >; rel="parent" and assign that to a variable, like:
my_url = $(echo $complete_header) <== some way to filter that
I have no idea how to apply some regex or pattern to extract the data I need. On the past I had use jq for filtering json responses, like this:
error_message=$(echo $response | jq '.["errors"]|.[0]|.["message"]')
But unfortunately for me, this is not a json. Could somebody point me on the right direction with that?

Use the following code:
#!/bin/bash
link='<https://api.some.com/v1/monitor/zzsomeLongIdzz?access_token=xxSomeLongTokenxx==>; rel="monitor",<https://api.some.com/v1/services/xx/something-more/accounts/2345?access_token=xxSomeLongTokenxx==>; rel="parent"'
re=",<([^>]+)>"
# look for ,< literally
# capture everything that is not a > one or more times ([^>]+)
# look for the closing > literally
my_url="test"
if [[ $link =~ $re ]]; then my_url=${BASH_REMATCH[1]}; fi
echo $my_url
See a demo on ideone.com.

regular expression shell script save specific part of file to variable

I need help for some command for handle the following situation.
I search on net for help but i can't succeed to find solution.
Problem is:
I have a xml file named as "temp.xml"
The text in the xml file is for example
<?xml version="1.0" encoding="utf-8"?>
<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />
I want to save only Note_Nbr to my variable x (312)
I try a something but it doesn't work.
X=$(sed -r 's/[^|]*(Note_Nbr=")(\d\d\d)([^|]*)/2 /' temp.xml )
Thank you for your helps.

The right way to do this is with a real XML parser:
x=$(xmllint --xpath 'string(/NoteData/#Note_Nbr)' test.xml)
...or, if you have XMLStarlet rather than a new enough xmllint:
x=$(xmlstarlet sel -t -m '/NoteData' -v #Note_Nbr -n <test.xml)
See also this answer: https://stackoverflow.com/a/1732454/14122
Now, if you only wanted to work with literal strings, you could build something fragile that looked like this, using parameter expansion:
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
s=${s#* Note_Nbr=\"}; s=${s%%\"*}; echo "$s"
Alternately, you could use native regular expression support within bash (note that this functionality is a bash extension not present in POSIX sh):
s='<NoteData Note_Nbr="312" Data_Revision="2" Note_Revision="1" />'
re='Note_Nbr="([^"]+)"'
if [[ $s =~ $re ]]; then
match="${BASH_REMATCH[1]}"
else
echo "ERROR: No match found" >&2
fi

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to find properties containing matching certain pattern using xmllint - regex

Related

Using sed to replace multiline xml

Retrieving Value from XML using grep and regular expressions

Parsing HTML page using bash

Extract parent url from header with bash script

regular expression shell script save specific part of file to variable

Categories

Resources