Using sed to replace multiline xml - regex

I'm trying to use sed to edit/change a xml file, but I'm having problems with multilines
The file I want to change has (extract)
<keyStore>
<location>repository/resources/security/apimanager.jks</location>
<password>wso2carbon</password>
</keyStore>
I want to change the password (and only the keyStore password, the file has another password tag)
I'm trying
sed -i 's/\(<keyStore.*>[\s\S]*<password.*>\)[^<>]*\(<\/password.*>\)/\1$WSO2_STORE_PASS\2/g' $WSO2_PATH/$1/repository/conf/broker.xml
but it's not working (change nothing, pattern not found)
If I test the pattern in on-line tester (https://regex101.com/) it seems to work find.
Also, I have tried to replace the [\s\S]* by [^]*, but in this case, sed generate a syntax error.
I'm using Ubuntu 16.04.1.
Any suggestion will be welcome

Parsing XML with regular expressions is always going to be problematic, as XML is not a regular language. Instead, you can use a proper XML parser, for example with XMLStarlet:
xmlstarlet ed --inplace -u "keyStore/password" -v "$WSO2_STORE_PASS" $WSO2_PATH/$1/repository/conf/broker.xml

Sed is not the tool for the job. Use an XML-aware tool, for example xsh:
open { shift } ;
insert text { shift } replace //keyStore/password/text() ;
save :b ;
Run as
xsh script.xsh "$WSO2_PATH/$1/repository/conf/broker.xml" "$WSO2_STORE_PASS"

Related

Parsing HTML page using bash

I have a web HTML page and im trying to parse it.
Source ::
<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac> </td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac> </td><td>0</td><td>0s</td><td></td></tr></table><p>
What I want is ::
172.29.219.17 L7TOUT in 1001ms
So what Im trying right now is ::
grep redis index.html | grep 'a name=\"redis\/[0-9]*.*\"'
to extract the IP address.
But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.
Ive doublecheck the regex im using but it doesnt seem to work.
Any ideas ?
Using xpath expressions in xmllint with its built-in HTML parser would produce an output as
ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17
and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html
produces an output as
L7TOUT in 1001ms
Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')
removing the whitespaces and extracting out only the needed parts with Awk as
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms
put in a variable as
timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
Now you can print both the values together as
echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms
version details,
xmllint --version
xmllint: using libxml version 20902
Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as
htmlfile:147: HTML parser error : Unexpected end tag : table
remove the line before further testing.
Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.
JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
I have tested all the other tools, comment on this one
If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.
naive - Old Answer
Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:
PHP
RUBY
PYTHON
GOLANG
because these languages are cross platform and have parsers for all the above listed formats.
If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:
<a
name="redis/172.29.219.17">
Some text
</a>
Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:
sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'
Explanation:
The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
The first expressions removes the leading <a name="redis/ text
The last expression removes everything that comes after the closing "

print multiple patterns with sed

I try to print multiple patterns with sed.
Here's a typical string to process :
(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>
and I would like : (1.15)
For this, I tried :
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
but I get (1.)15</span>)</td></tr>
Anyone could see what's wrong ?
Thanks
If you are Chuck Norris, use regex, brainfuck or assembly. If you're not, don't use regex to parse HTML, instead, use a tool that support xpath, like xmllint. In 2014, it's a solved problem :
xmllint --html --xpath '//span[#class="arabic"]/text()' file_or_URL
Check the famous RegEx match open tags except XHTML self-contained tags
xmllint comes from libxml2-utils package (for debian and derivatives)
Reason why you are getting "(1.)15) as your output"
sed 's/^(<span.*">\([0-9]*\).*\([0-9]*\).*">/(\1\.\2)/'
^^
the two characters "> needs to be placed before \([0-9]*\) since "> in your line is before the two digits (in this case). This way sed can find the pattern
The correct sed command
sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
^^
Correct Command line
echo '(<span class="arabic">1</span>.<span class="arabic">15</span>)</td></tr>'|sed 's/^(<span.*">\([0-9]*\).*">\([0-9]*\).*/(\1.\2)/'
results using the command line above
(1.15)
If data is at the same place all the time, awk may be a simpler solution than sed:
awk -F"[<>]" '{print "("$3"."$7")"}' file
(1.15)
$ lynx -dump -nomargins file.htm
(1.15)

search and replace multi-line text with white space

I am trying to search for some text in a XML file, the text is:
</p_dpopis>
<IMGURL>
And replace it with:
</p_dpopis>
<p_vyrobce>NONAME</p_vyrobce>
<IMGURL>
Here is what I tried with perl, without any luck:
perl -0pe 's|</p_dpopis>.*\n.*<IMGURL>|replacement|' myxml.xml
What is wrong here?
Your syntax works:
$ cat file
</p_dpopis>
<IMGURL>
$ perl -0pe 's|</p_dpopis>.*\n.*<IMGURL>|replacement|g' file
replacement
Here is a sed example with the same example file:
$ sed -r '/<\/p_dpopis>/{ N; s%</p_dpopis>.*\n.*<IMGURL>%replaced\ntest%g }' file
replaced
test
See this reference for more info.
You're missing a 'global' modifier for your regex, and using \s+ to match any amount of whitespace is much easier than specifying .*\n.*. It's also nicer to send the output to another file, rather than having to deal with it in the terminal window.
perl -0pe 's|</p_dpopis>\s+<IMGURL>|</p_dpopis>\n<p_vyrobce>NONAME</p_vyrobce>\n<IMGURL>|g' myxml.xml > my_new_xml.xml
If you're manipulating XML, it is really better to use a dedicated XML parser -- you can get into all sorts of mischief by manipulating an irregular language such as XML with regular expressions.

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules
Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;
I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.
I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.
Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"
You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"
You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

How can I remove everything after a word on every line of a text file?

I have a text file that looks a bit like this
356, http://linkgoeshere.com/4445555 title="The Chariot"> <br />
356, http://linkgoeshere.com/4445555 title="fddsfssfd"> <br />
356, http://linkgoeshere.com/4445555 title="T3434534535"> <br />
I want to just leave everything up to the link and remove everything after but each part after is unique apart from the title= so I can't do find and replace.
(About 800 lines of this btw)
Is there any way I can do this using programming?
Thanks.
In Notepad++ you can do this with find and replace using Regular expression
Click menu Search --> Replace...
In Search Mode select Regular expression
Enter the regular expression \stitle=".*$ in Find what
Make Replace with box empty
Click Replace all
Tested in version 6.2.2
This should also work in other editors supporting find and replace using Regular expressions.
Editor way (vim):
open your file with vim, type :%s/ title=.$//g you would have seen the result.
in fact any Editor supports regex replace would work.
script programming:
sed
(note:the command below will make the change in place.)
sed -i 's/ title=.$//' file
awk:
tricky way, without regex:
awk '{print $1,$2}' file
you see the output on stdout. you could redirect it to a file by awk... >newFile
Excel
If your editor doesn't support regular expressions, use Excel to import the file as a csv file (Data -> From Text) and tell excel to use the space as field delimiter. Then export the first two columns as a new csv file.