Parsing HTML page using bash - regex

I have a web HTML page and im trying to parse it.
Source ::
<tr class="active0"><td class=ac><a name="redis/172.29.219.17"></a><a class=lfsb href="#redis/172.29.219.17">172.29.219.17</a></td><td>0</td><td>0</td><td>-</td><td>0</td><td>0</td><td></td><td>0</td><td>0</td><td>-</td><td><u>0<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>0</td></tr><tr><th colspan=3>Avg over last 1024 success. conn.</th></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td>0</td><td>ms</td></tr></table></div></u></td><td>0</td><td>?</td><td>0</td><td>0</td><td></td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 0 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m DOWN</td><td class=ac><u> L7TOUT in 1001ms<div class=tips>Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')</div></u></td><td class=ac>1</td><td class=ac>Y</td><td class=ac>-</td><td><u>1<div class=tips>Failed Health Checks</div></u></td><td>1</td><td>17h12m</td><td class=ac>-</td></tr>
<tr class="backend"><td class=ac><a name="redis/Backend"></a><a class=lfsb href="#redis/Backend">Backend</a></td><td>0</td><td>0</td><td></td><td>1</td><td>24</td><td></td><td>29</td><td>41</td><td>200</td><td><u>5<span class="rls">4</span>033<div class=tips><table class=det><tr><th>Cum. sessions:</th><td>5<span class="rls">4</span>033</td></tr><tr><th>- Queue time:</th><td>0</td><td>ms</td></tr><tr><th>- Connect time:</th><td>0</td><td>ms</td></tr><tr><th>- Total time:</th><td><span class="rls">6</span>094</td><td>ms</td></tr></table></div></u></td><td>5<span class="rls">4</span>033</td><td>1s</td><td><span class="rls">4</span>89<span class="rls">1</span>000</td><td>1<span class="rls">8</span>11<span class="rls">6</span>385<div class=tips>compression: in=0 out=0 bypassed=0 savings=0%</div></td><td>0</td><td>0</td><td></td><td>0</td><td><u>0<div class=tips>Connection resets during transfers: 54004 client, 0 server</div></u></td><td>0</td><td>0</td><td class=ac>17h12m UP</td><td class=ac> </td><td class=ac>1</td><td class=ac>1</td><td class=ac>0</td><td class=ac> </td><td>0</td><td>0s</td><td></td></tr></table><p>
What I want is ::
172.29.219.17 L7TOUT in 1001ms
So what Im trying right now is ::
grep redis index.html | grep 'a name=\"redis\/[0-9]*.*\"'
to extract the IP address.
But the regex doesnt seem to look at pick out the only the first row and returns both the rows whereas the IP is only in row 1.
Ive doublecheck the regex im using but it doesnt seem to work.
Any ideas ?

Using xpath expressions in xmllint with its built-in HTML parser would produce an output as
ipAddr=$(xmllint --html --xpath "string(//tr[1]/td[1])" html)
172.29.219.17
and for the time out value prediction, I did a manual calculation of the number of the td row containing the value, which turned out to be 24
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html
produces an output as
L7TOUT in 1001ms
Layer7 timeout: at step 6 of tcp-check (expect string 'role:master')
removing the whitespaces and extracting out only the needed parts with Awk as
xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
L7TOUT in 1001ms
put in a variable as
timeOut=$(xmllint --html --xpath "string(//tr[1]/td[24]/u[1])" html | awk 'NF && /L7TOUT/{gsub(/^[[:space:]]*/,"",$0); print}'
Now you can print both the values together as
echo "${ipAddr} ${timeOut}"
172.29.219.17 L7TOUT in 1001ms
version details,
xmllint --version
xmllint: using libxml version 20902
Also there is an incorrect tag in your HTML input file </table> at the end just before <p> which xmllint reports as
htmlfile:147: HTML parser error : Unexpected end tag : table
remove the line before further testing.

Here is a list of command line tools that will help you parse different formats via bash; bash is extremely powerful and useful.
JSON utilize jq
XML/HTML utilize xq
YAML utilize yq
CSS utilize bashcss
I have tested all the other tools, comment on this one
If the code starts getting truly complex you might consider the naive answer below as coding languages with class support will assit.
naive - Old Answer
Parsing complex formats like JSON, XML, HTML, CSS, YAML, ...ETC is extremely difficult in bash and likely error prone. Because of this I recommend one of the following:
PHP
RUBY
PYTHON
GOLANG
because these languages are cross platform and have parsers for all the above listed formats.

If you want to parse HTML with regexes, then you have to make assumptions about the HTML formatting. E.g. you assume here that the a tag and its name attribute is on the same line. However, this is perfect HTML too:
<a
name="redis/172.29.219.17">
Some text
</a>
Anyway, let's sole the problem assuming that the a tags are on one line and the name is the first attribute. This is what I could come up with:
sed 's/\(<a name="redis\)/\n\1/g' index.html | grep '^<a name="redis\/[0-9.]\+"' | sed -e 's/^<a name="redis\///g' -e 's/".*//g'
Explanation:
The first sed command makes sure that all <a name="redis text goes to a separate line.
Then the grep keeps only those lines that start with `
The last sed contains two expressions:
The first expressions removes the leading <a name="redis/ text
The last expression removes everything that comes after the closing "

Related

Extract Google drive folder id from URL's

I am just trying to extract the Google drive folder id from bunch of different google drive URL's
cat links.txt
https://drive.google.com/drive/mobile/folders/1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE?usp=sharing
https://drive.google.com/open?id=1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
https://drive.google.com/folderview?id=1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
https://drive.google.com/file/d/1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_/view?usp=drivesdk
https://drive.google.com/drive/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA/1re_-YAGfTuyE1Gt848vzTu4ZDC6j23sG/1Ye90fM5qYMYkXp4QMAcQftsJCFVHswWj/149W7xNROO33zaPvIYTNwvtVGAXFxCg_b?sort=13&direction=a
https://drive.google.com/drive/mobile/folders/1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9?sort=13&direction=a
https://drive.google.com/drive/folders/1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF?sort=13&direction=a
Expected Output
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
After an hour of trial/error , i did came up with this regex - ([01A-Z])(?=[\w-]*[A-Za-z])[\w-]+
It seems to work almost well except it can't process the 3rd last link properly. If there are multiple nested folder ids in URL, i need the innermost one in the output . Can someone please help me out with this error and possibly improve the regex if it can be done in a more efficient way than mine
You may try this sed:
sed -E 's~.*[/=]([01A-Z][-_[:alnum:]]+)([?/].*|$)~\1~' links.txt
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
With GNU awk:
awk '{print $NF}' FPAT='[a-zA-Z0-9_-]{19,34}' file
$NF: contains last column
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
Output:
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF

perl one liner -add html tags around matched html element

I am using MS word to generate (Web page, filtered) html and have a code style which I want to wrap with <pre><code>content</pre></code> to have preformatted text (including tabs and spaces) so visitors can copy and paste code from my site.
I am aware of the cautions / dangers of using regex on html / xml, but, not an issue, since I control content.
Input html looks like this:
<table class=Code1 PLAIN_MULTILINE_TEXT_AND_FORMATTING</table>
Output html should look like this:
<pre><code>
<table class=Code1 PLAIN_MULTILINE_TEXT_AND_FORMATTING</table>
</pre></code>
here is my one-liner, in bash script, $1 is filename:
perl -pi -e 's|<table class=Code1 (.*?)</table>|<pre><code><table class=Code1 $1</table></pre></code>|sg' $1
Which has no effect. Appears to not match.
Questions:
1 - What is wrong?
2 - Do I need /s (multiline) modifier
3 - Is there a better way (ultimately, will add this to a pre-cache / rendering script, along with existing auto table of contents and auto popup definitions creation)?
site: www.rossco.org
Thanks;
Bill
Add this to your Perl command line flags to read the entire file instead of line by line: -0777
Why not to make the code a little bit more readable? And may be it is not bad idea to have backup of original file
perl -pi 'orig_*' -0777 -e 's|(<table class=Code1 .*?</table>)|<pre><code>$1</pre></code>|sg' filename

Why is my regex failing to select the correct elements, when it works on the online regex tester

I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.

Using sed to replace multiline xml

I'm trying to use sed to edit/change a xml file, but I'm having problems with multilines
The file I want to change has (extract)
<keyStore>
<location>repository/resources/security/apimanager.jks</location>
<password>wso2carbon</password>
</keyStore>
I want to change the password (and only the keyStore password, the file has another password tag)
I'm trying
sed -i 's/\(<keyStore.*>[\s\S]*<password.*>\)[^<>]*\(<\/password.*>\)/\1$WSO2_STORE_PASS\2/g' $WSO2_PATH/$1/repository/conf/broker.xml
but it's not working (change nothing, pattern not found)
If I test the pattern in on-line tester (https://regex101.com/) it seems to work find.
Also, I have tried to replace the [\s\S]* by [^]*, but in this case, sed generate a syntax error.
I'm using Ubuntu 16.04.1.
Any suggestion will be welcome
Parsing XML with regular expressions is always going to be problematic, as XML is not a regular language. Instead, you can use a proper XML parser, for example with XMLStarlet:
xmlstarlet ed --inplace -u "keyStore/password" -v "$WSO2_STORE_PASS" $WSO2_PATH/$1/repository/conf/broker.xml
Sed is not the tool for the job. Use an XML-aware tool, for example xsh:
open { shift } ;
insert text { shift } replace //keyStore/password/text() ;
save :b ;
Run as
xsh script.xsh "$WSO2_PATH/$1/repository/conf/broker.xml" "$WSO2_STORE_PASS"

Use a RegEx expression for PayPal output

I'm having issues using a sed expression to get the data I would like. I've research it a bit, and tried a small tutorial but I could use some help. I feel that I can't use any
The closest I've come to a similar thread was "How do i print word after regex but not a similar word?".
I'm trying to parse through this to get information:
<table cellpadding=""0"" cellspacing=""0"" border=""0""><tr><td>Product<br>Total: 9.99 CAD<br></td></tr><tr><td><br /> <table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""font-size:10px;""><tr><td colspan=""2""><b style=""color:#777; font size:12px;"">==Payer Info==</b></td></tr><tr><td width=""70""><b style=""color:#777"">First Name</b> </td><td>Greg</td></tr><tr><td><b style=""color:#777"">Last Name</b> </td><td>Allan</td></tr><tr><td><b style=""color:#777"">E-Mail</b></td><td>gregoryallan#me.com</td></tr></table></td></tr></table>
Ideally from this I'd like to get the persons first name. I have to make an expression that follows up until the > before the first name and then grab that variable.
$ sed -n 's/^.*[Payer Info] -- grab name and stop when you hit </td>
I've been misleading because I implied I was doing it in terminal. Which was my first goal. But now I need to use this RegEx in a Google Apps Script. I assumed that it would be similar - and it is not. Very sorry for all those who I misled.
This might work (assuming the format is always exactly like in your example):
sed -e 's/^.*First Name<\/b> <\/td><td>\([^<]*\).*$/\1/g' sed_sample
Here I extracted you the name (Greg in your case):
sed 's_^.*First Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
You can easily modify it to get other fields out.
Second name:
sed 's_^.*Last Name[^d]*d>[^>]*>\([A-Za-z]*\).*_\1_'
Email:
sed 's_^.*E-Mail[^d]*d>[^>]*>\([A-Za-z#.]*\).*_\1_'
Inside a script you can use something like:
NAME = $(echo $STRING | sed xxx )
where you replace xxx with the commands from sed.
There are many other possibilities to capture the output of a process inside a script.