html - hyperlink and link text extraction - regex

Hi I'm trying to extract the hyperlink and link text
HTML
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
Simple Symmetry </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
Faces and a Vase </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
Blind Contour Drawing </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
Seeing Values </td>
</tr>
Code
sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Desired
http://www.drawspace.com/lessons/b03/simple-symmetry Simple Symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase Faces and a Vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing Blind Contour Drawing
http://www.drawspace.com/lessons/b06/seeing-values Seeing Values

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be setting RS to <a href="[^"]*">[^<]* regex and in main program, checking if RT is NOT NULL and using split to split its value into array arr with delmiter of > OR ", if all conditions are met then simply printing 2nd and 4th values of array arr as per requirement to get needed output.
awk -v RS='<a href="[^"]*">[^<]*' '
RT && split(RT,arr,"[>\"]"){
print arr[2],arr[4]
}
' Input_file
2nd solution: Using sed with its -E option(to enable ERE, extended regular expressions) please try following code. Using -n option to stop default writing by sed for lines. Then in main program using s option to perform substitution operation. Where I am mentioning [[:space:]]+<a href="([^"]*)">([^<]*).* regex which will create 2 capturing groups from which we are substituting whole matched text and then using p option to print this matched part as per requirement.
sed -E -n 's/[[:space:]]+<a href="([^"]*)">([^<]*).*/\1 \2/p' Input_file
3rd solution: Using GNU awk's match function where mentioning regex and creating 2 capturing group to fetch the required value(s).
awk '
match($0,/^[[:space:]]+<a href="([^"]*)">([^<]*)/,arr){
print arr[1],arr[2]
}
' Input_file

Related

Collect results from groups to one string

I want to parse weather html page for Openhab.
This is significant part of whole html:
<!-- Amount of Sun -->
<tr>
<td class="label_det">
<span class="sum">∑</span> <span class="unit">in u</span>
</td>
<td class="sunamount">
10.2
</td>
<td class="sunamount">
10.6
</td>
<td class="sunamount">
5.9
</td>
<td class="sunamount">
6.8
</td>
<td class="dgrey sunamount">
6.8
</td>
<td class="dgrey sunamount">
5.4
</td>
<td class="sunamount">
5
</td>
</tr>
I would like to collect all numbers into one string, I understand that it's, perhaps, not possible, but may be...
So something like this: '10.2 10.6 5.9 6.8 6.8 5.4 5'
Example of full html and my current regex is here: https://regex101.com/r/nrzPHU/1
Thanks in advice.
You need named capture groups. Named capture groups allow you to specify a given part in regex with a name to extract it later. A named capture group starts with (? then followed by the regex and ended with ).
<td class\=\".*?sunamount\">\s+(?<amount>\d+(\.\d+)?)\s+<\/td>
You would then be able to extract the amount by applying your regex to the input and picking the group named amount out of it.
Reading about OpenHab online I'm not sure they support named capture groups. So an alternative would be using the regex above to match all lines with amounts in the input. Then using a regex replace on that matched string. So something like...
Use this regex to get amounts:
<td class\=\".*?sunamount\">\s+\d+(\.\d+)?\s+<\/td>
Use this regex on the result of the regex above to replace the non amounts (and replace them with an empty string to delete them):
([\s]|<td class=".*?">|<\/td>)

Perl 5 regex match all non html-tags without variable length lookbehind

So I am trying to parse a web page for all the non html-tag matches. I was using RegExr and one of their sample patterns worked perfectly for what I need. The only problem is I am using Perl 5 and it keep spitting out this error:
Variable length lookbehind not implemented in regex m/((?<=^|>)[^><]+?(?=<|$))/ at POODLE_calc.pl line 36.
I've read many other posts on here about this error but still cant get it to work! I've tried rewriting the pattern as many different ways as I can think of or find on google and tried \K as suggested in one of the stackoverflow posts but still nothing works.
This is the excerpt from the HTML page I was experimenting on in RegExr (Full page made it crash)
<TABLE border cellspacing="2">
<TR align="center">
<TD width="50"> no. </TD>
<TD width="50"> AA </TD>
<TD width="50"> ORD/DIS </TD>
<TD width="50"> Prob. </TD>
</TR>
<tr align="center">
<td> 1 </td>
<td> M </td>
<td> -1 </td>
<td> 0.1029 </td>
</tr>
If you could help me figure out a pattern that will give me "no. AA ORD/DIS Prob. 1 M -1 0.1029" that Perl will accept I would greatly appreciate it!
Thanks,
Hobbit
EDIT
I used the pattern suggested by ikegami and it stopped the Perl error but it is only returning "no." and all of the space characters.
Here is the code that is doing the parsing:
while (<FILE>){
$_ =~ /((?:^|(?<=>))[^><]+?(?=<|$))/g;
$proteinScores .= $1;
}
print $proteinScores."\n";
This can help, assuming no text spans across lines and single text per line:
while (<DATA>){
$proteinScores .= $1 if />([^>]+)</;
}
This one can do multiple texts per line:
while (<DATA>){
$proteinScores .= $1 while />([^>]+)</g;
}
and this one can handle spanning text:
$text = join("", <DATA>);
$proteinScores .= $1 while $text =~ />([^<>]+)</g;
(?<=^|>) could be written as (?:(?<=^)|(?<=>)) which simplifies to (?:^|(?<=>))

Find a table's last cell by regular expression

I want to use Regular Expression (compatible with pcre) to select a table
cell in an XML or HTML file.This cell was expanded in several lines containing
other elements and relative attributes and values. Thiscell supposed to be at the last column.
for some reasons I can't and don't want to use ". matches newline" option.
for example in this code:
EDITED:
<table colcount="4">
<tr>
<td colspan="2">
<para><text> Mike</text></para>
</td>
<td>
<tab />
</td>
<td1>
<para><text>Jack</text></para>
<para><text>Sarah</text></para>
</td>
</tr1>
<tr>
<td>
<para><text>Bob</text></para>
<para><text>Rita</text></para>
</td>
<td2 colspan="3" with>
<para><text>Helen</text></para>
</td>
</tr2>
<tr>
<td style="with:445px;">
<para><text>Sam</text></para>
</td>
<td>
<para><text>Emma</text></para>
<para><text>George</text></para>
</td>
<td>
</td>
<td3 colspan="">
<tab />
</td>
</tr3>
</table>
/EDITED
I want to find and select the whole last cell together with its start and end tags (<td and </td>)
and the end tag of the corresponding row(</tr>), that is:
EDITED:
Here is what I want to select in the table like above using RegEx:
Either from <td1 to </tr1> - or from <td2 to </tr2> - or from <td3 to </tr3>
/EDITED
The format (indentation and new lines have to be preserved), I mean I can't put, for example
</tr> in front of of closing tag of the cell(</td>).
Indentation is only space character.
Thanks for any help...
Best you can do with regex is:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>(?!(.|\r|\n)*<tr)
But this is kinda ugly, resource intensive and breaks when you have nested tables. A better route is indeed to use an XML or HTML parser for whichever programming language you're using.
If you want to select the last cell from EVERY row, as your updated question suggests, leave out the negative lookahead like so:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>
Working example here: http://refiddle.com/gt2

sed replace html

Having difficulty getting a sed search and replace working with an html file.
I have multiple sections that look like this:
<TABLE class="cattable">
<TBODY>
<TR>
<TH colspan="2">Header</TH></TR>
<TR>
<TD>Random Amount of Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Moar Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Yup, More</TD>
<TD>4</TD></TR></TBODY></TABLE>
I need to:
replace with xxxxFOOxxxx:
<TABLE class="cattable">
<TBODY>
<TR>
<TH colspan="2">
keep this:
Header
Replace this with yyyyFOOyyyy:
</TH></TR>
Keep this:
<TR>
<TD>Random Amount of Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Moar Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Yup, More</TD>
<TD>4</TD></TR>
replace this with zzzzFOOzzzz:
</TBODY></TABLE>
Heres what I’ve tried in vim, but cant limit the greedy .* properly:
s:\(<TABLE class="cattable">\_s\s*<TBODY>\_s\s*<TR>\_s\s*<TH colspan="2">\)\(.*\)\(<\/TH><\/TR>\)\(\_.*[^<]*\)\(<\/TABLE>\):xxxxFOOxxxx\2yyyyFOOyyyy\4zzzzFOOzzz<br>:g
tia
Replace * with \{-} to get non-greedy match (same as *? in PCRE/Perl regexes). For more complicated cases you will have to use something with negative look-aheads/look-behinds: like \(\(<\/TH><\/TR>\)\#!.\)* in place of .*, here .\{-} is probably enough.
Note: won’t work in sed.
Note 2: vim is not using BRE or ERE (basic/extended regular expressions) like sed does, vim :s is not invoking any external programs (including sed) and neither your attempt is suitable for sed. Thus if you did not mean to ask how to do this in sed remove sed from tags.

parse text using grep regex pull out text from multiple lines of text in a file

I have a chunck of text in a file:
<tr bgcolor="#F9F9F9">
<td align="left">8/7/2012 11:23:42 AM</td>
<td align="left"><em>Here is the text I want to parse out</em></td>
<td class="ra">9.00</td>
<td class="ra">297.00</td>
<td class="ra">0.00</td>
<td class="ra">0.00</td>
<td class="ra">$0.00</td>
<td class="ra">$0.50</td>
<td class="ra"></td>
</tr>
using grep I would like to end up with the result being
Here is the text I want to parse out
Working on the code now I have
cat file.txt | grep -m 1 -oP '<em>[^</em>]*'
but that does not work... thanks for your help!
A correct regex would be (?<=<em>).*?(?=</em>).
So, try:
grep -m 1 -oP '(?<=<em>).*?(?=</em>)' file.txt