sed replace html - regex

Having difficulty getting a sed search and replace working with an html file.
I have multiple sections that look like this:
<TABLE class="cattable">
<TBODY>
<TR>
<TH colspan="2">Header</TH></TR>
<TR>
<TD>Random Amount of Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Moar Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Yup, More</TD>
<TD>4</TD></TR></TBODY></TABLE>
I need to:
replace with xxxxFOOxxxx:
<TABLE class="cattable">
<TBODY>
<TR>
<TH colspan="2">
keep this:
Header
Replace this with yyyyFOOyyyy:
</TH></TR>
Keep this:
<TR>
<TD>Random Amount of Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Moar Data</TD>
<TD>3</TD></TR>
<TR>
<TD>Yup, More</TD>
<TD>4</TD></TR>
replace this with zzzzFOOzzzz:
</TBODY></TABLE>
Heres what I’ve tried in vim, but cant limit the greedy .* properly:
s:\(<TABLE class="cattable">\_s\s*<TBODY>\_s\s*<TR>\_s\s*<TH colspan="2">\)\(.*\)\(<\/TH><\/TR>\)\(\_.*[^<]*\)\(<\/TABLE>\):xxxxFOOxxxx\2yyyyFOOyyyy\4zzzzFOOzzz<br>:g
tia

Replace * with \{-} to get non-greedy match (same as *? in PCRE/Perl regexes). For more complicated cases you will have to use something with negative look-aheads/look-behinds: like \(\(<\/TH><\/TR>\)\#!.\)* in place of .*, here .\{-} is probably enough.
Note: won’t work in sed.
Note 2: vim is not using BRE or ERE (basic/extended regular expressions) like sed does, vim :s is not invoking any external programs (including sed) and neither your attempt is suitable for sed. Thus if you did not mean to ask how to do this in sed remove sed from tags.

Related

html - hyperlink and link text extraction

Hi I'm trying to extract the hyperlink and link text
HTML
<tr valign="top">
<td class="beginner">
B03
</td>
<td>
Simple Symmetry </td>
</tr>
<tr valign="top">
<td class="beginner">
B04
</td>
<td>
Faces and a Vase </td>
</tr>
<tr valign="top">
<td class="beginner">
B05
</td>
<td>
Blind Contour Drawing </td>
</tr>
<tr valign="top">
<td class="beginner">
B06
</td>
<td>
Seeing Values </td>
</tr>
Code
sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values
Desired
http://www.drawspace.com/lessons/b03/simple-symmetry Simple Symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase Faces and a Vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing Blind Contour Drawing
http://www.drawspace.com/lessons/b06/seeing-values Seeing Values
1st solution: With your shown samples please try following awk code. Written and tested in GNU awk. Simple explanation would be setting RS to <a href="[^"]*">[^<]* regex and in main program, checking if RT is NOT NULL and using split to split its value into array arr with delmiter of > OR ", if all conditions are met then simply printing 2nd and 4th values of array arr as per requirement to get needed output.
awk -v RS='<a href="[^"]*">[^<]*' '
RT && split(RT,arr,"[>\"]"){
print arr[2],arr[4]
}
' Input_file
2nd solution: Using sed with its -E option(to enable ERE, extended regular expressions) please try following code. Using -n option to stop default writing by sed for lines. Then in main program using s option to perform substitution operation. Where I am mentioning [[:space:]]+<a href="([^"]*)">([^<]*).* regex which will create 2 capturing groups from which we are substituting whole matched text and then using p option to print this matched part as per requirement.
sed -E -n 's/[[:space:]]+<a href="([^"]*)">([^<]*).*/\1 \2/p' Input_file
3rd solution: Using GNU awk's match function where mentioning regex and creating 2 capturing group to fetch the required value(s).
awk '
match($0,/^[[:space:]]+<a href="([^"]*)">([^<]*)/,arr){
print arr[1],arr[2]
}
' Input_file

Collect results from groups to one string

I want to parse weather html page for Openhab.
This is significant part of whole html:
<!-- Amount of Sun -->
<tr>
<td class="label_det">
<span class="sum">∑</span> <span class="unit">in u</span>
</td>
<td class="sunamount">
10.2
</td>
<td class="sunamount">
10.6
</td>
<td class="sunamount">
5.9
</td>
<td class="sunamount">
6.8
</td>
<td class="dgrey sunamount">
6.8
</td>
<td class="dgrey sunamount">
5.4
</td>
<td class="sunamount">
5
</td>
</tr>
I would like to collect all numbers into one string, I understand that it's, perhaps, not possible, but may be...
So something like this: '10.2 10.6 5.9 6.8 6.8 5.4 5'
Example of full html and my current regex is here: https://regex101.com/r/nrzPHU/1
Thanks in advice.
You need named capture groups. Named capture groups allow you to specify a given part in regex with a name to extract it later. A named capture group starts with (? then followed by the regex and ended with ).
<td class\=\".*?sunamount\">\s+(?<amount>\d+(\.\d+)?)\s+<\/td>
You would then be able to extract the amount by applying your regex to the input and picking the group named amount out of it.
Reading about OpenHab online I'm not sure they support named capture groups. So an alternative would be using the regex above to match all lines with amounts in the input. Then using a regex replace on that matched string. So something like...
Use this regex to get amounts:
<td class\=\".*?sunamount\">\s+\d+(\.\d+)?\s+<\/td>
Use this regex on the result of the regex above to replace the non amounts (and replace them with an empty string to delete them):
([\s]|<td class=".*?">|<\/td>)

RegEx Multiline result - Netbeans 7.2 find and replace feature

I'm using the "find" tool in Netbeans 7.2 and I'm looking to make regular expression that would allow me to gather results that have multiple lines.
A sample Html code I would like to apply the regular expression on :
<tr>
<td>
<label>some label</label><span>*</span>
</td>
<td>
<label>some label</label><span>*</span>
<label>some label</label>
<label>some label</label>
</td>
</tr>
Basically, I want to gather any <td> tag including it's content and it's end tag </td>.
In the above example, my first result should be :
<td>
<label>some label</label><span>*</span>
</td>
and my second result expected would be :
<td>
<label>some label</label><span>*</span>
<label>some label</label>
<label>some label</label>
</td>
I've tried many different regular expressions that would pick up the start of the <td> and the next line (if the <td>'s content is on more than one line).
Example :
<td>.*(.*\s*).*
But I'm looking to get a regular expression that can pick up every <td> tags and their content no matter how many <label> tags they hold.
You have to use use the s modifier to match new lines with a dot, I don't know where you can do this in NetBeans but you could begin your expression with (?s) to enable it.
So a regex to match <td ...> ... </td> would be something like this (?s)(<td[^>]*>.*?<\/td>).
Explanation:
(?s) : make the dot match also newlines
<td : match <td
[^>]* : match everything except > 0 or more times
> : match >
.*? : match everything 0 or more times ungreedy (until </td> found in this case)
<\/td> : should I even explain o_o ?
Online demo

Delete text block with regexp in notepad++

Textfile contains blocks:
<tr>
<td>waitForElementPresent</td>
<td>//div</td>
<td></td>
</tr>
<tr>
<td>assertElementPresent</td>
<td>//div</td>
<td></td>
</tr>
I want to remove all blocks with regexp in notepad++ with the word assertElementPresent in it:
<tr>*assertElementPresent*</tr>
who can help me with the regular expression??
<tr>.*?assertElementPresent.*?</tr>
should be a good start (note the ungreedy matches), however, it's rather brittle.
<tr>(?!.*<tr>.*).*?assertElementPresent.*?</tr>
It has been tested with RAD and RegExr. The previous suggested solution picks the previous row also..

Notepad++ search & replace

I'm trying to convert a html file with 100 of entries like this one:
<table>
<tr>
<td valign="top" width="30">
1.</td>
<td>
TEXT DESCRIPTION
</td>
</tr>
</table>
<table><tr><td></td></tr></table>
where the number "1." goes from 1 to 100, into this:
<li>
TEXT DESCRIPTION
</li>
I haven't find a way to do this, neither with regexp nor with extended search mode. Any ideas?
You could start with this:
Replace
.*<td>(.*[A-Za-z]+.*)<\/td>.*
with
<li>\1</li>
This will match one chunk of code of the form you reported. You must modify it to match multiple chunks of the same form in the same file.
Moreover to work correctly we should make it match lazily. Someone who knows how?