Parsing HTML Table using Regex - regex

I am trying to extract the contents of the table using Regex.
I have removed most of the tags from the table, i am stuck with <br> , <a href >, <img > & <b> How to remove them ??
for <b> tag i tried this Regex
\s*<b[^>]*>\s*
(?<value>.*?)
\s* </b>\s*
it worked for some lines and some its giving the out put as
<b class="saadirheader">Email:</b>
Can anyone help me removing these tags
<br> , <a href >, <img > and <b>
Full Tags :-
<img src="Newrecord_files/spacer.gif" alt="" border="0" height="1" width="5">
<a href="mailto:first.last#email.org">
Thanking you,
Naveen HS

Use the following Regex:
(?:<br|<a href|<img|<b)(?:.(?!>))*.>
This Regex will match all the tags you mentioned above, and if there are more tags you forgot to mention just add a "|" sign with the tag you want to add, and insert it into the first parentheses.

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

I have content like below
enter code here
<div class="content1">
<ul>
<li>line1</li>
<li>line2</li>
<li>line3</li>
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I want to strip all li tags within and retain contents inside it. like below
enter code here
<div class="content1">
<ul>
line1
line2
line3
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I have about 500 html files to edit.Is there any Regex code to achieve this in notepad++.
You can use a regex like this
<li>(.*?)<\/li>
With the replacement string:
$1
Working demo
The regex to match those tags are
\<li\>
\<\/li\>
The backslashes are used to treat special characters as 'normal' characters.
If you use terminal you can use stream edit which is
sed 's/\<li\>//' input.txt > output.txt
But in notepad++ i believe you can ctrl find and replace

Required text are not getting extracted

I am facing some problem extracting data using xpath of css selector from below html code.
I want to extract "XYZ" text and "xyz.com" text separately on 2 different variables.
I tried using css selector like below but it extracted all the text XYZ and xyz.com
response.css('p>b[id="name"],
<p>
<b id="name">Name</b>
<i class="abc">
XYX
</i>
</p>
<p>
<b id="email">Email</b>
<i class="abc">
XYX.com
</i>
</p>
Is there any way I can extract and store xyz and xyz.com in separate variable
Try it with XPath:
name = response.xpath('//p[b[#id="name"]]/i/a/text()').extract_first()
email = response.xpath('//p[b[#id="email"]]/i/a/text()').extract_first()

Need help to write a regular expression statement (Newbie alert!)

I use photobucket to host my imagery for my ebay ads when I sell things, so I copy the html out of photobucket into notepad, and I'm always left the <img> tag being wrapped in photobucket's <a> tag, and I have to go through each line and manually delete each <a></a>, which on 26 lines across multiple items can soon equate too hundreds of "highlight and delete" actions.
I already do a search for the closing tag </a> and just do a "replace" with nothing, thus removing it, but the string I cannot fathom to remove, due to the image file name being different on every line is as the following example demonstrates:
So it's essentially the section of the anchor tag up to and including the > I need to be able to remove on a mass scale - Any help would be greatly appreciated!
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC02424c_zpslt9m0cuu.jpg" border="0" alt=" photo DSC05653_zpslt9m0cuu.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC04444_zpspkgjw6vf.jpg" border="0" alt=" photo DSC05654_zpspkgjw6vf.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC05655_zpsxuev7czs.jpg" border="0" alt=" photo DSC05655_zpsxuev7czs.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC06624_zpsifjidypy.jpg" border="0" alt=" photo DSC05656_zpsifjidypy.jpg"/>
<img src="http://i1297.photobucket.com/albums/ag35/eye/Programmes/Yes%20joblot/DSC07777_zpsacyjrnnr.jpg" border="0" alt=" photo DSC05663_zpsacyjrnnr.jpg"/>
<a href="[^"]+?" target="_blank">
would do what you want, or even more general:
<a href=[^>]+?>

Regex lookahead and behind?

So I have a unordered list that looks like:
<ul class='radio' id='input_16_5'>
<li>
<input name='input_5' type='radio' value='location_1' id='choice_16_5_0' />
<label for='choice_16_5_0' id='label_16_5_0'>Location 1</label></li>
<li>
<input name='input_5' type='radio' value='location_2' id='choice_16_5_1' />
<label for='choice_16_5_1' id='label_16_5_1'>Location 2</label></li>
<li>
<input name='input_5' type='radio' value='location_3' id='choice_16_5_2' />
<label for='choice_16_5_2' id='label_16_5_2'>Location 3</label></li>
</ul>
I would like to pass a value (ie. location_2) to a regular expression that will then capture the whole list item that it's a part of in order to remove it. So if I pass it location_2 it will match the to the (including) <li> and the </li> of the list item that it's in.
I can match up to the end of the list item with /location_3.+?(?=<li|<\/ul)/ but is there something I can do to match before and not capture other items?
This should get what you want
<li>(?:(?!<li>)[\S\s])+location_1[\S\s]+?<\/li>
Exaplanation
<li>: open li tag,
(?:(?!<li>)[\S\s])+: match for any characters including a newline and use negative look ahead to make sure that your highlight will not consume two or more <li> tags,
location_1: keyword that you use for highlight the whole <li> tag,
[\S\s]+?: any characters including a newline. (Here, thanks #Tensibai for your comment that make this regex be more simple with non-greedy)
<\/li> close li tag.
DEMO: https://regex101.com/r/cU4eC6/5
Additional information:
/<li>(?:(?!<li>).)+location_2.+?<\/li>/s
This regex is also work where you use modifier s to handle a newline instead of [\S\s]. (Thanks again to #Tensibai)

find & replace part of text

I am trying to do a search and replace using GREP/Regex
Here is what I am searching for
<div align="center" class="orange-arial-11"><b>.+<br>
I want to remove the <b>, <br> tags, and place <h3> tags around what .+ finds.
But I can't get what .+ finds to stay when it does the replace.
For example, I want to find this
<div align="center" class="orange-arial-11"><b>This is the section I want intact<br>
to change to this
<div align="center" class="orange-arial-11"><h3>This is the section I want intact</h3>
Any help is appreciated.
Use sed instead of grep:
# Modify the file in-place
sed -i~ 's|\(<div align="center" class="orange-arial-11">\)<b>\(.\+\)<br>|\1<h3>\2</h3>|' the-file
It depends exactly what system you're using, but if you put something in parenthesis you can refer to it later
So it might be something like
s/<b>(.+)<br>/<h3>\1<\/h3>/
In TextWrangler:
search for:
<div align="center" class="orange-arial-11"><b>(.+?)<br>
replace with:
<div align="center" class="orange-arial-11"><h3>\1</h3>
The '\1' will be replaced with the string matched inside the parens in the search pattern.