Required text are not getting extracted - python-2.7

I am facing some problem extracting data using xpath of css selector from below html code.
I want to extract "XYZ" text and "xyz.com" text separately on 2 different variables.
I tried using css selector like below but it extracted all the text XYZ and xyz.com
response.css('p>b[id="name"],
<p>
<b id="name">Name</b>
<i class="abc">
XYX
</i>
</p>
<p>
<b id="email">Email</b>
<i class="abc">
XYX.com
</i>
</p>
Is there any way I can extract and store xyz and xyz.com in separate variable

Try it with XPath:
name = response.xpath('//p[b[#id="name"]]/i/a/text()').extract_first()
email = response.xpath('//p[b[#id="email"]]/i/a/text()').extract_first()

Related

Is there a way to find and append a specific URL within HTML using Regex?

I have not been able to find in Stackoverflow advice on finding specific URLs and appending them. I am looking to create "deep links" using a popular affiliate network within HTML content. For example here is some HTML:
<h2>This is a title</h2>
<p>this is some text</p>
<p>link to macys</p>
<p>link to google</p>
<p>something else</p>
</body>
</html>
I want to use Regex to find just the Macys link (not the Google link) in the HTML and append the URLs with the "deep link" code from the affiliate network. So it looks like this:
<html>
<body>
<h2>This is a title</h2>
<p>this is some text</p>
<p>link to macys</p>
<p>link to google</p>
<p>something else</p>
</body>
</html>
I did a find and replace for http://macys.com, and www.macys.com, and http://www.macys.com and it works.

PowerShell regular expression to get all the HTML tags

I have a string with HTML tags. I have to write PowerShell script to split this string using regular expression for HTML tags both opening and closing. I have tried many times but with no luck.
<([A-Z][A-Z0-9])[^>]>
I have tried this for opening tags. But it only removes the '<' and '>' from string not the whole tag.
My string is something like this:
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
My desired output is: some text inside. This is text inside font. this is h1 text. This is a new paragraph.
Not sure how you're doing your split, but it shouldn't be that difficult:
$Text =
#'
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
'#
$text -split '<.+?>' -match '\S'
some text inside.
this is text inside font.
this is h1 text.
This is a new paragraph.

How to skip a particular tag and crawl other tag's text in Beautifulsoup

I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.
Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text

Regular expression to remove lines with special characters

<a class='jdr' href='javascript:void(0);' onClick="return openDiv('jrtp');"></a>
<span class="jcn">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET" title='Aptech N Power Hardware & Networking' >Aptech N Power Hardware & Networkin...</a>
</span>
<section class="jrat">
<a rel="nofollow" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw"><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s0'></span></a>
<a class="jrt" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw">2 ratings</a>
<span class="jrt"> |</span>
<a class="rate_this" onclick="_ct('ratethis','lspg');" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET/writereview">Rate this</a>
</section>
<section class="jcar">
<section class="jbc">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET">
<img width="83" height="56" border="0" src="http://images.jdmagicbox.com/upload_test/ahmedabad/b4/079pxx79.xx79.110420172948.d4b4/logo/faf3f2409ed7993aaa70f848ab0bb6fb_t.jpg" class="Clogo" />
</a>
<!-- <span class="noLogo"></span> -->
<section class="jrcl">
<p>
**A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024** | View Map<br>
</p>
From the above XML data I want to extract the following---
A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024
I need help in creating a regular expression to find and remove all lines containing special characters.
I am using the following regex ----
/(\<.+?>)/g
Please help.Thanks
Try this
/(?<=\*{2})([^<>]*?)(?=\*{2})/g
it matches all content between the **.
I think you want to remove lines which are HTML tags, so try this:
/^<.*>\n/g

Parsing HTML Table using Regex

I am trying to extract the contents of the table using Regex.
I have removed most of the tags from the table, i am stuck with <br> , <a href >, <img > & <b> How to remove them ??
for <b> tag i tried this Regex
\s*<b[^>]*>\s*
(?<value>.*?)
\s* </b>\s*
it worked for some lines and some its giving the out put as
<b class="saadirheader">Email:</b>
Can anyone help me removing these tags
<br> , <a href >, <img > and <b>
Full Tags :-
<img src="Newrecord_files/spacer.gif" alt="" border="0" height="1" width="5">
<a href="mailto:first.last#email.org">
Thanking you,
Naveen HS
Use the following Regex:
(?:<br|<a href|<img|<b)(?:.(?!>))*.>
This Regex will match all the tags you mentioned above, and if there are more tags you forgot to mention just add a "|" sign with the tag you want to add, and insert it into the first parentheses.