BeautifulSoup not able to parse perfectly

BeautifulSoup not able to parse perfectly - python-2.7

When I am using soup.find("h3", text="Main Address:").find_parents("section"), I am getting an output which is:
[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
<span class="postal-code">36104</span></p> </section>]
Now I want to print only paragraph's text. I am not able to do that. Please tell me how can I print from here only text which is inside this paragraph of the section.
Or my HTML page is like this:
<article>
<header>
<h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Official Name:</h3></header>
<p>Alaska
</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 class="org">Governor:</h3></header>
<p>Bill Walker</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Main Address:</h3></header>
<p>120 East 4th Street<br>
<span class="locality">Juneau</span>,
<span class="region">AK</span>,
<span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 itemprop="name">Phone Number:</h3></header>
<p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
<header><h2 id="state-agencies">State Agencies</h2></header>
<ul>
<li>Consumer Protection Offices</li>
<li>Corrections Department</li>
<li>Election Office</li>
<li>Motor Vehicle Offices</li>
<li>Surplus Property Sales</li>
<li>Travel and Tourism</li>
</ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>
How should I get the address from it only text.

Your current code returns a list with one element. To get the <p> element in it, you can expand it a bit:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")
If you want to get what is inside that p element, you'll have to get the first element of that list again, and run decode_contents on it:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")
In your case that will return:
u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'

Related

RIDE Robot framework Select from dynamic list

I am trying to choose an element("Classic") from a dynamic dropdown list. Problem is that word Classic contains 2 elements.
Html page is:
<ul id="dynamic-14" class="results" role="list">
<li class="results-dept result">
<div dynamic-102" class="results" role="option">
<span class="match"/>
</div>
</li>
<li class="results-dept result">
<div dynamic-12" class="results" role="option">
<span class="match"/>
Classic
</div>
</li>
<li class="results-dept result">
<div dynamic-1022" class="results" role="option">
<span class="match"/>
Classic numbers
</div>
</li>
I tried to do it with xpath using:
//ul[#class="results"] //div[contains(.,'Classic')]
but it gives me back 2 values so robot framework can't choose one I need.

user normalize-space() function to get rid of the leading and trailing whitespace.
//ul[#class="results"] //div[ normalize-space(.)='Classic']

Extracting all dojo attach point values from HTML

I have a saved HTML page which I've opened in notepad++. I would like to extract all the attach points out of the html file. Example from the HTML below:
<div class="contentBar">
<div class="banner" style="">
<span class="bannerRepeat"></span>
<span class="bannerDecal"></span>
</div>
<div>
<div class="logo" data-dojo-attach-point="pageLogoPt">
ABC
</div>
<div class="title" data-dojo-attach-point="pageTitlePt">
ABC
</div>
<div class="userPane">
<div>
<span class="LoginCell LoginText"><span data-dojo-attach-point="welcomeBlockPt">Welcome</span>, <b data-dojo-attach-point="usernameBlockPt">User Name</b></span>
<span widgetid="acme_Button_0" id="acme_Button_0" class="LoginCell Button" data-dojo-type="acme.Button" data-dojo-props="size: 'small'" data-dojo-attach-point="logOutButtonPt"><span widgetid="dijit_form_Button_0" class="dijit dijitReset dijitInline dijitButton ButtonSmall" role="presentation"><span class="dijitReset dijitInline dijitButtonNode" data-dojo-attach-event="ondijitclick:__onClick" role="presentation"><span style="-moz-user-select: none;" aria-disabled="false" id="dijit_form_Button_0" tabindex="0" class="dijitReset dijitStretch dijitButtonContents" data-dojo-attach-point="titleNode,focusNode" role="button" aria-labelledby="dijit_form_Button_0_label"><span class="dijitReset dijitInline dijitIcon dijitNoIcon" data-dojo-attach-point="iconNode"></span><span class="dijitReset dijitToggleButtonIconChar">●</span><span class="dijitReset dijitInline dijitButtonText" id="dijit_form_Button_0_label" data-dojo-attach-point="containerNode">Logout</span></span></span><input value="" class="dijitOffScreen" data-dojo-attach-event="onclick:_onClick" tabindex="-1" role="presentation" aria-hidden="true" data-dojo-attach-point="valueNode" type="button"></span></span>
</div>
<div>
<span id="printLink" style="display:none;">Print</span>
<span id="zoomPercentageDisplay"><span data-dojo-attach-point="zoomBlockPt">Zoom</span>: 100%</span>
<span id="smallFontSizeLink" style="font-size: .8em;">A</span>
<span id="defaultFontSizeLink" style="font-size: 1em;">AA</span>
<span id="largeFontSizeLink" style="font-size: 1.2em;">AAA</span>
</div>
</div>
</div>
</div>
I would like to get:
pageLogoPt
pageTitlePt
welcomeBlockPt
usernameBlockPt
etc ...
Is this possible? Thanks

You can do the following:
Replace (data-dojo-attach-point="[^"]+)(?=") with \n\1\n. This will put what you're looking for on separate lines.
Mark All based on the regex data-dojo-attach-point="[^"]+. Tick "Bookmark line" checkbox.
Search -> Bookmark -> Remove Unmarked Lines
Replace data-dojo-attach-point=" with blank.
This will give you your list with each item in its own line.
Tested on Notepad++ 6.8.8.
Inspired by https://superuser.com/questions/477628/export-all-regular-expression-matches-in-textpad-or-notepad-as-a-list.

Foundation 4 - Sections- Centering Tabs

Problem: Attempting to Create 6 centered tabs in a row, however because I can't keep all 6 tabs on the same row. It does center the tabs, however pushes 2 of the tabs one line below and leaves an extra empty cell on the first line.
I tried using small-centered and made no difference.
Version: Foundation 4
Browser: Chrome - Latest
Code
<div class="row">
<div class="large-6 large-centered columns">
<div class="section-container horizontal-nav" data-section="horizontal-nav" >
<section class="section">
<p class="title">Tab 1</p>
</section>
<section class="section">
<p class="title">Tab 2</p>
</section>
<section class="section">
<p class="title">Tab 3</p>
</section>
<section class="section">
<p class="title">Tab 4</p>
</section>
<section class="section">
<p class="title">Tab 5</p >
</section>
<section class="section">
<p class="title">Tab 6</p>
</section>
</div>
</div>
</div>

I don't think this is possible with Foundation 4 out of the box. By using large-centered, you are just centering the div containing the tabs and not the tabs themselves. The tabs will always align left.
The functionality you are looking for is available in button groups. For example button-group even-6 will fill the entire space. Maybe it is possible to use jQuery to use a button group instead or to take a look at the JavaScript source and see how Zurb is resizing those elements and apply it to your section / tabs.

Regular expression to remove lines with special characters

<a class='jdr' href='javascript:void(0);' onClick="return openDiv('jrtp');"></a>
<span class="jcn">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET" title='Aptech N Power Hardware & Networking' >Aptech N Power Hardware & Networkin...</a>
</span>
<section class="jrat">
<a rel="nofollow" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw"><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s0'></span></a>
<a class="jrt" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw">2 ratings</a>
<span class="jrt"> |</span>
<a class="rate_this" onclick="_ct('ratethis','lspg');" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET/writereview">Rate this</a>
</section>
<section class="jcar">
<section class="jbc">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET">
<img width="83" height="56" border="0" src="http://images.jdmagicbox.com/upload_test/ahmedabad/b4/079pxx79.xx79.110420172948.d4b4/logo/faf3f2409ed7993aaa70f848ab0bb6fb_t.jpg" class="Clogo" />
</a>
<!-- <span class="noLogo"></span> -->
<section class="jrcl">
<p>
**A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024** | View Map<br>
</p>
From the above XML data I want to extract the following---
A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024
I need help in creating a regular expression to find and remove all lines containing special characters.
I am using the following regex ----
/(\<.+?>)/g
Please help.Thanks

Try this
/(?<=\*{2})([^<>]*?)(?=\*{2})/g
it matches all content between the **.

I think you want to remove lines which are HTML tags, so try this:
/^<.*>\n/g

xslt matching all nodes prior to a specific node

I am trying to match all the nodes before a specific node. Input XML
<story>
<content>
<p>This is the text I want</p>
<p>This is the text I want</p>
<p>This is the text I want</p>
<ul>
<li></li>
...
</ul>
....
....
</content>
</story>
With That as my input XML, I am trying and failing to grab all the <p> tags prior to the <ul> tags and render them. There could be 0 <p> tags or infinite. Any thoughts on how to do this with XSLT 1.0? Thanks!

/story/content/p[not(preceding-sibling::ul)]

Use:
//p[not(preceding::ul or ancestor::ul)]
This is generally wrong:
//p[not(preceding-sibling::ul)]
because it doesnt select p elements that come before any ul, but aren't siblings to any ul.
For example, given this XML document:
<story>
<div>
<p>Must be selected</p>
</div>
<ul>
<li><p>Must not be selected</p></li>
</ul>
<content>
<p>Must not be selected</p>
<div>
<p>Must not be selected</p>
</div>
<p>Must not be selected</p>
<p>Must not be selected</p>
<ul>
<li></li>
<li><p>This must not be selected</p></li>
</ul>
....
....
</content>
</story>
the above wrong expression selects:
<p>Must not be selected</p>
<p>Must not be selected</p>
<p>Must not be selected</p>
and doesn't select the wanted element:
<p>Must be selected</p>
But the correct expression at the start of this answer selects just the wanted p element:
<p>Must be selected</p>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BeautifulSoup not able to parse perfectly - python-2.7

Related

RIDE Robot framework Select from dynamic list

Extracting all dojo attach point values from HTML

Foundation 4 - Sections- Centering Tabs

Regular expression to remove lines with special characters

xslt matching all nodes prior to a specific node

Categories

Resources