xslt matching all nodes prior to a specific node - xslt

I am trying to match all the nodes before a specific node. Input XML
<story>
<content>
<p>This is the text I want</p>
<p>This is the text I want</p>
<p>This is the text I want</p>
<ul>
<li></li>
...
</ul>
....
....
</content>
</story>
With That as my input XML, I am trying and failing to grab all the <p> tags prior to the <ul> tags and render them. There could be 0 <p> tags or infinite. Any thoughts on how to do this with XSLT 1.0? Thanks!

/story/content/p[not(preceding-sibling::ul)]

Use:
//p[not(preceding::ul or ancestor::ul)]
This is generally wrong:
//p[not(preceding-sibling::ul)]
because it doesnt select p elements that come before any ul, but aren't siblings to any ul.
For example, given this XML document:
<story>
<div>
<p>Must be selected</p>
</div>
<ul>
<li><p>Must not be selected</p></li>
</ul>
<content>
<p>Must not be selected</p>
<div>
<p>Must not be selected</p>
</div>
<p>Must not be selected</p>
<p>Must not be selected</p>
<ul>
<li></li>
<li><p>This must not be selected</p></li>
</ul>
....
....
</content>
</story>
the above wrong expression selects:
<p>Must not be selected</p>
<p>Must not be selected</p>
<p>Must not be selected</p>
and doesn't select the wanted element:
<p>Must be selected</p>
But the correct expression at the start of this answer selects just the wanted p element:
<p>Must be selected</p>

Related

Test for following sibling not having argument

Consider the following xml file:
<main>
<sub>
<div n="1"/>
<div n="2"/>
<div n="3"/>
<div n="-">
</sub>
<sub>
<div n="1"/>
<div n="2"/>
<div/>
</sub>
<sub>
<div n="1"/>
<div n="2"/>
<div n="-"/>
</sub>
</main>
I want to output only the value of the last #n that is a digit and is followed either by a div with no #n or one with #n = '-'. The following test in a template that matches divs outputs all #ns containing digits...
<xsl:if test="
matches(#n, '[0-9]') and
following-sibling::div[not(#n) or #n='-']
">
I also tried:
<xsl:if test="
matches(#n, '[0-9]') and
(not(following-sibling::div[#n]) or
following-sibling::div[#n='-'])
">
Am I missing something?

BeautifulSoup not able to parse perfectly

When I am using soup.find("h3", text="Main Address:").find_parents("section"), I am getting an output which is:
[<section class="otlnrw" itemscope="" itemtype="http://microformats.org/wiki/hCard">\n<header>\n<h3 i
temprop="name">Main Address:</h3>\n</header>\n<p>600 Dexter <abbr title="Avenue\r"><abbr title="Avenu
e\r">Ave.</abbr></abbr><br/><span class="locality">Montgomery</span>, <span class="region">AL</span>,
<span class="postal-code">36104</span></p> </section>]
Now I want to print only paragraph's text. I am not able to do that. Please tell me how can I print from here only text which is inside this paragraph of the section.
Or my HTML page is like this:
<article>
<header>
<h2 id="state-government">State Government</h2>
</header>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Official Name:</h3></header>
<p>Alaska
</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 class="org">Governor:</h3></header>
<p>Bill Walker</p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otln">
<header><h3 itemprop="name">Main Address:</h3></header>
<p>120 East 4th Street<br>
<span class="locality">Juneau</span>,
<span class="region">AK</span>,
<span class="postal-code">99801</span></p>
</section>
<section itemscope="" itemtype="http://microformats.org/wiki/hCard" class="otlnrw">
<header><h3 itemprop="name">Phone Number:</h3></header>
<p class="spk tel">907-465-3708</p>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
<section>
<header><h2 id="state-agencies">State Agencies</h2></header>
<ul>
<li>Consumer Protection Offices</li>
<li>Corrections Department</li>
<li>Election Office</li>
<li>Motor Vehicle Offices</li>
<li>Surplus Property Sales</li>
<li>Travel and Tourism</li>
</ul>
</section>
<p class="volver clearfix"><a href="#skiptarget">
<span class="icon-backtotop-dwnlvl">Back to Top</span></a></p>
</article>
How should I get the address from it only text.
Your current code returns a list with one element. To get the <p> element in it, you can expand it a bit:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")
If you want to get what is inside that p element, you'll have to get the first element of that list again, and run decode_contents on it:
soup.find("h3", text="Main Address:").find_parents("section")[0]("p")[0].decode_contents(formatter="html")
In your case that will return:
u'120 East 4th Street<br/><span class="locality">Juneau</span>, <span class="region">AK</span>, <span class="postal-code">99801</span>'

Insert element in a tag on the fly (all in the "content" side)

I need to modify on-the-fly the "content" side of a tag appending some text.
I have (on the content side) the classic portal-tabs:
<ul class="nav" id="portal-globalnav">
....
<li id="portaltab-events" class="plain">
Eventi
</li>
</ul>
I need to append (via diazo) on-the fly the content of another tag (#numbers) to obtain something like:
<ul class="nav" id="portal-globalnav">
....
<li id="portaltab-events" class="plain">
Eventi
<div id="#numbers">33</div>
</li>
</ul>
How solve this issue?
Thank's
You might see if this helps: http://docs.diazo.org/en/latest/recipes/modifying-text/index.html
Also, where does the #numbers div come from? If you append it to each LI tag, you'll have an invalid HTML (more than one element with the same ID)
A content replace containing a little XSL should do it.
<replace css:content="#portaltab-events a">
<xsl:copy-of select="." />
<xsl:copy-of select="//*[#id='numbers']" />
<xsl:apply-templates />
</replace>
If you separately drop the #numbers div, you'll need to add mode="raw" to the apply-templates to prevent it from being dropped here.

Regular expression to remove lines with special characters

<a class='jdr' href='javascript:void(0);' onClick="return openDiv('jrtp');"></a>
<span class="jcn">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET" title='Aptech N Power Hardware & Networking' >Aptech N Power Hardware & Networkin...</a>
</span>
<section class="jrat">
<a rel="nofollow" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw"><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s10'></span><span class='s0'></span></a>
<a class="jrt" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET#rvw">2 ratings</a>
<span class="jrt"> |</span>
<a class="rate_this" onclick="_ct('ratethis','lspg');" href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET/writereview">Rate this</a>
</section>
<section class="jcar">
<section class="jbc">
<a href="http://example.com/Ahmedabad/Aptech-N-Power-Hardware-Networking-<near>-Toll-Naka-Opp-Kakadia-Hospital-Below-Sankalp-Reataurant-Bapu-Nagar/079PXX79-XX79-110420173655-D4K6_QWhtZWRhYmFkIENDTkEgVHJhaW5pbmcgSW5zdGl0dXRlcw==_BZDET">
<img width="83" height="56" border="0" src="http://images.jdmagicbox.com/upload_test/ahmedabad/b4/079pxx79.xx79.110420172948.d4b4/logo/faf3f2409ed7993aaa70f848ab0bb6fb_t.jpg" class="Clogo" />
</a>
<!-- <span class="noLogo"></span> -->
<section class="jrcl">
<p>
**A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024** | View Map<br>
</p>
From the above XML data I want to extract the following---
A/35, Lakhani Chamber, Toll Naka, Opp Kakadia Hospital, Below Sankalp Reataurant, Bapu Nagar, Ahmedabad - 380024
I need help in creating a regular expression to find and remove all lines containing special characters.
I am using the following regex ----
/(\<.+?>)/g
Please help.Thanks
Try this
/(?<=\*{2})([^<>]*?)(?=\*{2})/g
it matches all content between the **.
I think you want to remove lines which are HTML tags, so try this:
/^<.*>\n/g

building django template files with xslt

I have about 4,000 html documents that i am trying to convert into django templates using xslt. The problem that I am having is that xslt is escaping the '{' curly braces for template variables, when I try to include a template variable inside of an attribute tag;
my xslt file looks like this:
<xsl:template match="p">
<p>
<xsl:attribute name="nid"><xsl:value-of select="$node_id"/></xsl:attribute>
<xsl:apply-templates select="*|node()"/>
</p>
<span>
{% get_comment_count for thing '<xsl:value-of select="$node_id"/>' as node_count %}
{{ node_count }} //This works as expected
</span>
<div>
<xsl:attribute name="class">HControl</xsl:attribute>
<xsl:text disable-output-escaping="yes">{% if node_count > 0 %}</xsl:text> // have to escape this because of the '>'
<div class="comment-list">
{% get_comment_list for thing '<xsl:value-of select="$node_id"/>' as node_comments %}
{% for comment in node_comments %}
<div class="comment {{ comment.object_id }}"> // this gets escaped
<a>
<xsl:attribute name="name">c{{ comment.id }}</xsl:attribute> //and so does this
</a>
<a>
<xsl:attribute name="href">
{% get_comment_permalink comment %}
</xsl:attribute>
permalink for comment #{{ forloop.counter }}
</a>
<div>
{{ comment.comment }}
</div>
</div>
{% endfor %}
</div>
{% endif %}
</div>
the output looks something like this:
<div>
<p nid="50:1r:SB:1101S:5">
<span class="Insert">B. A person who violates this section is guilty of a class 1 misdemeanor.</span>
</p>
<span>
1
</span>
<div class="HControl">
<div class="comment-list">
<div class="comment '{ comment.object_id }'"> // this should be class="comment #c123"
<a name="c%7B%7B%20comment.id%20%7D%7D"></a> // this should name="c123"
<a href="%7B%%20get_comment_permalink%20comment%20%%7D"> //this should be an href to the comment
permalink for comment #1
</a>
<div>
Well you should show some respect!
</div>
</div>
</div>
</div>
I transform the file with lxml.etree and then pass the string to a django template object, and render it.
I just dont seem to understand how to get the xslt parser to leave the curly braces alone
XSLT has its own purpose for curly braces - they are used in Attribute Value Templates, like this:
<!-- $someVariableOrExpression will be evaluated here -->
<div title="{$someVariableOrExpression}" />
To get literal curly braces into attribute values in XSLT, you need to escape them, which is done by doubling them:
<!-- the title will be "{$someVariableOrExpression}" here -->
<div title="{{$someVariableOrExpression}}" />
So if you want to output literal double curly braces, you need (guess what):
<div title="{{{{$someVariableOrExpression}}}}" />