XSLT - Select all siblings of a given tag until another tag (again) - xslt

Given the following XML, I want to select every potential element between "First heading" and "Second heading", these heading elements excluded.
I am not sure what version of XSLT I can use (I'm modifying a sheet run by a proprietary app...)
<body>
<h1 class="heading1">Some title</h1>
<p class="bodytext">Some text.</p>
<p class="sectiontitle">First heading</p>
<p class="bodytext">Want that.</p>
<div>
<p class="bodytext">Want that too!</p>
</div>
<p class="sectiontitle">Second heading</p>
<p class="bodytext">Some text</p>
<p class="sectiontitle">Third heading</p>
...
</body>
Expected:
<p class="bodytext">Want that.</p>
<div>
<p class="bodytext">Want that too!</p>
<div>
I know that p class="sectiontitle">First heading</p>:
will always be of the sectiontitle class.
will always contain First heading.
does not have to be first p of this class, its position is unknown.
I also now that I will stop once I find <p class="sectiontitle">Could be any title</p> (so based on class only)
I have seen the other similar posts about this kind of problems, and I still can't crack my case...
What I have tried, amongst other things:
//*[(preceding-sibling::p/text()="First heading") and (not(following-sibling::p[#class="sectiontitle"]))]

You can use the following XPath expression (updated to avoid selecting the 2nd sectiontitle element) :
//p[#class='sectiontitle' and .='First heading']
/following-sibling::*[
preceding-sibling::p[#class='sectiontitle'][1] = 'First heading'
and not(self::p/#class = 'sectiontitle')
]
Basically, the XPath returns following-sibling elements of the First Heading element, where the nearest preceding sibling 'sectiontitle' is the First Heading element itself.

I think this is more straightforward, meaning you can specify between which two headings you want the output :
//p[#class='sectiontitle' and text()='Second heading']/preceding-sibling::*[preceding-sibling::p[#class='sectiontitle'][1] = 'First heading']
For example if you want to get output between 'Second heading' and 'Third heading' just change 'Second heading' to 'Third heading' and 'First heading' to 'Second Heading' in the above expression

I discovered a great way to answer my own question using ids.
Let's say you want to select the following siblings of the current tag (a sectiontitle in my example), until you find any element that has a 'title' looking class, so for instance paragraphtitle or sectiontitle:
<xsl:variable name="thisgid" select="generate-id(.)" />
<xsl:apply-templates select="following-sibling::*[not(#class='sectiontitle' or #class='paragraphtitle')]
[generate-id(preceding-sibling::p[#class='sectiontitle'][1]) = $thisgid]"/>
That has solved many problems in my case.

Related

How to keep only the first paragraph of $product.description_short in a listing?

I want to keep only the first paragraph of product description in categories.
Example: <p>This is a pretty good description.</p><p>The rest of the description, even if it's cool to I don't want it.</p>
To : <p>This is a pretty good description.</p>
This is the default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{$product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
</p>
This is what I tried default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{assign var $newdescription = $product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
{preg_replace('(?<=<\/p>)\s+<p>.*','',$newdescription)}
</p>
{$_shorten = explode('</p>', $product.description_short)}
// with valid html tags
{$_shorten.0|cat:'</p>'}
// or if you want to strip tags:
{$_shorten.0|strip_tags}

How do I scrape nested data using selenium and Python

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

jSoup - How to get elements with background style (inline CSS)?

I'm building an app in Railo, which uses the jSoup .jar library. It all works really well in my CFML language.
Anyhow, I can grab every element with a "style" attribute doing:
<cfset variables.mySelection = variables.myDocument.select("*[style]") />
But this returns an array which contains elements that sometimes do not have a "background" or "background-image" style on them. As an example, the HTML might looks like so:
<p style="color: red;">I should not be selected</p>
<p style="background: green">I **should** be selected</p>
<p style="text-align: left;">I should not be selected</p>
<p style="background-image: url("/path/to/image.jpg");">I **should** be selected</p>
So I can get these elements above, but I don't want the 1st and 3rd in my array, as they don't have a background style...do you know how I can only grab and work with these?
Please note, I'm not after a COMPUTATED style, or anything that complicated, I'm just wondering if I can filter based on the properties of an inline CSS style. Perhaps some regex after the fact? I'm open to ideas!
I tried messing with :contains(background) as a key word, but I wasn't sure if that was the correct path?
Many thanks for your help.
Michael.
Try with:
variables.myDocument.select("*[style*='background']")
As *= is the standard selector to match a substring in the attribute content.
Elements els = doc.select(div[style*=dashed]);
Or
Elements elements = doc1.select("span[style*=font-weight:bold]");

can't add a link to an entire div section

I have a problem with TinyMCE in Joomla 2.5.4. I have tried for a few days now to add a link to a div section (like <div> something< </div> ) but failed, the anchor is stripped from the HTML section because TinyMCE sees that as being wrong in HTML4. After a 3 days research I gave up and instead of a div i used a unordered list.
Now when i try to add a link to a list item (like <li> <p> something </p> </li> ) TinyMCE rearranges everything and moves the anchor inside of the list item (like <li> <a href="#"> <p> something </p> &=lt;/a> </li>).
I have tried pretty much everything from valid_elements : "[]" to text filter: No Filtering but i ran low on ideas.
Can anyone please help me?
Try playing around with TinyMCE's html5 options: http://www.tinymce.com/tryit/html5_formats.php
Hit "view source" to see how they're doing it. It's mainly this option inside tinyMCE.init:
schema: "html5",

How to embed <pre> tag in a list in a wiki

I am trying to embed a <pre> tag in within an ordered list, of the form:
# Some content
#: <pre>
Some pre-formatted content
</pre>
But it doesn't work. Can someone please let me know on how to achieve what I am trying to do?
You can use a regular HTML list:
<ol>
<li>Some Content</li>
<li><dl><dd><pre>Some pre-formatted content</pre></dd></dl></li>
</ol>
This is the better answer for continuing a numbered list after using the <pre> tag without resorting to html:
# one
#:<pre>
#:some stuff
#:some more stuff</pre>
# two
Produces:
1. one
some stuff
some more stuff
2. two