How to keep only the first paragraph of $product.description_short in a listing? - regex

I want to keep only the first paragraph of product description in categories.
Example: <p>This is a pretty good description.</p><p>The rest of the description, even if it's cool to I don't want it.</p>
To : <p>This is a pretty good description.</p>
This is the default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{$product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
</p>
This is what I tried default code in product-list.tpl Prestashop (1.6):
<p class="product-desc" itemprop="description">
{assign var $newdescription = $product.description_short|strip_tags:'UTF-8'|truncate:360:'...'}
{preg_replace('(?<=<\/p>)\s+<p>.*','',$newdescription)}
</p>

{$_shorten = explode('</p>', $product.description_short)}
// with valid html tags
{$_shorten.0|cat:'</p>'}
// or if you want to strip tags:
{$_shorten.0|strip_tags}

Related

Getting the first author in a pandoc template (Rmarkdown)

I am currently creating my own pandoc template for Rmarkdown (outputting html). I want my report to show a footer containing the title and first author's name.
Reading the pandoc manual, I saw that it is possible to use a pipe to get the first element of an array (var/first). So having the header:
title: "My Report"
author:
- "Jane Doe"
- "John Doe"
I tried to do the following in the template:
<div class="footer">
<div class="footer-content">
<div class="footer-title">
<h3>$title$</h3>
</div>
<div class="footer-author">
<h3>$author/first$</h3>
</div>
</div>
</div>
Then I got as error in the author line:
"template" (line 96, column 24):
unexpected "/"
expecting "." or "$"
Please note that the pandoc version used by rmarkdown is 2.3.1.
Is there a different way to accomplish that?
You could write a small Lua filter to get just the first author:
function Meta (meta)
local firstauthor = meta.author.t == 'MetaList'
and meta.author[1]
or meta.author
meta.firstauthor = firstauthor
return meta
end
You can then use $firstauthor$ in your template. See here for a brief discussion of Lua filters and how to use them.

How do I scrape nested data using selenium and Python

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

XSLT - Select all siblings of a given tag until another tag (again)

Given the following XML, I want to select every potential element between "First heading" and "Second heading", these heading elements excluded.
I am not sure what version of XSLT I can use (I'm modifying a sheet run by a proprietary app...)
<body>
<h1 class="heading1">Some title</h1>
<p class="bodytext">Some text.</p>
<p class="sectiontitle">First heading</p>
<p class="bodytext">Want that.</p>
<div>
<p class="bodytext">Want that too!</p>
</div>
<p class="sectiontitle">Second heading</p>
<p class="bodytext">Some text</p>
<p class="sectiontitle">Third heading</p>
...
</body>
Expected:
<p class="bodytext">Want that.</p>
<div>
<p class="bodytext">Want that too!</p>
<div>
I know that p class="sectiontitle">First heading</p>:
will always be of the sectiontitle class.
will always contain First heading.
does not have to be first p of this class, its position is unknown.
I also now that I will stop once I find <p class="sectiontitle">Could be any title</p> (so based on class only)
I have seen the other similar posts about this kind of problems, and I still can't crack my case...
What I have tried, amongst other things:
//*[(preceding-sibling::p/text()="First heading") and (not(following-sibling::p[#class="sectiontitle"]))]
You can use the following XPath expression (updated to avoid selecting the 2nd sectiontitle element) :
//p[#class='sectiontitle' and .='First heading']
/following-sibling::*[
preceding-sibling::p[#class='sectiontitle'][1] = 'First heading'
and not(self::p/#class = 'sectiontitle')
]
Basically, the XPath returns following-sibling elements of the First Heading element, where the nearest preceding sibling 'sectiontitle' is the First Heading element itself.
I think this is more straightforward, meaning you can specify between which two headings you want the output :
//p[#class='sectiontitle' and text()='Second heading']/preceding-sibling::*[preceding-sibling::p[#class='sectiontitle'][1] = 'First heading']
For example if you want to get output between 'Second heading' and 'Third heading' just change 'Second heading' to 'Third heading' and 'First heading' to 'Second Heading' in the above expression
I discovered a great way to answer my own question using ids.
Let's say you want to select the following siblings of the current tag (a sectiontitle in my example), until you find any element that has a 'title' looking class, so for instance paragraphtitle or sectiontitle:
<xsl:variable name="thisgid" select="generate-id(.)" />
<xsl:apply-templates select="following-sibling::*[not(#class='sectiontitle' or #class='paragraphtitle')]
[generate-id(preceding-sibling::p[#class='sectiontitle'][1]) = $thisgid]"/>
That has solved many problems in my case.

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

CSS3 class match letter range [a-z]+?

Is there any possibility to create CSS definition for any element with the class "icon-" and then a set of letters but not numbers.
According to this article something like:
[class^='/icon\-([a-zA-Z]+)/'] {}
should works. But for some reason it doesn't.
In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32"
Is it possible at all?
CSS attribute selectors do not support regular expressions.
If you actually read that article closely:
Regex Matching Attribute Selectors
They don’t exist, but wouldn’t that be so cool? I’ve no idea how hard it would be to implement, or how to expensive to parse, but wouldn’t it just be the bomb?
Notice the first three words. They don't exist. That article is nothing more than a blog post lamenting the absence of regex support in CSS attribute selectors.
But if you're using jQuery, James Padolsey's :regex selector for jQuery may interest you. Your given CSS selector might look like this for example:
$(":regex(class, ^icon\-[a-zA-Z]+)")
I answered this one on facebook but thought I'd best share here too :)
I haven't tested this so don't shoot me if it doesn't work :) but my guess would be to excplicitly target elements that contain the word icon in the classname, but to instruct the browser not to inlcude those classes containing numbers.
Example code:
div[class|=icon]:not(.icon-16, .icon-32, icon-64, icon-96) {.....}
Reference:
attribute selectors... (http://www.w3.org/TR/CSS2/selector.html#attribute-selectors):
[att|=val]
Represents an element with the att attribute, its value either being exactly "val" or beginning with "val" immediately followed by "-" (U+002D).
:not selector...
(http://kilianvalkhof.com/2008/css-xhtml/the-css3-not-selector/)
Hope this helps,
Waseem
I tested my previous solution and can confirm that it DOES NOT work (see comment from BoltClock). This however does:
OP: "In particular I need to create style definition for all elements like "icon-user", "icon-ok" etc but not "icon-16" or "icon-32""
The required CSS code would look something like this:
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that begin with ( ^= ) "icon-16", or "icon-32" */
*[class^="icon"]:not([class^="icon-16"]):not([class^="icon-32"]) {.....}
or
/* target every element were the class name begins with ( ^= ) "icon" but NOT those that contain ( *= ) the number "16" or the number "18" */
*[class^="icon"]:not([class*="16"]):not([class*="32"]) { ...... }
Test code:
<!DOCTYPE html>
<html>
<head>
<style>
div{border:1px solid #999;margin-bottom:1em;height:100px;}
*[class|=icon]:not([class|=icon-16]):not([class|=icon-32]) {background:red;color:white;}
</style>
</head>
<body>
<div class="icon-something">
<h4>icon-something</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-anotherthing">
<h4>icon-anotherthing</h4>
<p><strong>IS</strong> targeted therfore background colour will be red</p>
</div>
<div class="icon-16-install">
<h4>icon-16-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-redirect">
<h4>icon-16-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-16-login">
<h4>icon-16-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-install">
<h4>icon-32-install</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-redirect">
<h4>icon-32-redirect</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
<div class="icon-32-login">
<h4>icon-32-login</h4>
<p>Is <strong>NOT</strong> targeted therfore no background colour</p>
</div>
</body>
</html>