Extract text with bold content from css selector - python-2.7

I am trying to extract a text from forum posts, however the bold element is ignored.
How can I extract raw data like Some text to extract bold content? Currently I am getting only Some text to extract ?
<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
Some text to extract <b>bold content</b>?
</blockquote>
def parse_page(self, response):
for quote in response.css('article'):
yield {
'text': quote.css('blockquote::text').extract()
}

You need a space in your css selector:
'blockquote ::text'
^
Because you want text of every descending node under blockquote, without space it means just the text of blockquote node.

Use * selector to select text of all inner elements inside an element.
''.join([ a.strip() for a in quote.css('blockquote *::text').extract() ])

Related

Regex to find a particular pattern in Excel

This is the text in cell H1
<a class="stop-propagation" href="javascript:void(0);" data-link="/propertyDetails/poiOnMap.html?lat=19.2412011&longt=73.1290596&projectOrProp=Project&city=Thane&includeJs=y&type=poiMap2017&address=Thane, Maharashtra" id="map_link_27696295" onclick="stopPage=true; showPhotoMap('/propertyDetails/poiOnMap.html?lat=19.2412011&longt=73.1290596&projectOrProp=Project&city=Thane&includeJs=y&type=poiMap2017&address=Thane, Maharashtra');" style="outline: 1px solid blue;"><span class="icoMap"></span>Map</a>
From above cell I'm trying to extract element of 1st occurrence of lat and longt
This is what I have tried
=IF(LEFT(H1,2)="lat=",SUBSTITUTE(H1,"lat=",""),IF(RIGHT(H1,2)="lat=",SUBSTITUTE(H1,"lat=",""),H1))
But it doesn't gives me proper output.
This is what I Expect
lat=19.2412011
longt=73.1290596
Any help would be much appreciated.
Thanks
For the lat=19.2412011,
=TRIM(LEFT(SUBSTITUTE(REPLACE(H1,1,FIND("?",H1),TEXT(,)),"&",REPT(" ",LEN(H1))), LEN(H1)))
For the longt=73.1290596,
=TRIM(MID(SUBSTITUTE(REPLACE(H1,1,FIND("?",H1),TEXT(,)),"&",REPT(" ",LEN(H1))), LEN(H1), LEN(H1)))
For the two together in a single cell with a line feed,
=TRIM(LEFT(SUBSTITUTE(REPLACE(H1,1,FIND("?",H1),TEXT(,)),"&",REPT(" ",LEN(H1))),LEN(H1)))&CHAR(10)&TRIM(MID(SUBSTITUTE(REPLACE(H1,1,FIND("?",H1),TEXT(,)),"&",REPT(" ",LEN(H1))),LEN(H1),LEN(H1)))

XML find and delete all text in doc not within a specified tag

I have an XML doc which is massive - a short example is below to illustrate formatting. What I want to do is find all the text in the doc which is not within a tag and delete it - so I am left with just a list of the data...
So here is the original:
51.639973121-2.161205923
112.0
<time>2017-02-19T11:26:45Z</time>
51.639902964-2.161258059
111.6
<time>2017-02-19T11:26:46Z</time>
51.639834484-2.161310529
111.6
<time>2017-02-19T11:26:47Z</time>
51.639765501-2.161366101
111.6
<time>2017-02-19T11:26:48Z</time>
51.639697859-2.161426451
111.8
<time>2017-02-19T11:26:49Z</time>
And once formatted - it will become:
<time>2017-02-19T11:26:45Z</time>
<time>2017-02-19T11:26:46Z</time>
<time>2017-02-19T11:26:47Z</time>
<time>2017-02-19T11:26:48Z</time>
<time>2017-02-19T11:26:49Z</time>
How is this possible???
The following expression will select all text but time tags:
^(?!<time>[^<]+<\/time>).*\R
It works only if the tags are on a new line, like in you example input.
See the demo

Construct Xpath

I have the following repeated piece of the web-page:
<div class="txt ext">
<strong class="param">param_value1</strong>
<strong class="param">param_value2</strong>
</div>
I would like to extract separately values param_value1 and param_value2 using Xpath. How can I do it?
I have tried the following constructions:
'//strong[#class="param"]/text()[0]'
'//strong[#class="txt ext"]/strong[#class="param"][0]/text()'
'//strong[#class="param"]'
none of which returned me separately param_value1 and param_value2.
P.S. I am using Python 2.7 and the latest version of Scrapy.
Here is my testing code:
test_content = '<div class="txt ext"><strong class="param">param_value1</strong><strong class="param">param_value2</strong></div>'
sel = HtmlXPathSelector(text=test_content)
sel.select('//div/strong[#class="param"]/text()').extract()[0]
sel.select('//div/strong[#class="param"]/text()').extract()[1]
// means descendant or self. You are selecting any strong element in any context. [...] is a predicate which restricts your selection according to some boolean test. There is no strong element with a class attribute which equals txt ext, so you can exclude your second expression.
Your last expression will actually return a node-set of all the strong elements which have a param attribute. You can then extract individual nodes from the node set (use [1], [2]) and then get their text contents (use text()).
Your first expression selects the text contents of both nodes but it's also wrong. It's in the wrong place and you can't select node zero (it doesn't exist). If you want the text contents of the first node you should use:
//strong[#class="param"][1]/text()
and you can use
//strong[#class="param"][2]/text()
for the second text.

Qt : QXmlQuery and XPaths

I'm here to ask you some help with QXmlQuery and Xpath.
I'm trying to use this combination to extract some data from several HTML documents.
These documents are downloaded and then cleaned with the HTML Tidy Library.
The problem is when I try my XPath. Here is an example code :
[...]
<ul class="bullet" id="idTab2">
<li><span>Hauteur :</span> 1127 mm</li>
<li><span>Largeur :</span> 640 mm</li>
<li><span>Profondeur :</span> 685 mm</li>
<li><span>Poids :</span> 159.6 kg</li>
[...]
The clean code is stored in a QString "code" :
QStringList fields, values;
QXmlQuery query;
query.setFocus(code);
query.setQuery("//*[#id=\"idTab2\"]/*/*/string()");
query.evaluateTo(&fields);
My goal is to get all the fields (Hauteur, Largeur, Profondeur, Poids, etc.) and their value (1127 mm, 640 mm, 685 mm, 159.6 kg, etc.).
Question 1
As you can see, I use this XPath //*[#id="idTab2"]/*/*/string() to recover the fields because this : //ul[#id="idTab2"]/li/span/string() doesn't work. When I try to specify a tag name, it gives me nothing. It only works with *. Why ? I've checked the code returned by the tidy function and the XPath is not altered. So, I don't see any prolem. Is this normal ? Or maybe there is something I don't know...
Question 2
In the previous XHTML code, the li tags wrap a span tag and some text. I don't know how to get only the text and not the content of the span tag. I tried :
//*[#id="idTab2"]/*/string() gives : Hauteur : 1127 mm Largeur : 640 mm Profondeur : 685 mm
//*[#id="idTab2"]/*[2]/string() gives : Nothing
So, if I'm not wrong, the text in the li tag is not considered as a child node but it should be. See the accepted answer : Select just text directly in node, not in child nodes.
Thanks for reading, I hope someone can help me.
To get the elements (not the text representation) inside the different <li>s, you can test the text content:
//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]
Same thing of other items:
//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]
To get the string representation of these <li>, you can use string() around the whole expression, like this:
string(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")])
which gives "Poids : 159.6 kg"
To extract only the text node in the <li>, without the <span>, you can use these expressions, which select the text nodes which are direct children of <li> (<span> is not a text node), and removes the leading and trailing whitespace characters (normalize-space())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]/text())
The last on gives "159.6 kg"

re pulls data from one tag and not the other

I am trying to get a program to work that parses html like tags- it's for a TREC collection. I don't program often, except for databases and I am getting stuck on syntax. Here's my current code:
parseTREC ('LA010189.txt')
#Following Code-re P worked in Python
def parseTREC (atext):
atext=open(atext, "r")
filePath= "testLA.txt"
docID= []
docTXT=[]
p = re.compile ('<DOCNO>(.*?)</DOCNO>', re.IGNORECASE)
m= re.compile ('<P>(.*?)</P>', re.IGNORECASE)
for aline in atext:
values=str(aline)
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
print docID
atext.close()
the p re pulled the DOCNO as it was supposed. The m re though would not pull data and would print an empty list. I pretty sure that there are white spaces and also a new line. I tried the re.M and that did not help pull the data from the other lines. Ideally I would like to get to the point to where I store in a dictionary {DOCNO, Count}. Count would be determined by summing up every word that is in the P tags and also in a list []. I would appreciate any suggestions or advice.
You can try removing all the line breaks from the file if you think that is impacting your regex results. Also, make sure you don't have nested <P> tags because your regex may not match as expected. For example:
<p>
<p>
<p>here's some data</p>
And some more data.
</p>
And even more data.
</p>
will capture this section because of the "?":
<p>
<p>here's some data</p>
And some more data.
Also, is this a typo:
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
should that be:
docID.append(m.findall(values))
ont the last line?
Add the re.DOTALL flag like so:
m= re.compile ('<P>(.*?)</P>',
re.IGNORECASE | re.DOTALL)
You may want to add this to the other regex as well.
from xml.dom.minidom import *
import re
def parseTREC2 (atext):
fc = open(atext,'r').read()
fc = '<DOCS>\n' + fc + '\n</DOCS>'
dom = parseString(fc)
w_re = re.compile('[a-z]+',re.IGNORECASE)
doc_nodes = dom.getElementsByTagName('DOC')
for doc_node in doc_nodes:
docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data
cnt = 1
for p_node in doc_node.getElementsByTagName('P'):
p = p_node.firstChild.data
words = w_re.findall(p)
print "\t".join([docno,str(cnt),p])
print words
cnt += 1
parseTREC2('LA010189.txt')
The code adds tags to the front of the document because there is no parent tag. The program then retrieves the information through the xml parser.