I'm here to ask you some help with QXmlQuery and Xpath.
I'm trying to use this combination to extract some data from several HTML documents.
These documents are downloaded and then cleaned with the HTML Tidy Library.
The problem is when I try my XPath. Here is an example code :
[...]
<ul class="bullet" id="idTab2">
<li><span>Hauteur :</span> 1127 mm</li>
<li><span>Largeur :</span> 640 mm</li>
<li><span>Profondeur :</span> 685 mm</li>
<li><span>Poids :</span> 159.6 kg</li>
[...]
The clean code is stored in a QString "code" :
QStringList fields, values;
QXmlQuery query;
query.setFocus(code);
query.setQuery("//*[#id=\"idTab2\"]/*/*/string()");
query.evaluateTo(&fields);
My goal is to get all the fields (Hauteur, Largeur, Profondeur, Poids, etc.) and their value (1127 mm, 640 mm, 685 mm, 159.6 kg, etc.).
Question 1
As you can see, I use this XPath //*[#id="idTab2"]/*/*/string() to recover the fields because this : //ul[#id="idTab2"]/li/span/string() doesn't work. When I try to specify a tag name, it gives me nothing. It only works with *. Why ? I've checked the code returned by the tidy function and the XPath is not altered. So, I don't see any prolem. Is this normal ? Or maybe there is something I don't know...
Question 2
In the previous XHTML code, the li tags wrap a span tag and some text. I don't know how to get only the text and not the content of the span tag. I tried :
//*[#id="idTab2"]/*/string() gives : Hauteur : 1127 mm Largeur : 640 mm Profondeur : 685 mm
//*[#id="idTab2"]/*[2]/string() gives : Nothing
So, if I'm not wrong, the text in the li tag is not considered as a child node but it should be. See the accepted answer : Select just text directly in node, not in child nodes.
Thanks for reading, I hope someone can help me.
To get the elements (not the text representation) inside the different <li>s, you can test the text content:
//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]
Same thing of other items:
//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]
//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]
To get the string representation of these <li>, you can use string() around the whole expression, like this:
string(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")])
which gives "Poids : 159.6 kg"
To extract only the text node in the <li>, without the <span>, you can use these expressions, which select the text nodes which are direct children of <li> (<span> is not a text node), and removes the leading and trailing whitespace characters (normalize-space())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Hauteur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Largeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Profondeur")]/text())
normalize-space(//*[#id=\"idTab2\"]/li[starts-with(span, "Poids")]/text())
The last on gives "159.6 kg"
Related
I'm trying to create a list in Markdown. As I've read in some documentation, if I write this Markdown code:
My list
* first item
* second item
* third item
Not in the list
I would get as result the same as if I write this in HTML:
<p>My list</p>
<li>
<ul>first item</ul>
<ul>second item</ul>
<ul>third item</ul>
</li>
<p>Not in the list</p>
I use Atom as editor and its Markdown previewer and everything is OK, but when I use pandoc to convert my Markdown file as follows:
pandoc test.md -o test.odt
what I get is this:
My list * first item * second item * third item
Not in the list
Where am I doing wrong?
There are two possible solutions to your problem:
Add a blank line between the paragraph and the list (as #melpomene mentioned in a comment).
My list
* first item
* second item
* third item
Not in the list
Leave out the blank line and tell Pandoc to use commonmark as the input format rather than the default, markdown.
pandoc -f commonmark -o test.odt test.md
The "problem" is that the Atom editor uses a CommonMark parser and, by default, Pandoc uses an old-school Markdown parser which mostly follows these rules and the reference implementation (markdown.pl). In fact, the Commonmark spec specifically acknowledges this difference:
In CommonMark, a list can interrupt a paragraph. That is, no blank
line is needed to separate a paragraph from a following list:
Foo
- bar
- baz
<p>Foo</p>
<ul>
<li>bar</li>
<li>baz</li>
</ul>
Markdown.pl does not allow this, through fear of triggering a list
via a numeral in a hard-wrapped line:
The number of windows in my house is
14. The number of doors is 6.
If you want common behavior among your tools, then you need to only use tools which follow the same behavior.
I want to create a regex that will capture property names in JSON objects. So I can color their property names. Then, in loop (in TypeScript), I will add a span with class to color captured matches.
For example:
I have an object that looks like this
{"restriction_data" : "ALL","old" : null,"new" : ["ALL"],"record_type" : "product"}
I want to get restriction_data, old, new, record_type from regex and make it red in color.
Other JSON that I can get is:
{"category":"category is mandatory","field":"[u'NONE'] is invalid field. Found: NONE","description":"description is mandatory"}
And same, I want to get category, field, description and make it red.
I tried \"(.*?)\" regex, but it doesn't quite work for me.
I am trying to extract a text from forum posts, however the bold element is ignored.
How can I extract raw data like Some text to extract bold content? Currently I am getting only Some text to extract ?
<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
Some text to extract <b>bold content</b>?
</blockquote>
def parse_page(self, response):
for quote in response.css('article'):
yield {
'text': quote.css('blockquote::text').extract()
}
You need a space in your css selector:
'blockquote ::text'
^
Because you want text of every descending node under blockquote, without space it means just the text of blockquote node.
Use * selector to select text of all inner elements inside an element.
''.join([ a.strip() for a in quote.css('blockquote *::text').extract() ])
I have posted my HTML below. In which I want to get the name value from within my textbox area. I've tried several processes and I'm still not getting any valid solution. Please check my HTML and code snippet, and show me a possible solution.
The name prefix will always stay the same when I refresh the page. However, the last name within the "name" area will change, but will always contain the literal "mr." as the first 3 digits. regex as ([mM]r.\ ) - Four digits if you consider the literal space. Below is my table example.
<table>
<tr><td><b>Your Name is </b> mr. kamrul</td></tr>
<tr><td><b>your age </b> 12</td></tr>
<tr><td><b>Email:</b>kennethdasma30#gmail.com</td></tr>
<tr><td><b>job title</b> sales man</td></tr>
</table>
As shown below I am trying this process using listbox but I am not receiving anything.
HtmlElementCollection bColl =
webBrowser1.Document.GetElementsByTagName("table");
foreach (HtmlElement bEl in bColl)
{
if (bEl.GetAttribute("table") != null)
{
listBox1.Items.Add(bEl.GetAttribute("table"));
}
}
If anyone ca give me an idea of how I am able to receive all in the browser window as ("mr. " + text) within my list box I would appreciate it. Also, if you can explain the answer verbosely and with good comments I would appreciate it, as I'd like to understand the answer in greater detail as well.
Here is one simple way using Regex, assuming that the format of your html page doesn't change.
Regex re = new Regex(#"(?<=<tr><td><b>Your\sName\sis\s?</b>\s?)[mM]r\.\s.+?(?=</td></tr>)", RegexOptions.Singleline);
foreach (Match match in re.Matches(webBrowser1.DocumentText))
{
listBox1.Items.Add(match.Value);
}
I have the following repeated piece of the web-page:
<div class="txt ext">
<strong class="param">param_value1</strong>
<strong class="param">param_value2</strong>
</div>
I would like to extract separately values param_value1 and param_value2 using Xpath. How can I do it?
I have tried the following constructions:
'//strong[#class="param"]/text()[0]'
'//strong[#class="txt ext"]/strong[#class="param"][0]/text()'
'//strong[#class="param"]'
none of which returned me separately param_value1 and param_value2.
P.S. I am using Python 2.7 and the latest version of Scrapy.
Here is my testing code:
test_content = '<div class="txt ext"><strong class="param">param_value1</strong><strong class="param">param_value2</strong></div>'
sel = HtmlXPathSelector(text=test_content)
sel.select('//div/strong[#class="param"]/text()').extract()[0]
sel.select('//div/strong[#class="param"]/text()').extract()[1]
// means descendant or self. You are selecting any strong element in any context. [...] is a predicate which restricts your selection according to some boolean test. There is no strong element with a class attribute which equals txt ext, so you can exclude your second expression.
Your last expression will actually return a node-set of all the strong elements which have a param attribute. You can then extract individual nodes from the node set (use [1], [2]) and then get their text contents (use text()).
Your first expression selects the text contents of both nodes but it's also wrong. It's in the wrong place and you can't select node zero (it doesn't exist). If you want the text contents of the first node you should use:
//strong[#class="param"][1]/text()
and you can use
//strong[#class="param"][2]/text()
for the second text.