DOMXPath - Select only a few items - domxpath

Let's say we have this html and the following DOMXPath code:
<div>
<div>
<p>1</p>
</div>
<div>
<p>2</p>
</div>
<div>
<p>3</p>
</div>
<div>
<p>4</p>
</div>
<div>
<p>5</p>
</div>
<div>
<p>6</p>
</div>
</div>
$doc = new DOMDocument();
$doc->loadHtml($strhtml);
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath( $doc );
$nodelist = $xpath->query('//div/div[2]/p');
foreach( $nodelist as $node ) {
$result = $node->nodeValue."\n";
}
echo $result;
Obviously, $result = '2', since we asked for the value of 'p' from the second 'div' node.
Now, how can I get the values for, say, from 'div[2]' to 'div[4]' and sum them?
To be precise, I would like to know how to get "from # to #" and also how to get "this #, that #, also # and #". So two questions, for two different problems.
Thanks in advance.

You are able to select range of elements with DOMXPath:
as for the first problem to get "from # to #" use the
following approach:
// select nodes within specified range of positions
$nodelist = $xpath->query('//div/div[position()>1 and position()<5]');
as for the second problem to get "this #, that #, also # and #"
try the following (with | union operator):
// extracts the 2nd, 4th and 6th elements respectively
$nodelist = $xpath->query('//div/div[2] | //div/div[4] | //div/div[6]);

Related

Scrapy concatenate array elements inside div in python

I need to concatenate some text inside a <div> with xpath in Scrapy. The div has the next structure:
<div class="col-12 e-description" itemprop="description">
"-Text1"
<br>
<br>
"-Text2"
<br>
<br>
"-Text3"
</div>
I've created a ScrapyItem in my Spider:
class MyScrapyItem(scrapy.Item):
name = scrapy.Field()
description = scrapy.Field()
If I do this,
item['description'] = response.xpath('//div[#itemprop="description"]/text()').extract()
everything gets mixed and separated by commas, like this:
- Text1
,- Text2
,- Text3
I think that's because response.xpath('//div[#itemprop="description"]/text()').extract() returns an array so it adds commas to separate the array items.
I'm trying to loop over the array and join each item inside the "description" ScrapyItem property.
This is what I'm trying:
def parse_item(self, response):
item = MyScrapyItem()
item['name'] = response.xpath('normalize-space(//span[#itemprop="name"]/text())').extract()
for subItem in response.xpath('//div[#itemprop="description"]/text()'):
item['description'] = " ".join(subItem.extract())
I know it would work if I could do something like this:
for subItem in response.xpath('//div[#itemprop="description"]/text()'):
item['description'] = " ".join(subItem.xpath('//div[#itemprop="something_here"]/text()')extract())
but the div that contains the text has no more tags inside.
Any help would be appreciated, it's my first Scrapy project.
it is the other way around,
you have used
item['description'] = response.xpath('//div[#itemprop="description"]/text()').extract()
that will return a list
join the list directly
item['description'] = " ".join(response.xpath('//div[#itemprop="description"]/text()').extract())

RegExp replace all but selected

So I'm trying to erase everything except the matched case in this 1900 line document with Notepad++ RegExp Find/Replace, so that I only have the file names, which shorten it to under about 1000 lines at minimum. I know the code that selects the text ((?<=/images/item/)(.*)(?=" a) but the problem is I don't know how to make it erase anything that doesn't match that case. Here's a portion of the document.
using notepad++, it would find and select abyssal-scepter.gif, aegis-of-the-legion.gif, etc
<img src="/images/item/abyssal-scepter.gif" alt="LoL Item: Abyssal Scepter"><br> <div id="id_77" class="tier-wrapper drag-items health magic-resist health-regen champ-box float-left ajax-tooltip {t:'Item',i:'77'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-advanced filter-bonus-aura filter-category-health filter-category-magic-resist filter-category-health-regen ui-draggable ui-draggable-handle">
<img src="/images/item/aegis-of-the-legion.gif" alt="LoL Item: Aegis of the Legion"><br> <div id="id_235" class="tier-wrapper drag-items ability-power movement champ-box float-left ajax-tooltip {t:'Item',i:'235'} filter-tier-advanced filter-bonus-unique-passive filter-category-ability-power filter-category-movement ui-draggable ui-draggable-handle">
<img src="/images/item/aether-wisp.gif" alt="LoL Item: Aether Wisp"><br>
<div class="info">
<div class="champ-name">Aether Wisp</div>
<div class="champ-sub">
<img src="/images/gold.png" alt="Item Cost" style="width:16px; vertical-align:middle;"> 850 / 415
</div>
</div>
</div>
<div id="id_21" class="tier-wrapper drag-items ability-power champ-box float-left ajax-tooltip {t:'Item',i:'21'} classic-and-dominion filter-is-dominion filter-is-classic filter-tier-basic filter-category-ability-power ui-draggable ui-draggable-handle">
<img src="/images/item/amplifying-tome.gif" alt="LoL Item: Amplifying Tome"><br>
<div class="info">
<div class="champ-name">Amplifying Tome</div>
<div class="champ-sub">
I'm not familiar with RegExp, so to summarize, I need it to look like this at the end of it.
abyssal-scepter.gif
aegis-of-thelegion.gif
aether-wisp.gif
amplifying-tome.gif
Thank you for your time
A Notepad++ solution:
Find what : .*?/images/item/(.*?)"|.*
Replace with : $1\n
Search mode : Regular expression (with ". matches newline" checked)
The result will have an extra linefeed at the end.
But that shouldn't pose a problem I suppose.
Maybe this can help. or not since you dropped the Javascript tag out of your original post
<script type="text/javascript">
var thestring = "<img src=\"/images/item/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
var thestring2 = "<img src=\"/images/otherstuff/aegis-of-the-legion.gif\" alt=\"LoL Item: Aegis of the Legion\"><br>";
function ParseIt(incomingstring) {
var pattern = /"\/images\/item\/(.*)" /;
if (pattern.test(incomingstring)) {
return pattern.exec(incomingstring)[1];
}
else {
return "";
}
//return pattern.test(incomingstring) ? pattern.exec(incomingstring)[1] : "";
}
</script>
Calling ParseIt(thestring) returns "aegis-of-the-legion.gif"
Calling ParseIt(thestring2) return ""
Since you are doing this in NP++, this works for me. In cases like this where speed and results are more important than specific technique, I'll usually run several regexes. First, I'll get each tag on its own line by doing a search for > and replacing it with >\n. This gets each tag on its own line for simpler processing. Then a replace of ^>*<.*?".*?/?([\w\d\-_]+\.\w{2,4})?".*>.*$ with $1 will will extract all the filenames from the tags, removing the unneeded text. Then, finally, to clear all the tags that didn't have a filename in them, just replace <.*> with an empty string. Finally, use Edit>Line Operations>Remove empty lines, and you'll have the result you're looking for. It's not a 100% regex solution, but this is a one time action that you just need a simple result from.

Trying to match src part of HTML <img> tag Regular Expression

I've got a bunch of strings already separated from an HTML file, examples:
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
I am trying to make a regular expression that will grab the src="URL" part of the img tag so that I can replace it later based on a few other conditions. The many instances of quotation marks are giving me the biggest problem, I'm still relatively new with Regex, so a lot of the tricks are out of my knowledge,
Thanks in advance
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
Example:
$html = <<<DATA
<img alt="" src="//i.imgur.com/tApg8ebb.jpg" title="Some manly skills for you guys<p><span class='points-q7Vdm'>18,736</span> <span class='points-text-q7Vdm'>points</span> : 316,091 views</p>">
<img src="//i.imgur.com/SwmwL4Gb.jpg" width="48" height="48">
<img src="//s.imgur.com/images/blog_rss.png">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
//i.imgur.com/tApg8ebb.jpg
//i.imgur.com/SwmwL4Gb.jpg
//s.imgur.com/images/blog_rss.png
If you would rather store the results in an array, you could do..
foreach ($imgs as $img) {
$sources[] = $img->getAttribute('src');
}
print_r($sources);
Output
Array
(
[0] => //i.imgur.com/tApg8ebb.jpg
[1] => //i.imgur.com/SwmwL4Gb.jpg
[2] => //s.imgur.com/images/blog_rss.png
)
$pattern = '/<img.+src="([\w/\._\-]+)"/';
I'm not sure which language you're using, so quote syntax will vary.

Database execution with Opencart

I am using this module which allows me to run a simple blog / news feed in OC which does what I need apart from one thing, I need to display the first 4 articles on the homepage.
I have got the following so far:
<?php
$sql = "SELECT * FROM " . DB_PREFIX . "blog b LEFT JOIN " . DB_PREFIX . "blog_description bd ON (b.blog_id = bd.blog_id) WHERE b.status = 1 AND b.date <= NOW() AND bd.language_id = '" . (int)$this->config->get('config_language_id') . "'";
$query = $this->db->query($sql);
$blogs = array();
?>
<div class="box" id="news">
<div class="title">
<p>Latest News</p>
</div>
<div class="content">
<?php
foreach($query->rows as $result){
$blogs[] = $result;
}
?>
</div>
</div>
I have been developing OC templates for a while but modules are a whole new ball game for me, any help would be appreciated
First of all You should learn and understand how the MVC (or whatever it tends to be) is implemented in OpenCart - we have controllers, models and view templates.
Your approach is mixing all the controller and model part into a view template which is completely wrong.
So what should go where:
SQL query should go into the model
A new method for retrieving the data from the model and preparing it for the view template should go into the controller
Only HTML markup should be present in the view template
Let's say the extension You have downloaded has a controller here: catalog/controller/information/news.php - You should extend it's index() (or other appropriate) method (or even create a new one if needed) so that it calls the news model where You place Your new method getLastFourArticles() which should look like:
public function getLastFourArticles() {
$sql = "
SELECT *
FROM " . DB_PREFIX . "blog b
LEFT JOIN " . DB_PREFIX . "blog_description bd ON b.blog_id = bd.blog_id
WHERE b.status = 1
AND b.date <= NOW()
AND bd.language_id = " . (int)$this->config->get('config_language_id') . "
ORDER BY b.date DESC
LIMIT 4";
return $this->db->query($sql);
}
ORDER BY part will sort the blog entries from the newest to the latest and the LIMIT 4 part will make sure we only receive 4 rows maximally.
Now in the controller You should do something like:
$this->data['latest_entries'] = $this->model_information_news->getLastFourArticles();
while expecting the model to be catalog/model/information/news.php and that it is loaded ($this->load->model('information/news');).
Now in Your template only this part is needed:
<div class="box" id="news">
<div class="title">
<p>Latest News</p>
</div>
<div class="content">
<?php foreach($latest_entries as $entry) { ?>
<span class="date"><?php echo $entry['date']; ?></span>
<span class="entry"><?php echo $entry['text']; ?></span>
<?php } ?>
</div>
</div>
Keep in mind this is only instruction-like answer and You should pass in the right names and variables (and indices for the blog entry in the template).

How to write this in regular expression in Python?

I have a big HTML file from which I need to parse some data using Regular expression. The first is the name of restaurant. Hotel names are in this format:
Update:
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
<div class="leftcol">
<div id="bizTitle0" class="itemheading">
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
</div>
<div class="itemcategories">
Categories: Italian, Seafood
</div>
<div class="itemneighborhoods">
Neighborhood: Marina/Cow Hollow
</div>
</div>
<div class="rightcol">
<div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>
<address>
1809 Union St<br>San Francisco, CA 94123<br>
</address><div class="phone">
(415) 409-8001
</div>
</div>
There are altogether 40 hotels. I think there's two spaces after the . in number. I need to list all the hotels from 1 to 40. I have tried using:
re.findall("[./0-9]", string_Name)
It outputs the number. I want to get the number and all the hotel names. How can I do that?
The answer by Blender gives the rating and the restaurant list. That's fine but I want rating and the restaurant name in a different variable.
Parse the HTML:
import re
from bs4 import BeautifulSoup
html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5. Ristorante Parma
</a>
'''
soup = BeautifulSoup(html)
for link in soup.find_all('a', text=re.compile(r'^\d')):
print link.get_text()
And the output:
1. Capannina
5. Ristorante Parma
You shouldn't run regexes on html directly (preferring to use an HTML parser first), but try this regex:
(\d+)\.\s+([^<]+)
one or more digits
a dot
one or more whitespace characters
one or more non < letters
The presence of the brackets () creates a capture group. The contents of the capture group 1 will be the number. The contents of the capture group 2 will be the name.