I need to concatenate some text inside a <div> with xpath in Scrapy. The div has the next structure:
<div class="col-12 e-description" itemprop="description">
"-Text1"
<br>
<br>
"-Text2"
<br>
<br>
"-Text3"
</div>
I've created a ScrapyItem in my Spider:
class MyScrapyItem(scrapy.Item):
name = scrapy.Field()
description = scrapy.Field()
If I do this,
item['description'] = response.xpath('//div[#itemprop="description"]/text()').extract()
everything gets mixed and separated by commas, like this:
- Text1
,- Text2
,- Text3
I think that's because response.xpath('//div[#itemprop="description"]/text()').extract() returns an array so it adds commas to separate the array items.
I'm trying to loop over the array and join each item inside the "description" ScrapyItem property.
This is what I'm trying:
def parse_item(self, response):
item = MyScrapyItem()
item['name'] = response.xpath('normalize-space(//span[#itemprop="name"]/text())').extract()
for subItem in response.xpath('//div[#itemprop="description"]/text()'):
item['description'] = " ".join(subItem.extract())
I know it would work if I could do something like this:
for subItem in response.xpath('//div[#itemprop="description"]/text()'):
item['description'] = " ".join(subItem.xpath('//div[#itemprop="something_here"]/text()')extract())
but the div that contains the text has no more tags inside.
Any help would be appreciated, it's my first Scrapy project.
it is the other way around,
you have used
item['description'] = response.xpath('//div[#itemprop="description"]/text()').extract()
that will return a list
join the list directly
item['description'] = " ".join(response.xpath('//div[#itemprop="description"]/text()').extract())
Related
trying to get specific text that is in a span class from a web page. I can get the first instance, but not sure how to iterate to get the one i need.
<div class="pricing-base__plan-pricing">
<div class="pricing-base__plan-price pricing-base__plan-price--annual">
<sup class="pricing-base__price-symbol">$</sup>
<span class="pricing-base__price-value">14</span></div>
<div class="pricing-base__plan-price pricing-base__plan-price--monthly">
<sup class="pricing-base__price-symbol">$</sup>
<span class="pricing-base__price-value">18</span>
</div>
<div class="pricing-base__term">
<div class="pricing-base__term-wrapper">
<div class="pricing-base__date">mo*</div>
</div>
I need to get the "18" in the line
18
that number changes quite often and that is what my code is looking to scrape.
You can use a class selector as shown to retrieve a list of all prices then index into that list to get annual and monthly
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.gotomeeting.com/meeting/pricingc')
soup = bs(r.content, 'lxml')
prices = [item.text for item in soup.select('.pricing-base__price-value')]
monthly = prices[1]
annual = prices[0]
You could also add in parent classes:
monthly = soup.select_one('.pricing-base__plan-price--monthly .pricing-base__price-value').text
annual = soup.select_one('.pricing-base__plan-price--annual .pricing-base__price-value').text
Example:
I am building a search engine, which needs a custom filter that displays the text surrounding a keyword, like the excerpts on Google results page. I am using regex to identify the surrounding words. Here is my code for the filter:
#register.filter(needs_autoescape=True)
#stringfilter
def show_excerpt (value, search_term, autoescape=True):
# make the keyword put into the search engine case insensitive #
keywords = re.compile(re.escape(search_term), re.IGNORECASE)
# make excerpt return 300 characters before and after keyword #
excerpt_text = '.{300}' + str(keywords) + '.{300}'
# replace the original text with excerpt #
excerpt = value.sub(excerpt_text, value)
return mark_safe(excerpt)
Code for the search engine in view.py:
def query_search(request):
articles = cross_currents.objects.all()
search_term = ''
if 'keyword' in request.GET:
search_term = request.GET['keyword']
articles = articles.annotate(similarity=Greatest(TrigramSimilarity('Title', search_term), TrigramSimilarity('Content', search_term))).filter(similarity__gte=0.03).order_by('-similarity')
context = {'articles': articles, 'search_term': search_term}
return render(request, 'query_search.html', context)
HTML template (it includes a custom highlight filter that highlights the keyword put into search engine):
<ul>
{% for article in articles %}
<li>{{ article|highlight:search_term }}</li>
<p> {{ article.Content|highlight:search_term|show_excerpt:search_term }} </p>
{% endfor %}
</ul>
Error message: 'SafeText' object has no attribute 'sub'
I think I am doing .sub wrong. I just need the excerpt to replace the entire original text (the text that I am putting a filter on). The original text starts from the beginning of the data but I just want to display the data surrounding the keyword, with my highlight custom filter highlighting the keyword (just like on Google). Any idea?
EDIT: when I do re.sub(excerpt_text, value), I get the error message sub() missing 1 required positional argument: 'string'.
You need to call re.sub(), not value.sub(). You are calling sub on a SafeText object, .sub() is a regex function.
I haven't tested your code but if the remaining code is correct you should just change that line to re.sub(excerpt_text, value)
I decided to ditch regex and just do good old string slicing. Working code for the filter:
#register.filter(needs_autoescape=True)
#stringfilter
def show_excerpt(value, search_term, autoescape=True):
#make data into string and lower#
original_text = str(value)
lower_original_text = original_text.lower()
#make keyword into string and lower#
keyword_string = str(search_term)
lower_keyword_string = keyword_string.lower()
#find the position of the keyword in the data#
keyword_index = lower_original_text.find(lower_keyword_string)
#Specify the begining and ending positions of the excerpt#
start_index = keyword_index - 10
end_index = keyword_index + 300
#Define the position range of excerpt#
excerpt = original_text[start_index:end_index]
return mark_safe(excerpt)
I have the follow html structure:
<div id="mod_imoveis_result">
<a class="mod_res" href="#">
<div id="g-img-imo">
<div class="img_p_results">
<img src="/img/image.jpg">
</div>
</div>
</a>
</div>
This is a product result page, so is 7 blocks for page with that mod_imoveis_result id. I need get image src from all blocks. Each page have 7 blocks like above.
I try:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class QuotesSpider(scrapy.Spider):
name = "magichat"
start_urls = ['https://magictest/results']
def parse(self, response):
for bimb in response.xpath('//div[#id="mod_imoveis_result"]'):
yield {
'img_url': bimb.xpath('//div[#id="g-img-imo"]/div[#class="img_p_results"]/img/#src').extract_first(),
'text': bimb.css('#titulo_imovel::text').extract_first()
}
next_page = response.xpath('//a[contains(#class, "num_pages") and contains(#class, "pg_number_next")]/#href').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
I can't understand why text target is ok, but img_url get first result for all blocks for page. Example: each page have 7 blocks, so 7 texts and 7 img_urls, but, img_urls is the same for all other 6 blocks, and text is right, why?
If i change extract_first to extract i get others urls, but the result come in the same brackts. Example:
text: 1aaaa
img_url : a,b,c,d,e,f,g
but i need
text: 1aaaa
img_url: a
text: 2aaaa
img_url: b
What is wrong with that loop?
// selects the root node i.e. <div id="mod_imoveis_result"> of for node you're trying to get which is div[#id="g-img-imo"] so the two tage that were missed it the reason of NO DATA
**. **selects the current node which is mentioned in your xpath irrespective of how deep it is.
In your case xpath('./div[#id="g-img-imo"]/div[#class="img_p_results"]/img/#src') denotes selection from root node i.e. from arrow
<div id="mod_imoveis_result">
<a class="mod_res" href="#">
---> <div id="g-img-imo">
<div class="img_p_results">
<img src="/img/image.jpg">
</div>
</div>
</a>
</div>
I hope you i made it clear.
If all your classes have separate div names, in your case different class tag, then you can directly call image div and extract image URL.
//*[#class="img_p_results"]/img/#src
sup2 = soup2.find_all("div", {"class": "xxxxxxx"})
When i use find_all over a div i get the following result
<div class="xxxxxxx" data-reactid="37">aa , bb </div>
how to get href between these two commas
Iterate over the Tag elements in sup2 and select the 'href' attribute, eg:
hrefs = [a['href'] for tag in sup2 for a in tag.find_all('a')]
Using css selectors:
hrefs = [tag['href'] for tag in soup2.select("div.xxxxxxx a")]
I'm new to Python and BeautifulSoup, how would I search certain tags whose children have certain attributes?
For example,
<section ...>
<a href="URL" ...>
<h4 itemprop="name">ABC</h4>
<p class="open"></p>
</a>
</section>
I hope if I could get all names ('ABC') and urls("URL") if class="open". I can get all sections by
soup.findAll(lambda tag: tag.name="section")
But I don't know how to add other conditions since tag.children is a listiterator.
Because you're looking for certain attributes with the <p> tags, I would search for only <p> tags with attrs={"class": "open"} and then select the parent (which is the <a> tag) and gather the rest of the information from that.
soup = BeautifulSoup(data, "html.parser")
items = soup.find_all("p", attrs={"class": "open"})
for item in items:
name = item.parent.h4.text
url = item.parent.attrs.get('href', None)
print("{} : {}".format(name, url))