finding href value in python - regex

i'm working on project ,that searches the webpage content for some data
from lxml import html
import requests
def tabletPhone(webAddress):
page = requests.get(webAddress)
tree = html.fromstring(page.content)
product = tree.xpath("""//h1[#class="product_title entry-\
title"]/text()""")
price = tree.xpath("""//span[#class='price-number']/text()""")
availability = tree.xpath("//n:link",namespaces={'n':'availability'})
return product,price,availability
I have problem with finding availability of product , html code is somthing like :
<link itemprop="availability" href="http://schema.org/InStock" />
Is any way to return {'availability':'http://schema.org/InStock'} or return 'http://schema.org/InStock'

Using BeautifulSoup.We can achieve this.
from bs4 import BeautifulSoup
samplecode = '''Google <link itemprop="availability" href="http://schema.org/InStock"/> <link itemprop="availability" href="http://google.com"/>'''
soup = BeautifulSoup(samplecode,"lxml")
for line in soup.find_all(href=True):
print "Url-", line['href']
Above Code Snippet will work for all tags.it will search for href in all tags but if you want to search for a specific tag then use as follows:
for line in soup.find_all('link',href=True):
print "Url-", line['href']

Related

How to select text without the html code when web scraping in Python (2.7)?

The below code returns the text including the html code. However, I need to retrieve the text only so that it can be loaded nicely into a pd.DataFrame. How do I 'strip' the text?
#importing packages
from bs4 import BeautifulSoup
import requests
#url
url = "https://example.com/this_is_just_an_example"
#request to get text from url
r = requests.get(url).text
#create soup version of the text
soup = BeautifulSoup(r, features="lxml")
#create a list to store the text
MyHeadlines= []
#appended the text to list Names
for i in soup.find_all('h3', {'class': 'headline'}):
MyHeadlines.append(str(i))
You can easily do this with some simple regex:
import re
CLEAN_TEXT = re.sub('<[^<]+?>', '', YOUR_TEXT)
Enjoy!

Remove img tags from xml with BeautifulSoup

It's my first time using Python and BeautifulSoup. The thing is I'm doing a migration of all articles within a blog from one website to another, and to perform this, I'm extracting certain information from a xml file; the last part of my code tells to extract only the text between the position 0 and 164 from the meta tag, so this way it can appear on google SERP as they want to appear.
The problem here is some articles from the blog has img tags on the first lines inside the tag and I want to remove them, including the src attributes so the code can grab just the text after those img tags.
I tried to solve it in many ways but I did not succeed.
Here is my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
import re
reload(sys)
sys.setdefaultencoding('utf8')
base_url = ("http://pimacleanpro.com/blog?rss=true")
soup = BeautifulSoup(urlopen(base_url).read(),"xml")
titles = soup("title")
slugs = soup("link")
bodies = soup("description")
with open("blog-data.csv", "w") as f:
fieldnames = ("title", "content", "slug", "seo_title", "seo_description","site_id", "page_path", "category")
output = csv.writer(f, delimiter=",")
output.writerow(fieldnames)
for i in xrange(len(titles)):
output.writerow([titles[i].encode_contents(),bodies[i].encode_contents(formatter=None),slugs[i].get_text(),titles[i].encode_contents(),bodies[i].encode_contents(formatter=None)[4:164]])
print "Done writing file"
any help will be appreciated.
Here's a Python 2.7 example that I think does what you want:
from bs4 import BeautifulSoup
from urllib2 import urlopen
from xml.sax.saxutils import unescape
base_url = ("http://pimacleanpro.com/blog?rss=true")
# Unescape to allow BS to parse the <img> tags
soup = BeautifulSoup(unescape(urlopen(base_url).read()))
titles = soup("title")
slugs = soup("link")
bodies = soup("description")
print bodies[2].encode_contents(formatter=None)[4:164]
# Remove all 'img' tags in all the 'description' tags in bodies
for body in bodies:
for img in body("img"):
img.decompose()
print bodies[2].encode_contents(formatter=None)[4:164]
# Proceed to writing to CSV, etc.
The first print statement outputs the following:
<img src='"http://ekblog.s3.amazonaws.com/contentp/wp-content/uploads/2018/09/03082910/decoration-design-detail-691710-300x221.jpg"'><br>
<em>Whether you are up
While the second one after removing the <img> tags is as follows:
<em>Whether you are upgrading just one room or giving your home a complete renovation, it’s likely that your first thought is to choose carpet for all of
Of course you could just remove all image tags in the soup object before creating titles, slugs, or bodies if they're not of interest to you:
for tag in soup("img"):
tag.decompose()

get picture from dynamic content python

I'm trying to get the href of the picture from an url without using selenium
def():
try:
page = urllib2.urlopen('')
except httplib.IncompleteRead, e:
page = e.partial
response = BeautifulSoup(page)
print response
var = response.find("div", {"id":"il_m"}).find('p')
but i got None as a result.What should I do to ge the href ?
You can also get the link from an anchor tag with a download attribute:
In [2]: from bs4 import BeautifulSoup
In [3]: import urllib2
In [4]: r = urllib2.urlopen('http://icecat.us/index.php/product/image_gallery?num=9010647&id=9409545&lang=us&imgrefurl=philips.com')
In [5]: soup = BeautifulSoup(r,"html.parser")
In [6]: print(soup.select_one("p a[download]")["href"])
http://images.icecat.biz/img/gallery/9010647-Philips-_FP.jpg
You should also take note of the text Images may be subject to copyright.. on the page.
You're not targeting the right p tag:
First of all, you want to extract the href from <a> node and not <p>
The first <p> child element that is found is this one <p class="il_r" id="url_domain" </p>
What you could do is to target the 5th <p> element's <a> which is the image. One way of doing this is var = response.find("div", id = "il_m").find_all('p')[4].find('a')

Using beautifulsoup to extract data from html content - HTML Parsing

The contents of my script using beautifulsoup library is as follows:
<meta content="Free" itemprop="price" />
and
<div class="content" itemprop="datePublished">November 4, 2013</div>
I would want to pull the words Free and November 4, 2013 from that output. Will using a Regex help or does beautifulsoup has any such attributes that will pull out this directly? Here is the code I used below:
from BeautifulSoup import BeautifulSoup
import urllib
import re
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"})
print item
items = soup.find("div",{"itemprop":"datePublished"})
print items
Ok got it! Just access the values by the following method(for the above case):
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://play.google.com/store/apps/details?id=com.ea.game.fifa14_na")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
item = soup.find("meta", {"itemprop":"price"}) # meta content="Free" itemprop="price"
print item['content']
items = soup.find("div",{"itemprop":"datePublished"})
print items.string
No need to add regex. Just a read up through the documentation would help.

Using Beautiful Soup4 can I get the <a> embedded text?

I am using Beautiful Soup4 with Python and now any way i have come upto the below. So now in what way I can get the values Dog,Cat, Horse and also the Ids. Please help!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find_all('a')
# [<a class="Animal" href="http://example.com/elsie" id="link1">Dog</a>,
# <a class="Animal" href="http://example.com/lacie" id="link2">Cat</a>,
# <a class="Animal" href="http://example.com/tillie" id="link3">Horse</a>]
Documentation
for a in soup.find_all('a'):
id = a.get('id')
value = a.string # This is a NavigableString
unicode_value = unicode(value)