I'm using django-yarr for my RSS reader applications. Is there any way to fetch content from RSS URL and save in database? Or is there any library that could do that?
Are you looking to read data from an RSS, process it and save it?
Use Requests to fetch the data.
import requests
req = requests.get('http://feeds.bbci.co.uk/news/technology/rss.xml')
reg.text // XML as a string
BeautifulSoup, lxml or ElementTree to process the data (or similar libraries that can process xml)
from bs4 import BeautifulSoup
soup = BeautifulSoup(req.text)
images = soup.findAll('media:thumbnail')
Finally do whatever you want with the data
for image in images:
thing = DjangoModelThing()
thing.image = image.attrs.get('url')
thing.save()
UPDATE
Alternatively you could grab each article from the RSS
articles = soup.findAll('item')
for article in articles:
title = article.find('title')
description = article.find('description')
link = article.find('link')
images = article.find('media:thumbnail')
Related
What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles
I'm just a beginner at webscraping and python in general so I'm sorry if the answer is obvious, but I can't figure out I'm unable to find any of the table elements on https://www.hockey-reference.com/leagues/NHL_2018.html.
My initial thought was that this was a result of the whole div being commented out, so following some advice I found on here in another similar post, I replaced the comment characters and confirmed that they were removed when I saved the soup.text to a text file and searched. I was still unable to find any tables however.
In trying to search a little further I took the ID out of my .find and did a findAll and still table was coming up empty.
Here's the code I was trying to use, any advice is much appreciated!
import csv
import requests
from BeautifulSoup import BeautifulSoup
import re
comm = re.compile("<!--|-->")
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(comm.sub("", html))
table = soup.find('table', id="stats")
When searching for all of the table elements I was using
table = soup.findAll('table')
I'm also aware that there is a csv version on the site, I was just eager to practice.
Give a parser along with your markup, for example BeautifulSoup(html,'lxml') . Try the below code
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
table = soup.findAll('table')
[DISCLAIMER] I have been through plenty of the other answers on the area, but they do not seem to work for me.
I want to be able to export the data I have scraped as a CSV file.
My question is how do I write the piece of code which outputs the data to a CSV?
Current Code
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
req = requests.get(url).text
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Output from the code
View Position
</a>
<a href='/career/management-consultants-to-help-our-customers-succeed-with-
it/'>
Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
View Position
</a>
<a href='/career/management-consultants-within-process-improvement/'>
Management consultants within process improvement
COPENHAGEN • We are looking for consultants with profound
experience in Six Sigma, Lean and operational
management
Code I have tried
with open('ImplementTest1.csv',"w") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["link.get", "link.text"])
csv_file.close()
Output in CSV format
Column 1: Url Links
Column 2: Job description
E.g
Column 1: /career/management-consultants-to-help-our-customers-succeed-with-
it/
Column 2: Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
Try this script and get the csv output:
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('career.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["job_link", "job_desc"])
res = requests.get("http://implementconsultinggroup.com/career/#/6257").text
soup = BeautifulSoup(res,"lxml")
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
item_link = link.get("href").strip()
item_text = link.text.replace("View Position","").strip()
writer.writerow([item_link, item_text])
print(item_link, item_text)
outfile.close()
I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?
timestamp = time.asctime()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')
txt.close()
You are appending a new line and text to the start of the data for every image, essentially corrupting it.
Also, you are writing every image into the same file, again corrupting them.
Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
timestamp = time.asctime()
txt = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write(download_img.read())
txt.close()
I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.
import urllib2
import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
print data[0]['content'].encode("utf-8")
the result I am taking is
BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text
The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?
PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.
Thank you in advance!
You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
with open("test.txt", "w") as myfile:
myfile.write(data[0]['content'].encode("utf-8"))
test.txt:
BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text
Which OS and terminal you are using?