BeautifulSoup: cleaning article text further - python-2.7

I need to count characters in a news article. Some pages have lots of stuff I don't need (navigation, footer etc.). I managed to get rid of all these but I still have a couple of things like images copyright, image and video captions and adverts I struggle to remove. Could anyone suggest how to improve the code below to get only useful text from the article?
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
for s in soup.findAll("div", {"class":"story-body__inner"}):
article = ''.join(s.findAll(text=True))
print(article)
print (len(article))
The code for this particular url yields this (top part just to illustrate the problem):
Image copyright
AFP
Image caption
Erdogan supporters began celebrating early outside party headquarters in Ankara
Turks have backed President Recep Tayyip Erdogan's call for sweeping new presidential powers, partial official results of a referendum indicate.With about 98% of ballots counted, "Yes" was on about 51.3% and "No" on about 48.7%.Erdogan supporters say replacing the parliamentary system with an executive presidency would modernise the country. Opponents have attacked a decision to accept unstamped ballot papers as valid unless proven otherwise.The main opposition Republican People's Party (CHP) is already demanding a recount of 60% of the votes.
/**/
(function() {
if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) {
bbcdotcom.adverts.slotAsync('mpu', [1,2,3]);
}
})();
/**/

it seems that you don't need the script nor the figure tags, so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
# delete unwanted tags:
for e in soup(['figure', 'script']):
e.decompose()
article_soup = [e.get_text() for e in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
print (len(article))

Related

Export data from BeautifulSoup to CSV

[DISCLAIMER] I have been through plenty of the other answers on the area, but they do not seem to work for me.
I want to be able to export the data I have scraped as a CSV file.
My question is how do I write the piece of code which outputs the data to a CSV?
Current Code
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
req = requests.get(url).text
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Output from the code
View Position
</a>
<a href='/career/management-consultants-to-help-our-customers-succeed-with-
it/'>
Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
View Position
</a>
<a href='/career/management-consultants-within-process-improvement/'>
Management consultants within process improvement
COPENHAGEN • We are looking for consultants with profound
experience in Six Sigma, Lean and operational
management
Code I have tried
with open('ImplementTest1.csv',"w") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["link.get", "link.text"])
csv_file.close()
Output in CSV format
Column 1: Url Links
Column 2: Job description
E.g
Column 1: /career/management-consultants-to-help-our-customers-succeed-with-
it/
Column 2: Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
Try this script and get the csv output:
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('career.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["job_link", "job_desc"])
res = requests.get("http://implementconsultinggroup.com/career/#/6257").text
soup = BeautifulSoup(res,"lxml")
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
item_link = link.get("href").strip()
item_text = link.text.replace("View Position","").strip()
writer.writerow([item_link, item_text])
print(item_link, item_text)
outfile.close()

Working with Scrapy 'regex definition'

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am aiming for anything at this point), but cannot seem to get it to work. I suspect this is because I have not configured my regex properly to identify the span tag I am trying to scrape from. Does anyone have any idea what I might be doing wrong and how I fix it?
Much appreciated.
Matt
import urllib
import re
url = "https://services.aamc.org/msar/home#null"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<td colspan="2" class="schoolLocation">(.+?)</td>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the school location is ",price
First of all, don't use regular expressions to parse HTML. There are specialized tools called HTML parsers, like BeautifulSoup or lxml.html.
Actually, the advice is not that relevant to this particular problem, since there is no need to parse HTML. The search results on this page are dynamically loaded from a separate endpoint to which a browser sends an XHR request, receives a JSON response, parses it and displays the search results with the help of javascript executed in the browser. urllib is not a browser and provide you with an initial page HTML only with an empty search results container.
What you need to do is to simulate the XHR request in your code. Let's use requests package. Complete working code, printing a list of school programs:
import requests
url = "https://services.aamc.org/msar/home#null"
search_url = "https://services.aamc.org/msar/search/resultData"
with requests.Session() as session:
session.get(url) # visit main page
# search
data = {
"start": "0",
"limit": "40",
"sort": "",
"dir": "",
"newSearch": "true",
"msarYear": ""
}
response = session.post(search_url, data=data)
# extract search results
results = response.json()["searchResults"]["rows"]
for result in results:
print(result["schoolProgramName"])
Prints:
Albany Medical College
Albert Einstein College of Medicine
Baylor College of Medicine
...
Howard University College of Medicine
Howard University College of Medicine Joint Degree Program
Icahn School of Medicine at Mount Sinai

Having issues with Python xpath scraping

I'm back again with a question for the wonderful people here :)
Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.
My issue:
I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints
<Element html at 0xRANDOM>
with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.
My Code:
#!/bin/python
#Scrape current gold spot price in CAD
from lxml import html
import requests
def scraped_price():
page = requests.get('http://goldprice.org/gold-price-canada.html')
tree = html.fromstring(page.content)
print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
bid = tree.xpath("//span[#id='gpotickerLeftCAD_price']/text()")
print "Scraped content: ", bid
return bid
gold_scraper = scraped_price()
My research:
1) www.w3schools.com/xsl/xpath_syntax.asp
This is where I figured out to use '//span' to find all 'span' objects and then used the #id to narrow it down to the one I need.
2)Scraping web content using xpath won't work
This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.
Any assistance would be greatly appreciated.
<Element html at 0xRANDOM>
What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():
print(html.tostring(tree, pretty_print=True))
You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.
The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):
>>> import time
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
... print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
... time.sleep(1)
...
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Python Scraping, how to get text from this through BeautifulSoup?

well here is my code to scrape text content from a site.... well it is working though i am not getting plane text only.... how to handle that
from bs4 import BeautifulSoup
import mechanize
def getArticle(url):
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.contents
for example when i scrape a site.
i get this output
"[u"\nIn Soviet Russia, it's the banks that pay customers' bills.\xa0Or, at least, one might.", , u'\n', , u'\r\nAn interesting case has surfaced in Voronezh, Russia, where a man is suing a bank for more than 24 million Russian rubles (about $727,000) in compensation over a handcrafted document that was signed and recognized by the bank.\xa0', , u'\n', , u'\r\nA person who goes by name Dmitry Alexeev (his surname was changed ', by the first Russian outlet to publish this story, u') said that in 2008 he received a letter from ', Tinkoff Credit Systems, u'\xa0in his mailbox. It was a credit card application form with an agreement contract enclosed, much like the applications Americans receive daily from various banks working with ', Visa
how to get plain text only?
Use tag.text instead of tag.contents:
from bs4 import BeautifulSoup
import mechanize
url = "http://www.minyanville.com/business-news/editors-pick/articles/A-Russian-Bank-Is-Sued-for/8/7/2013/id/51205"
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.text