Python Scraping, how to get text from this through BeautifulSoup? - python-2.7

well here is my code to scrape text content from a site.... well it is working though i am not getting plane text only.... how to handle that
from bs4 import BeautifulSoup
import mechanize
def getArticle(url):
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.contents
for example when i scrape a site.
i get this output
"[u"\nIn Soviet Russia, it's the banks that pay customers' bills.\xa0Or, at least, one might.", , u'\n', , u'\r\nAn interesting case has surfaced in Voronezh, Russia, where a man is suing a bank for more than 24 million Russian rubles (about $727,000) in compensation over a handcrafted document that was signed and recognized by the bank.\xa0', , u'\n', , u'\r\nA person who goes by name Dmitry Alexeev (his surname was changed ', by the first Russian outlet to publish this story, u') said that in 2008 he received a letter from ', Tinkoff Credit Systems, u'\xa0in his mailbox. It was a credit card application form with an agreement contract enclosed, much like the applications Americans receive daily from various banks working with ', Visa
how to get plain text only?

Use tag.text instead of tag.contents:
from bs4 import BeautifulSoup
import mechanize
url = "http://www.minyanville.com/business-news/editors-pick/articles/A-Russian-Bank-Is-Sued-for/8/7/2013/id/51205"
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.text

Related

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

Export data from BeautifulSoup to CSV

[DISCLAIMER] I have been through plenty of the other answers on the area, but they do not seem to work for me.
I want to be able to export the data I have scraped as a CSV file.
My question is how do I write the piece of code which outputs the data to a CSV?
Current Code
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
req = requests.get(url).text
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Output from the code
View Position
</a>
<a href='/career/management-consultants-to-help-our-customers-succeed-with-
it/'>
Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
View Position
</a>
<a href='/career/management-consultants-within-process-improvement/'>
Management consultants within process improvement
COPENHAGEN • We are looking for consultants with profound
experience in Six Sigma, Lean and operational
management
Code I have tried
with open('ImplementTest1.csv',"w") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["link.get", "link.text"])
csv_file.close()
Output in CSV format
Column 1: Url Links
Column 2: Job description
E.g
Column 1: /career/management-consultants-to-help-our-customers-succeed-with-
it/
Column 2: Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
Try this script and get the csv output:
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('career.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["job_link", "job_desc"])
res = requests.get("http://implementconsultinggroup.com/career/#/6257").text
soup = BeautifulSoup(res,"lxml")
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
item_link = link.get("href").strip()
item_text = link.text.replace("View Position","").strip()
writer.writerow([item_link, item_text])
print(item_link, item_text)
outfile.close()

BeautifulSoup: cleaning article text further

I need to count characters in a news article. Some pages have lots of stuff I don't need (navigation, footer etc.). I managed to get rid of all these but I still have a couple of things like images copyright, image and video captions and adverts I struggle to remove. Could anyone suggest how to improve the code below to get only useful text from the article?
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
for s in soup.findAll("div", {"class":"story-body__inner"}):
article = ''.join(s.findAll(text=True))
print(article)
print (len(article))
The code for this particular url yields this (top part just to illustrate the problem):
Image copyright
AFP
Image caption
Erdogan supporters began celebrating early outside party headquarters in Ankara
Turks have backed President Recep Tayyip Erdogan's call for sweeping new presidential powers, partial official results of a referendum indicate.With about 98% of ballots counted, "Yes" was on about 51.3% and "No" on about 48.7%.Erdogan supporters say replacing the parliamentary system with an executive presidency would modernise the country. Opponents have attacked a decision to accept unstamped ballot papers as valid unless proven otherwise.The main opposition Republican People's Party (CHP) is already demanding a recount of 60% of the votes.
/**/
(function() {
if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) {
bbcdotcom.adverts.slotAsync('mpu', [1,2,3]);
}
})();
/**/
it seems that you don't need the script nor the figure tags, so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
# delete unwanted tags:
for e in soup(['figure', 'script']):
e.decompose()
article_soup = [e.get_text() for e in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
print (len(article))

Working with Scrapy 'regex definition'

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am aiming for anything at this point), but cannot seem to get it to work. I suspect this is because I have not configured my regex properly to identify the span tag I am trying to scrape from. Does anyone have any idea what I might be doing wrong and how I fix it?
Much appreciated.
Matt
import urllib
import re
url = "https://services.aamc.org/msar/home#null"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<td colspan="2" class="schoolLocation">(.+?)</td>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the school location is ",price
First of all, don't use regular expressions to parse HTML. There are specialized tools called HTML parsers, like BeautifulSoup or lxml.html.
Actually, the advice is not that relevant to this particular problem, since there is no need to parse HTML. The search results on this page are dynamically loaded from a separate endpoint to which a browser sends an XHR request, receives a JSON response, parses it and displays the search results with the help of javascript executed in the browser. urllib is not a browser and provide you with an initial page HTML only with an empty search results container.
What you need to do is to simulate the XHR request in your code. Let's use requests package. Complete working code, printing a list of school programs:
import requests
url = "https://services.aamc.org/msar/home#null"
search_url = "https://services.aamc.org/msar/search/resultData"
with requests.Session() as session:
session.get(url) # visit main page
# search
data = {
"start": "0",
"limit": "40",
"sort": "",
"dir": "",
"newSearch": "true",
"msarYear": ""
}
response = session.post(search_url, data=data)
# extract search results
results = response.json()["searchResults"]["rows"]
for result in results:
print(result["schoolProgramName"])
Prints:
Albany Medical College
Albert Einstein College of Medicine
Baylor College of Medicine
...
Howard University College of Medicine
Howard University College of Medicine Joint Degree Program
Icahn School of Medicine at Mount Sinai

Read multilanguage strings from html via Python 2.7

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.
import urllib2
import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
print data[0]['content'].encode("utf-8")
the result I am taking is
BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text
The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?
PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.
Thank you in advance!
You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
with open("test.txt", "w") as myfile:
myfile.write(data[0]['content'].encode("utf-8"))
test.txt:
BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text
Which OS and terminal you are using?