Export data from BeautifulSoup to CSV - python-2.7

[DISCLAIMER] I have been through plenty of the other answers on the area, but they do not seem to work for me.
I want to be able to export the data I have scraped as a CSV file.
My question is how do I write the piece of code which outputs the data to a CSV?
Current Code
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
req = requests.get(url).text
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Output from the code
View Position
</a>
<a href='/career/management-consultants-to-help-our-customers-succeed-with-
it/'>
Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
View Position
</a>
<a href='/career/management-consultants-within-process-improvement/'>
Management consultants within process improvement
COPENHAGEN • We are looking for consultants with profound
experience in Six Sigma, Lean and operational
management
Code I have tried
with open('ImplementTest1.csv',"w") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["link.get", "link.text"])
csv_file.close()
Output in CSV format
Column 1: Url Links
Column 2: Job description
E.g
Column 1: /career/management-consultants-to-help-our-customers-succeed-with-
it/
Column 2: Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.

Try this script and get the csv output:
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('career.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["job_link", "job_desc"])
res = requests.get("http://implementconsultinggroup.com/career/#/6257").text
soup = BeautifulSoup(res,"lxml")
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
item_link = link.get("href").strip()
item_text = link.text.replace("View Position","").strip()
writer.writerow([item_link, item_text])
print(item_link, item_text)
outfile.close()

Related

Sentiment Analysis using NLTK and beautifulsoup

I'm working on a personal project where I'm thinking of doing sentiment analysis using NLTK and Vader to compare presidential speeches.
I was able to use beautiful soup to find one of George Washington's speeches and I managed to put the speech in a list. But after that, I'm not really sure the best way to go further. It seems that it's typical for the file to be read from a text file but I have the brackets that have the list which make it difficult. I'm not sure if I should store the web scraped speech in a file or just work at from the list. Or maybe I should put the speech into a dataframe already? I'm not too sure.
from bs4 import BeautifulSoup
import requests
import spacy
import pandas as pd
page_link = 'https://www.ourdocuments.gov/doc.php?flash=false&doc=11&page=transcript'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for i in range(0, 7):
paragraphs = page_content.find_all("p")[i].text
textContent.append(paragraphs)
toWrite = open('washington.txt', 'w')
line = textContent
toWrite.write(str(line))
toWrite.close()
Any help or pointers would be greatly appreciated.
You can seek help from this article...Do check.
https://towardsdatascience.com/basic-binary-sentiment-analysis-using-nltk-c94ba17ae386

Issue scraping website with bs4 (beautiful soup) python 2.7

What I am attempting to accomplish is a simple python web scraping script for google trends and running into an issue when grabbing the class
from bs4 import BeautifulSoup
import requests
results = requests.get("https://trends.google.com/trends/trendingsearches/daily?geo=US")
soup = BeautifulSoup(results.text, 'lxml')
keyword_list = soup.find_all('.details-top')
for keyword in keyword_list:
print(keyword)
When printing tag I receive and empty class however when I print soup I receive the entire HTML document. My goal is to print out the text of each "Keyword" that was searched for the page https://trends.google.com/trends/trendingsearches/daily?geo=AU
this has a list of results:
1. covid-19
2.Woolworths jobs
If you use google developer options select inspect and hover over the title you will see div.details-top.
how would I just print the text of the title of each
I can see that data being dynamically retrieved from an API call in the dev tools network tab. You can issue an xhr to that url then use regex on the response text to parse out the query titles.
import requests, re
from bs4 import BeautifulSoup as bs
r = requests.get('https://trends.google.com/trends/api/dailytrends?hl=en-GB&tz=0&geo=AU&ns=15').text
p = re.compile(r'"query":"(.*?)"')
titles = p.findall(r)
print(titles) # 2.7 use print titles

Soup.find and findAll unable to find table elements on hockey-reference.com

I'm just a beginner at webscraping and python in general so I'm sorry if the answer is obvious, but I can't figure out I'm unable to find any of the table elements on https://www.hockey-reference.com/leagues/NHL_2018.html.
My initial thought was that this was a result of the whole div being commented out, so following some advice I found on here in another similar post, I replaced the comment characters and confirmed that they were removed when I saved the soup.text to a text file and searched. I was still unable to find any tables however.
In trying to search a little further I took the ID out of my .find and did a findAll and still table was coming up empty.
Here's the code I was trying to use, any advice is much appreciated!
import csv
import requests
from BeautifulSoup import BeautifulSoup
import re
comm = re.compile("<!--|-->")
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(comm.sub("", html))
table = soup.find('table', id="stats")
When searching for all of the table elements I was using
table = soup.findAll('table')
I'm also aware that there is a csv version on the site, I was just eager to practice.
Give a parser along with your markup, for example BeautifulSoup(html,'lxml') . Try the below code
url = 'https://www.hockey-reference.com/leagues/NHL_2018.html'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
table = soup.findAll('table')

BeautifulSoup: cleaning article text further

I need to count characters in a news article. Some pages have lots of stuff I don't need (navigation, footer etc.). I managed to get rid of all these but I still have a couple of things like images copyright, image and video captions and adverts I struggle to remove. Could anyone suggest how to improve the code below to get only useful text from the article?
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
for s in soup.findAll("div", {"class":"story-body__inner"}):
article = ''.join(s.findAll(text=True))
print(article)
print (len(article))
The code for this particular url yields this (top part just to illustrate the problem):
Image copyright
AFP
Image caption
Erdogan supporters began celebrating early outside party headquarters in Ankara
Turks have backed President Recep Tayyip Erdogan's call for sweeping new presidential powers, partial official results of a referendum indicate.With about 98% of ballots counted, "Yes" was on about 51.3% and "No" on about 48.7%.Erdogan supporters say replacing the parliamentary system with an executive presidency would modernise the country. Opponents have attacked a decision to accept unstamped ballot papers as valid unless proven otherwise.The main opposition Republican People's Party (CHP) is already demanding a recount of 60% of the votes.
/**/
(function() {
if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) {
bbcdotcom.adverts.slotAsync('mpu', [1,2,3]);
}
})();
/**/
it seems that you don't need the script nor the figure tags, so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
# delete unwanted tags:
for e in soup(['figure', 'script']):
e.decompose()
article_soup = [e.get_text() for e in soup.find_all(
'div', {'class': 'story-body__inner'})]
article = ''.join(article_soup)
print(article)
print (len(article))

Python Scraping, how to get text from this through BeautifulSoup?

well here is my code to scrape text content from a site.... well it is working though i am not getting plane text only.... how to handle that
from bs4 import BeautifulSoup
import mechanize
def getArticle(url):
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.contents
for example when i scrape a site.
i get this output
"[u"\nIn Soviet Russia, it's the banks that pay customers' bills.\xa0Or, at least, one might.", , u'\n', , u'\r\nAn interesting case has surfaced in Voronezh, Russia, where a man is suing a bank for more than 24 million Russian rubles (about $727,000) in compensation over a handcrafted document that was signed and recognized by the bank.\xa0', , u'\n', , u'\r\nA person who goes by name Dmitry Alexeev (his surname was changed ', by the first Russian outlet to publish this story, u') said that in 2008 he received a letter from ', Tinkoff Credit Systems, u'\xa0in his mailbox. It was a credit card application form with an agreement contract enclosed, much like the applications Americans receive daily from various banks working with ', Visa
how to get plain text only?
Use tag.text instead of tag.contents:
from bs4 import BeautifulSoup
import mechanize
url = "http://www.minyanville.com/business-news/editors-pick/articles/A-Russian-Bank-Is-Sued-for/8/7/2013/id/51205"
br = mechanize.Browser()
htmltext = br.open(url).read()
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('span',{'itemprop':'articleBody'}):
print tag.text