python 2.7 BeautifulSoup find the table containing a particular string - python-2.7

After searching a BeautifulSoup document for a string, how do I get the table which contains that string? I have a solution which works on one table that I am familiar with:
My code is as follows:
import mechanize
from bs4 import BeautifulSoup
sitemap_url = "https://www.rbi.org.in/scripts/sitemap.aspx"
br = mechanize.Browser()
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'),
('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')]
response = br.open(sitemap_url)
text = response.read()
br.close()
soup = BeautifulSoup(text, 'lxml')
# Find the table containing the financial intermediaries.
# First I find "Financial Intermediaries" in soup.
fin_str = soup.find(text="Financial Intermediaries")
# Next I step out through the parents
# until it turns out that I have found the table.
fin_tbl = fin_str.parent.parent.parent.parent
The problem with this is that I have to check the results each time I step out of the document. How can I add .parent until I see the table?

Append the following code onto the program:
# The first tag around the string is the parent.
fn_in = fin_str.parent
# Step out through the parents.
def step_out(i):
if isinstance(i, element.NavigableString):
pass
return i.parent
# Continue until 'table' is in the name of the tag.
while not 'table' in fn_in.name:
fn_in = step_out(fn_in)

Related

Replace White-space with hyphen then create URL

I'm trying to speed up a process of webscraping by sending raw data to python in lieu of correctly formatted data.
Current data is received as an excel file with data formatted as:
26 EXAMPLE RD EXAMPLEVILLE SA 5000
Data is formatted in excel via macros to:
Replace all spaces with hyphen
Change all text to lower-case
Paste text onto end of http://example.com/property/
Formatted data is http://www.example.com/property/26-example-rd-exampleville-sa-5000
What i'm trying to accomplish:
Get python to go into excel sheet and follow formatting rules listed above, then pass the records to the scraper.
Here is the code I have been trying to compile - please go easy i am VERY new.
Any advice or reading sources related to python formatting would be appreciated.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd
# URL_BUILDER
# Source File for UNFORMATTED DATA
file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')
# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []
URL_LIST = []
# Search Phrase to capture suitable URL's for Scraping
text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''
# Write Sales .CSV file
with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'
if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])
If you're just trying to figure out how to do the formatting in Python:
text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)
# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000

How do I write all of these rows into a CSV file for a given range?

The purpose of the below code is the webscrape the oxford english dictionary for words that were "invented" in each year within a range of years. This all works as intended.
import csv
import os
import re
import requests
import urllib2
year_start= 1550
year_end = 1552
subject_search = ['Law']
for year in range(year_start, year_end +1):
path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+ str(year)+ '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+ str(subject_search)+ '&type=dictionarysearch', None, header)
page = opener.open(request)
with open(resultPath, 'wb') as outputw, open(htmlPath, 'w') as outputh:
urlpage = page.read()
outputh.write(urlpage)
new_words = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print new_words
csv_writer = csv.writer(outputw)
if csv_writer.writerow([year] + new_words):
csv_writer.writerow([year, word])
However, when I actually run the code, the only portion that gets written to the csv file is the very last year that I call. So, my csv file ends up looking like a one row like this:
1552, word1, word2, word3, etc....
I basically want to have a separate row for each year in the range of years. How do I go about this?
You keep overwriting in the loop and every time you run the code, open it once outside the loops and append to the file opening with a instead of w so each run of the code will add to the existing data not overwrite.:
with open("/Applications/Python 3.5/Economic/OED_table.csv", 'a') as outputw, open("/Applications/Python 3.5/Economic/OED.html", 'a') as outputh:
for year in range(year_start, year_end +1):
.....................

Python Web scraper using Beautifulsoup 4

I wanted to create a database with commonly used words. Right now when I run this script it works fine but my biggest issue is I need all of the words to be in one column. I feel like what I did was more of a hack than a real fix. Using Beautifulsoup, can you print everything in one column without having extra blank lines?
import requests
import re
from bs4 import BeautifulSoup
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
# Creating the CSV file
commonFile = open('common_words.csv', 'wb')
# Grabbing the lines you want
for node in soup.findAll("tr"):
# Getting just the text and removing the html
words = ''.join(node.findAll(text=True))
# Removing the extra lines
ID = re.sub(r'[\t\r\n]', '', words)
# Needed to add a break in the line to make the rows
update = ''.join(ID)+'\n'
# Now we add this to the file
commonFile.write(update)
commonFile.close()
How about this?
import requests
import csv
from bs4 import BeautifulSoup
f = csv.writer(open("common_words.csv", "w"))
f.writerow(["common_words"])
#Website you want to scrap info from
res = requests.get("https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt")
# Getting just the content using bs4
soup = BeautifulSoup(res.content, "lxml")
words = soup.select('div[class=file] tr')
for i in range(len(words)):
word = words[i].text
f.writerow([word.replace('\n', '')])

Scraping aspx with Python mechanize - getting search results

I've been trying to scrape Congressional financial disclosure reports using mechanize; the form submits successfully, but I can't locate any of the search results. My script is below:
br = Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://clerk.house.gov/public_disc/financial-search.aspx')
br.select_form(name='aspnetForm')
br.set_all_readonly(False)
br['filing_year'] = ['2008']
response = br.submit(name='search_btn')
html = response.read()
I'm new to scraping, and would appreciate any corrections/advice on this. Thanks!
This is an alternative solution that involves a real browser with the help of selenium tool.
from selenium import webdriver
from selenium.webdriver.support.select import Select
# initialize webdriver instance and visit url
url = "http://clerk.house.gov/public_disc/financial-search.aspx"
browser = webdriver.Firefox()
browser.get(url)
# find select tag and select 2008
select = Select(browser.find_element_by_id('ctl00_cphMain_txbFiling_year'))
select.select_by_value('2008')
# find "search" button and click it
button = browser.find_element_by_id('ctl00_cphMain_btnSearch')
button.click()
# display results
table = browser.find_element_by_id('search_results')
for row in table.find_elements_by_tag_name('tr')[1:-1]:
print [cell.text for cell in row.find_elements_by_tag_name('td')]
# close the browser
browser.close()
Prints:
[u'ABERCROMBIE, HON.NEIL', u'HI01', u'2008', u'FD Amendment']
[u'ABERCROMBIE, HON.NEIL', u'HI01', u'2008', u'FD Original']
[u'ACKERMAN, HON.GARY L.', u'NY05', u'2008', u'FD Amendment']
[u'ACKERMAN, HON.GARY L.', u'NY05', u'2008', u'FD Amendment']
...

How do you convert the multi-line content scraped into a list?

I was trying to convert the content scraped into a list for data manipulation, but got the following error: TypeError: 'NoneType' object is not callable
#! /usr/bin/python
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import os
import re
# Copy all of the content from the provided web page
webpage = urlopen("http://www.optionstrategist.com/calculators/free-volatility- data").read()
# Grab everything that lies between the title tags using a REGEX
preBegin = webpage.find('<pre>') # Locate the pre provided
preEnd = webpage.find('</pre>') # Locate the /pre provided
# Copy the content between the pre tags
voltable = webpage[preBegin:preEnd]
# Pass the content to the Beautiful Soup Module
raw_data = BeautifulSoup(voltable).splitline()
The code is very simple. This is the code for BeautifulSoup4:
# Find all <pre> tag in the HTML page
preTags = webpage.find_all('pre')
for tag in preTags:
# Get the text inside the tag
print(tag.get_text())
Reference:
find_all()
Kinds of filters to put into name field of find()/findall()
get_text()
To get the text from the first pre element:
#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = "http://www.optionstrategist.com/calculators/free-volatility-data"
soup = BeautifulSoup(urlopen(url))
print soup.pre.string
To extract lines with data:
from itertools import dropwhile
lines = soup.pre.string.splitlines()
# drop lines before the data table header
lines = dropwhile(lambda line: not line.startswith("Symbol"), lines)
# extract lines with data
lines = (line for line in lines if '%ile' in line)
Now each line contains data in a fixed-column format. You could use slicing and/or regex to parse/validate individual fields in each row.