How can I assign web scraping outputs to an array using python? - python-2.7

I would like to execute this and get all of the text from the title and href attributes. The code runs, and I do get all of the needed data, but I would like to assign the outputs to an array and when I attempt to assign this just gives me the last instance of the attributes being true in the HTML.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('http://www.genome.jp/kegg-bin/show_pathway?map=hsa05215&show_description=show').read()
soup = BeautifulSoup((r), "lxml")
for area in soup.find_all('area', href=True):
print area['href']
for area in soup.find_all('area', title=True):
print area['title']
If it helps, I'm doing this because I will create a list with the data later. I'm just beginning to learn, so extra explanations are much appreciated.

You need to use list comprehensions:
links = [area['href'] for area in soup.find_all('area', href=True)]
titles = [area['title'] for area in soup.find_all('area', title=True)]

Related

How to iterate .extracttext in pdfplumber

I am trying to build a tool to extract the text from each page of a PDF file. So far, only pdfplumber is returning readable text. Examples of pdfplumber (e.g. https://github.com/jsvine/pdfplumber) show the text being extracted per page. As such, I have done the following to capture multiple pages:
import pdfplumber
with pdfplumber.open(file) as pdf:
p1 = pdf.pages[0]
p2 = pdf.pages[1]
p3 = pdf.pages[2]
p1_text = p1.extract_text()
p2_text = p2.extract_text()
p3_text = p3.extract_text()
print(p1_text, p2_text, p3_text)
My pdf has 17 pages. I want to know whether it is possible to iterate through a list (i.e. 0 - 16) in order to generate p1, p2, p3... p17 (the first block under the with statement).
I have generated the necessary list using:
file = '/Users/Guy/Coding/Crossref/sample.pdf'
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open(file,'rb'))
total_pages = pdf.getNumPages()
total_pages_range = list(range(1, total_pages))
But can't seem to join the two together.
Any help would be much appreciated - just starting out with Python.
Thanks.
The pdfplumber.PDF class has a .pages property which is a list containing one pdfplumber.Page instance per page loaded. So, if your PDF has n pages, you can iterate through all of them like
import pdfplumber
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
print(page.extract_text())

How would I format data in a PrettyTable?

I'm getting the text from the title and href attributes from the HTML. The code runs fine and I'm able to import it all into a PrettyTable fine. The problem that I face now is that there are some titles that I believe are too large for one of the boxes in the table and thus distort the entire PrettyTable made. I've tried adjusting the hrules, vrules, and padding_width and have not found a resolution.
from bs4 import BeautifulSoup
from prettytable import PrettyTable
import urllib
r = urllib.urlopen('http://www.genome.jp/kegg-bin/show_pathway?map=hsa05215&show_description=show').read()
soup = BeautifulSoup((r), "lxml")
links = [area['href'] for area in soup.find_all('area', href=True)]
titles = [area['title'] for area in soup.find_all('area', title=True)]
k = PrettyTable()
k.field_names = ["ID", "Active Compound", "Link"]
c = 1
for i in range(len(titles)):
k.add_row([c, titles[i], links[i]])
c += 1
print(k)
How I would like the entire table to display as:
print (k.get_string(start=0, end=25))
If PrettyTable can't do it. Are there any other recommended modules that could accomplish this?
This was not a formatting error, but rather the overall size of the table created was so large that the python window could not accommodate all the values on the screen.
This proven by changing to a much smaller font size. If it helps anyone exporting as .csv then arranging in Excel helped.

networkx graph display using matplotlib- missing labels

I am writing a program which generates satisfiable models (connected graphs) for a specific input string. The details here are not important but the main problem is that each node has a label and such label can be lengthy one. So, what happens is that it does not fit into the figure which results in displaying all the nodes but some labels are partly displayed... Also, the figure that is displayed does not provide an option to zoom out so it is impossible to capture entire graph with full labels on one figure.
Can someone help me out and perhaps suggest a solution?
for i in range(0,len(Graphs)):
graph = Graphs[i]
custom_labels={}
node_colours=['y']
for node in graph.nodes():
custom_labels[node] = graph.node[node]
node_colours.append('c')
#nx.circular_layout(Graphs[i])
nx.draw(Graphs[i], nx.circular_layout(Graphs[i]), node_size=1500, with_labels=True, labels = custom_labels, node_color=node_colours)
#show with custom labels
fig_name = "graph" + str(i) + ".png"
#plt.savefig(fig_name)
plt.show()
Update picture added:
You could scale the figure
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_edge('a'*50,'b'*50)
nx.draw(G,with_labels=True)
plt.savefig('before.png')
l,r = plt.xlim()
print(l,r)
plt.xlim(l-2,r+2)
plt.savefig('after.png')
before
after
You could reduce the font size, using the font_size parameter:
nx.draw(Graphs[i], nx.circular_layout(Graphs[i]), ... , font_size=6)

Paginating with Python 2.7.9 Web Crawler

I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)

Converting a list from a .txt file into a dictionary

Ok, I've tried all the methods in Convert a list to a dictionary in Python, but I can't seem to get this to work right. I'm trying to convert a list that I've made from a .txt file into a dictionary. So far my code is:
import os.path
from tkinter import *
from tkinter.filedialog import askopenfilename
import csv
window = Tk()
window.title("Please Choose a .txt File")
fileName = askopenfilename()
classInfoList = []
classRoster = {}
with open(fileName, newline = '') as listClasses:
for line in csv.reader(listClasses):
classInfoList.append(line)
The .txt file is in the format:
professor
class
students
An example would be:
Professor White
Chem 101
Jesse Pinkman, Brandon Walsh, Skinny Pete
The output I desire would be a dictionary with professors as the keys, and then the class and list of students for the values.
OUTPUT:
{"Professor White": ["Chem 101", [Jesse Pinkman, Brandon Walsh, Skinny Pete]]}
However, when I tried the things in the above post, I kept getting errors.
What can I do here?
Thanks
Since the data making up your dictionary is on consecutive lines, you will have to process three lines at once. You can use the next() method on the file handle like this:
output = {}
input_file = open('file1')
for line in input_file:
key = line.strip()
value = [next(input_file).strip()]
value.append(next(input_file).split(','))
output[key] = value
input_file.close()
This would give you:
{'Professor White': ['Chem 101',
['Jesse Pinkman, Brandon Walsh, Skinny Pete']]}