How to iterate .extracttext in pdfplumber - list

I am trying to build a tool to extract the text from each page of a PDF file. So far, only pdfplumber is returning readable text. Examples of pdfplumber (e.g. https://github.com/jsvine/pdfplumber) show the text being extracted per page. As such, I have done the following to capture multiple pages:
import pdfplumber
with pdfplumber.open(file) as pdf:
p1 = pdf.pages[0]
p2 = pdf.pages[1]
p3 = pdf.pages[2]
p1_text = p1.extract_text()
p2_text = p2.extract_text()
p3_text = p3.extract_text()
print(p1_text, p2_text, p3_text)
My pdf has 17 pages. I want to know whether it is possible to iterate through a list (i.e. 0 - 16) in order to generate p1, p2, p3... p17 (the first block under the with statement).
I have generated the necessary list using:
file = '/Users/Guy/Coding/Crossref/sample.pdf'
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open(file,'rb'))
total_pages = pdf.getNumPages()
total_pages_range = list(range(1, total_pages))
But can't seem to join the two together.
Any help would be much appreciated - just starting out with Python.
Thanks.

The pdfplumber.PDF class has a .pages property which is a list containing one pdfplumber.Page instance per page loaded. So, if your PDF has n pages, you can iterate through all of them like
import pdfplumber
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
print(page.extract_text())

Related

Python - webscraping; dictionary data structure

I need to scrape this website (http://setkab.go.id/profil-kabinet/#) and produce an Excel file that has headers "Cabinet names" in column 1 and "Era" in column 2. That means each Cabinet name (e.g. Kabinet Presidensil, Kabinet Sjahrir I) should have its own row - alongside its respective era (e.g. Era Revolusi Fisik, Era Republik Indonesia Serikat).
This is the closest I've gotten:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://setkab.go.id/profil-kabinet/#')
soup = BeautifulSoup(response.text, 'html.parser')
eras = soup.find_all('div', attrs={'class':"wpb_accordion_section group"})
setkab = {}
for element in eras:
setkab[element.a.get_text()] = {}
for element in eras:
cabname = element.find('div',attrs={'class':'wpb_wrapper'}).get_text()
setkab[element.a.get_text()]['cbnm'] = cabname
for item in setkab.keys():
print item + setkab[item]['cbnm']
import os, csv
os.chdir("/Users/mxcodes/Code")
with open("setkabfinal.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8"), setkab[a]["cbnm"]])
However, this creates an Excel file with the headers "Era" and "Cabinet names" in column 1 and 2, respectively. It fails to put each Cabinet name in a separate row. For example, it has 'Era Revolusi Fisik' in column 1 and lists all the cabinets together in column 2.
My guess is that I need to switch the key-value pairs somehow so that each Cabinet becomes a key and its era becomes its value - because currently it's the other way around. But I've tried and failed to do so. Any help? Thank you!
From what I can see, the cabinets[a]["cbnm"] variable you use for writing is just a long Unicode so when you do writer.writerow([a.encode("utf-8"), cabinets[a]["cbnm"]]) what actually happens is that you write the era at the first column and the whole Unicode in the single cell in the next column (even if you have \n in your string it does not prevent it from being writed in a single cell (csv actually think that you want the unicode to be in ONLY one cell so it puts " before and after the cabinets[a]["cbnm"] value to be sure it will actually be in one cell)), what you should do to write every cabinet value in another row is to use the writerow method separately for each desired row.
for example this code worked fine for me:
cabinets = setkab
with open("cabinets.csv", "w") as toWrite:
writer = csv.writer(toWrite, delimiter=",")
writer.writerow(["Era", "Cabinet name"])
for a in setkab.keys():
writer.writerow([a.encode("utf-8")]) #write the era column
cabinets_list = [i for i in cabinets[a]["cbnm"].split('\n') if i != ''] #get all the values that are separated by newline chars (if they aren't empty strings)
for i in cabinets_list: writer.writerow([a.encode("utf-8"),i]) #write every value separately in the CABINET NAME row
as you can see I changed only the last 3 lines.
I hope this will help you!

How would I format data in a PrettyTable?

I'm getting the text from the title and href attributes from the HTML. The code runs fine and I'm able to import it all into a PrettyTable fine. The problem that I face now is that there are some titles that I believe are too large for one of the boxes in the table and thus distort the entire PrettyTable made. I've tried adjusting the hrules, vrules, and padding_width and have not found a resolution.
from bs4 import BeautifulSoup
from prettytable import PrettyTable
import urllib
r = urllib.urlopen('http://www.genome.jp/kegg-bin/show_pathway?map=hsa05215&show_description=show').read()
soup = BeautifulSoup((r), "lxml")
links = [area['href'] for area in soup.find_all('area', href=True)]
titles = [area['title'] for area in soup.find_all('area', title=True)]
k = PrettyTable()
k.field_names = ["ID", "Active Compound", "Link"]
c = 1
for i in range(len(titles)):
k.add_row([c, titles[i], links[i]])
c += 1
print(k)
How I would like the entire table to display as:
print (k.get_string(start=0, end=25))
If PrettyTable can't do it. Are there any other recommended modules that could accomplish this?
This was not a formatting error, but rather the overall size of the table created was so large that the python window could not accommodate all the values on the screen.
This proven by changing to a much smaller font size. If it helps anyone exporting as .csv then arranging in Excel helped.

How can I assign web scraping outputs to an array using python?

I would like to execute this and get all of the text from the title and href attributes. The code runs, and I do get all of the needed data, but I would like to assign the outputs to an array and when I attempt to assign this just gives me the last instance of the attributes being true in the HTML.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('http://www.genome.jp/kegg-bin/show_pathway?map=hsa05215&show_description=show').read()
soup = BeautifulSoup((r), "lxml")
for area in soup.find_all('area', href=True):
print area['href']
for area in soup.find_all('area', title=True):
print area['title']
If it helps, I'm doing this because I will create a list with the data later. I'm just beginning to learn, so extra explanations are much appreciated.
You need to use list comprehensions:
links = [area['href'] for area in soup.find_all('area', href=True)]
titles = [area['title'] for area in soup.find_all('area', title=True)]

Paginating with Python 2.7.9 Web Crawler

I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)

Use Python with lxml to parse a xml document and write elements into a text file

With the following Python code I want to parse a xml file. An extract of the xml file you can see below the code. I need to "extract" everything which is behind "inv: name =" like in this case "'datasource roof height' and (value = 1000 or value = 2000 or value = 3000 or value = 4000 or value = 5000 or value = 6000)". Any ideas?
My Python code (so far):
from lxml import etree
doc = etree.parse("data.xml")
for con in doc.xpath("//specification"):
for cons in con.xpath("./#body"):
with open("output.txt", "w") as cons_out:
cons_out.write(cons)
cons_out.close()
Part of the xml file:
<ownedRule xmi:type="uml:Constraint" xmi:id="EAID_OR000004_EE68_4efa_8E1B_8DDFA8F95FB8" name="datasource roof height">
<constrainedElement xmi:idref="EAID_94F3B0A6_EE68_4efa_8E1B_8DDFA8F95FB8"/>
<specification xmi:type="uml:OpaqueExpression" xmi:id="EAID_COE000004_EE68_4efa_8E1B_8DDFA8F95FB8" body="inv: name = 'datasource roof height' and (value = 1000 or value = 2000 or value = 3000 or value = 4000 or value = 5000 or value = 6000)"/>
</ownedRule>
XML Parsers understand attributes and elements. What is present within these attributes or elements (the textual content) is of no concern to the XML parser.
In order to solve your problem you would need to split the string retrieved from the body attribute. Of course, I am assuming that the body attribute for all elements would have the same format content i.e. "inv : name = some content"
from lxml import etree
doc = etree.parse("data.xml")
for con in doc.xpath("//specification"):
for cons in con.xpath("./#body"):
with open("output.txt", "w") as cons_out:
content = cons.split("inv: name =")[1]
cons_out.write(content)
cons_out.close()