Python 3 | Beautiful Soup For Loop, Variable not assigned - regex

I'm in the process of learning Python, and I decided to train a bit of my programming by trying to make a program that could research text in a site called "Library of Babel" (https://libraryofbabel.info/)
I'm using BeautifulSoup to get the actual text out of the HTML code and I'm then using Regular Expressions to search for what I'm looking for, in this case I was testing it with just the leter "a".
But for some reason the code gives a error and says the variable I'm searching the "a" on, is not assigned.
Code:
import re
import requests
from bs4 import BeautifulSoup
url = "https://libraryofbabel.info/browse.cgi"
pages,data=[],[]
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("li",{"onclick":"gethexfromlist(this.innerHTML); enterhex();"}):
page = text.string
pages.append(page)
for eachRoom in pages:
url = "https://libraryofbabel.info/browse.cgi?" + eachRoom
for eachWall in range(1,5):
url = url + "-w" + str(eachWall)
for eachShelf in range(1,6):
url = url + "s-" + str(eachShelf)
for eachVolume in range(1,33):
if len(str(eachVolume)) == 1:
url = url + "-v0" + str(eachVolume)
else:
url = url + "-v" + str(eachVolume)
for eachPage in range(1,411):
url = url + ":" + str(eachPage)
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("div",{"class":"bookrealign"}):
rdata = text.string
if data == []:
data = re.findall(r"a",rdata)
else:
break
Error:
Traceback (most recent call last):
File "C:\Users\...", line 37, in <module>
data = re.findall(r"a",rdata)
NameError: name 'rdata' is not defined
Thanks in advance for any help given :)

Your if is outside the loop and soup.findAll("div",{"class":"bookrealign"}) finds nothing so rdata never gets defined.

Related

how to convert .docx file to html using python?

import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value

Getting ParseError when parsing using xml.etree.ElementTree

I am trying to extract the <comment> tag (using xml.etree.ElementTree) from the XML and find the comment count number and add all of the numbers. I am reading the file via a URL using urllib package.
sample data: http://python-data.dr-chuck.net/comments_42.xml
But currently i am trying to trying to print the name, and count.
import urllib
import xml.etree.ElementTree as ET
serviceurl = 'http://python-data.dr-chuck.net/comments_42.xml'
address = raw_input("Enter location: ")
url = serviceurl + urllib.urlencode({'sensor': 'false', 'address': address})
print ("Retrieving: ", url)
link = urllib.urlopen(url)
data = link.read()
print("Retrieved ", len(data), "characters")
tree = ET.fromstring(data)
tags = tree.findall('.//comment')
for tag in tags:
Name = ''
count = ''
Name = tree.find('commentinfo').find('comments').find('comment').find('name').text
count = tree.find('comments').find('comments').find('comment').find('count').number
print Name, count
Unfortunately, I am not able to even parse the XML file into Python, because i am getting this error as follows:
Traceback (most recent call last):
File "ch13_parseXML_assignment.py", line 14, in <module>
tree = ET.fromstring(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 49
I have read previously in a similar situation that maybe the parser isn't accepting the XML file. Anticipating this, i did a Try and Except around tree = ET.fromstring(data) and I was able to get past this line, but later it is throwing an erro saying tree variable is not defined. This defeats the purpose of the output I am expecting.
Can somebody please point me in a direction that helps me?

Exception Handling in Beautiful soup/Python

I have written code blocks which searches for some random text in the web page. The webpage has multiple tabs which I'm navigating using selenium. Here's the problem that the text I'm trying to find is not fixed in a specific page. The text can be in any of the tabs in the webpage. If the text is not found an exception is raised. If an Exception is raised it should goto the next tab to search. I'm facing difficulties in handling the exceptions.
Below is the code I'm trying out.
import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.yxx.com/71463001")
a = driver.page_source
soup = BeautifulSoup(a, "html.parser")
try:
head = soup.find_all("div", {"style":"overflow:hidden;max-height:25px"})
head_str = str(head)
z = re.search('B00.{7}', head_str).group(0)
print z
print 'header'
except AttributeError:
g_info = soup.find_all("div", {"id":"details_readonly"})
g_info1=str(g_info)
x = re.search('B00.{7}', g_info1).group(0)
print x
print 'description'
except AttributeError:
corre = driver.find_element_by_id("tab_correspondence")
corre.click()
corr_g_info = soup.find_all("table", {"id" : "correspondence_view"})
corr_g_info1=str(corr_g_info)
print corr_g_info
y = re.search('B00.{7}', corr_g_info1).group(0)
print y
print 'correspondance'
When i Run this code i get an
error Traceback (most recent call last):
File "C:\Python27\BS.py", line 21, in <module>
x = re.search('B00.{7}', g_info1).group(0)
AttributeError: 'NoneType' object has no attribute 'group'
you're getting that error because you're calling group on a re.search object that doesn't contain anything. When I run your code, it fails on that because the page you're trying to connect to isn't currently up.
As far as why your except isn't catching it: you mistakenly wrote two excepts for only one try. The try is only going to catch any AttributeErrors for the code before the first except.
By changing line 19 to x = re.search('B00.{7}', g_info1), the code runs and returns None and description - again, because the page isn't currently up.
Alternatively, to achieve what I think you're going for, nesting the try/except is an option:
try:
head = soup.find_all("div", {"style":"overflow:hidden;max-height:25px"})
head_str = str(head)
z = re.search('B00.{7}', head_str).group(0)
print z
print 'header'
except AttributeError:
try:
g_info = soup.find_all("div", {"id":"details_readonly"})
g_info1=str(g_info)
x = re.search('B00.{7}', g_info1)
print x
print 'description'
except AttributeError:
corre = driver.find_element_by_id("tab_correspondence")
corre.click()
corr_g_info = soup.find_all("table", {"id" : "correspondence_view"})
corr_g_info1=str(corr_g_info)
print corr_g_info
y = re.search('B00.{7}', corr_g_info1).group(0)
print y
print 'correspondance'
Of course, this code currently throws a NameError because there is no info on the site from which to define the corr_g_info variable.

The BeautifulSoup object isn't fetching <ul> tags with class set to comments. Any suggestions?

>>> from bs4 import BeautifulSoup
>>> import urllib
>>> url = "http://www.securitytube.net/video/7313"
>>>
>>> page = urllib.urlopen(url)
>>>
>>> pageDom = BeautifulSoup(page)
On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
>>> allComments = pageDom.find_all("ul", class_="comments")
>>>
>>> allComments
[]
>>>
>>>
So now I removed 'class_' and am able to fetch all the unordered list tags.
Check the code below.
>>> allComments = pageDom.find_all("ul")
>>> len(allComments)
27
>>>
If I look at the source code of the page I can very well see all the < ul > with the class as "comments". I don't know where am I missing stuffs. I also tried changing the parser to "lxml" but no joy.
Any suggestions/ improvements will be highly appreciated ...
I am not sure if there is a difference from the versions but here is the code and the output that worked fine with Python 3.4:
url = "http://www.securitytube.net/video/7313"
page = urllib.request.urlopen(url)
pageDom = BeautifulSoup(page)
#print(pageDom)
#On running the above code, I receive the dom object in the 'pageDom' variable. Now I do this (code mentioned below) and I get an empty list.
allComments = pageDom.find_all("ul", class_="comments")
#print(allComments)
print(len(allComments))
#So now I removed 'class_' and am able to fetch all the unordered list tags. Check the code below.
allComments = pageDom.find_all("ul")
#print(allComments)
print(len(allComments))
Output:
C:\Python34\python.exe C:/{path}/testPython.py
2
27
Process finished with exit code 0
You can uncomment the print lines to see the array contents
I tested (multiple times) in python 2.7 32 bit-
from bs4 import BeautifulSoup
import urllib
url = "http://www.securitytube.net/video/7313"
page = urllib.urlopen(url)
page = d=page.read()
pageDom = BeautifulSoup(page,'lxml')
allComments = pageDom.find_all("ul", class_="comments")
print len(allComments)
allComments = pageDom.find_all("ul")
print len(allComments)
It prints-
2
27

PDF to Word Doc in Python

I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.
All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.
This is the code I have so far. All it prints is the female gender symbol.
Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file('Bottom Dec.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print convert_pdf_to_txt(1)
from pdf2docx import Converter
pdf_file = 'E:\Muhammad UMER LAR.pdf'
doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)
c.convert(doc_file)
c.close()
Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.
import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here
words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))
I'm developer evangelist at Aspose.