I have written code blocks which searches for some random text in the web page. The webpage has multiple tabs which I'm navigating using selenium. Here's the problem that the text I'm trying to find is not fixed in a specific page. The text can be in any of the tabs in the webpage. If the text is not found an exception is raised. If an Exception is raised it should goto the next tab to search. I'm facing difficulties in handling the exceptions.
Below is the code I'm trying out.
import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.yxx.com/71463001")
a = driver.page_source
soup = BeautifulSoup(a, "html.parser")
try:
head = soup.find_all("div", {"style":"overflow:hidden;max-height:25px"})
head_str = str(head)
z = re.search('B00.{7}', head_str).group(0)
print z
print 'header'
except AttributeError:
g_info = soup.find_all("div", {"id":"details_readonly"})
g_info1=str(g_info)
x = re.search('B00.{7}', g_info1).group(0)
print x
print 'description'
except AttributeError:
corre = driver.find_element_by_id("tab_correspondence")
corre.click()
corr_g_info = soup.find_all("table", {"id" : "correspondence_view"})
corr_g_info1=str(corr_g_info)
print corr_g_info
y = re.search('B00.{7}', corr_g_info1).group(0)
print y
print 'correspondance'
When i Run this code i get an
error Traceback (most recent call last):
File "C:\Python27\BS.py", line 21, in <module>
x = re.search('B00.{7}', g_info1).group(0)
AttributeError: 'NoneType' object has no attribute 'group'
you're getting that error because you're calling group on a re.search object that doesn't contain anything. When I run your code, it fails on that because the page you're trying to connect to isn't currently up.
As far as why your except isn't catching it: you mistakenly wrote two excepts for only one try. The try is only going to catch any AttributeErrors for the code before the first except.
By changing line 19 to x = re.search('B00.{7}', g_info1), the code runs and returns None and description - again, because the page isn't currently up.
Alternatively, to achieve what I think you're going for, nesting the try/except is an option:
try:
head = soup.find_all("div", {"style":"overflow:hidden;max-height:25px"})
head_str = str(head)
z = re.search('B00.{7}', head_str).group(0)
print z
print 'header'
except AttributeError:
try:
g_info = soup.find_all("div", {"id":"details_readonly"})
g_info1=str(g_info)
x = re.search('B00.{7}', g_info1)
print x
print 'description'
except AttributeError:
corre = driver.find_element_by_id("tab_correspondence")
corre.click()
corr_g_info = soup.find_all("table", {"id" : "correspondence_view"})
corr_g_info1=str(corr_g_info)
print corr_g_info
y = re.search('B00.{7}', corr_g_info1).group(0)
print y
print 'correspondance'
Of course, this code currently throws a NameError because there is no info on the site from which to define the corr_g_info variable.
Related
I am trying to extract the <comment> tag (using xml.etree.ElementTree) from the XML and find the comment count number and add all of the numbers. I am reading the file via a URL using urllib package.
sample data: http://python-data.dr-chuck.net/comments_42.xml
But currently i am trying to trying to print the name, and count.
import urllib
import xml.etree.ElementTree as ET
serviceurl = 'http://python-data.dr-chuck.net/comments_42.xml'
address = raw_input("Enter location: ")
url = serviceurl + urllib.urlencode({'sensor': 'false', 'address': address})
print ("Retrieving: ", url)
link = urllib.urlopen(url)
data = link.read()
print("Retrieved ", len(data), "characters")
tree = ET.fromstring(data)
tags = tree.findall('.//comment')
for tag in tags:
Name = ''
count = ''
Name = tree.find('commentinfo').find('comments').find('comment').find('name').text
count = tree.find('comments').find('comments').find('comment').find('count').number
print Name, count
Unfortunately, I am not able to even parse the XML file into Python, because i am getting this error as follows:
Traceback (most recent call last):
File "ch13_parseXML_assignment.py", line 14, in <module>
tree = ET.fromstring(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1300, in XML
parser.feed(text)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 49
I have read previously in a similar situation that maybe the parser isn't accepting the XML file. Anticipating this, i did a Try and Except around tree = ET.fromstring(data) and I was able to get past this line, but later it is throwing an erro saying tree variable is not defined. This defeats the purpose of the output I am expecting.
Can somebody please point me in a direction that helps me?
I just began python crawling and have tried to crawl web text for a month.
I tried this code with python 2.7.13 and it worked well before.
class IEEECrawler:
def __init__(self):
self.baseUrl = "http://ieeexplore.ieee.org"
self.targetUrl = "http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?reload=true&filter%3DAND%28p_IS_Number%3A4359286%29&rowsPerPage=100&pageNumber=1&resultAction=REFINE&resultAction=ROWS_PER_PAGE&isnumber=4359286#1.html"
self.soup = BeautifulSoup(urllib.urlopen(self.targetUrl).read(), "lxml")
self.doc_list = self.soup.find_all('div', {'class': "txt"})
self.subUrl = []
def crawlOriginalPage(self):
file = open("./result.txt", "w")
for doc in self.doc_list:
head = doc.find("h3")
author_list = ''
for author in doc.find_all("div", {'class':"authors"}):
for tt in author.find_all('span', {'id':"preferredName"}):
author_list += tt['data-author-name'] + ";"
author_list = author_list[:-1]
file.write(head.find("span").text + ';')
file.write(author_list.strip() + ';')
file.write(self.baseUrl+head.find('a')['href']+ ';')
file.write(doc.find("div", {'class': "hide abstract RevealContent"}).find("p").text.replace('View full abstract'+'»'.decode('utf-8'),'').strip()+ '\n')
file.close()
print 'finish'
However, today I ran this code again, I doesn't work with this error masseges. I can't figure out what code should be fixed.
Traceback (most recent call last):
File "/Users/user/Downloads/ieee_fin/ieee.py", line 35, in <module>
crawler.crawlOriginalPage()
File "/Users/user/Downloads/ieee_fin/ieee.py", line 29, in crawlOriginalPage
file.write(doc.find("div", {'class': "hide abstract RevealContent"}).find("p").text.replace('View full abstract'+'»'.decode('utf-8'),'').strip()+ '\n')
AttributeError: 'NoneType' object has no attribute 'find'
The error shows you the line:
file.write(doc.find("div", {'class': "hide abstract RevealContent"}).find("p").text.replace('View full abstract'+'»'.decode('utf-8'),'').strip()+ '\n')
Just look for the method find (there are 2) and check to see what comes before it.
Maybe this is bad:
doc.find(...)
Which would mean that doc was type NoneType, which would mean that doc was None.
Or maybe this is bad:
doc.find("div", {'class': "hide abstract RevealContent"}).find("p")
Which would mean that doc.find(...class...) is returning None. Possibly because it couldn't find any.
Bottom line, you probably either need to put a try...except wrapper around that code, or break it up a little and start checking for None.
I'm in the process of learning Python, and I decided to train a bit of my programming by trying to make a program that could research text in a site called "Library of Babel" (https://libraryofbabel.info/)
I'm using BeautifulSoup to get the actual text out of the HTML code and I'm then using Regular Expressions to search for what I'm looking for, in this case I was testing it with just the leter "a".
But for some reason the code gives a error and says the variable I'm searching the "a" on, is not assigned.
Code:
import re
import requests
from bs4 import BeautifulSoup
url = "https://libraryofbabel.info/browse.cgi"
pages,data=[],[]
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("li",{"onclick":"gethexfromlist(this.innerHTML); enterhex();"}):
page = text.string
pages.append(page)
for eachRoom in pages:
url = "https://libraryofbabel.info/browse.cgi?" + eachRoom
for eachWall in range(1,5):
url = url + "-w" + str(eachWall)
for eachShelf in range(1,6):
url = url + "s-" + str(eachShelf)
for eachVolume in range(1,33):
if len(str(eachVolume)) == 1:
url = url + "-v0" + str(eachVolume)
else:
url = url + "-v" + str(eachVolume)
for eachPage in range(1,411):
url = url + ":" + str(eachPage)
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("div",{"class":"bookrealign"}):
rdata = text.string
if data == []:
data = re.findall(r"a",rdata)
else:
break
Error:
Traceback (most recent call last):
File "C:\Users\...", line 37, in <module>
data = re.findall(r"a",rdata)
NameError: name 'rdata' is not defined
Thanks in advance for any help given :)
Your if is outside the loop and soup.findAll("div",{"class":"bookrealign"}) finds nothing so rdata never gets defined.
Running the below script works for 60% of the entries from the MasterGroupList however suddenly fails with the below error. although my questions seem to be poor ou guys have been able to help me before. Any idea how I can avoid getting this error? or what is trhoughing off the script? The masterGroupList looks like:
Groups Pulled from AD
SET00 POWERUSER
SET00 USERS
SEF00 CREATORS
SEF00 USERS
...another 300 entries...
Error:
Traceback (most recent call last):
File "C:\Users\ks185278\OneDrive - NCR Corporation\Active Directory Access Scr
ipt\test.py", line 44, in <module>
print group.member
File "C:\Python27\lib\site-packages\active_directory.py", line 805, in __getat
tr__
raise AttributeError
AttributeError
Code:
from active_directory import *
import os
file = open("C:\Users\NAME\Active Directory Access Script\MasterGroupList.txt", "r")
fileAsList = file.readlines()
indexOfTitle = fileAsList.index("Groups Pulled from AD\n")
i = indexOfTitle + 1
while i <= len(fileAsList):
fileLocation = 'C:\\AD Access\\%s\\%s.txt' % (fileAsList[i][:5], fileAsList[i][:fileAsList[i].find("\n")])
#Creates the dir if it does not exist already
if not os.path.isdir(os.path.dirname(fileLocation)):
os.makedirs(os.path.dirname(fileLocation))
fileGroup = open(fileLocation, "w+")
#writes group members to the open file
group = find_group(fileAsList[i][:fileAsList[i].find("\n")])
print group.member
for group_member in group.member: #this is line 44
fileGroup.write(group_member.cn + "\n")
fileGroup.close()
i+=1
Disclaimer: I don't know python, but I know Active Directory fairly well.
If it's failing on this:
for group_member in group.member:
It could possibly mean that the group has no members.
Depending on how phython handles this, it could also mean that the group has only one member and group.member is a plain string rather than an array.
What does print group.member show?
The source code of active_directory.py is here: https://github.com/tjguk/active_directory/blob/master/active_directory.py
These are the relevant lines:
if name not in self._delegate_map:
try:
attr = getattr(self.com_object, name)
except AttributeError:
try:
attr = self.com_object.Get(name)
except:
raise AttributeError
So it looks like it just can't find the attribute you're looking up, which in this case looks like the 'member' attribute.
I just wrote a simple webscraping script to give me all the episode links on a particular site's page. The script was working fine, but, now it's broke. I didn't change anything.
Try this URL (For scraping ) :- http://www.crunchyroll.com/tabi-machi-late-show
Now, the script works mid-way and gives me an error stating, ' Element not found in the cache - perhaps the page has changed since it was looked up'
I looked it up on internet and people said about using the 'implicit wait' command at certain places. I did that, still no luck.
UPDATE : I tried this script in a demote desktop and it's working there without any problems.
Here's my script :-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import time
from subprocess import Popen
#------------------------------------------------
try:
Link = raw_input("Please enter your Link : ")
if not Link:
raise ValueError('Please Enter A Link To The Anime Page. This Application Will now Exit in 5 Seconds.')
except ValueError as e:
print(e)
time.sleep(5)
exit()
print 'Analyzing the Page. Hold on a minute.'
driver = webdriver.Firefox()
driver.get(Link)
assert "Crunchyroll" in driver.title
driver.implicitly_wait(5) # <-- I tried removing this lines as well. No luck.
elem = driver.find_elements_by_xpath("//*[#href]")
driver.implicitly_wait(10) # <-- I tried removing this lines as well. No luck.
text_file = open("BatchLink.txt", "w")
print 'Fetching The Links, please wait.'
for elem in elem:
x = elem.get_attribute("href")
#print x
text_file.write(x+'\n')
print 'Links have been fetched. Just doing the final cleaning now.'
text_file.close()
CleanFile = open("queue.txt", "w")
with open('BatchLink.txt') as f:
mylist = f.read().splitlines()
#print mylist
with open('BatchLink.txt', 'r') as inF:
for line in inF:
if 'episode' in line:
CleanFile.write(line)
print 'Please Check the file named queue.txt'
CleanFile.close()
os.remove('BatchLink.txt')
driver.close()
Here's a screenshot of the error (might be of some help) :
http://i.imgur.com/SaANlsg.png
Ok i didn't work with python but know the problem
you have variable that you init -> elem = driver.find_elements_by_xpath("//*[#href]")
after that you doing some things with it in loop
before you finishing the loop try to init this variable again
elem = driver.find_elements_by_xpath("//*[#href]")
The thing is that the DOM is changes and you loosing the element collection.