I just wrote a simple webscraping script to give me all the episode links on a particular site's page. The script was working fine, but, now it's broke. I didn't change anything.
Try this URL (For scraping ) :- http://www.crunchyroll.com/tabi-machi-late-show
Now, the script works mid-way and gives me an error stating, ' Element not found in the cache - perhaps the page has changed since it was looked up'
I looked it up on internet and people said about using the 'implicit wait' command at certain places. I did that, still no luck.
UPDATE : I tried this script in a demote desktop and it's working there without any problems.
Here's my script :-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import time
from subprocess import Popen
#------------------------------------------------
try:
Link = raw_input("Please enter your Link : ")
if not Link:
raise ValueError('Please Enter A Link To The Anime Page. This Application Will now Exit in 5 Seconds.')
except ValueError as e:
print(e)
time.sleep(5)
exit()
print 'Analyzing the Page. Hold on a minute.'
driver = webdriver.Firefox()
driver.get(Link)
assert "Crunchyroll" in driver.title
driver.implicitly_wait(5) # <-- I tried removing this lines as well. No luck.
elem = driver.find_elements_by_xpath("//*[#href]")
driver.implicitly_wait(10) # <-- I tried removing this lines as well. No luck.
text_file = open("BatchLink.txt", "w")
print 'Fetching The Links, please wait.'
for elem in elem:
x = elem.get_attribute("href")
#print x
text_file.write(x+'\n')
print 'Links have been fetched. Just doing the final cleaning now.'
text_file.close()
CleanFile = open("queue.txt", "w")
with open('BatchLink.txt') as f:
mylist = f.read().splitlines()
#print mylist
with open('BatchLink.txt', 'r') as inF:
for line in inF:
if 'episode' in line:
CleanFile.write(line)
print 'Please Check the file named queue.txt'
CleanFile.close()
os.remove('BatchLink.txt')
driver.close()
Here's a screenshot of the error (might be of some help) :
http://i.imgur.com/SaANlsg.png
Ok i didn't work with python but know the problem
you have variable that you init -> elem = driver.find_elements_by_xpath("//*[#href]")
after that you doing some things with it in loop
before you finishing the loop try to init this variable again
elem = driver.find_elements_by_xpath("//*[#href]")
The thing is that the DOM is changes and you loosing the element collection.
Related
I try search some mail,but it does not work, my code:
`
conn = imaplib.IMAP4_SSL('imap.qq.com',993)
# login
try:
conn.login(config.MAIL_ADDRESS, config.MAIL_PASSWORD)
except Exception as err:
print 'connect fail: ',err
conn.select("inbox")
typ, data = conn.search(None,'(SUBJECT "test")')
print"UID list length is %i" % len(string.split(data[0]))`
run this code, it can print all mails. It seems search does not work.
I do not know where is wrong.I use python 2.7
I have a spider that will run on schedule. Spider input is based on Date. From date of last scrape to todays date. So the question is how to save the date of last scrape within the Scrapy project? There is an option to get data from scrapy settings using pkjutil module, but i did not find any reference in the docs on how to write data in that file. Any idea? Maybe an alternative?
P.S. My other option is to use some free remote MySql DB just for this. But looks like more work if simple solution is available.
import pkgutil
class CodeSpider(scrapy.Spider):
name = "code"
allowed_domains = ["google.com.au"]
def start_requests(self):
f = pkgutil.get_data("au_go", "res/state.json")
ids = json.loads(f)
id = ids[0]['state']
yield {'state':id}
ids[0]['state'] = 'New State'
with open('./au_go/res/state.json', 'w') as f:
json.dump(ids, f)
The above solution works fine when ran locally. But I am getting no such file or directory when running the code at Scrapinghub.
File "/tmp/unpacked-eggs/__main__.egg/au_go/spiders/test_state.py", line 33, in parse
with open(savePath, 'w') as f:
IOError: [Errno 2] No such file or directory: './au_go/res/state.json'
The problem is fixed with use of Scrapinghub Colections
And scrapinghub API. Works nice now.
Here is an example code in case somebody will find it usefull.
from scrapinghub import ScrapinghubClient
client = ScrapinghubClient(Your API KEY)
project = client.get_project(Your Project ID)
collections = project.collections
last_accessed = collections.get_store('last_accessed')
last_accessed.set({'_key': 'Date', 'value': '12-54-1235'})
print last_accessed.get('Date')['value']
I have a simple web.py app that reads a config file and serves to URL paths. However I get two strange behaviors. One, changes made to data in the Main are not reflected in the results of GET. Two, Main appears to run twice.
Desired behavior is modifying data in Main will cause methods to see modified data, and not having main re-run.
Questions:
What is really happening here, that mydict is not modified in either
GET.
Why am I getting some code running twice.
Simplest path to desired behavior (most important)
Pythonic path to desired behavior (least important)
From pbuck (Accepted Answer): Answer for 3.) is replace
app = web.application(urls, globals())
with:
app = web.application(urls, globals(), autoreload=False)
Same behavior on pythons Linux (CentOS 6 python 2.6.6) and MacBook (brew python 2.7.12)
When started I get:
$ python ./foo.py 8080
Initializing mydict
Modifying mydict
http://0.0.0.0:8080/
When queried with:
wget http://localhost:8080/node/first/foo
wget http://localhost:8080/node/second/bar
Which results in (notice a second "Initializing mydict"):
Initializing mydict
firstClass.GET called with clobber foo
firstClass.GET somevalue is something static
127.0.0.1:52480 - - [17/Feb/2017 17:30:42] "HTTP/1.1 GET /node/first/foo" - 200 OK
secondClass.GET called with clobber bar
secondClass.GET somevalue is something static
127.0.0.1:52486 - - [17/Feb/2017 17:30:47] "HTTP/1.1 GET /node/second/bar" - 200 OK
Code:
#!/usr/bin/python
import web
urls = (
'/node/first/(.*)', 'firstClass',
'/node/second/(.*)', 'secondClass'
)
# Initialize web server, start it later at "app . run ()"
#app = web.application(urls, globals())
# Running web.application in Main or above does not change behavior
# Static Initialize mydict
print "Initializing mydict"
mydict = {}
mydict['somevalue'] = "something static"
class firstClass:
def GET(self, globarg):
print "firstClass.GET called with clobber %s" % globarg
print "firstClass.GET somevalue is %s" % mydict['somevalue']
return mydict['somevalue']
class secondClass:
def GET(self, globarg):
print "secondClass.GET called with clobber %s" % globarg
print "secondClass.GET somevalue is %s" % mydict['somevalue']
return mydict['somevalue']
if __name__ == '__main__':
app = web.application(urls, globals())
# read configuration files for initializations here
print "Modifying mydict"
mydict['somevalue'] = "something dynamic"
app.run()
Short answer, avoid using globals as they don't do what you think they do. Especially when you eventually deploy this under nginx / apache where there will (likely) be multiple processes running.
Longer answer
Why am I getting some code running twice?
Code, global to app.py, is running twice because it runs once, as it normally does. The second time is within the web.application(urls, globals()) call. Really, that call to globals() sets up module loading / re-loading. Part of that is re-loading all modules (including app.py). If you set autoreload=False in the web.applications() call, it won't do that.
What is really happening here, that mydict is not modified in either GET?
mydict is getting set to 'something dynamic', but then being re-set to 'something static' on second load. Again, set autoreload=False and it will work as you expect.
Shortest path?
autoreload=False
Pythonic path?
.... well, I wonder why you have mydict['somevalue'] = 'something static' and mydict['somevalue'] = 'something dynamic' in your module this way: why not just set it once under '__main__'?
I'm trying to implement simple_tokenize using dictionary as the output from my previous code but i get an error message. Any assistance with the following code would be much appreciated. I'm using Python 2.7 Jupyter
import csv
reader = csv.reader(open('data.csv'))
dictionary = {}
for row in reader:
key = row[0]
dictionary[key] = row[1:]
print dictionary
The above works pretty well but issue is with the following:
import re
words = dictionary
split_regex = r'\W+'
def simple_tokenize(string):
for i in rows:
word = words.split
#pass
print word
I get this error:
NameError Traceback (most recent call last)
<ipython-input-2-0d0e05fb1556> in <module>()
1 import re
2
----> 3 words = dictionary
4 split_regex = r'\W+'
5
NameError: name 'dictionary' is not defined
Variables are not saved between Jupyter sessions, unless you explicitly do so yourself. Thus, if you ran the first code section, then quit your Jupyter session, started a new Jupyter session and ran the second code block, dictionary is not preserved from the first session and will thus be undefined, as indicated by the error.
If you run the above code blocks differently (e.g., not across Jupyter sessions), you should indicate this, but the tags and traceback suggest this is what you do.
Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')
See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'