The following code produces the desired result, but the root of my issue stems from the process slowing down with each append. In the first 100 pages or so, it runs at less than one second per page, but by the 500s it gets up to 3 seconds and into the 1000s it runs at about 5 seconds per append. Are there suggestions for how to make this more efficient or explanations for why this is just the way things are?
import lxml
from lxml import html
import itertools
import datetime
l=[]
for pageno in itertools.count(start=1):
time = datetime.datetime.now()
url = 'http://example.com/'
parse = lxml.html.parse(url)
try:
for x in parse.xpath('//center'):
x.getparent().remove(x)
x.clear()
while x.getprevious() is not None:
del x.getparent()[0]
for n in parse.xpath('//tr[#class="rt"]'):
l.append([n.find('td/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').attrib['title'].encode('utf8').decode('utf8').strip()]\
+[c.text.encode('utf8').decode('utf8').strip() for c in n if c.text.strip() is not ''])
n.clear()
while n.getprevious() is not None:
del n.getparent()[0]
except:
print 'Page ' + str(pageno) + 'Does Not Exist'
print '{0} Pages Complete: {1}'.format(pageno,datetime.datetime.now()-time)
I have tried a number of solutions, such as disabling the garbage collector, writing one list as a row to file instead of appending to a large list, etc. I look forward to learning more from potential suggestions/answers.
Related
I am struggling to find an efficient way of retrieving the solution to an optimization problem. The solution consists of around 200K variables that I would like in a pandas DataFrame. After searching online the only approaches I found for accessing the variables was through a for loop which looks something like this:
instance = M.create_instance('input.dat') # reading in a datafile
results = opt.solve(instance, tee=True)
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
for index in varobject:
print (" ",index, varobject[index].value)
I know I can use this for loop to store them in a dataframe but this is pretty inefficient.
I found out how to access the indexes by using
import pandas as pd
index = pd.DataFrame(instance.component_objects(Var, active=True))
But I dont know how to get the solution
There is actually a very simple and elegant solution, using the method pandas.DataFrame.from_dict combined with the Var.extract_values() method.
from pyomo.environ import *
import pandas as pd
m = ConcreteModel()
m.N = RangeSet(5)
m.x = Var(m.N, rule=lambda _, el: el**2) # x = [1,4,9,16,25]
df = pd.DataFrame.from_dict(m.x.extract_values(), orient='index', columns=[str(m.x)])
print(df)
yields
x
1 1
2 4
3 9
4 16
5 25
Note that for Var we can use both get_values() and extract_values(), they seem to do the same. For Param there is only extract_values().
Of course you can use instance.some_var.pprint() to print it on the screen.
But if you have a variable indexed by a large set. You can also write it to a
seperate file. The following code writes the result to a .txt file:
f = open('Result.txt', 'a')
instance.some_var.pprint(f)
f.close()
I had the same issue as Jasper and tried the suggested solutions. By doing so I noticed, that the part writing the results takes most time. Maybe this is also true in Jasper's case.
results.write()
instance.solutions.load_from(results)
So I suggest to surpress this two lines if you can do so. Maybe someone has a suggestions how to speed this up? Or an alternative method.
Also I saw that in this post (Pyomo: Save results to CSV files) The "for loop" method is recomanded. A pyomo developer states:"I think it's possible in option 2 for the indices and the variable slice to be iterated over in a different order which would invalidate your resulting array."
For simplicity of code and to largely avoid for-loops, I found the pyomoio module in the urbs project, which has taken over the slightly deprecated code of pandaspyomo.py. It relies on each pyomo object's iteritem() method, and handles multiple dimensions elegantly. It can extract sets, parameters, variables as pandas objects.
If I set up a small pyomo model
from pyomo.environ import *
import pyomoio as po
import pandas as pd
# Define a model with 200k values
m = ConcreteModel()
m.ix = RangeSet(200000)
def idem(model, i):
return i
m.a = Param(m.ix, rule=idem)
I can read in the parameter with just one line of code
%%timeit
a_po = po.get_entity(m, 'a')
# 110 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, if I compare it to the approach in the original question, it is not faster, even a little slower:
%%timeit
val = []
ix = []
varobject = getattr(m, 'a')
for index in varobject:
ix.append(index)
val.append(varobject[index])
a = pd.Series(index=ix, data=val)
# 92.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I edited and saved a text file, "fullname" on my Windows 7 computer.
I ran the following two lines of code immediately after saving the edits to "fullname", and I expected both of the following lines of code to return almost the same number of seconds since the epoch:
print str(os.path.getmtime(fullname))
print str(time.mktime(t.timetuple()))
The second line of code was borrowed from How to convert a Python datetime object to seconds
The results were not even close:
"1494082110.0"
"1319180400.0"
I would like to know why the results were not close.
My ultimate goal is that I want to know how to generate a float date, matching a calendar date of my choosing,
for use in the context of:
win32file.SetFileTime(handle, CreatedTime , AccessTime , WrittenTime )
Any help in understanding these issues would be much appreciated.
You need to compare the current time with the time at which you saved the file. In this code I save a file, then I get the current time in t and display it, then I get the modification time for the file and display that. You may note that the two times differ by less than a half a second.
>>> import datetime
>>> import time
>>> import os
>>> fullname = 'temp.txt'
>>> open('temp.txt', 'w').write('something')
9
>>> t = datetime.datetime.now()
>>> time.mktime(t.timetuple())
1502039202.0
>>> os.path.getmtime(fullname)
1502039187.4629886
I notice too that,
>>> datetime.datetime.fromtimestamp(1319180400)
datetime.datetime(2011, 10, 21, 3, 0)
In other words, that second number in your question yields a date that came before you put your question.
I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)
I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.
If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.
I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.
Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)