Xively read data in Python - python-2.7

I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.

Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)

Related

My fuzzy logic code is taking too long and not giving result

I have a task to remove the duplicate codes from a master data of approx 80K data points .
The duplicate identification can only be done using the 'description' field.
I have tried fuzzy logic - both with fuzzywuzzy and thefuzz libraries . Its been more than 18 hours and the code is still running .
I have upgraded my instance size on cloud platform but that is also not helping .
below is the code
Looking forward to ways of getting the result quickly
for index, row in df.iterrows():
value = row[column_name]
if any(fuzz.token_set_ratio(value,x)>90 for x in unique_values):
removed_duplicates = removed_duplicates.append(row)
df=df.drop(index)
else:
#if it is not , add the value to the unique value list
unique_values.append(value)
#save the modified dataframe to a new excel file
df.to_excel("file_without_duplicates_after_fuzzyV23.xlsx",index=False)
removed_duplicates.to_excel("removed_duplicatesV23.xlsx",index= False)
fuzzywuzzy
thefuzz
upgradation of the instance on the cloud platform .
patience

Yahoo Finance API .get_historical() not working python

So I recently downloaded the yahoo_finance API and its version 1.4.0. I got it a few days ago, and the .get_historical() was working fine. Now however, it doesn't. Heres what its doing:
import yahoo_finance as yf
apple=yf.Share('AAPL')
apple_price=apple.get_price()
print apple.get_historical('2016-02-15', '2016-04-29')
The error I get is:YQLResponseMalformedError: Response malformed. Is there a bug in the API or am I forgetting something?
The Yahoo Stock Price API doesn't work anymore, which a lot of modules are based on, unfortunately.
Alternatively, you could use Google's API
https://www.google.com/finance/getprices?q=1101&x=TPE&i=86400&p=3d&f=d,c,h,l,o,v
q=1101 is the stock quote
x=TPE is the exchange (List of Exchanges here: https://www.google.com/googlefinance/disclaimer/ )
i=86400 interval in seconds (86400 sec = 1 day)
p=3d data since how long ago
f= fields of data (d=date, c=close, h=high, l=low, o=open, v=volume)
Data would look like this:
EXCHANGE%3DTPE
MARKET_OPEN_MINUTE=540
MARKET_CLOSE_MINUTE=810
INTERVAL=86400
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=480
a1496295000,24.4,24.75,24.35,24.75,11782000
1,24.5,24.5,24.3,24.4,10747000
a1496295000 is the Unix timestamp of first row of data
the second row 1 is the interval offset from first row (offset 1 day)

Looping in Python Slowing Down with each Iteration

The following code produces the desired result, but the root of my issue stems from the process slowing down with each append. In the first 100 pages or so, it runs at less than one second per page, but by the 500s it gets up to 3 seconds and into the 1000s it runs at about 5 seconds per append. Are there suggestions for how to make this more efficient or explanations for why this is just the way things are?
import lxml
from lxml import html
import itertools
import datetime
l=[]
for pageno in itertools.count(start=1):
time = datetime.datetime.now()
url = 'http://example.com/'
parse = lxml.html.parse(url)
try:
for x in parse.xpath('//center'):
x.getparent().remove(x)
x.clear()
while x.getprevious() is not None:
del x.getparent()[0]
for n in parse.xpath('//tr[#class="rt"]'):
l.append([n.find('td/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').attrib['title'].encode('utf8').decode('utf8').strip()]\
+[c.text.encode('utf8').decode('utf8').strip() for c in n if c.text.strip() is not ''])
n.clear()
while n.getprevious() is not None:
del n.getparent()[0]
except:
print 'Page ' + str(pageno) + 'Does Not Exist'
print '{0} Pages Complete: {1}'.format(pageno,datetime.datetime.now()-time)
I have tried a number of solutions, such as disabling the garbage collector, writing one list as a row to file instead of appending to a large list, etc. I look forward to learning more from potential suggestions/answers.

Paginating with Python 2.7.9 Web Crawler

I am trying to code a program in Python 2.7.9 to crawl and gather the club names, addresses and phone numbers from the website http://tennishub.co.uk/
The following code gets the job done, except for it doesn't move on to the subsequent pages for each location such as
/Berkshire/1
/Berkshire/2
/Berkshire/3
..and so on.
import requests
from bs4 import BeautifulSoup
def tennis_club():
url = 'http://tennishub.co.uk/'
r = requests.get(url)
soup = BeautifulSoup(r.text)
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
pages_data(href)
def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text)
g_data = soup.select('table.display-table')
for item in g_data:
print item.contents[1].text
print item.contents[3].findAll('td')[1].text
try:
print item.contents[3].find_all('td',{'class':'telrow'})[0].text
except:
pass
try:
print item.contents[5].findAll('td',{'class':'emailrow'})[0].text
except:
pass
print item_url
tennis_club()
I have tried tweaking the code to the best of my understanding but it doesn't work at all.
Can someone please advise what do I need to do so that the program goes through all the pages of a location, collects the data and move on the to next location and so on.
You are going to need to put another for loop into this code:
for link in soup.select('div.countylist a'):
href = 'http://tennishub.co.uk' + link.get('href')
# new for loop goes here #
pages_data(href)
If you want to brute force it you just have the for loop go as many times as the area with the most clubs (Surrey), however you would double, triple, quadruple, etc. count the last clubs for many of the areas. This is ugly but you can get away with it if you are using a database where you don't insert duplicates. However it is unacceptable if you are writing to a file. In that case you will need to pull the number in parenthesis after the area Berkshire (39). To get that number you can do a get_text() on the div.countylist which would change the above to
for link in soup.select('div.countylist'):
for endHref in link.find_all('a'):
numClubs = endHref.next
#need to clean up endHrefNum here to remove spaces and parens
endHrefNum = numClubs//10 + 1 #add one because // gives the floor
href = 'http://tennishub.co.uk' + endHref.get('href') + / + endHrefNum
pages_data(href)
(disclaimer: I didn't run this through bs4 so there might be syntax errors (and you might need to use something other than .next, but the logic should help you)

Missing Tweets from Twitter API (using Tweepy)?

I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.
If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.