Missing Tweets from Twitter API (using Tweepy)? - python-2.7

I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.

If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.

Related

NLTK regex parser's output has changed. Unable to parse phrases like verb followed by a noun

I have written a piece of code to parse the action items from a troubleshooting doc.
I want to extract phrases that start with a verb and end with a noun.
It was working as expected earlier (a month ago). But on running against the same input as earlier, its missing some action items that it was catching previously.
I haven't changed the code. Has something changed from nltk or punkt side that may be affecting my results?
Please help me figure what needs to be changed to make it run as earlier.
import re
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
#One time downloads
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
custom_sent_tokenizer = PunktSentenceTokenizer()
def process_content(x):
try:
#sent_tag = []
act_item = []
for i in x:
print('tokenized = ',i)
words = nltk.word_tokenize(i)
print(words)
tagged = nltk.pos_tag(words)
print('tagged = ',tagged)
#sent_tag.append(tagged)
#print('sent= ',sent_tag)
#chunking
chunkGram = r"""ActionItems: {<VB.>+<JJ.|CD|VB.|,|CC|NN.|IN|DT>*<NN|NN.>+}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'ActionItems'):
print('Filtered chunks= ',subtree)
ActionItems = ' '.join([w for w, t in subtree.leaves()])
act_item.append(ActionItems)
chunked.draw()
return act_item
except Exception as e:
#print(str(e))
return str(e)
res = 'replaced rev 6 aeb with a rev 7 aeb. configured new board and regained activity. tuned, flooded and calibrated camera. scanned fi rst patient with no issues. made new backups. replaced aeb board and completed setup. however, det 2 st ill not showing any counts. performed all necessary tests and the y passed . worked with tech support to try and resolve the issue. we decided to order another board due to lower rev received. camera is st ill down.'
tokenized = custom_sent_tokenizer.tokenize(res)
tag = process_content(tokenized)
With the input as shared in the code, earlier, the following action items were being parsed:
['replaced rev 6 aeb', 'configured new board', 'regained activity', 'tuned , flooded and calibrated camera', 'scanned fi rst patient', 'made new backups', 'replaced aeb board', 'completed setup', 'det 2 st ill', 'showing any counts', 'performed all necessary tests and the y', 'worked with tech support']
But now, only these are coming up:
['regained activity', 'tuned , flooded and calibrated camera', 'completed setup', 'det 2 st ill', 'showing any counts']
I finally resolved this by replacing JJ. with JJ|JJR|JJS
So my chunk is defined as :
chunkGram = r"""ActionItems: {<VB.>+<JJ|JJR|JJS|CD|NN.|CC|IN|VB.|,|DT>*<NN|NN.>+}"""
I dont understand this change in behavior.
Dot (.) was a really good way of using all modifiers on a POS

slack bot news feeder not banning words

I'm having trouble getting my slack bots news feeder to ban words, the trigger words are working fine, but its letting banned words through but it catches them because it prints out found banned word: xxxxx. I'm new to python and I just don't get what's up.
triggers = ['SEC', 'CSA', 'OSC', 'CFTC', 'CME', 'CBOE', 'AMD', 'Intel', 'Nvidia',
'Bitcoin', 'blockchain', 'Apple', 'Amazon', 'Google', 'Microsoft',
'commerce', 'business', 'law', 'legal', 'financial', 'hack', 'hacked',
'chains', 'chairman', 'CEO', 'board', 'bank']
banned = ['technical', 'analysis', 'bearish', 'bullish', 'trading', 'trade', 'opinion',
'sponsored', 'price', 'watch']
def feedparsecheck(url):
feed = feedparser.parse(url)
feed_title = feed['feed']['title']
feed_entries = feed.entries
database()
print "feed 30 min"
for entry in feed.entries:
article_title = entry.title
article_link = entry.link
for trig in triggers:
if trig.lower() in article_title.lower():#trigger
for ban in banned:
if ban.lower() not in article_title.lower():#banned
response = "%s\n%s\n" % (article_title, article_link)
article_link = str(article_link.strip())
if not in_database(article_link):
update_database(article_link)
#print article_link
slack_client.api_call("chat.postMessage", channel=NEWS, text=''.join(response), as_user=True)
else:
print "found banned word:- " + ban
You are iterating over all triggers and all banned words for all articles, which means that every article will get sent to your channel for every banned word that's not in the title * number of triggers in the title.
Example:
Bitcoin trading for lower price after hack
Will get sent to your channel 16 times. 2 triggers (Bitcoin, hack) * 8 banned words not in title (10 - len(trading, price)) = 16.
To fix:
title_lower = article_title.lower()
if any(trig.lower() in title_lower for trig in triggers):
if any(ban.lower() in title_lower for ban in banned):
print 'found banned word:- '+ ban
else:
# post to slack channel

Looping in Python Slowing Down with each Iteration

The following code produces the desired result, but the root of my issue stems from the process slowing down with each append. In the first 100 pages or so, it runs at less than one second per page, but by the 500s it gets up to 3 seconds and into the 1000s it runs at about 5 seconds per append. Are there suggestions for how to make this more efficient or explanations for why this is just the way things are?
import lxml
from lxml import html
import itertools
import datetime
l=[]
for pageno in itertools.count(start=1):
time = datetime.datetime.now()
url = 'http://example.com/'
parse = lxml.html.parse(url)
try:
for x in parse.xpath('//center'):
x.getparent().remove(x)
x.clear()
while x.getprevious() is not None:
del x.getparent()[0]
for n in parse.xpath('//tr[#class="rt"]'):
l.append([n.find('td/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').attrib['title'].encode('utf8').decode('utf8').strip()]\
+[c.text.encode('utf8').decode('utf8').strip() for c in n if c.text.strip() is not ''])
n.clear()
while n.getprevious() is not None:
del n.getparent()[0]
except:
print 'Page ' + str(pageno) + 'Does Not Exist'
print '{0} Pages Complete: {1}'.format(pageno,datetime.datetime.now()-time)
I have tried a number of solutions, such as disabling the garbage collector, writing one list as a row to file instead of appending to a large list, etc. I look forward to learning more from potential suggestions/answers.

Specify exact time to start and end the collection of tweets using Python Tweepy?

I am having quite a bit of technical issues. My python script below usually works (when time is in yyyy-mm-dd' format. But during the extremely heavy tweet activities, for example more than 500,000 tweets collected a day, my computer runs out of memory and have to force stop the program.
I can work around by looking at the time of the last tweets in the stopped csv file, in this case it's at time 18:44:00. I have tried many time format (for example 'yyyy-mm-dd hh:mm:ss' format as below) but none actually works.
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18 00:00:00'
endUntil = '2014-09-18 18:44:00'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break

Xively read data in Python

I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.
Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)