Tweepy - Get All Followers For Account - Rate Limit Issues - python-2.7

Below is my working code to get twitter followers for certain accounts (#hudsonci in this case).
My issue is the time that it is taking to pull in all of these followers. This account specifically has approx 1,000 followers ... I can only get to 300 at a time with the rate limiting restrictions. So, it is taking > an hour to get all the followers for this account. I can imagine this will become a huge pain in the ass for large accounts.
I am looking for some suggestions for how I can improve this. I feel like I am not taking full advantage of the pagination cursor, but I can't be sure.
any help is appreciated.
#!/usr/bin/env python
# encoding: utf-8
import tweepy
import time
#Twitter API credentials
consumer_key = "mine"
consumer_secret = "mine"
access_key = "mine"
access_secret = "mine"
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
def handle_errors(cursor):
while True:
try:
yield cursor.next()
except tweepy.TweepError:
time.sleep(20 * 60)
for user in handle_errors(tweepy.Cursor(api.followers,screen_name='hudsonci').items()):
print user.screen_name

As per the Twitter documentation for followers you need to use the count parameter.
Specifies the number of IDs attempt retrieval of, up to a maximum of 5,000 per distinct request.
So, adding count=5000 should help you.

You are getting 300 followers at a time because getting the followers object (as opposed to IDs only) has a page limit 20. With 15 requests per window, that comes out to be 300 followers.
Here are the docs for followers: https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-followers-list

Related

How to create a queue for python-requests in Django?

REST API service has a limit of requests (say a maximum of 100 requests per minute). In Django, I am trying to allow USERs to access such API and retrieve data in real-time to update SQL tables. Therefore there is a problem that if multiple users are trying to access the API, the limit of requests is likely to be exceeded.
Here is a code snippet as an example of how I currently perform requests - each user will add a list of objects he wants to request and run request_engine().start(object_list) to access the API. I use multithreading to speed up requests. I also allow retrying failed API requests via setting a limit on the number of requests for each request object upper_limit.
As I understand there should be some queue for API requests. I anticipate there must be a more elegant solution for this, however, I could not find any similar examples. How can one implement/rewrite this for multiUSER usage with Django?
import requests
from multiprocessing.dummy import Pool as ThreadPool
N=50 # number of threads
upper_limit=1 # limit on the number of requests for a single object
class request_engine():
def __init__(self):
pass
def start(self,objs):
self.objs={obj:{'status':0,'data':None} for obj in objs}
done=False
while not done:
self.parallel_requests()
done=all(_['status']>upper_limit or _['status']==-1 for obj,_ in self.objs.items())
return dict(self.objs)
def single_request(self,request_obj):
URL = f"https://reqres.in/api/users?page={request_obj}"
r = requests.get(url = URL)
if r.ok:
res = r.json()
self.objs[request_obj]['status']=-1
self.objs[request_obj]['data']=res
else:
self.objs[request_obj]['status']+=1
def parallel_requests(self):
objs=[obj for obj,_ in self.objs.items() if _['status']!=-1 and _['status']<=upper_limit]
pool = ThreadPool(N)
pool.map(self.single_request, objs)
pool.close()
pool.join()
objs=[1,2,3,4,5,6,7,7,8,234,124,24,535,6,234,24,4,1,3,4,5,4,3,5,3,1,5,2,3,5,3]
result=request_engine().start(objs)
print([_['status'] for obj,_ in result.items()])
# status corresponds to the number of unsuccessful requests
# status=-1 implies success of the request
Thanks in advance.

Instagram Graph API - Fetch media insights metric when a user switched from personal to business account

I'm looking for a way to fetch Media Insights metrics in Instagram Graph API (https://developers.facebook.com/docs/instagram-api/reference/media/insights) with a nested query based on the userId, even when a client switched from a Personal to a Business account.
I use this nested query to fetch all the data I need : https://graph.facebook.com/v3.2/{userId}?fields=followers_count,media{media_type,caption,timestamp,like_count,insights.metric(reach, impressions)} (this part causes the error: insights.metric(reach, impressions) - it works however for an account that has always been a Business one)
However, because some media linked to the userId were posted before the user switched to a Business account, instead of returning the data only for the media posted after, the API returns this error:
{
"error": {
"message": "Invalid parameter",
"type": "OAuthException",
"code": 100,
"error_data": {
"blame_field_specs": [
[
""
]
]
},
"error_subcode": 2108006,
"is_transient": false,
"error_user_title": "Media Posted Before Business Account Conversion",
"error_user_msg": "The media was posted before the most recent time that the user's account was converted to a business account from a personal account.",
"fbtrace_id": "Gs85pUz14JC"
}
}
Is there a way to know, thru the API, which media were created before and after the account switch from Personal to Business? Or is there a way to fetch the date on which the account was switched?
The only way I currently see to get the data I need is to use the /media edge and query insights for each media until I get an error. Then I would get approximately the date I need. However, this is not optimized at all since we are rate limited to 200 calls per user per hour.
I have the same problem.
For now, I'm Switch between queries (if first have error)
"userId"?fields=id,media.limit(100){insights.metric(reach, impressions)}
"userId"?fields=id,media.limit(100)
I show the user all insights in zero.
I don't know if they're the best alternative, like identify the time of conversion to business and get the post between this range of DateTime
I got the same problem and solved it like this:
Use the nested query just like you did, including insights.metric
If the error appears, do another call without insights.metric - to at least get all other data
For most accounts, it works and there is no additional API call. For the rest, i just cannot get the insights and i have to live with it, i guess - until Facebook/IG fixes the issue.
I got the same problem and solved it like this:
Step1: Convert your Instagram account to a Professional account.
Step2: Then According to Error Post a new post on Instagram and get their Post-ID.
Step3: Then try to get a request using that Post-ID.
{Post-ID}?fields=comments_count,like_count,timestamp,insights.metric(reach,impressions)
curl -i -X GET "https://graph.facebook.com/v12.0/{Post-ID}?fields=comments_count%2Clike_count%2Ctimestamp%2Cinsights.metric(reach%2Cimpressions)&access_token={access_token}"
For more: insights
Here is the relevant logic from a script that can handle this error while still doing a full import. It works by reducing the requested limit to 1 once the error is encountered. It will keep requesting insights until it encounters the error again, then removes insights from the fields and returns to the requested limit.
limit = 50
error_2108006 = False
metrics = 'insights.metric%28impressions%29%2C' # Must be URL encoded for replacement
url = '/PAGE_ID/media?fields=%sid,caption,media_url,media_type&limit=%s' % (metrics, limit)
# While we have more pages
while True:
# Make your API call to Instagram
posts = get_posts_from_instagram(url)
# Check for error 2108006
if posts == 2108006:
# First time getting this error, keep trying to get insights but one by one
if error_2108006 is False:
error_2108006 = True
url = url.replace('limit={}'.format(limit), 'limit=1')
continue
# Not the first time. Strip out insights and return to desired limit.
url = url.replace(metrics, '')
url = url.replace('limit=1', 'limit='.format(limit))
continue
# Do something with the data
for post in posts:
continue
# If there are more pages, fetch the next URL
if 'paging' in posts and 'next' in posts['paging']:
url = posts['paging']['next']
continue
# Done
break

gspread Invalid token: Stateless token expired [duplicate]

I am using gspread and a Service Account Key, Other, json file. to continually update a google spreadsheet with python 2.7. I have this running off a Raspberry Pi running the latest Raspian Jessie. my oauth and gspread should all be the latest versions available for my platform. My script runs for one hour(the max token life span),then stops working with the error message : "Invalid token: Statless token expired error" My code is as follows
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import httplib2
from httplib2 import Http
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name(filename.json,scope)
gc = gspread.authorize(credentials)
wks = gc.open('spreadsheet name')
p1 = wks.worksheet('Printer One')
def functon()
...
p1.append_row(printing)
Any Help would be greatly appreciated, Thank You.
Authorisation expires every 0.5/1 hour (I think it depends on which of the two available methods you use to connect).
I have a google sheet connected 24/7 that updates every 2 seconds. Almost always the reason for a bad read/write is an authorisation error but also Google API can throw a variety of errors at you too that normally resolve after a few seconds. Here's one of my functions to update a cell, but using your details for auth_for_worksheet. Every operation (update single cell, update a range, read a column of values) has some similar construct as a function, which always returns an authorised worksheet. It's probably not the most elegant solution but the sheet has been connected for 3 months fine with no downtime.
def auth_for_worksheet():
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name(filename.json,scope)
gc = gspread.authorize(credentials)
wks = gc.open('spreadsheet name')
p1 = wks.worksheet('Printer One')
return p1
def update_single_cell(worksheet, counter, message):
""" No data to return, update a single cell in column B to reflect this """
single_cell_updated = False
while not single_cell_updated:
try:
cell_location = "B" + str(counter)
worksheet.update_acell(cell_location, message)
single_cell_updated = True
except gspread.exceptions.HTTPError:
logger.critical("Could not update single cell")
time.sleep(10)
worksheet = auth_for_worksheet()
logger.info("Updated single cell")
return worksheet
if __name__ == '__main__':
# your code here, but now to update a single cell
wksheet = update_single_cell(wksheet, x, "NOT FOUND")

gspread w/ OAuth2Client Service Account Key

I am using gspread and a Service Account Key, Other, json file. to continually update a google spreadsheet with python 2.7. I have this running off a Raspberry Pi running the latest Raspian Jessie. my oauth and gspread should all be the latest versions available for my platform. My script runs for one hour(the max token life span),then stops working with the error message : "Invalid token: Statless token expired error" My code is as follows
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import httplib2
from httplib2 import Http
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name(filename.json,scope)
gc = gspread.authorize(credentials)
wks = gc.open('spreadsheet name')
p1 = wks.worksheet('Printer One')
def functon()
...
p1.append_row(printing)
Any Help would be greatly appreciated, Thank You.
Authorisation expires every 0.5/1 hour (I think it depends on which of the two available methods you use to connect).
I have a google sheet connected 24/7 that updates every 2 seconds. Almost always the reason for a bad read/write is an authorisation error but also Google API can throw a variety of errors at you too that normally resolve after a few seconds. Here's one of my functions to update a cell, but using your details for auth_for_worksheet. Every operation (update single cell, update a range, read a column of values) has some similar construct as a function, which always returns an authorised worksheet. It's probably not the most elegant solution but the sheet has been connected for 3 months fine with no downtime.
def auth_for_worksheet():
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name(filename.json,scope)
gc = gspread.authorize(credentials)
wks = gc.open('spreadsheet name')
p1 = wks.worksheet('Printer One')
return p1
def update_single_cell(worksheet, counter, message):
""" No data to return, update a single cell in column B to reflect this """
single_cell_updated = False
while not single_cell_updated:
try:
cell_location = "B" + str(counter)
worksheet.update_acell(cell_location, message)
single_cell_updated = True
except gspread.exceptions.HTTPError:
logger.critical("Could not update single cell")
time.sleep(10)
worksheet = auth_for_worksheet()
logger.info("Updated single cell")
return worksheet
if __name__ == '__main__':
# your code here, but now to update a single cell
wksheet = update_single_cell(wksheet, x, "NOT FOUND")

Making very specific time requests (to the second) on Twitter API, using Python Tweepy?

I would like to request tweets on a specific topic (for example: "cancer"), using Python Tweepy. But usually its time can only be specified by a specific day, for example.
startSince = '2014-10-01'
endUntil = '2014-10-02'
for tweet in tweepy.Cursor(api.search, q="cancer",
since=startSince, until=endUntil).items(999999999):
Is there a way to specify the time so I can collect "cancer" tweets between 2014-10-01 00:00:00 and 2014-10-02 12:00:00? This is for my academic research: I was able to collect cancer tweets for the last month, but the sudden burst of quantity of "breast cancer" tweets due to the cancer awareness month breaks my script and I have to collect them in different time segments, and I will not be able to retrieve the tweets for Oct 01, 2014 if I can't figure it out soon.
There is no way that I've found to specific a time using since/until.
You can hack your way around this using since_id & max_id.
If you can find a tweet made at around the times you want, you can restrict you search to those made after since_id and before max_id
import tweepy
consumer_key = 'aaa'
consumer_secret = 'bbb'
access_token = 'ccc'
access_token_secret = 'ddd'
# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
results = api.search(q="cancer", since_id=518857118838181000, max_id=518857136202194000)
for result in results:
print result.text