How to get and save into file the full list of twitter account followers with Tweepy - python-2.7

I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.

Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')

Related

How to extract multiple rows of data relative to single row in scrapy

I am trying to scrape webpage given in the this link -
http://new-york.eat24hours.com/picasso-pizza/19053
Here I am trying to get all the possible details like address and phone etc..
So, Far I have extracted the name, phone, address, reviews, rating.
But I also want to extract the the full menu of restaurant here(name of item with price).
So, far I have no idea how to manage this data into output of csv.
The rest of the data for a single url will be single but the items in menu will always be of different amount.
here below is my code so far-
import scrapy
from urls import start_urls
class eat24Spider(scrapy.Spider):
AUTOTHROTTLE_ENABLED = True
name = 'eat24'
def start_requests(self):
for x in start_urls:
yield scrapy.Request(x, self.parse)
def parse(self, response):
brickset = response
NAME_SELECTOR = 'normalize-space(.//h1[#id="restaurant_name"]/a/text())'
ADDRESS_SELECTION = 'normalize-space(.//span[#itemprop="streetAddress"]/text())'
LOCALITY = 'normalize-space(.//span[#itemprop="addressLocality"]/text())'
REGION = 'normalize-space(.//span[#itemprop="addressRegion"]/text())'
ZIP = 'normalize-space(.//span[#itemprop="postalCode"]/text())'
PHONE_SELECTOR = 'normalize-space(.//span[#itemprop="telephone"]/text())'
RATING = './/meta[#itemprop="ratingValue"]/#content'
NO_OF_REVIEWS = './/meta[#itemprop="reviewCount"]/#content'
OPENING_HOURS = './/div[#class="hours_info"]//nobr/text()'
EMAIL_SELECTOR = './/div[#class="company-info__block"]/div[#class="business-buttons"]/a[span]/#href[substring-after(.,"mailto:")]'
yield {
'name': brickset.xpath(NAME_SELECTOR).extract_first().encode('utf8'),
'pagelink': response.url,
'address' : str(brickset.xpath(ADDRESS_SELECTION).extract_first().encode('utf8')+', '+brickset.xpath(LOCALITY).extract_first().encode('utf8')+', '+brickset.xpath(REGION).extract_first().encode('utf8')+', '+brickset.xpath(ZIP).extract_first().encode('utf8')),
'phone' : str(brickset.xpath(PHONE_SELECTOR).extract_first()),
'reviews' : str(brickset.xpath(NO_OF_REVIEWS).extract_first()),
'rating' : str(brickset.xpath(RATING).extract_first()),
'opening_hours' : str(brickset.xpath(OPENING_HOURS).extract_first())
}
I am sorry if I am making this confusing but any kind of help will be appreciated.
Thank you in advance!!
If you want to extract full restaurant menu, first of all, you need to locate element who contains both name and price:
menu_items = response.xpath('//tr[#itemscope]')
After that, you can simply make for loop and iterate over restaurant items appending name and price to list:
menu = []
for item in menu_items:
menu.append({
'name': item.xpath('.//a[#class="cpa"]/text()').extract_first(),
'price': item.xpath('.//span[#itemprop="price"]/text()').extract_first()
})
Finally you can add new 'menu' key to your dict:
yield {'menu': menu}
Also, I suggest you use scrapy Items for storing scraped data:
https://doc.scrapy.org/en/latest/topics/items.html
For outputting data in csv file use scrapy Feed exports, type in console:
scrapy crawl yourspidername -o restaurants.csv

Writing search results only from last row in csv

Here is my code:
import urllib
import json
import csv
apiKey = "MY_KEY" # Google API credentials
##perform a text search based on input, place results in text-search-results.json
print "Starting"
myfile = open("results.csv","wb")
headers = []
headers.append(['Search','Name','Address','Phone','Website','Type','Google ID','Rating','Permanently Closed'])
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerows(headers)
with open('input_file.csv', 'rb') as csvfile:
filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in filereader:
search = ', '.join(row)
search.replace(' ', '+')
url1 = "https://maps.googleapis.com/maps/api/place/textsearch/json?query=%s&key=%s" % (search,apiKey)
urllib.urlretrieve(url1,"text-search-results.json")
print "SEARCH", search
print "Google Place URL", url1
## load text-search-results.json and get the list of place IDs
textSearchResults = json.load(open("text-search-results.json"))
listOfPlaceIds = []
for item in textSearchResults["results"]:
listOfPlaceIds.append(str(item["place_id"]))
## open a nested list for the results
output = []
## iterate through and download a JSON for each place ID
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
try:
lineToAppend.append(str(data["result"]["name"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["formatted_address"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["formatted_phone_number"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["website"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["types"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["place_id"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["rating"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["permanently_closed"]))
except KeyError:
lineToAppend.append('')
output.append(lineToAppend)
wr.writerows(output)
myfile.close()
What this is doing is taking the search terms from one column in the input_file and running that search through the Google Places API. However, when I have multiple search terms, it only returns the last search results in the results.csv file. I am not quite sure why this is happening since it is reading all of the search terms and running them through, but only returning the last result. Any suggestions?
Currently you are only writing out the last line because you are resetting the variable lineToAppend within your for loop. However, you are not adding it to your output within your for loop. Therefore, it is getting to the end of your for loop and writing out the last line.
Therefore currently it looks like this: (Shortened for brevity)
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
...
output.append(lineToAppend)
wr.writerows(output)
Whereas it should be:
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
...
output.append(lineToAppend)
wr.writerows(output)

Tweepy location on Twitter API filter always throws 406 error

I'm using the following code (from django management commands) to listen to the Twitter stream - I've used the same code on a seperate command to track keywords successfully - I've branched this out to use location, and (apparently rightly) wanted to test this out without disrupting my existing analysis that's running.
I've followed the docs and have made sure the box is in Long/Lat format (in fact, I'm using the example long/lat from the Twitter docs now). It looks broadly the same as the question here, and I tried using their version of the code from the answer - same error. If I switch back to using 'track=...', the same code works, so it's a problem with the location filter.
Adding a print debug inside streaming.py in tweepy so I can see what's happening, I print out the self.parameters self.url and self.headers from _run, and get:
{'track': 't,w,i,t,t,e,r', 'delimited': 'length', 'locations': '-121.7500,36.8000,-122.7500,37.8000'}
/1.1/statuses/filter.json?delimited=length and
{'Content-type': 'application/x-www-form-urlencoded'}
respectively - seems to me to be missing the search for location in some way shape or form. I don't believe I'm/I'm obviously not the only one using tweepy location search, so think it's more likely a problem in my use of it than a bug in tweepy (I'm on 2.3.0), but my implementation looks right afaict.
My stream handling code is here:
consumer_key = 'stuff'
consumer_secret = 'stuff'
access_token='stuff'
access_token_secret_var='stuff'
import tweepy
import json
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
#print type(decoded), decoded
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
try:
user, created = read_user(decoded)
print "DEBUG USER", user, created
if decoded['lang'] == 'en':
tweet, created = read_tweet(decoded, user)
print "DEBUG TWEET", tweet, created
else:
pass
except KeyError,e:
print "Error on Key", e
pass
except DataError, e:
print "DataError", e
pass
#print user, created
print ''
return True
def on_error(self, status):
print status
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret_var)
stream = tweepy.Stream(auth, l)
#locations must be long, lat
stream.filter(locations=[-121.75,36.8,-122.75,37.8], track='twitter')
The issue here was the order of the coordinates.
Correct format is:
SouthWest Corner(Long, Lat), NorthEast Corner(Long, Lat). I had them transposed. :(
The streaming API doesn't allow to filter by location AND keyword simultaneously.
you must refer to this answer i had the same problem earlier
https://stackoverflow.com/a/22889470/4432830

BigQuery results run via python in Google Cloud don't match results running on MAC

I have a python app that runs a query on BigQuery and appends results to a file. I've run this on MAC workstation (Yosemite) and on GC instance (ubuntu 14.1) and the results for floating point differ. How can I make them the same? They python environments are the same on both.
run on google cloud instance
1120224,2015-04-06,23989,866,55159.71274162368,0.04923989554019882,0.021414467106578683,0.03609987911125933,63.69481840834143
54897577,2015-04-06,1188089,43462,2802473.708558333,0.051049132980100984,0.021641920553251377,0.03658143455582873,64.4810111950286
run on mac workstation
1120224,2015-04-06,23989,866,55159.712741623654,0.049239895540198794,0.021414467106578683,0.03609987911125933,63.694818408341405
54897577,2015-04-06,1188089,43462,2802473.708558335,0.05104913298010102,0.021641920553251377,0.03658143455582873,64.48101119502864
import sys
import pdb
import json
from collections import OrderedDict
from csv import DictWriter
from pprint import pprint
from apiclient import discovery
from oauth2client import tools
import functools
import argparse
import httplib2
import time
from subprocess import call
def authenticate_SERVICE_ACCOUNT(service_acct_email, private_key_path):
""" Generic authentication through a service accounts.
Args:
service_acct_email: The service account email associated
with the private key private_key_path: The path to the private key file
"""
from oauth2client.client import SignedJwtAssertionCredentials
with open(private_key_path, 'rb') as pk_file:
key = pk_file.read()
credentials = SignedJwtAssertionCredentials(
service_acct_email,
key,
scope='https://www.googleapis.com/auth/bigquery')
http = httplib2.Http()
auth_http = credentials.authorize(http)
return discovery.build('bigquery', 'v2', http=auth_http)
def create_query(number_of_days_ago):
""" Create a query
Args:
number_of_days_ago: Default value of 1 gets yesterday's data
"""
q = 'SELECT xxxxxxxxxx'
return q;
def translate_row(row, schema):
"""Apply the given schema to the given BigQuery data row.
Args:
row: A single BigQuery row to transform.
schema: The BigQuery table schema to apply to the row, specifically
the list of field dicts.
Returns:
Dict containing keys that match the schema and values that match
the row.
Adpated from bigquery client
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/client.py
"""
log = {}
#pdb.set_trace()
# Match each schema column with its associated row value
for index, col_dict in enumerate(schema):
col_name = col_dict['name']
row_value = row['f'][index]['v']
if row_value is None:
log[col_name] = None
continue
# Cast the value for some types
if col_dict['type'] == 'INTEGER':
row_value = int(row_value)
elif col_dict['type'] == 'FLOAT':
row_value = float(row_value)
elif col_dict['type'] == 'BOOLEAN':
row_value = row_value in ('True', 'true', 'TRUE')
log[col_name] = row_value
return log
def extractResult(queryReply):
""" Extract a result from the query reply. Uses schema and rows to translate.
Args:
queryReply: the object returned by bigquery
"""
#pdb.set_trace()
result = []
schema = queryReply.get('schema', {'fields': None})['fields']
rows = queryReply.get('rows',[])
for row in rows:
result.append(translate_row(row, schema))
return result
def writeToCsv(results, filename, ordered_fieldnames, withHeader=True):
""" Create a csv file from a list of rows.
Args:
results: list of rows of data (first row is assumed to be a header)
order_fieldnames: a dict with names of fields in order desired - names must exist in results header
withHeader: a boolen to indicate whether to write out header -
Set to false if you are going to append data to existing csv
"""
try:
the_file = open(filename, "w")
writer = DictWriter(the_file, fieldnames=ordered_fieldnames)
if withHeader:
writer.writeheader()
writer.writerows(results)
the_file.close()
except:
print "Unexpected error:", sys.exc_info()[0]
raise
def runSyncQuery (client, projectId, query, timeout=0):
results = []
try:
print 'timeout:%d' % timeout
jobCollection = client.jobs()
queryData = {'query':query,
'timeoutMs':timeout}
queryReply = jobCollection.query(projectId=projectId,
body=queryData).execute()
jobReference=queryReply['jobReference']
# Timeout exceeded: keep polling until the job is complete.
while(not queryReply['jobComplete']):
print 'Job not yet complete...'
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
timeoutMs=timeout).execute()
# If the result has rows, print the rows in the reply.
if('rows' in queryReply):
#print 'has a rows attribute'
#pdb.set_trace();
result = extractResult(queryReply)
results.extend(result)
currentPageRowCount = len(queryReply['rows'])
# Loop through each page of data
while('rows' in queryReply and currentPageRowCount < int(queryReply['totalRows'])):
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
startIndex=currentRow).execute()
if('rows' in queryReply):
result = extractResult(queryReply)
results.extend(result)
currentRow += len(queryReply['rows'])
except AccessTokenRefreshError:
print ("The credentials have been revoked or expired, please re-run"
"the application to re-authorize")
except HttpError as err:
print 'Error in runSyncQuery:', pprint.pprint(err.content)
except Exception as err:
print 'Undefined error' % err
return results;
# Main
if __name__ == '__main__':
# Name of file
FILE_NAME = "results.csv"
# Default prior number of days to run query
NUMBER_OF_DAYS = "1"
# BigQuery project id as listed in the Google Developers Console.
PROJECT_ID = 'xxxxxx'
# Service account email address as listed in the Google Developers Console.
SERVICE_ACCOUNT = 'xxxxxx#developer.gserviceaccount.com'
KEY = "/usr/local/xxxxxxxx"
query = create_query(NUMBER_OF_DAYS)
# Authenticate
client = authenticate_SERVICE_ACCOUNT(SERVICE_ACCOUNT, KEY)
# Get query results
results = runSyncQuery (client, PROJECT_ID, query, timeout=0)
#pdb.set_trace();
# Write results to csv without header
ordered_fieldnames = OrderedDict([('f_split',None),('m_members',None),('f_day',None),('visitors',None),('purchasers',None),('demand',None), ('dmd_per_mem',None),('visitors_per_mem',None),('purchasers_per_visitor',None),('dmd_per_purchaser',None)])
writeToCsv(results, FILE_NAME, ordered_fieldnames, False)
# Backup current data
backupfilename = "data_bk-" + time.strftime("%y-%m-%d") + ".csv"
call(['cp','../data/data.csv',backupfilename])
# Concatenate new results to data
with open("../data/data.csv", "ab") as outfile:
with open("results.csv","rb") as infile:
line = infile.read()
outfile.write(line)
You mention that these come from aggregate sums of floating point data. As Felipe mentioned, floating point is awkward; it violates some of the mathematical identities that we tend to assume.
In this case, the associative property is the one that bites us. That is, usually (A+B)+C == A+(B+C). However, in floating point math, this isn't the case. Each operation is an approximation; you can see this better if you wrap with an 'approx' function: approx(approx(A+B) + C) is clearly different from approx(A + approx(B+C)).
If you think about how bigquery computes aggregates, it builds an execution tree, and computes the value to be aggregated at the leaves of the tree. As those answers are ready, they're passed back up to the higher levels of the tree and aggregated (let's say they're added). The "when they're ready" part makes it non-deterministic.
A node may get results back in the order A,B,C the first time and C,A, B the second time. This means that the order of distribution will change, since you'll get approx(approx(A + B) + C) the first time and approx(approx(C, A) + B) the second time. Note that since we're dealing with ordering, it may look like the commutative property is the problematic one, but it isn't; A+B in floating math is the same as B+A. The problem is really that you're adding partial results, which aren't associative.
Floating point math has all sorts of nasty properties and should usually be avoided if you rely on precision.
Assume floating point is non-deterministic:
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
“the IEEE standard does not guarantee that the same program will
deliver identical results on all conforming systems.”

Iterating tweets and saving them in a shapefile with tweepy and shapefile

I am searching for tweets and want to save them in a shapefile. Iterating through tweets is going well and when I use print statements I can get exactly what I want. I am attempting to put these tweets in a point shapefile. For some reason it does not accept the iterating of the if statement. So how do I iterate through my tweets and save them one by one in my point shapefile with only the tweet.text and tweet.id attached?
I got inspired by looking at the following link: https://code.google.com/p/pyshp/
import tweepy
import shapefile
consumer_key = "..."
consumer_secret = "..."
access_token = "..."
access_token_secret = "..."
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweetsaspoints = shapefile.Writer(shapefile.POINT)
page = 1
while True:
statuses = api.search(q="*",count=1000, geocode="52.015106,5.394287,150km")
if statuses:
for status in statuses:
print status.geo
tweetsaspoints._shapes.extend([status.geo['coordinates']])
tweetsaspoints.records.extend([("TEXT","Test")])
else:
# All done
break
page += 1 # next page
tweetsaspoints.save('shapefiles/test/point')
I do not understand the page part. I seem to iterate through the same tweets over and over again. Also, I am not succeeding in writing my coordinates and data to a point shapefile.
As per documentation, try to use:
for status in tweepy.Cursor(api.user_timeline).items():
# process status here
process_status(status)
Alternative:
page = 1
while True:
statuses = api.user_timeline(page=page)
if statuses:
for status in statuses:
# process status here
process_status(status)
else:
# All done
break
page += 1 # next page
reference: http://docs.tweepy.org/en/latest/cursor_tutorial.html