Here is my code:
import urllib
import json
import csv
apiKey = "MY_KEY" # Google API credentials
##perform a text search based on input, place results in text-search-results.json
print "Starting"
myfile = open("results.csv","wb")
headers = []
headers.append(['Search','Name','Address','Phone','Website','Type','Google ID','Rating','Permanently Closed'])
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerows(headers)
with open('input_file.csv', 'rb') as csvfile:
filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in filereader:
search = ', '.join(row)
search.replace(' ', '+')
url1 = "https://maps.googleapis.com/maps/api/place/textsearch/json?query=%s&key=%s" % (search,apiKey)
urllib.urlretrieve(url1,"text-search-results.json")
print "SEARCH", search
print "Google Place URL", url1
## load text-search-results.json and get the list of place IDs
textSearchResults = json.load(open("text-search-results.json"))
listOfPlaceIds = []
for item in textSearchResults["results"]:
listOfPlaceIds.append(str(item["place_id"]))
## open a nested list for the results
output = []
## iterate through and download a JSON for each place ID
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
try:
lineToAppend.append(str(data["result"]["name"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["formatted_address"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["formatted_phone_number"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["website"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["types"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["place_id"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["rating"]))
except KeyError:
lineToAppend.append('')
try:
lineToAppend.append(str(data["result"]["permanently_closed"]))
except KeyError:
lineToAppend.append('')
output.append(lineToAppend)
wr.writerows(output)
myfile.close()
What this is doing is taking the search terms from one column in the input_file and running that search through the Google Places API. However, when I have multiple search terms, it only returns the last search results in the results.csv file. I am not quite sure why this is happening since it is reading all of the search terms and running them through, but only returning the last result. Any suggestions?
Currently you are only writing out the last line because you are resetting the variable lineToAppend within your for loop. However, you are not adding it to your output within your for loop. Therefore, it is getting to the end of your for loop and writing out the last line.
Therefore currently it looks like this: (Shortened for brevity)
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
...
output.append(lineToAppend)
wr.writerows(output)
Whereas it should be:
for ids in listOfPlaceIds:
url = "https://maps.googleapis.com/maps/api/place/details/json?placeid=%s&key=%s" % (ids,apiKey)
fn = ids + "-details.json"
urllib.urlretrieve(url,fn)
data = json.load(open(fn))
lineToAppend = []
lineToAppend.append(search)
...
output.append(lineToAppend)
wr.writerows(output)
Related
I'm trying to write a small crawler to crawl multiple wikipedia pages.
I want to make the crawl somewhat dynamic by concatenating the hyperlink for the exact wikipage from a file which contains a list of names.
For example, the first line of "deutsche_Schauspieler.txt" says "Alfred Abel" and the concatenated string would be "https://de.wikipedia.org/wiki/Alfred Abel". Using the txt file will result in heading being none, yet when I complete the link with a string inside the script, it works.
This is for python 2.x.
I already tried to switch from " to ',
tried + instead of %s
tried to put the whole string into the txt file (so that first line reads "http://..." instead of "Alfred Abel"
tried to switch from "Alfred Abel" to "Alfred_Abel
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
for line in content:
link = "https://de.wikipedia.org/wiki/%s" % (str(line))
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html)
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
f.close()
file.close()
I expect to get the content of the table "Vorlage_Personendaten" which actually works if i change line 10 to
link = "https://de.wikipedia.org/wiki/Alfred Abel"
# link = "https://de.wikipedia.org/wiki/Alfred_Abel" also works
But I want it to work using the textfile
Looks like the problem in your text file where you have used "Alfred Abel" that is why you are getting the following exceptions
uls = heading.find_all('td')
AttributeError: 'NoneType' object has no attribute 'find_all'
Please remove the string quotes "Alfred Abel" and use Alfred Abel inside the text file deutsche_Schauspieler.txt . it will work as expected.
I found the solution myself.
Although there are no additionaly lines on the file, the content array displays like
['Alfred Abel\n'], but printing out the first index of the array will result in 'Alfred Abel'. It still gets interpreted like the string in the array, thus forming a false link.
So you want to move the last(!) character from the current line.
A solution could look like so:
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
print (content)
for line in content:
line=line[:-1] #Note how this removes \n which are technically two characters
link = "https://de.wikipedia.org/wiki/%s" % str(line)
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,"html.parser")
try:
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
except:
print ("That did not work")
pass
f.close()
file.close()
I'm reading a file outputfromextractand I want to split the contents of that file with the delimiter ',' which I have done.
When reading the contents into a list there's two 'faff' entries at the beginning that I'm just trying to remove however I find myself unable to remove the index
import json
class device:
ipaddress = None
macaddress = None
def __init__(self, ipaddress, macaddress):
self.ipaddress = ipaddress
self.macaddress = macaddress
listofItems = []
listofdevices = []
def format_the_data():
file = open("outputfromextract")
contentsofFile = file.read()
individualItem = contentsofFile.split(',')
listofItems.append(individualItem)
print(listofItems[0][0:2]) #this here displays the entries I want to remove
listofItems.remove[0[0:2]] # fails here and raises a TypeError (int object not subscriptable)
In the file I have created the first three lines are enclosed below for reference:
[u' #created by system\n', u'time at 12:05\n', u'192.168.1.1\n',...
I'm wanting to simply remove those two items from the list and the rest will be put into an constructor
I believe listofItems.remove[0[0:2]] should be listofItems.remove[0][0:2].
But, slicing will be much easier, for example:
with open("outputfromextract") as f:
contentsofFile = f.read()
individualItem = contentsofFile.split(',')[2:]
I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.
Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')
Dont know why the below python script would not crawl glassdoor.com website
from bs4 import BeautifulSoup # documentation available at :` #www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import NavigableString, Tag
import requests # To send http requests and access the page : docs.python-requests.org/en/latest/
import csv # To create the output csv file
import unicodedata # To work with the string encoding of the data
entries = []
entry = []
urlnumber = 1 # Give the page number to start with
while urlnumber<100: # Give the page number to end with
#print type(urlnumber), urlnumber
url = 'http://www.glassdoor.com/p%d' % (urlnumber,) #Give the url of the forum, excluding the page number in the hyperlink
#print url
try:
r = requests.get(url, timeout = 10) #Sending a request to access the page
except Exception,e:
print e.message
break
if r.status_code == 200:
data = r.text
else:
print str(r.status_code) + " " + url
soup = BeautifulSoup(data) # Getting the page source into the soup
for div in soup.find_all('div'):
entry = []
if(div.get('class') != None and div.get('class')[0] == 'Comment'): # A single post is referred to as a comment. Each comment is a block denoted in a div tag which has a class called comment.
ps = div.find_all('p') #gets all the tags called p to a variable ps
aas = div.find_all('a') # gets all the tags called a to a variable aas
spans = div.find_all('span') #
times = div.find_all('time') # used to extract the time tage which gives the iDate of the post
concat_str = ''
for str in aas[1].contents: # prints the contents that is between the tag start and end
if str != "<br>" or str != "<br/>": # This denotes breaks in post which we need to work around.
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip() # The encoding is because the format exracted is a unicode. We need a uniform structure to work with the strings.
entry.append(concat_str)
concat_str = ''
for str in times[0].contents:
if str != "<br>" or str != "<br/>":
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip()
entry.append(concat_str)
#print "-------------------------"
for div in div.find_all('div'):
if (div.get('class') != None and div.get('class')[0] == 'Message'): # Extracting the div tag witht the class attribute as message.
blockqoutes = []
x = div.get_text()
for bl in div.find_all('blockquote'):
blockqoutes.append(bl.get_text()) #Block quote is used to get the quote made by a person. get text helps to elimiate the hyperlinks and pulls out only the data.
bl.decompose()
entry.append(div.get_text().replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
for bl in blockqoutes:
#print bl
entry.append(bl.replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
#print entry
entries.append(entry)
urlnumber = urlnumber + 1 # increment so that we can extract the next page
with open('gd1.csv', 'w') as output:
writer = csv.writer(output, delimiter= ',', lineterminator = '\n')
writer.writerows(entries)
print "Wrote to gd1.csv"
I fixed some errors in your script, but I guess that it doesn't print anything because you only get 405(!)-pages!
Also your previous try/catch-block didn't print the error message. Was it on purpose?
I have a python app that runs a query on BigQuery and appends results to a file. I've run this on MAC workstation (Yosemite) and on GC instance (ubuntu 14.1) and the results for floating point differ. How can I make them the same? They python environments are the same on both.
run on google cloud instance
1120224,2015-04-06,23989,866,55159.71274162368,0.04923989554019882,0.021414467106578683,0.03609987911125933,63.69481840834143
54897577,2015-04-06,1188089,43462,2802473.708558333,0.051049132980100984,0.021641920553251377,0.03658143455582873,64.4810111950286
run on mac workstation
1120224,2015-04-06,23989,866,55159.712741623654,0.049239895540198794,0.021414467106578683,0.03609987911125933,63.694818408341405
54897577,2015-04-06,1188089,43462,2802473.708558335,0.05104913298010102,0.021641920553251377,0.03658143455582873,64.48101119502864
import sys
import pdb
import json
from collections import OrderedDict
from csv import DictWriter
from pprint import pprint
from apiclient import discovery
from oauth2client import tools
import functools
import argparse
import httplib2
import time
from subprocess import call
def authenticate_SERVICE_ACCOUNT(service_acct_email, private_key_path):
""" Generic authentication through a service accounts.
Args:
service_acct_email: The service account email associated
with the private key private_key_path: The path to the private key file
"""
from oauth2client.client import SignedJwtAssertionCredentials
with open(private_key_path, 'rb') as pk_file:
key = pk_file.read()
credentials = SignedJwtAssertionCredentials(
service_acct_email,
key,
scope='https://www.googleapis.com/auth/bigquery')
http = httplib2.Http()
auth_http = credentials.authorize(http)
return discovery.build('bigquery', 'v2', http=auth_http)
def create_query(number_of_days_ago):
""" Create a query
Args:
number_of_days_ago: Default value of 1 gets yesterday's data
"""
q = 'SELECT xxxxxxxxxx'
return q;
def translate_row(row, schema):
"""Apply the given schema to the given BigQuery data row.
Args:
row: A single BigQuery row to transform.
schema: The BigQuery table schema to apply to the row, specifically
the list of field dicts.
Returns:
Dict containing keys that match the schema and values that match
the row.
Adpated from bigquery client
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/client.py
"""
log = {}
#pdb.set_trace()
# Match each schema column with its associated row value
for index, col_dict in enumerate(schema):
col_name = col_dict['name']
row_value = row['f'][index]['v']
if row_value is None:
log[col_name] = None
continue
# Cast the value for some types
if col_dict['type'] == 'INTEGER':
row_value = int(row_value)
elif col_dict['type'] == 'FLOAT':
row_value = float(row_value)
elif col_dict['type'] == 'BOOLEAN':
row_value = row_value in ('True', 'true', 'TRUE')
log[col_name] = row_value
return log
def extractResult(queryReply):
""" Extract a result from the query reply. Uses schema and rows to translate.
Args:
queryReply: the object returned by bigquery
"""
#pdb.set_trace()
result = []
schema = queryReply.get('schema', {'fields': None})['fields']
rows = queryReply.get('rows',[])
for row in rows:
result.append(translate_row(row, schema))
return result
def writeToCsv(results, filename, ordered_fieldnames, withHeader=True):
""" Create a csv file from a list of rows.
Args:
results: list of rows of data (first row is assumed to be a header)
order_fieldnames: a dict with names of fields in order desired - names must exist in results header
withHeader: a boolen to indicate whether to write out header -
Set to false if you are going to append data to existing csv
"""
try:
the_file = open(filename, "w")
writer = DictWriter(the_file, fieldnames=ordered_fieldnames)
if withHeader:
writer.writeheader()
writer.writerows(results)
the_file.close()
except:
print "Unexpected error:", sys.exc_info()[0]
raise
def runSyncQuery (client, projectId, query, timeout=0):
results = []
try:
print 'timeout:%d' % timeout
jobCollection = client.jobs()
queryData = {'query':query,
'timeoutMs':timeout}
queryReply = jobCollection.query(projectId=projectId,
body=queryData).execute()
jobReference=queryReply['jobReference']
# Timeout exceeded: keep polling until the job is complete.
while(not queryReply['jobComplete']):
print 'Job not yet complete...'
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
timeoutMs=timeout).execute()
# If the result has rows, print the rows in the reply.
if('rows' in queryReply):
#print 'has a rows attribute'
#pdb.set_trace();
result = extractResult(queryReply)
results.extend(result)
currentPageRowCount = len(queryReply['rows'])
# Loop through each page of data
while('rows' in queryReply and currentPageRowCount < int(queryReply['totalRows'])):
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
startIndex=currentRow).execute()
if('rows' in queryReply):
result = extractResult(queryReply)
results.extend(result)
currentRow += len(queryReply['rows'])
except AccessTokenRefreshError:
print ("The credentials have been revoked or expired, please re-run"
"the application to re-authorize")
except HttpError as err:
print 'Error in runSyncQuery:', pprint.pprint(err.content)
except Exception as err:
print 'Undefined error' % err
return results;
# Main
if __name__ == '__main__':
# Name of file
FILE_NAME = "results.csv"
# Default prior number of days to run query
NUMBER_OF_DAYS = "1"
# BigQuery project id as listed in the Google Developers Console.
PROJECT_ID = 'xxxxxx'
# Service account email address as listed in the Google Developers Console.
SERVICE_ACCOUNT = 'xxxxxx#developer.gserviceaccount.com'
KEY = "/usr/local/xxxxxxxx"
query = create_query(NUMBER_OF_DAYS)
# Authenticate
client = authenticate_SERVICE_ACCOUNT(SERVICE_ACCOUNT, KEY)
# Get query results
results = runSyncQuery (client, PROJECT_ID, query, timeout=0)
#pdb.set_trace();
# Write results to csv without header
ordered_fieldnames = OrderedDict([('f_split',None),('m_members',None),('f_day',None),('visitors',None),('purchasers',None),('demand',None), ('dmd_per_mem',None),('visitors_per_mem',None),('purchasers_per_visitor',None),('dmd_per_purchaser',None)])
writeToCsv(results, FILE_NAME, ordered_fieldnames, False)
# Backup current data
backupfilename = "data_bk-" + time.strftime("%y-%m-%d") + ".csv"
call(['cp','../data/data.csv',backupfilename])
# Concatenate new results to data
with open("../data/data.csv", "ab") as outfile:
with open("results.csv","rb") as infile:
line = infile.read()
outfile.write(line)
You mention that these come from aggregate sums of floating point data. As Felipe mentioned, floating point is awkward; it violates some of the mathematical identities that we tend to assume.
In this case, the associative property is the one that bites us. That is, usually (A+B)+C == A+(B+C). However, in floating point math, this isn't the case. Each operation is an approximation; you can see this better if you wrap with an 'approx' function: approx(approx(A+B) + C) is clearly different from approx(A + approx(B+C)).
If you think about how bigquery computes aggregates, it builds an execution tree, and computes the value to be aggregated at the leaves of the tree. As those answers are ready, they're passed back up to the higher levels of the tree and aggregated (let's say they're added). The "when they're ready" part makes it non-deterministic.
A node may get results back in the order A,B,C the first time and C,A, B the second time. This means that the order of distribution will change, since you'll get approx(approx(A + B) + C) the first time and approx(approx(C, A) + B) the second time. Note that since we're dealing with ordering, it may look like the commutative property is the problematic one, but it isn't; A+B in floating math is the same as B+A. The problem is really that you're adding partial results, which aren't associative.
Floating point math has all sorts of nasty properties and should usually be avoided if you rely on precision.
Assume floating point is non-deterministic:
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
“the IEEE standard does not guarantee that the same program will
deliver identical results on all conforming systems.”