How to page through QueryResults - python-2.7

I am getting result from BigQuery using the following code:
from google.oauth2 import service_account
from google.cloud import bigquery
credential = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_FILE)
scoped_credential = credential.with_scopes(BIG_QUERY_SCOPE)
client = bigquery.Client(project="XX-XX",credentials=scoped_credential)
query_results = client.run_sync_query(query_detail)
query_results.use_legacy_sql = False
query_results.run()
iterator = query_results.fetch_data()
rows = iterator.query_result.rows
But it only returns up-to 50000 rows. I tried to paginate while fetching data, but failed to figure out how to do it:
page_token = query_results.page_token
iterator = query_results.fetch_data(max_results=500, page_token=page_token)
I could not find out how to get the updated page_token.
Thanks,

I think you are close. Try running this code now:
data = list(query_results.fetch_data()) # changed from `iterator` to `data` the variable name
The management of page tokens is done automatically for you.

Related

Can't connect to Online Sharepoint using Python

I'am trying to display all sharepoint's list name but i'am getting this error :
No handlers could be found for logger "office365.runtime.auth.saml_token_provider.SamlTokenProvider._process_service_token_response"
This is my code :
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
url = 'https://abc.sharepoint.com/sites/siteName/'
ctx_auth = AuthenticationContext(url)
if ctx_auth.acquire_token_for_user(username='username#abc.com'
,password ='password'):
ctx = ClientContext(url, ctx_auth)
lists = ctx.web.lists
ctx.load(lists)
ctx.execute_query()
for l in lists:
print(l.properties["Title"])
Thanks
I tested below code here with python 2.7 and it works well.
from office365.runtime.auth.authentication_context import AuthenticationContext
from office365.sharepoint.client_context import ClientContext
tenant_url= "https://company.sharepoint.com"
site_url="https://company.sharepoint.com/sites/sname"
ctx_auth = AuthenticationContext(tenant_url)
if ctx_auth.acquire_token_for_user("abc#company.onmicrosoft.com","mypassword"):
ctx = ClientContext(site_url, ctx_auth)
lists = ctx.web.lists
ctx.load(lists)
ctx.execute_query()
for l in lists:
print(l.properties["Title"])
else:
print(ctx_auth.get_last_error())
Result:
If this is related to ADFS, please refer to this closed question:
https://github.com/vgrem/Office365-REST-Python-Client/issues/85
BR
Well i found a solution to get data for specific sharepoint List
from shareplum import Site
from shareplum import Office365
import json
import csv
import pandas
authcookie = Office365('https://abc.sharepoint.com/', username='username', password='password').GetCookies()
site = Site('https://abc.sharepoint.com/sites/SitesName/', authcookie=authcookie)
sp_list = site.List('ListName')
#print(sp_list)
data = sp_list.GetListItems(fields=['FieldName1','FieldName2'])
c = pandas.read_json(json.dumps(data)).to_csv("output.csv")

Unable to insert data into existing BigQuery Table?

I am trying to insert some data into bigquery table which is already exists. But I am unable to get that data into the table.
I tried standard example provided by google (insert_rows) but no luck. I have also referred this:https://github.com/googleapis/google-cloud-python/issues/5539
I have tried passing this data as list of tupples as well but same issue with that too.
from google.cloud import bigquery
import datetime
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset('my_dataset_id')
table_ref = dataset_ref.table('my_destination_table_id')
table = bigquery_client.get_table(table_ref)
rows_to_insert = [
{u'jobName': 'writetobigquery'},
{u'startDatetime': datetime.datetime.now().strftime('%Y-%m-%d-%H%M%S')},
{u'jobStatus': 'Success'},
{u'logMessage': 'NA'},
]
errors = bigquery_client.insert_rows(table, rows_to_insert)
When I execute this, I don't get an error, but its not writing anything into table. It will be really great if anyone suggested something that would work for me. Thank You!
After making some modifications on your code I could make it work as expected. I changed your row from being a list of dictionaries of one value each to be a dictionary with all the columns in one row. I also changed the datetime format as it was invalid for BigQuery (valid format can be found here). So the following snippet should work fine:
from google.cloud import bigquery
import datetime
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset('my_dataset_id')
table_ref = dataset_ref.table('my_destination_table_id')
table = bigquery_client.get_table(table_ref)
rows_to_insert = [
{u'jobName': 'writetobigquery',
u'startDatetime': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
u'jobStatus': 'Success',
u'logMessage': 'NA'}
]
errors = bigquery_client.insert_rows(table, rows_to_insert)
print "Errors occurred:", errors
Shouldn't your rows be a list of dictionaries? I assume your table schema is like jobName, startDatetime, jobStatus, logMessage, then:
rows_to_insert = [
{
u'jobName': 'writetobigquery',
u'startDatetime': datetime.datetime.now().strftime('%Y-%m-%d-%H%M%S'),
u'jobStatus': 'Success',
u'logMessage': 'NA'
}
]
errors = bigquery_client.insert_rows(table, rows_to_insert)

Bigquery : job is done but job.query_results().total_bytes_processed returns None

The following code :
import time
from google.cloud import bigquery
client = bigquery.Client()
query = """\
select 3 as x
"""
dataset = client.dataset('dataset_name')
table = dataset.table(name='table_name')
job = client.run_async_query('job_name_76', query)
job.write_disposition = 'WRITE_TRUNCATE'
job.destination = table
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
time.sleep(10)
job.reload()
print job.state
print job.query_results().name
print job.query_results().total_bytes_processed
prints :
DONE
job_name_76
None
I do not understand why total_bytes_processed returns None because the job is done and the documentation says :
total_bytes_processed:
Total number of bytes processed by the query.
See
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#totalBytesProcessed
Return type: int, or NoneType
Returns: Count generated on the server (None until set by the server).
Looks like you are right. As you can see in the code, the current API does not process data regarding bytes processed.
This has been reported in this issue and as you can see in this tseaver's PR this feature has already been implemented and awaits review /merging so probably we'll have this code in production quite soon.
In the mean time you could get the result from the _properties attribute of job, like:
from google.cloud.bigquery import Client
import types
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'
bc = Client()
query = 'your query'
job = bc.run_async_query('name', query)
job.begin()
wait_job(job)
query_results = job._properties['statistics'].get('query')
query_results should have the totalBytesProcessed you are looking for.

BigQuery results run via python in Google Cloud don't match results running on MAC

I have a python app that runs a query on BigQuery and appends results to a file. I've run this on MAC workstation (Yosemite) and on GC instance (ubuntu 14.1) and the results for floating point differ. How can I make them the same? They python environments are the same on both.
run on google cloud instance
1120224,2015-04-06,23989,866,55159.71274162368,0.04923989554019882,0.021414467106578683,0.03609987911125933,63.69481840834143
54897577,2015-04-06,1188089,43462,2802473.708558333,0.051049132980100984,0.021641920553251377,0.03658143455582873,64.4810111950286
run on mac workstation
1120224,2015-04-06,23989,866,55159.712741623654,0.049239895540198794,0.021414467106578683,0.03609987911125933,63.694818408341405
54897577,2015-04-06,1188089,43462,2802473.708558335,0.05104913298010102,0.021641920553251377,0.03658143455582873,64.48101119502864
import sys
import pdb
import json
from collections import OrderedDict
from csv import DictWriter
from pprint import pprint
from apiclient import discovery
from oauth2client import tools
import functools
import argparse
import httplib2
import time
from subprocess import call
def authenticate_SERVICE_ACCOUNT(service_acct_email, private_key_path):
""" Generic authentication through a service accounts.
Args:
service_acct_email: The service account email associated
with the private key private_key_path: The path to the private key file
"""
from oauth2client.client import SignedJwtAssertionCredentials
with open(private_key_path, 'rb') as pk_file:
key = pk_file.read()
credentials = SignedJwtAssertionCredentials(
service_acct_email,
key,
scope='https://www.googleapis.com/auth/bigquery')
http = httplib2.Http()
auth_http = credentials.authorize(http)
return discovery.build('bigquery', 'v2', http=auth_http)
def create_query(number_of_days_ago):
""" Create a query
Args:
number_of_days_ago: Default value of 1 gets yesterday's data
"""
q = 'SELECT xxxxxxxxxx'
return q;
def translate_row(row, schema):
"""Apply the given schema to the given BigQuery data row.
Args:
row: A single BigQuery row to transform.
schema: The BigQuery table schema to apply to the row, specifically
the list of field dicts.
Returns:
Dict containing keys that match the schema and values that match
the row.
Adpated from bigquery client
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/client.py
"""
log = {}
#pdb.set_trace()
# Match each schema column with its associated row value
for index, col_dict in enumerate(schema):
col_name = col_dict['name']
row_value = row['f'][index]['v']
if row_value is None:
log[col_name] = None
continue
# Cast the value for some types
if col_dict['type'] == 'INTEGER':
row_value = int(row_value)
elif col_dict['type'] == 'FLOAT':
row_value = float(row_value)
elif col_dict['type'] == 'BOOLEAN':
row_value = row_value in ('True', 'true', 'TRUE')
log[col_name] = row_value
return log
def extractResult(queryReply):
""" Extract a result from the query reply. Uses schema and rows to translate.
Args:
queryReply: the object returned by bigquery
"""
#pdb.set_trace()
result = []
schema = queryReply.get('schema', {'fields': None})['fields']
rows = queryReply.get('rows',[])
for row in rows:
result.append(translate_row(row, schema))
return result
def writeToCsv(results, filename, ordered_fieldnames, withHeader=True):
""" Create a csv file from a list of rows.
Args:
results: list of rows of data (first row is assumed to be a header)
order_fieldnames: a dict with names of fields in order desired - names must exist in results header
withHeader: a boolen to indicate whether to write out header -
Set to false if you are going to append data to existing csv
"""
try:
the_file = open(filename, "w")
writer = DictWriter(the_file, fieldnames=ordered_fieldnames)
if withHeader:
writer.writeheader()
writer.writerows(results)
the_file.close()
except:
print "Unexpected error:", sys.exc_info()[0]
raise
def runSyncQuery (client, projectId, query, timeout=0):
results = []
try:
print 'timeout:%d' % timeout
jobCollection = client.jobs()
queryData = {'query':query,
'timeoutMs':timeout}
queryReply = jobCollection.query(projectId=projectId,
body=queryData).execute()
jobReference=queryReply['jobReference']
# Timeout exceeded: keep polling until the job is complete.
while(not queryReply['jobComplete']):
print 'Job not yet complete...'
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
timeoutMs=timeout).execute()
# If the result has rows, print the rows in the reply.
if('rows' in queryReply):
#print 'has a rows attribute'
#pdb.set_trace();
result = extractResult(queryReply)
results.extend(result)
currentPageRowCount = len(queryReply['rows'])
# Loop through each page of data
while('rows' in queryReply and currentPageRowCount < int(queryReply['totalRows'])):
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
startIndex=currentRow).execute()
if('rows' in queryReply):
result = extractResult(queryReply)
results.extend(result)
currentRow += len(queryReply['rows'])
except AccessTokenRefreshError:
print ("The credentials have been revoked or expired, please re-run"
"the application to re-authorize")
except HttpError as err:
print 'Error in runSyncQuery:', pprint.pprint(err.content)
except Exception as err:
print 'Undefined error' % err
return results;
# Main
if __name__ == '__main__':
# Name of file
FILE_NAME = "results.csv"
# Default prior number of days to run query
NUMBER_OF_DAYS = "1"
# BigQuery project id as listed in the Google Developers Console.
PROJECT_ID = 'xxxxxx'
# Service account email address as listed in the Google Developers Console.
SERVICE_ACCOUNT = 'xxxxxx#developer.gserviceaccount.com'
KEY = "/usr/local/xxxxxxxx"
query = create_query(NUMBER_OF_DAYS)
# Authenticate
client = authenticate_SERVICE_ACCOUNT(SERVICE_ACCOUNT, KEY)
# Get query results
results = runSyncQuery (client, PROJECT_ID, query, timeout=0)
#pdb.set_trace();
# Write results to csv without header
ordered_fieldnames = OrderedDict([('f_split',None),('m_members',None),('f_day',None),('visitors',None),('purchasers',None),('demand',None), ('dmd_per_mem',None),('visitors_per_mem',None),('purchasers_per_visitor',None),('dmd_per_purchaser',None)])
writeToCsv(results, FILE_NAME, ordered_fieldnames, False)
# Backup current data
backupfilename = "data_bk-" + time.strftime("%y-%m-%d") + ".csv"
call(['cp','../data/data.csv',backupfilename])
# Concatenate new results to data
with open("../data/data.csv", "ab") as outfile:
with open("results.csv","rb") as infile:
line = infile.read()
outfile.write(line)
You mention that these come from aggregate sums of floating point data. As Felipe mentioned, floating point is awkward; it violates some of the mathematical identities that we tend to assume.
In this case, the associative property is the one that bites us. That is, usually (A+B)+C == A+(B+C). However, in floating point math, this isn't the case. Each operation is an approximation; you can see this better if you wrap with an 'approx' function: approx(approx(A+B) + C) is clearly different from approx(A + approx(B+C)).
If you think about how bigquery computes aggregates, it builds an execution tree, and computes the value to be aggregated at the leaves of the tree. As those answers are ready, they're passed back up to the higher levels of the tree and aggregated (let's say they're added). The "when they're ready" part makes it non-deterministic.
A node may get results back in the order A,B,C the first time and C,A, B the second time. This means that the order of distribution will change, since you'll get approx(approx(A + B) + C) the first time and approx(approx(C, A) + B) the second time. Note that since we're dealing with ordering, it may look like the commutative property is the problematic one, but it isn't; A+B in floating math is the same as B+A. The problem is really that you're adding partial results, which aren't associative.
Floating point math has all sorts of nasty properties and should usually be avoided if you rely on precision.
Assume floating point is non-deterministic:
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
“the IEEE standard does not guarantee that the same program will
deliver identical results on all conforming systems.”

Facebook problem + django

I am trying to write a facebook app where user can see the status history of his friends.
Everything seems to work fine until I try to save the status information in my DB.
here is code :
class UserStatus(models.Model):
facebookid = models.IntegerField()
time = models.IntegerField()
status_msg = models.CharField(max_length = 2000)
#facebook.require_login()
def canvas(request):
# Get the User object
user, created = FacebookUser.objects.get_or_create(id = request.facebook.uid)
user_lastname = request.facebook.users.getInfo([request.facebook.uid], ['last_name'])[0]['last_name']
query = "SELECT time,message FROM status WHERE uid=%s" % request.facebook.uid
result = request.facebook.fql.query(query)
So result give me all the information of the status.
so my problem is its give error when I try to save it.
userstatus = UserStatus()
for item in result:
userstatus.facebookid = request.facebook.uid
userstatus.time = item.time
userstatus.msg = item.message
userstatus.save()
error:
Errors while loading page from application
Received HTTP error code 500 while loading
So how can I fix this.
thanks.
First you should check if you are getting results from this,
result = request.facebook.fql.query(query)
Make sure that the results are in correct format required by your model ( uid is integer, time is integer and message is string.
Again make sure that result is a valid python object and not a JSON string/Object.
Remember python is not fully compatible with JSON so if result is JSON then do this to convert it to python Object,
import simplejson
result = simpljson.loads(result) # if result was a JSON string
result = simpljson.loads(simplejson.dumps(result)) # if result was a JSON object
Check if now result is a list of dictionaries { "time" : 123456, "messaage": "xyz"}.
for item in result:
userstatus = UserStatus()
userstatus.facebookid = request.facebook.uid
userstatus.time = item["time"]
userstatus.msg = item["message"]
userstatus.save()
And you should not have any errors now.