I'm looking to create a HTTPrequest to BigQuery via a Cloud Function within GCP where the request passes a value, that value is passed to the query and another value is returned from a joined table. The SQL works in BQ but having issues getting a value returned when I apply it to the Cloud Function.
Here's the curl I'm using as well.
https://us-central1-something.cloudfunctions.net/something/something2?client_Id=823754783.2
Thanks in advance.
from flask import escape
from google.cloud import bigquery
def cors_enabled_function(request):
if request.method == 'OPTIONS':
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600'
}
return ('', 204, headers)
client = bigquery.Client()
def something2(request):
request_json = request.get_json(silent=True)
request_args = request_args
table = ['`audience-cookie.ecommerce.traffic` as traffic JOIN `audience-cookie.ecommerce.cardholder` as cardholder ON traffic.customerId = cardholder.customerId']
QUERY = ('SELECT '+cardholder+' from `'+table+'` WHERE client_Id='+client_Id)
try:
query_job = client.query(QUERY)
rows = query_job.result()
row_list = []
for row in rows:
row_list.append(str(row[cardholder]))
return("<p>"+"</p><p>".join(row(row_list) + "</p>"))
except e:
return(e)
I recommend you to print your query. There is several mistakes
Why are you using [] in the table var definition?
you use backquote ` in table and when your insert table into the query. So, at the end, you have a double backquote
client_Id isn't defined in your code
I have ten records in mysql database and am using fetchall() method
Now I have requirements to display all database result in json using sql queries in django.
When I run the code below, it only shows the first records while the rest is not displayed.
I was wondering why am getting just only one json record despite using fetchall() approach
Here is the code
from django.db import connection
def read(request):
sql = 'SELECT * from crud_posts'
with connection.cursor() as cursor:
cursor.execute(sql)
output = cursor.fetchall()
print(output[0])
items=[]
for row in output:
items.append({'id':row[0], 'title': row[1],'content': row[2]})
jsondata = json.dumps({'items': items})
return HttpResponse(jsondata, content_type='application/json')
You are exiting the for loop after the first iteration...fix your identation:
from django.db import connection
def read(request):
sql = 'SELECT * from crud_posts'
with connection.cursor() as cursor:
cursor.execute(sql)
output = cursor.fetchall()
print(output[0])
items=[]
for row in output:
items.append({'id':row[0], 'title': row[1],'content': row[2]})
jsondata = json.dumps({'items': items})
return HttpResponse(jsondata, content_type='application/json')
I am using the python module called PyAthenaJDBC in order to query Athena using the provided JDBC driver.
Here is the link : https://pypi.python.org/pypi/PyAthenaJDBC/
I have been facing some persistent issue. I keep getting this java error whenever I use the Athena connection twice in a row.
As a matter of fact, I was able to connect to Athena, show databases, create new tables and even query the content. I am building an application using Django and running its server to use Athena
However, I am obliged to re-run the server in order for the Athena connection to work once again,
Here is a glimpse of the class I have built
import os
import configparser
import pyathenajdbc
#Get aws credentials for the moment
aws_config_file = '~/.aws/config'
Config = configparser.ConfigParser()
Config.read(os.path.expanduser(aws_config_file))
access_key_id = Config['default']['aws_access_key_id']
secret_key_id = Config['default']['aws_secret_access_key']
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
athena_jdbc_driver_path = BASE_DIR + "/lib/static/AthenaJDBC.jar"
log_path = BASE_DIR + "/lib/static/queries.log"
class PyAthenaLoader():
def __init__(self):
pyathenajdbc.ATHENA_JAR = athena_jdbc_driver_path
def connecti(self):
self.conn = pyathenajdbc.connect(
s3_staging_dir="s3://aws-athena-query-results--us-west-2",
access_key=access_key_id,
secret_key=secret_key_id,
#profile_name = "default",
#credential_file = aws_config_file,
region_name="us-west-2",
log_path=log_path,
driver_path=athena_jdbc_driver_path
)
def databases(self):
dbs = self.query("show databases;")
return dbs
def tables(self, database):
tables = self.query("show tables in {0};".format(database))
return tables
def create(self):
self.connecti()
try:
with self.conn.cursor() as cursor:
cursor.execute(
"""CREATE EXTERNAL TABLE IF NOT EXISTS sales4 (
Day_ID date,
Product_Id string,
Store_Id string,
Sales_Units int,
Sales_Cost float,
Currency string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '|',
'field.delim' = '|',
'collection.delimm' = 'undefined',
'mapkey.delim' = 'undefined'
) LOCATION 's3://athena-internship/';
""")
res = cursor.description
finally:
self.conn.close()
return res
def query(self, req):
self.connecti()
try:
with self.conn.cursor() as cursor:
cursor.execute(req)
print(cursor.description)
res = cursor.fetchall()
finally:
self.conn.close()
return res
def info(self):
res = []
for i in dir(pyathenajdbc):
temp = i + ' = ' + str(dic[i])
#print(temp)
res.append(temp)
return res
Example of usage :
def test(request):
athena = jdbc.PyAthenaLoader()
res = athena.query('Select * from sales;')
return render(request, 'test.html', {'data': res})
Works just fine!
However refreshing the page would cause this error :
Error
Note that I am using a local .jar file: I thought that would solve the issue but I was wrong
Even if I remove the path of the JDBC driver and let the module download it from s3, the error persists:
File "/home/tewfikghariani/.virtualenvs/venv/lib/python3.4/site-packages/pyathenajdbc/connection.py", line 69, in init
ATHENA_CONNECTION_STRING.format(region=self.region_name, schema=schema_name), props)
jpype._jexception.java.sql.SQLExceptionPyRaisable:
java.sql.SQLException: No suitable driver found for
jdbc:awsathena://athena.us-west-2.amazonaws.com:443/hive/default/
Furthermore, when I run the module on its own, it works just fine.
When I set multiple connection inside my view before rendering the template, that works just fine as well.
I guess the issue is related to the django view, once one of the views is performing a connection with athena, the next connection is not possible anymore and the error is raised unless I restart the server
Any help? If other details are missing I will provide them immediately.
Update:
After posting the issue in github, the author solved this problem and released a new version that works perfectly.
It was a multi-threading problem with JPype.
Question answered!
ref : https://github.com/laughingman7743/PyAthenaJDBC/pull/8
The below script I am using to extract data from Google Analytics. Here I am extracting data for last one week. I want to automate the date range so that i don't have to change date_range every week.
I also want to avoid sampling of data by GA. Please guide my the correct way to automate in details.
author = 'test#gmail.com (test)'
import argparse
import sys
import csv
import string
import datetime
import json
import time
from apiclient.errors import HttpError
from apiclient import sample_tools
from oauth2client.client import AccessTokenRefreshError
cam_name = sys.argv[1:]
class SampledDataError(Exception): pass
def main(argv):
# Authenticate and construct service.
service, flags = sample_tools.init(
argv[0], 'analytics', 'v3', __doc__, __file__,
scope='https://www.googleapis.com/analytics.readonly')
# Try to make a request to the API. Print the results or handle errors.
try:
profile_id = profile_ids[profile]
if not profile_id:
print ('Could not find a valid profile for this user.')
else:
metrics = argv[1]
dimensions = argv[2]
reportName = argv[3]
sort = argv[4]
filters = argv[5]
for start_date, end_date in date_ranges:
limit = ga_query(service, profile_id, 0,
start_date, end_date, metrics, dimensions, sort, filters).get('totalResults')
for pag_index in range(0, limit, 10000):
results = ga_query(service, profile_id, pag_index,
start_date, end_date, metrics, dimensions, sort, filters)
# if results.get('containsSampledData'):
# raise SampledDataError
print_results(results, pag_index, start_date, end_date, reportName)
except TypeError as error:
# Handle errors in constructing a query.
print ('There was an error in constructing your query : %s' % error)
except HttpError as error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s' %
(error.resp.status, error._get_reason()))
except AccessTokenRefreshError:
# Handle Auth errors.
print ('The credentials have been revoked or expired, please re-run '
'the application to re-authorize')
except SampledDataError:
# force an error if ever a query returns data that is sampled!
print ('Error: Query contains sampled data!')
def ga_query(service, profile_id, pag_index, start_date, end_date, metrics, dimensions, sort, filters):
return service.data().ga().get(
ids='ga:' + profile_id,
start_date=start_date,
end_date=end_date,
metrics=metrics,
dimensions=dimensions,
sort=sort,
filters=filters,
samplingLevel='HIGHER_PRECISION',
start_index=str(pag_index+1),
max_results=str(pag_index+10000)).execute()
def print_results(results, pag_index, start_date, end_date, reportName):
"""Prints out the results.
This prints out the profile name, the column headers, and all the rows of
data.
Args:
results: The response returned from the Core Reporting API.
"""
# New write header
if pag_index == 0:
if (start_date, end_date) == date_ranges[0]:
print ('Profile Name: %s' % results.get('profileInfo').get('profileName'))
columnHeaders = results.get('columnHeaders')
cleanHeaders = [str(h['name']) for h in columnHeaders]
writer.writerow(cleanHeaders)
print (reportName,'Now pulling data from %s to %s.' %(start_date, end_date))
# Print data table.
if results.get('rows', []):
for row in results.get('rows'):
for i in range(len(row)):
old, new = row[i], str()
for s in old:
new += s if s in string.printable else ''
row[i] = new
writer.writerow(row)
else:
print ('No Rows Found')
limit = results.get('totalResults')
print (pag_index, 'of about', int(round(limit, -4)), 'rows.')
return None
# Uncomment this line & replace with 'profile name': 'id' to query a single profile
# Delete or comment out this line to loop over multiple profiles.
#Brands
profile_ids = {'abc-Mobile': '12345',
'abc-Desktop': '23456',
'pqr-Mobile': '34567',
'pqr-Desktop': '45678',
'xyz-Mobile': '56789',
'xyz-Desktop': '67890'}
date_ranges = [
('2017-01-24','2017-01-24'),
('2017-01-25','2017-01-25'),
('2017-01-26','2017-01-26'),
('2017-01-27','2017-01-27'),
('2017-01-28','2017-01-28'),
('2017-01-29','2017-01-29'),
('2017-01-30','2017-01-30')
]
for profile in sorted(profile_ids):
print("Sequence 1",profile)
with open('qwerty.json') as json_data:
d = json.load(json_data)
for getThisReport in d["Reports"]:
print("Sequence 2",getThisReport["ReportName"])
reportName = getThisReport["ReportName"]
metrics = getThisReport["Metrics"]
dimensions = getThisReport["Dimensions"]
sort = getThisReport["sort"]
filters = getThisReport["filter"]
path = 'C:\\Projects\\DataExport\\test\\' #replace with path to your folder where csv file with data will be written
today = time.strftime('%Y%m%d')
filename = profile+'_'+reportName+'_'+today+'.csv' #replace with your filename. Note %s is a placeholder variable and the profile name you specified on row 162 will be written here
with open(path + filename, 'wt') as f:
writer = csv.writer(f,delimiter = '|', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
args = [sys.argv,metrics,dimensions,reportName,sort,filters]
if __name__ == '__main__': main(args)
print ( "Profile done. Next profile...")
print ("All profiles done.")
The Core Reporting API supports some interesting things as far as dates goes.
All Analytics data requests must specify a date range. If you do not include start-date and end-date parameters in the request, the server returns an error. Date values can be for a specific date by using the pattern YYYY-MM-DD or relative by using today, yesterday, or the NdaysAgo pattern. Values must match [0-9]{4}-[0-9]{2}-[0-9]{2}|today|yesterday|[0-9]+(daysAgo).
so doing something like
start_date = '7daysAgo'
end_date = 'today'
Just remember that data hasn't completed processing for 24 - 48 hours so your data for today, yesterday and the day before that may not be 100% accurate.
I have a task like this in Django:
from celery import task
import subprocess, celery
#celery.task
def file(password, source12, destination):
return subprocess.Popen(['sshpass', '-p', password, 'rsync', '-avz', '--info=progress2', source12, destination],
stderr=subprocess.PIPE, stdout=subprocess.PIPE).communicate()[0]
This transfers file from one server to another using rsync.
Here's my views:
def sync(request):
"""Sync the files into the server with the progress bar"""
choices = request.POST.getlist('choice')
for i in choices:
new_source = source +"/"+ i
start_date1 = datetime.datetime.utcnow().replace(tzinfo=utc)
source12 = new_source.replace(' ', '') #Remove whitespaces
result = file.delay(password, source12, destination)
result.get()
a = result.ready()
start_date = start_date1.strftime("%B %d, %Y, %H:%M%p")
extension = os.path.splitext(i)[1][1:] #Get the file_extension
fullname = os.path.join(destination, i) #Get the file_full_size to calculate size
st = int(os.path.getsize(fullname))
f_size = size(st, system=alternative)
I want to update the table in database which I want to update and show it to the user. The table should be updated while the file is being transferred. How can I do that using django-celery?
There is nothing too special about Celery when it comes to Django. You can just update the database like you would normally do. The only thing you may need to think about are the transactions.
Just to be sure I would recommend using either manual commits or autocommits to update the database. Although I would suggest using redis/memcached instead of the database for these kind of status updates. They are more suited for this purpose.