Writing to a table after transformation (bonobo-sqlalchemy) - bonobo-etl

I'm trying to read a table, modify a column and write to another table. I followed the available documentation and ran following code. It doesn't give any errors, but the task doesn't get performed either.
I tried removing the transformation step and then information gets written.
import sqlalchemy
import bonobo
import bonobo_sqlalchemy
def get_services():
return {
'sql_alchemy.engine': sqlalchemy.create_engine('postgresql://postgres:password#localhost:5432/postgres')
}
def transform(*row):
new_row = row[0]+1, row[1]
yield new_row
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(bonobo_sqlalchemy.Select('SELECT * FROM users', engine='sql_alchemy.engine')
,
transform,
bonobo_sqlalchemy.InsertOrUpdate(table_name='table_1', engine='sql_alchemy.engine'),
)
return graph
# The __main__ block actually execute the graph.
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(get_graph(**options), services=get_services(**options))
Output:
- Select in=1 out=6 [done]
- format_for_db in=6 out=6 [done]
- InsertOrUpdate in=6 out=6 [done]

It works when a Dictionary is yielded as follows,
yield {"id": row[0], "text": row[1], "count":row[2]}
with bonobo.UnpackItems(0) node in the chain after the transformation.

Related

How to mock boto3 calls when testing a function that calls boto3 in its body

I am trying to test a function called get_date_from_s3(bucket, table) using pytest. In this function, there a boto3.client("s3").list_objects_v2() call that I would like to mock during testing, but I can't seem to figure out how this would work.
Here is my directory setup:
my_project/
glue/
continuous.py
tests/
glue/
test_continuous.py
conftest.py
conftest.py
The code continuous.py will be executed in an AWS glue job but I am testing it locally.
my_project/glue/continuous.py
import boto3
def get_date_from_s3(bucket, table):
s3_client = boto3.client("s3")
result = s3_client.list_objects_v2(Bucket=bucket, Prefix="Foo/{}/".format(table))
# [the actual thing I want to test]
latest_date = datetime_date(1, 1, 1)
output = None
for content in result.get("Contents"):
date = key.split("/")
output = [some logic to get the latest date from the file name in s3]
return output
def main(argv):
date = get_date_from_s3(argv[1], argv[2])
if __name__ == "__main__":
main(sys.argv[1:])
my_project/tests/glue/test_continuous.py
This is what I want: I want to test get_date_from_s3() by mocking the s3_client.list_objects_v2() and explicitly setting the response value to example_response. I tried doing something like below but it doesn't work:
from glue import continuous
import mock
def test_get_date_from_s3(mocker):
example_response = {
"ResponseMetadata": "somethingsomething",
"IsTruncated": False,
"Contents": [
{
"Key": "/year=2021/month=01/day=03/some_file.parquet",
"LastModified": "datetime.datetime(2021, 2, 5, 17, 5, 11, tzinfo=tzlocal())",
...
},
{
"Key": "/year=2021/month=01/day=02/some_file.parquet",
"LastModified": ...,
},
...
]
}
mocker.patch(
'continuous.boto3.client.list_objects_v2',
return_value=example_response
)
expected = "20210102"
actual = get_date_from_s3(bucket, table)
assert actual == expected
Note
I noticed that a lot of examples of mocking have the functions to test as part of a class. Because continuous.py is a glue job, I didn't find the utility of creating a class, I just have functions and a main() that calls it, is it a bad practice? It seems like mock decorators before functions are used only for functions that are part of a class.
I also read about moto, but couldn't seem to figure out how to apply it here.
The idea with mocking and patching that one would want to mock/patch something specific. So, to have correct patching, one has to specify exactly the thing to be mocked/patch. In the given example, the thing to be patched is located in: glue > continuous > boto3 > client instance > list_objects_v2.
As you pointed one you would like calls to list_objects_v2() to give back prepared data. So, this means that you have to first mock "glue.continuous.boto3.client" then using the latter mock "list_objects_v2".
In practice you need to do something along the lines of:
from glue import continuous_deduplicate
from unittest.mock import Mock, patch
#patch("glue.continuous.boto3.client")
def test_get_date_from_s3(mocked_client):
mocked_response = Mock()
mocked_response.return_value = { ... }
mocked_client.list_objects_v2 = mocked_response
# Run other setup and function under test:
In the end, I figured out that my patching target value was wrong thanks to #Gros Lalo. It should have been 'glue.continuous.boto3.client.list_objects_v'. That still didn't work however, it threw me the error AttributeError: <function client at 0x7fad6f1b2af0> does not have the attribute 'list_objects_v'.
So I did a little refactoring to wrap the whole boto3.client in a function that is easier to mock. Here is my new my_project/glue/continuous.py file:
import boto3
def get_s3_objects(bucket, table):
s3_client = boto3.client("s3")
return s3_client.list_objects_v2(Bucket=bucket, Prefix="Foo/{}/".format(table))
def get_date_from_s3(bucket, table):
result = get_s3_objects(bucket, table)
# [the actual thing I want to test]
latest_date = datetime_date(1, 1, 1)
output = None
for content in result.get("Contents"):
date = key.split("/")
output = [some logic to get the latest date from the file name in s3]
return output
def main(argv):
date = get_date_from_s3(argv[1], argv[2])
if __name__ == "__main__":
main(sys.argv[1:])
My new test_get_latest_date_from_s3() is therefore:
def test_get_latest_date_from_s3(mocker):
example_response = {
"ResponseMetadata": "somethingsomething",
"IsTruncated": False,
"Contents": [
{
"Key": "/year=2021/month=01/day=03/some_file.parquet",
"LastModified": "datetime.datetime(2021, 2, 5, 17, 5, 11, tzinfo=tzlocal())",
...
},
{
"Key": "/year=2021/month=01/day=02/some_file.parquet",
"LastModified": ...,
},
...
]
}
mocker.patch('glue.continuous_deduplicate.get_s3_objects', return_value=example_response)
expected_date = "20190823"
actual_date = continuous_deduplicate.get_latest_date_from_s3("some_bucket", "some_table")
assert expected_date == actual_date
The refactoring worked out for me, but if there is a way to mock the list_objects_v2() directly without having to wrap it in another function, I am still interested!
In order to achieve this result using moto, you would have to create the data normally using the boto3-sdk. In other words: create a test case that succeeds agains AWS itself, and then slap the moto-decorator on it.
For your usecase, I imagine it looks something like:
from moto import mock_s3
#mock_s3
def test_glue:
# create test data
s3 = boto3.client("s3")
for d in range(5):
s3.put_object(Bucket="", Key=f"year=2021/month=01/day={d}/some_file.parquet", Body="asdf")
# test
result = get_date_from_s3(bucket, table)
# assert result is as expected
...

prometheus unit test for Summary

Im trying to write a unit test for 'Summary' but not sure what variables I need to check?
from prometheus_client import Counter, Summary
import unittest
import time
from prometheus_client import REGISTRY
my_summary = Summary('my_summary', 'A useful help string.')
def my_function():
time.sleep(1)
my_summary.observe(5)
class TestMyFunction(unittest.TestCase):
def test_metric_incremented(self):
print 'here'
before = REGISTRY.get_sample_value('my_summary')
print 'summary before == ', before
my_function()
after = REGISTRY.get_sample_value('my_summary')
print 'summary after == ', after
self.assertEqual(0, after - before)
if __name__ == '__main__':
unittest.main()
Here is my code I observe the function 5 seconds. Not sure if this is the right approach..Any test example would be great.
I tried following this blog - https://www.robustperception.io/how-to-unit-test-prometheus-instrumentation/
The time series you want are my_summary_count and my_summary_sum.

Bigquery : job is done but job.query_results().total_bytes_processed returns None

The following code :
import time
from google.cloud import bigquery
client = bigquery.Client()
query = """\
select 3 as x
"""
dataset = client.dataset('dataset_name')
table = dataset.table(name='table_name')
job = client.run_async_query('job_name_76', query)
job.write_disposition = 'WRITE_TRUNCATE'
job.destination = table
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
time.sleep(10)
job.reload()
print job.state
print job.query_results().name
print job.query_results().total_bytes_processed
prints :
DONE
job_name_76
None
I do not understand why total_bytes_processed returns None because the job is done and the documentation says :
total_bytes_processed:
Total number of bytes processed by the query.
See
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#totalBytesProcessed
Return type: int, or NoneType
Returns: Count generated on the server (None until set by the server).
Looks like you are right. As you can see in the code, the current API does not process data regarding bytes processed.
This has been reported in this issue and as you can see in this tseaver's PR this feature has already been implemented and awaits review /merging so probably we'll have this code in production quite soon.
In the mean time you could get the result from the _properties attribute of job, like:
from google.cloud.bigquery import Client
import types
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'
bc = Client()
query = 'your query'
job = bc.run_async_query('name', query)
job.begin()
wait_job(job)
query_results = job._properties['statistics'].get('query')
query_results should have the totalBytesProcessed you are looking for.

Code is Not able to find my function in Python(Spark) class

I need some help regarding the error in code. My Code consists of retrieving the zomato reviews and storing it in HDFS and again reading it performing Recommender Analtyics on it. I am getting a problem regarding my function is not recognizing in pyspark code. I am not entirely pasting the code as it might be confusing so i am writing a small similar use case for your easy understanding.
I am trying to read a file from local and converting it to dataframe from rdd and performing some operations and again converting it to rdd and performing map operation to have delimiter by '|' and then save it to HDFS.
When i try to call self.filter_data(y) in lambda func of check function its not recognizing and giving me error as
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
****CAN ANY ONE HELP ME WHY MY FILTER_DATA FUNCTION IS NOT RECOGNISING? SHOULD I NEED TO ADD ANY THING OR ANY THING WRONG IN THE WAY I AM CALLING. PLEASE HELP ME. THANKS IN ADVANCE****
INPUT VALUE
starting
0|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/565/aed32fa2eb18bb4a5a3ba426870fd565.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/akellaram87?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|2.5|FFBA00|Well...|unknown|16946626|2017-08-01T00-25-43.455182Z|30059877|Have been here for a quick bite for lunch, ambience and everything looked good, food was okay but presentation was not very appealing. We or...|2017-04-15 16:38:38|Big Foodie|6|Venkata Ram Akella|akellaram87|Bad Food|0.969352505662|0|0|0|0|0|0|1|1|0|0|1|0|0|0.782388212399
ending
starting
1|0|ffae4f|0|https://b.zmtcdn.com/data/user_profile_pictures/4d1/d70d7a57e1bfdf296ff4db3d8daf94d1.jpg?fit=around%7C100%3A100&crop=100%3A100%3B%2A%2C%2A|https://www.zomato.com/users/sm4-2011696?utm_source=api_basic_user&utm_medium=api&utm_campaign=v2.1|1|CB202D|Avoid!|unknown|16946626|2017-08-01T00-25-43.455182Z|29123338|Giving a 1.0 rating because one cannot proceed with writing a review, without rating it. This restaurant deserves a 0 star rating. The qual...|2017-01-04 10:54:53|Big Foodie|4|Sm4|unknown|Bad Service|0.964402034541|0|1|0|0|0|0|0|1|0|0|0|1|0|0.814540622345
ending
My code:
if __name__== '__main__':
import os,logging,sys,time,pandas,json;from subprocess
import PIPE,Popen,call;from datetime import datetime, time, timedelta
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('test')
sc = SparkContext(conf = conf,pyFiles=['/bdaas/exe/nlu_project/spark_classifier.py','/bdaas/exe/spark_zomato/other_files/spark_zipcode.py','/bdaas/exe/spark_zomato/other_files/spark_zomato.py','/bdaas/exe/spark_zomato/conf_files/spark_conf.py','/bdaas/exe/spark_zomato/conf_files/date_comparision.py'])
from pyspark.sql import Row, SQLContext,HiveContext
from pyspark.sql.functions import lit
sqlContext = HiveContext(sc)
import sys,logging,pandas as pd
import spark_conf
n = new()
n.check()
class new:
def __init__(self):
print 'entered into init'
def check(self):
data = sc.textFile('file:///bdaas/src/spark_dependencies/classifier_data/final_Output.txt').map(lambda x: x.split('|')).map(lambda z: Row(restaurant_id=z[0], rating = z[1], review_id = z[2],review_text = z[3],rating_color = z[4],rating_time_friendly=z[5],rating_text=z[6],time_stamp=z[7],likes=z[8],comment_count =z[9],user_name = z[10],user_zomatohandle=z[11],user_foodie_level = z[12],user_level_num=z[13],foodie_color=z[14],profile_url=z[15],profile_image=z[16],retrieved_time=z[17]))
data_r = sqlContext.createDataFrame(data)
data_r.show()
d = data_r.rdd.collect()
print d
data_r.rdd.map(lambda x: list(x)).map(lambda y: self.filter_data(y)).collect()
print data_r
def filter_data(self,y):
s = str()
for i in y:
print i.encode('utf-8')
if i != '':
s = s + i.encode('utf-8') + '|'
print s[0:-1]
return s[0:-1]

BigQuery results run via python in Google Cloud don't match results running on MAC

I have a python app that runs a query on BigQuery and appends results to a file. I've run this on MAC workstation (Yosemite) and on GC instance (ubuntu 14.1) and the results for floating point differ. How can I make them the same? They python environments are the same on both.
run on google cloud instance
1120224,2015-04-06,23989,866,55159.71274162368,0.04923989554019882,0.021414467106578683,0.03609987911125933,63.69481840834143
54897577,2015-04-06,1188089,43462,2802473.708558333,0.051049132980100984,0.021641920553251377,0.03658143455582873,64.4810111950286
run on mac workstation
1120224,2015-04-06,23989,866,55159.712741623654,0.049239895540198794,0.021414467106578683,0.03609987911125933,63.694818408341405
54897577,2015-04-06,1188089,43462,2802473.708558335,0.05104913298010102,0.021641920553251377,0.03658143455582873,64.48101119502864
import sys
import pdb
import json
from collections import OrderedDict
from csv import DictWriter
from pprint import pprint
from apiclient import discovery
from oauth2client import tools
import functools
import argparse
import httplib2
import time
from subprocess import call
def authenticate_SERVICE_ACCOUNT(service_acct_email, private_key_path):
""" Generic authentication through a service accounts.
Args:
service_acct_email: The service account email associated
with the private key private_key_path: The path to the private key file
"""
from oauth2client.client import SignedJwtAssertionCredentials
with open(private_key_path, 'rb') as pk_file:
key = pk_file.read()
credentials = SignedJwtAssertionCredentials(
service_acct_email,
key,
scope='https://www.googleapis.com/auth/bigquery')
http = httplib2.Http()
auth_http = credentials.authorize(http)
return discovery.build('bigquery', 'v2', http=auth_http)
def create_query(number_of_days_ago):
""" Create a query
Args:
number_of_days_ago: Default value of 1 gets yesterday's data
"""
q = 'SELECT xxxxxxxxxx'
return q;
def translate_row(row, schema):
"""Apply the given schema to the given BigQuery data row.
Args:
row: A single BigQuery row to transform.
schema: The BigQuery table schema to apply to the row, specifically
the list of field dicts.
Returns:
Dict containing keys that match the schema and values that match
the row.
Adpated from bigquery client
https://github.com/tylertreat/BigQuery-Python/blob/master/bigquery/client.py
"""
log = {}
#pdb.set_trace()
# Match each schema column with its associated row value
for index, col_dict in enumerate(schema):
col_name = col_dict['name']
row_value = row['f'][index]['v']
if row_value is None:
log[col_name] = None
continue
# Cast the value for some types
if col_dict['type'] == 'INTEGER':
row_value = int(row_value)
elif col_dict['type'] == 'FLOAT':
row_value = float(row_value)
elif col_dict['type'] == 'BOOLEAN':
row_value = row_value in ('True', 'true', 'TRUE')
log[col_name] = row_value
return log
def extractResult(queryReply):
""" Extract a result from the query reply. Uses schema and rows to translate.
Args:
queryReply: the object returned by bigquery
"""
#pdb.set_trace()
result = []
schema = queryReply.get('schema', {'fields': None})['fields']
rows = queryReply.get('rows',[])
for row in rows:
result.append(translate_row(row, schema))
return result
def writeToCsv(results, filename, ordered_fieldnames, withHeader=True):
""" Create a csv file from a list of rows.
Args:
results: list of rows of data (first row is assumed to be a header)
order_fieldnames: a dict with names of fields in order desired - names must exist in results header
withHeader: a boolen to indicate whether to write out header -
Set to false if you are going to append data to existing csv
"""
try:
the_file = open(filename, "w")
writer = DictWriter(the_file, fieldnames=ordered_fieldnames)
if withHeader:
writer.writeheader()
writer.writerows(results)
the_file.close()
except:
print "Unexpected error:", sys.exc_info()[0]
raise
def runSyncQuery (client, projectId, query, timeout=0):
results = []
try:
print 'timeout:%d' % timeout
jobCollection = client.jobs()
queryData = {'query':query,
'timeoutMs':timeout}
queryReply = jobCollection.query(projectId=projectId,
body=queryData).execute()
jobReference=queryReply['jobReference']
# Timeout exceeded: keep polling until the job is complete.
while(not queryReply['jobComplete']):
print 'Job not yet complete...'
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
timeoutMs=timeout).execute()
# If the result has rows, print the rows in the reply.
if('rows' in queryReply):
#print 'has a rows attribute'
#pdb.set_trace();
result = extractResult(queryReply)
results.extend(result)
currentPageRowCount = len(queryReply['rows'])
# Loop through each page of data
while('rows' in queryReply and currentPageRowCount < int(queryReply['totalRows'])):
queryReply = jobCollection.getQueryResults(
projectId=jobReference['projectId'],
jobId=jobReference['jobId'],
startIndex=currentRow).execute()
if('rows' in queryReply):
result = extractResult(queryReply)
results.extend(result)
currentRow += len(queryReply['rows'])
except AccessTokenRefreshError:
print ("The credentials have been revoked or expired, please re-run"
"the application to re-authorize")
except HttpError as err:
print 'Error in runSyncQuery:', pprint.pprint(err.content)
except Exception as err:
print 'Undefined error' % err
return results;
# Main
if __name__ == '__main__':
# Name of file
FILE_NAME = "results.csv"
# Default prior number of days to run query
NUMBER_OF_DAYS = "1"
# BigQuery project id as listed in the Google Developers Console.
PROJECT_ID = 'xxxxxx'
# Service account email address as listed in the Google Developers Console.
SERVICE_ACCOUNT = 'xxxxxx#developer.gserviceaccount.com'
KEY = "/usr/local/xxxxxxxx"
query = create_query(NUMBER_OF_DAYS)
# Authenticate
client = authenticate_SERVICE_ACCOUNT(SERVICE_ACCOUNT, KEY)
# Get query results
results = runSyncQuery (client, PROJECT_ID, query, timeout=0)
#pdb.set_trace();
# Write results to csv without header
ordered_fieldnames = OrderedDict([('f_split',None),('m_members',None),('f_day',None),('visitors',None),('purchasers',None),('demand',None), ('dmd_per_mem',None),('visitors_per_mem',None),('purchasers_per_visitor',None),('dmd_per_purchaser',None)])
writeToCsv(results, FILE_NAME, ordered_fieldnames, False)
# Backup current data
backupfilename = "data_bk-" + time.strftime("%y-%m-%d") + ".csv"
call(['cp','../data/data.csv',backupfilename])
# Concatenate new results to data
with open("../data/data.csv", "ab") as outfile:
with open("results.csv","rb") as infile:
line = infile.read()
outfile.write(line)
You mention that these come from aggregate sums of floating point data. As Felipe mentioned, floating point is awkward; it violates some of the mathematical identities that we tend to assume.
In this case, the associative property is the one that bites us. That is, usually (A+B)+C == A+(B+C). However, in floating point math, this isn't the case. Each operation is an approximation; you can see this better if you wrap with an 'approx' function: approx(approx(A+B) + C) is clearly different from approx(A + approx(B+C)).
If you think about how bigquery computes aggregates, it builds an execution tree, and computes the value to be aggregated at the leaves of the tree. As those answers are ready, they're passed back up to the higher levels of the tree and aggregated (let's say they're added). The "when they're ready" part makes it non-deterministic.
A node may get results back in the order A,B,C the first time and C,A, B the second time. This means that the order of distribution will change, since you'll get approx(approx(A + B) + C) the first time and approx(approx(C, A) + B) the second time. Note that since we're dealing with ordering, it may look like the commutative property is the problematic one, but it isn't; A+B in floating math is the same as B+A. The problem is really that you're adding partial results, which aren't associative.
Floating point math has all sorts of nasty properties and should usually be avoided if you rely on precision.
Assume floating point is non-deterministic:
https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/
“the IEEE standard does not guarantee that the same program will
deliver identical results on all conforming systems.”