How to mock boto3 calls when testing a function that calls boto3 in its body - unit-testing

I am trying to test a function called get_date_from_s3(bucket, table) using pytest. In this function, there a boto3.client("s3").list_objects_v2() call that I would like to mock during testing, but I can't seem to figure out how this would work.
Here is my directory setup:
my_project/
glue/
continuous.py
tests/
glue/
test_continuous.py
conftest.py
conftest.py
The code continuous.py will be executed in an AWS glue job but I am testing it locally.
my_project/glue/continuous.py
import boto3
def get_date_from_s3(bucket, table):
s3_client = boto3.client("s3")
result = s3_client.list_objects_v2(Bucket=bucket, Prefix="Foo/{}/".format(table))
# [the actual thing I want to test]
latest_date = datetime_date(1, 1, 1)
output = None
for content in result.get("Contents"):
date = key.split("/")
output = [some logic to get the latest date from the file name in s3]
return output
def main(argv):
date = get_date_from_s3(argv[1], argv[2])
if __name__ == "__main__":
main(sys.argv[1:])
my_project/tests/glue/test_continuous.py
This is what I want: I want to test get_date_from_s3() by mocking the s3_client.list_objects_v2() and explicitly setting the response value to example_response. I tried doing something like below but it doesn't work:
from glue import continuous
import mock
def test_get_date_from_s3(mocker):
example_response = {
"ResponseMetadata": "somethingsomething",
"IsTruncated": False,
"Contents": [
{
"Key": "/year=2021/month=01/day=03/some_file.parquet",
"LastModified": "datetime.datetime(2021, 2, 5, 17, 5, 11, tzinfo=tzlocal())",
...
},
{
"Key": "/year=2021/month=01/day=02/some_file.parquet",
"LastModified": ...,
},
...
]
}
mocker.patch(
'continuous.boto3.client.list_objects_v2',
return_value=example_response
)
expected = "20210102"
actual = get_date_from_s3(bucket, table)
assert actual == expected
Note
I noticed that a lot of examples of mocking have the functions to test as part of a class. Because continuous.py is a glue job, I didn't find the utility of creating a class, I just have functions and a main() that calls it, is it a bad practice? It seems like mock decorators before functions are used only for functions that are part of a class.
I also read about moto, but couldn't seem to figure out how to apply it here.

The idea with mocking and patching that one would want to mock/patch something specific. So, to have correct patching, one has to specify exactly the thing to be mocked/patch. In the given example, the thing to be patched is located in: glue > continuous > boto3 > client instance > list_objects_v2.
As you pointed one you would like calls to list_objects_v2() to give back prepared data. So, this means that you have to first mock "glue.continuous.boto3.client" then using the latter mock "list_objects_v2".
In practice you need to do something along the lines of:
from glue import continuous_deduplicate
from unittest.mock import Mock, patch
#patch("glue.continuous.boto3.client")
def test_get_date_from_s3(mocked_client):
mocked_response = Mock()
mocked_response.return_value = { ... }
mocked_client.list_objects_v2 = mocked_response
# Run other setup and function under test:

In the end, I figured out that my patching target value was wrong thanks to #Gros Lalo. It should have been 'glue.continuous.boto3.client.list_objects_v'. That still didn't work however, it threw me the error AttributeError: <function client at 0x7fad6f1b2af0> does not have the attribute 'list_objects_v'.
So I did a little refactoring to wrap the whole boto3.client in a function that is easier to mock. Here is my new my_project/glue/continuous.py file:
import boto3
def get_s3_objects(bucket, table):
s3_client = boto3.client("s3")
return s3_client.list_objects_v2(Bucket=bucket, Prefix="Foo/{}/".format(table))
def get_date_from_s3(bucket, table):
result = get_s3_objects(bucket, table)
# [the actual thing I want to test]
latest_date = datetime_date(1, 1, 1)
output = None
for content in result.get("Contents"):
date = key.split("/")
output = [some logic to get the latest date from the file name in s3]
return output
def main(argv):
date = get_date_from_s3(argv[1], argv[2])
if __name__ == "__main__":
main(sys.argv[1:])
My new test_get_latest_date_from_s3() is therefore:
def test_get_latest_date_from_s3(mocker):
example_response = {
"ResponseMetadata": "somethingsomething",
"IsTruncated": False,
"Contents": [
{
"Key": "/year=2021/month=01/day=03/some_file.parquet",
"LastModified": "datetime.datetime(2021, 2, 5, 17, 5, 11, tzinfo=tzlocal())",
...
},
{
"Key": "/year=2021/month=01/day=02/some_file.parquet",
"LastModified": ...,
},
...
]
}
mocker.patch('glue.continuous_deduplicate.get_s3_objects', return_value=example_response)
expected_date = "20190823"
actual_date = continuous_deduplicate.get_latest_date_from_s3("some_bucket", "some_table")
assert expected_date == actual_date
The refactoring worked out for me, but if there is a way to mock the list_objects_v2() directly without having to wrap it in another function, I am still interested!

In order to achieve this result using moto, you would have to create the data normally using the boto3-sdk. In other words: create a test case that succeeds agains AWS itself, and then slap the moto-decorator on it.
For your usecase, I imagine it looks something like:
from moto import mock_s3
#mock_s3
def test_glue:
# create test data
s3 = boto3.client("s3")
for d in range(5):
s3.put_object(Bucket="", Key=f"year=2021/month=01/day={d}/some_file.parquet", Body="asdf")
# test
result = get_date_from_s3(bucket, table)
# assert result is as expected
...

Related

Mocking functions from object created by context manager

I have the following module that I am trying to write unit tests for.
import myModuleWithCtxMgr
def myFunc(arg1):
with myModuleWithCtxMgr.ctxMgr() as ctxMgr:
result = ctxMgr.someFunc()
if result:
return True, result
return False, None
The unit tests I'm working on looks like this.
import mock
import unittest
import myModule as myModule
class MyUnitTests(unittest.TestCase):
#mock.patch("myModuleWithCtxMgr.ctxMgr")
def testMyFunc(self, mockFunc):
mockReturn = mock.MagicMock()
mockReturn.someFunc = mock.Mock(return_value="val")
mockFunc.return_value = mockReturn
result = myModule.myFunc("arg")
self.assertEqual(result, (True, "val"))
The test is failing because result[0] = magicMock() and not the return value (I thought) I configured.
I've tried a few different variations of the test but I can't seem to be able to mock the return value of ctxMgr.someFunc(). Does anyone know how I might accomplish this?
Thanks!
The error says:
First differing element 1:
<MagicMock name='ctxMgr().__enter__().someFunc()' id='139943278730000'>
'val'
- (True, <MagicMock name='ctxMgr().__enter__().someFunc()' id='139943278730000'>)
+ (True, 'val')
The error contains the mock name which exactly shows you what needs to be mocked. Note that __enter__ corresponds to the Context Manager protocol.
This works for me:
class MyUnitTests(unittest.TestCase):
#mock.patch("myModuleWithCtxMgr.ctxMgr")
def testMyFunc(self, mockCtxMgr):
mockCtxMgr().__enter__().someFunc.return_value = "val"
result = myModule.myFunc("arg")
self.assertEqual(result, (True, "val"))
Note how each of these is a separate MagicMock instance which you can configure:
mockCtxMgr
mockCtxMgr()
mockCtxMgr().__enter__
mockCtxMgr().__enter__()
mockCtxMgr().__enter__().someFunc
MagicMocks are created lazily but have identity, so you can configure them this way and it Just Works.

Writing to a table after transformation (bonobo-sqlalchemy)

I'm trying to read a table, modify a column and write to another table. I followed the available documentation and ran following code. It doesn't give any errors, but the task doesn't get performed either.
I tried removing the transformation step and then information gets written.
import sqlalchemy
import bonobo
import bonobo_sqlalchemy
def get_services():
return {
'sql_alchemy.engine': sqlalchemy.create_engine('postgresql://postgres:password#localhost:5432/postgres')
}
def transform(*row):
new_row = row[0]+1, row[1]
yield new_row
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(bonobo_sqlalchemy.Select('SELECT * FROM users', engine='sql_alchemy.engine')
,
transform,
bonobo_sqlalchemy.InsertOrUpdate(table_name='table_1', engine='sql_alchemy.engine'),
)
return graph
# The __main__ block actually execute the graph.
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(get_graph(**options), services=get_services(**options))
Output:
- Select in=1 out=6 [done]
- format_for_db in=6 out=6 [done]
- InsertOrUpdate in=6 out=6 [done]
It works when a Dictionary is yielded as follows,
yield {"id": row[0], "text": row[1], "count":row[2]}
with bonobo.UnpackItems(0) node in the chain after the transformation.

Twisted - how to make lots of Python code non-blocking

I've been trying to get this script to perform the code in hub() in written order.
hub() contains a mix of standard Python code and requests to carry out I/O using Twisted and Crossbar.
However, because the Python code is blocking, reactor doesn't have any chance to carry out those 'publish' tasks. My frontend receives all the published messages at the end.
This code is a massively simplified version of what I'm actually dealing with. The real script (hub() and the other methods it calls) is over 1500 lines long. Modifying all those functions to make them non-blocking is not ideal. I'd rather be able to isolate the changes to a few methods like publish() if that's possible to fix this problem.
I have played around with terms like async, await, deferLater, loopingCall, and others. I have not found an example that helped yet in my situation.
Is there a way to modify publish() (or hub()) so they send out the messages in order?
from autobahn.twisted.component import Component, run
from twisted.internet.defer import inlineCallbacks, returnValue
from twisted.internet import reactor, defer
component = Component(
transports=[
{
u"type": u"websocket",
u"url": u"ws://127.0.0.1:8080/ws",
u"endpoint": {
u"type": u"tcp",
u"host": u"localhost",
u"port": 8080,
},
u"options": {
u"open_handshake_timeout": 100,
}
},
],
realm=u"realm1",
)
#component.on_join
#inlineCallbacks
def join(session, details):
print("joined {}: {}".format(session, details))
def publish(context='output', value='default'):
""" Publish a message. """
print('publish', value)
session.publish(u'com.myapp.universal_feedback', {"id": context, "value": value})
def hub(thing):
""" Main script. """
do_things
publish('output', 'some data for you')
do_more_things
publish('status', 'a progress message')
do_even_more_things
publish('status', 'some more data')
do_all_the_things
publish('other', 'something else')
try:
yield session.register(hub, u'com.myapp.hello')
print("procedure registered")
except Exception as e:
print("could not register procedure: {0}".format(e))
if __name__ == "__main__":
run([component])
reactor.run()
Your join() function is async (decorated with #inlineCallbacks and contains at least one yield in the body).
Internally it registers function hub() as WAMP RPC; hub() is however not async.
Also the calls to session.publish() are not yielded as async calls should be.
Result: you add a bunch of events to the eventloop but don't await them until you flush the eventloop on application shutdown.
You need to make your function hub and publish async.
#inlineCallbacks
def publish(context='output', value='default'):
""" Publish a message. """
print('publish', value)
yield session.publish(u'com.myapp.universal_feedback', {"id": context, "value": value})
#inlineCallbacks
def hub(thing):
""" Main script. """
do_things
yield publish('output', 'some data for you')
do_more_things
yield publish('status', 'a progress message')
do_even_more_things
yield publish('status', 'some more data')
do_all_the_things
yield publish('other', 'something else')

Testing Graphene-Django

Currently I am investigating using graphene to build my Web server API. I have been using Django-Rest-Framework for quite a while and want to try something different.
I have figured out how to wire it up with my existing project and I can test the query from Graphiql UI, by typing something like
{
industry(id:10) {
name
description
}
}
Now, I want to have the new API covered by Unit/integration tests. And here the problem starts.
All the documentation/post I am checking on testing query/execution on graphene is doing something like
result = schema.execute("{industry(id:10){name, description}}")
assertEqual(result, {"data": {"industry": {"name": "Technology", "description": "blab"}}}
My point is that the query inside execute() is just a big chunk of text and I don't know how I can maintain it in the future. I or other developer in the future has to read that text, figure out what it means and update it if needed.
Is that how this supposed to be? How do you guys write unit test for graphene?
I've been writing tests that do have a big block of text for the query, but I've made it easy to paste in that big block of text from GraphiQL. And I've been using RequestFactory to allow me to send a user along with the query.
from django.test import RequestFactory, TestCase
from graphene.test import Client
def execute_test_client_api_query(api_query, user=None, variable_values=None, **kwargs):
"""
Returns the results of executing a graphQL query using the graphene test client. This is a helper method for our tests
"""
request_factory = RequestFactory()
context_value = request_factory.get('/api/') # or use reverse() on your API endpoint
context_value.user = user
client = Client(schema) # Note: you need to import your schema
executed = client.execute(api_query, context_value=context_value, variable_values=variable_values, **kwargs)
return executed
class APITest(TestCase):
def test_accounts_queries(self):
# This is the test method.
# Let's assume that there's a user object "my_test_user" that was already setup
query = '''
{
user {
id
firstName
}
}
'''
executed = execute_test_client_api_query(query, my_test_user)
data = executed.get('data')
self.assertEqual(data['user']['firstName'], my_test_user.first_name)
...more tests etc. etc.
Everything between the set of ''' s ( { user { id firstName } } ) is just pasted in from GraphiQL, which makes it easier to update as needed. If I make a change that causes a test to fail, I can paste the query from my code into GraphQL, and will often fix the query and paste a new query back into my code. There is purposefully no tabbing on this pasted-in query, to facilitate this repeated pasting.

Python & nose: what is the variable scope in the "test class"?

I am running tests with nose and would like to use a variable from one of the tests item in another. For this I create the variable when setting the class up. It seems to me that the variable is copied for each item, so the one in the class stays in fact untouched. If instead of a simple variable I use a list, I see the behavior that I was expecting.
I wrote a small exemple, we can observe that var1 and varg always show the same value when entering a test:
import time
import sys
import logging
logger = logging.getLogger('008')
class Test008:
varg = None
#classmethod
def setup_class(cls):
logger.info('* setup class')
cls.var1 = None
cls.list1 = []
def setup(self):
logger.info('\r\n* setup')
logger.info('\t var1: {}, varg: {}, list: {}'.format(
self.var1, self.varg, self.list1))
def teardown(self):
logger.info('* teardown')
logger.info('\t var1: {}, varg: {}, list: {}'.format(
self.var1, self.varg, self.list1))
def test_000(self):
self.var1 = 0
self.varg = 0
self.list1.append(0)
pass
def test_001(self):
# Here I would like to access the variables but they still show 'None'
self.var1 = 1
self.varg = 1
self.list1.append(1)
pass
#classmethod
def teardown_class(self):
logger.info('* teardown class')
Result:
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
* setup class
008_TestVars.Test008.test_000 ...
* setup
var1: None, varg: None, list: []
* teardown
var1: 0, varg: 0, list: [0]
ok
008_TestVars.Test008.test_001 ...
* setup
var1: None, varg: None, list: [0]
* teardown
var1: 1, varg: 1, list: [0, 1]
ok
* teardown class
----------------------------------------------------------------------
Is there a way to have the values of var1 and varg be carried on from one test to the other?
The docs clearly say
a test case is constructed to run each method with a fresh instance of
the test class
If you need the actual time to set up an parameter to the function you are testing, why not write a test which sets up the state, call your function once and assert it passes then call it again and assert if fails?
def test_that_two_calls_to_my_method_with_same_params_fails(self):
var1 = 1
varg = 1
assert myMethod(var1, varg)
assert myMethod(var1, varg) == False
I think that is clearer because one test has all the state together, and can run the tests in any order.
You could argue it does, because you were trying to use the setup method. The docs also say
A test module is a python module that matches the testMatch regular
expression. Test modules offer module-level setup and teardown; define
the method setup, setup_module, setUp or setUpModule for setup,
teardown, teardown_module, or tearDownModule for teardown.
So, outside your class have a
def setup_module()
#bother now I need to use global variables
pass