Databricks -Test SQL created table exists with correct columns - unit-testing

I'm trying to create a test in Databricks that checks a suite of tables has been correctly created with the correct columns. This feels as if it should be simple but I can't quite grasp the solution, everything is migrated from Oracle and my background is Oracle and SQL rather than python.
So for example, imagine the following example table that will be populated with dashboard data. If it already exists with a different structure the reporting scripts will fail.
%sql
CREATE TABLE IF NOT EXISTS $report_schema.p23a_com
(AGE_GROUP STRING,
UniqServReqID STRING,
Rec_Code STRING,
Elapsed_days INT)
USING delta
PARTITIONED BY (AGE_GROUP)
Part of the test is as follows but obviously the assert fails because of the partition column info.
I can't seem to make DESCRIBE less wordy, I could remove the # from the input list but that seems messy and makes it more difficult when I extend the test to pick up datatype. Is there a better way to capture the schema?
def get_table_schema(dbase,table_name):
desc_query="DESCRIBE "+dbase+"."+table_name
df_tab_cols = sqlContext.sql(desc_query)
return df_tab_cols
def test_table_schema(tab_cols,list_tab_cols):
input_col_list = df_tab_cols.select("col_name").rdd.map(lambda row : row[0]).collect()
assert set(input_col_list) == set(list_tab_cols)
db = report_schema
table = "p23a_com"
cols = ["AGE_GROUP","UniqServReqID","Rec_Code","Elapsed_days"]
df_tab_cols = get_table_schema(db,table)
test_table_schema(df_tab_cols,cols)

The answer is me being a bit thick and too SQL focused. Instead of using SQL Describe all I needed to do was read the table directly into the dataframe via spark then use columns.
ie
def get_table_schema(dbase,table_name):
desc_query = dbase+"."+table_name
df_tab_cols = spark.table(desc_query)
return df_tab_cols
def test_table_schema(tab_cols,list_tab_cols):
input_col_list = list(df_tab_cols.columns)
assert set(input_col_list) == set(list_tab_cols)
print(input_col_list)

My issue is that we have migrated a reporting system to databricks and some builds are failing due to tables existing, old versions of tables, creates failing when drops haven't completed; we also don't have full git integration and some environments are more wild west than I would like.
As a result a schema check is necessary.
As I couldn't get an answer the following is my final code in case anyone else needs something similar. As I wanted to check columns and column types I switched to dictionaries, it turns out that comparing dictionaries is simple.
I don't think it is very pythonic and it isn't efficient when there are a lot of tables. I think the dictionary creation from a dataframe might need an rdd in non-databricks environments.
Feel free to critique as I am still learning.
# table_list is a python dictionary in the form
# {tablename:{column1:column1_type, column2:column2_type, etc:etc}}
# When a table is added or changed in the build it should be added as a dictionary item.
table_list = {
'p23a_com': {'AGE_GROUP': 'string',
'UniqServReqID': 'string',
'Rec_Code': 'string',
'Elapsed_days': 'int'},
'p23b_com': {'AGE_GROUP': 'string',
'UniqServReqID': 'string',
'Org_code': 'string',
'Elapsed_days': 'int'}}
# Function to get the schema for a table
# Function takes the database name and the table name and returns a dictionary of the columns and data types.
def get_table_schema(dbase,table_name):
desc_query = dbase+"."+table_name
df_tab_cols = spark.createDataFrame(spark.table(desc_query).dtypes)
tab_cols_dict = dict(map(lambda row: (row[0],row[1]), df_tab_cols.collect()))
return tab_cols_dict
# This is the test, it cycles through the table_list dictionary returning the columns and types and then doing a dictionary compare.
# The query will fail on any missing table or any table with incorrect columns.
for tab in table_list:
tab_cols_d = get_table_schema(db,tab)
assert tab_cols_d == table_list[tab]

Related

How do you unit test Django RawQuerySets

I'm using raw() and annotate() on a Django query and need some way to test it.
Here's what my code looks like (this is simplified significantly)
query = """SELECT
table1.id, table1.column1, table1.column2,
table2.other_column1, table2.other_column2
FROM myapp_mymodel as table1
JOIN otherapp_othermodel as table2 ON table1.othermodel_id = table2.id"""
return MyModel.objects.annotate(
other_column1=models.Value('other_column1', models.IntegerField()),
other_column2=models.Value('other_column2', models.DateField())
).raw(query)
It's relatively straightforward to fill the database with sample data, but what's the best way to check that the data is returned by this code?
There are a lot of options when dealing with standard querysets that seem to go out the window when dealing with RawQuerySets.
Usually the approach is to have the test set up a relatively small data set that contains some things the query should find, and some things it shouldn't find. Then inspect the returned QuerySet and verify:
That it contains the expected number of results
That the set of primary key values returned matches what you expected to return
That the annotated values on the returned objects are what they're expected to be.
So, for example, you might do something like:
def test_custom_query(self):
# Put whatever code you need here to insert the test data
results = run_your_query_here()
self.assertEqual(results.count(), expected_number_of_results)
self.assertEqual({obj.pk for obj in results}, set_of_expected_primary_keys)
self.assertEqual(
[obj.annotated_value for obj in results],
list_of_expected_annotated_values
)

Fulltext Search DynamoDB

Following situation:
I´m storing elements in a DyanmoDb for my customers. HashKey is a Element ID and Range Key is the customer ID. In addition to these fields I´m storing an array of strings -> tags (e.g. ["Pets", "House"]) and a multiline text.
I want to provide a search function in my application, where the user can type a free text or select tags and get all related elements.
In my opinion a plain DB query is not the correct solution. I was playing around with CloudSearch, but I´m not really sure if this is the correct solution, because everytime the user adds a tag the index must be updated...
I hope you have some hints for me.
DynamoDB is now integrated with Elasticsearch, enabling you to perform
full-text queries on your data.
https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-dynamodb-elasticsearch-integration/
DynamoDB streams are used to keep the search index up-to-date.
You can use an instant-search engine like Typesense to search through data in your DynamoDB table:
https://github.com/typesense/typesense
There's also ElasticSearch, but it has a steep learning curve and can become a beast to manage, given the number of features and configuration options it supports.
At a high level:
Turn on DynamoDB streams
Setup an AWS Lambda trigger to listen to these change events
Write code inside your lambda function to index data into Typesense:
def lambda_handler(event, context):
client = typesense.Client({
'nodes': [{
'host': '<Endpoint URL>',
'port': '<Port Number>',
'protocol': 'https',
}],
'api_key': '<API Key>',
'connection_timeout_seconds': 2
})
processed = 0
for record in event['Records']:
ddb_record = record['dynamodb']
if record['eventName'] == 'REMOVE':
res = client.collections['<collection-name>'].documents[str(ddb_record['OldImage']['id']['N'])].delete()
else:
document = ddb_record['NewImage'] # format your document here and the use upsert function to index it.
res = client.collections['<collection-name>'].upsert(document)
print(res)
processed = processed + 1
print('Successfully processed {} records'.format(processed))
return processed
Here's a detailed article from Typesense's docs on how to do this: https://typesense.org/docs/0.19.0/guide/dynamodb-full-text-search.html
DynamoDB just added PartiQL, a SQL-compatible language for querying data. You can use the contains() function to find a value within a set (or a substring): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-functions.contains.html
In your specific case you need Elastic search. But you can do wildcard text search on sort-key,
/* Return all of the songs by an artist, matching first part of title */
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%';
/* Return all of the songs by an artist, with a particular word in the title...
...but only if the price is less than 1.00 */
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%'
AND Price < 1.00;
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.ReadData.Query.html
This is the advantage of using dynamodb as a 'managed service' by aws. You get multiple components managed apart from the managed nosql db.
If you are using the 'downloaded' version of dynamodb then you need to ' build your own ' elasticcluster and index the data in dynamodb .

Django ORM: Select items where latest status is `success`

I have this model.
class Item(models.Model):
name=models.CharField(max_length=128)
An Item gets transferred several times. A transfer can be successful or not.
class TransferLog(models.Model):
item=models.ForeignKey(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
How can I query for all Items which latest TransferLog was successful?
With "latest" I mean ordered by timestamp.
TransferLog Table
Here is a data sample. Here item1 should not be included, since the last transfer was not successful:
ID|item_id|timestamp |success
---------------------------------------
1 | item1 |2014-11-15 12:00:00 | False
2 | item1 |2014-11-15 14:00:00 | True
3 | item1 |2014-11-15 16:00:00 | False
I know how to solve this with a loop in python, but I would like to do the query in the database.
An efficient trick is possible if timestamps in the log are increasing, that is the end of transfer is logged as timestamp (not the start of transfer) or if you can expect ar least that the older transfer has ended before a newer one started. Than you can use the TransferLog object with the highest id instead of with the highest timestamp.
from django.db.models import Max
qs = TransferLog.objects.filter(id__in=TransferLog.objects.values('item')
.annotate(max_id=Max('id')).values('max_id'), success=True)
It makes groups by item_id in the subquery and sends the highest id for every group to the main query, where it is filtered by success of the latest row in the group.
You can see that it is compiled to the optimal possible one query directly by Django.
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql())
SELECT L.id, L.item_id, L.timestamp, L.success FROM app_transferlog L
WHERE L.success = true AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0 GROUP BY U0.item_id )
(I edited the example result compiled SQL for better readability by replacing many "app_transferlog"."field" by a short alias L.field, by substituting the True parameter directly into SQL and by editing whitespace and parentheses.)
It can be improved by adding some example filter and by selecting the related Item in the same query:
kwargs = {} # e.g. filter: kwargs = {'timestamp__gte': ..., 'timestamp__lt':...}
qs = TransferLog.objects.filter(
id__in=TransferLog.objects.filter(**kwargs).values('item')
.annotate(max_id=Max('id')).values('max_id'),
success=True).select_related('item')
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql()[0])
SELECT L.id, L.item_id, L.timestamp, L.success, I.id, I.name
FROM app_transferlog L INNER JOIN app_item I ON ( L.item_id = I.id )
WHERE L.success = %s AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0
WHERE U0.timestamp >= %s AND U0.timestamp < %s
GROUP BY U0.item_id )
print(qs.query.get_compiler('default').as_sql()[1])
# result
(True, <timestamp_start>, <timestamp_end>)
Useful fields of latest TransferLog and the related Items are acquired by one query:
for logitem in qs:
item = logitem.item # the item is still cached in the logitem
...
The query can be more optimized according to circumstances, e.g. if you are not interested in the timestamp any more and you work with big data...
Without assumption of increasing timestamps it is really more complicated than a plain Django ORM. My solutions can be found here.
EDIT after it has been accepted:
An exact solution for a non increasing dataset is possible by two queries:
Get a set of id of the last failed transfers. (Used fail list, because it is much smaller small than the list of successful tranfers.)
Iterate over the list of all last transfers. Exclude items found in the list of failed transfers.
This way can be be efficiently simulated queries that would otherwise require a custom SQL:
SELECT a_boolean_field_or_expression,
rank() OVER (PARTITION BY parent_id ORDER BY the_maximized_field DESC)
FROM ...
WHERE rank = 1 GROUP BY parent_object_id
I'm thinking about implementing an aggregation function (e.g. Rank(maximized_field) ) as an extension for Django with PostgresQL, how it would be useful.
try this
# your query
items_with_good_translogs = Item.objects.filter(id__in=
(x.item.id for x in TransferLog.objects.filter(success=True))
since you said "How can I query for all Items which latest TransferLog was successful?", it is logically easy to follow if you start the query with Item model.
I used the Q Object which can be useful in places like this. (negation, or, ...)
(x.item.id for x in TransferLog.objects.filter(success=True)
gives a list of TransferLogs where success=True is true.
You will probably have an easier time approaching this from the ItemLog thusly:
dataset = ItemLog.objects.order_by('item','-timestamp').distinct('item')
Sadly that does not weed out the False items and I can't find a way to apply the filter AFTER the distinct. You can however filter it after the fact with python listcomprehension:
dataset = [d.item for d in dataset if d.success]
If you are doing this for logfiles within a given timeperiod it's best to filter that before ordering and distinct-ing:
dataset = ItemLog.objects.filter(
timestamp__gt=start,
timestamp__lt=end
).order_by(
'item','-timestamp'
).distinct('item')
If you can modify your models, I actually think you'll have an easier time using ManyToMany relationship instead of explicit ForeignKey -- Django has some built-in convenience methods that will make your querying easier. Docs on ManyToMany are here. I suggest the following model:
class TransferLog(models.Model):
item=models.ManyToManyField(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
Then you could do (I know, not a nice, single-line of code, but I'm trying to be explicit to be clearer):
results = []
for item in Item.objects.all():
if item.transferlog__set.all().order_by('-timestamp')[0].success:
results.append(item)
Then your results array will have all the items whose latest transfer was successful. I know, it's still a loop in Python...but perhaps a cleaner loop.

Django bulk_create with ignore rows that cause IntegrityError?

I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was wondering if there was a way to tell django to ignore such rows and save as much of the batch as possible?
This is now possible on Django 2.2
Django 2.2 adds a new ignore_conflicts option to the bulk_create method, from the documentation:
On databases that support it (all except PostgreSQL < 9.5 and Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values. Enabling this parameter disables setting the primary key on each model instance (if the database normally supports it).
Example:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
], ignore_conflicts=True)
One quick-and-dirty workaround for this that doesn't involve manual SQL and temporary tables is to just attempt to bulk insert the data. If it fails, revert to serial insertion.
objs = [(Event), (Event), (Event)...]
try:
Event.objects.bulk_create(objs)
except IntegrityError:
for obj in objs:
try:
obj.save()
except IntegrityError:
continue
If you have lots and lots of errors this may not be so efficient (you'll spend more time serially inserting than doing so in bulk), but I'm working through a high-cardinality dataset with few duplicates so this solves most of my problems.
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT each row in a separate transaction and ignore errors (very slow);
Create a SAVEPOINT before each insert (can have scaling problems);
Use a procedure or query to insert only if the row doesn't already exist (complicated and slow); or
Bulk-insert or (better) COPY the data into a TEMPORARY table, then merge that into the main table server-side.
The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED or TEMPORARY, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECTs, only SELECT ... FOR UPDATE, INSERT,UPDATE and DELETE, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable into realtable, retrying each block if it failed.
Or 5. Divide and conquer
I didn't test or benchmark this thoroughly, but it performs pretty well for me. YMMV, depending in particular on how many errors you expect to get in a bulk operation.
def psql_copy(records):
count = len(records)
if count < 1:
return True
try:
pg.copy_bin_values(records)
return True
except IntegrityError:
if count == 1:
# found culprit!
msg = "Integrity error copying record:\n%r"
logger.error(msg % records[0], exc_info=True)
return False
finally:
connection.commit()
# There was an integrity error but we had more than one record.
# Divide and conquer.
mid = count / 2
return psql_copy(records[:mid]) and psql_copy(records[mid:])
# or just return False
Even in Django 1.11 there is no way to do this. I found a better option than using Raw SQL. It using djnago-query-builder. It has an upsert method
from querybuilder.query import Query
q = Query().from_table(YourModel)
# replace with your real objects
rows = [YourModel() for i in range(10)]
q.upsert(rows, ['unique_fld1', 'unique_fld2'], ['fld1_to_update', 'fld2_to_update'])
Note: The library only support postgreSQL
Here is a gist that I use for bulk insert that supports ignoring IntegrityErrors and returns the records inserted.
Late answer for pre Django 2.2 projects :
I ran into this situation recently and I found my way out with a seconder list array for check the uniqueness.
In my case, the model has that unique together check, and bulk create is throwing Integrity Error exception because of the array of bulk create has duplicate data in it.
So I decided to create checklist besides bulk create objects list. Here is the sample code; The unique keys are owner and brand, and in this example owner is an user object instance and brand is a string instance:
create_list = []
create_list_check = []
for brand in brands:
if (owner.id, brand) not in create_list_check:
create_list_check.append((owner.id, brand))
create_list.append(ProductBrand(owner=owner, name=brand))
if create_list:
ProductBrand.objects.bulk_create(create_list)
it's work for me
i am use this this funtion in thread.
my csv file contains 120907 no of rows.
def products_create():
full_path = os.path.join(settings.MEDIA_ROOT,'productcsv')
filename = os.listdir(full_path)[0]
logger.debug(filename)
logger.debug(len(Product.objects.all()))
if len(Product.objects.all()) > 0:
logger.debug("Products Data Erasing")
Product.objects.all().delete()
logger.debug("Products Erasing Done")
csvfile = os.path.join(full_path,filename)
csv_df = pd.read_csv(csvfile,sep=',')
csv_df['HSN Code'] = csv_df['HSN Code'].fillna(0)
row_iter = csv_df.iterrows()
logger.debug(row_iter)
logger.debug("New Products Creating")
for index, row in row_iter:
Product.objects.create(part_number = row[0],
part_description = row[1],
mrp = row[2],
hsn_code = row[3],
gst = row[4],
)
# products_list = [
# Product(
# part_number = row[0] ,
# part_description = row[1],
# mrp = row[2],
# hsn_code = row[3],
# gst = row[4],
# )
# for index, row in row_iter
# ]
# logger.debug(products_list)
# Product.objects.bulk_create(products_list)
logger.debug("Products uploading done")```

How do I get the related objects In an extra().values() call in Django?

Thank to this post I'm able to easily do count and group by queries in a Django view:
Django equivalent for count and group by
What I'm doing in my app is displaying a list of coin types and face values available in my database for a country, so coins from the UK might have a face value of "1 farthing" or "6 pence". The face_value is the 6, the currency_type is the "pence", stored in a related table.
I have the following code in my view that gets me 90% of the way there:
def coins_by_country(request, country_name):
country = Country.objects.get(name=country_name)
coin_values = Collectible.objects.filter(country=country.id, type=1).extra(select={'count': 'count(1)'},
order_by=['-count']).values('count', 'face_value', 'currency_type')
coin_values.query.group_by = ['currency_type_id', 'face_value']
return render_to_response('icollectit/coins_by_country.html', {'coin_values': coin_values, 'country': country } )
The currency_type_id comes across as the number stored in the foreign key field (i.e. 4). What I want to do is retrieve the actual object that it references as part of the query (the Currency model, so I can get the Currency.name field in my template).
What's the best way to do that?
You can't do it with values(). But there's no need to use that - you can just get the actual Collectible objects, and each one will have a currency_type attribute that will be the relevant linked object.
And as justinhamade suggests, using select_related() will help to cut down the number of database queries.
Putting it together, you get:
coin_values = Collectible.objects.filter(country=country.id,
type=1).extra(
select={'count': 'count(1)'},
order_by=['-count']
).select_related()
select_related() got me pretty close, but it wanted me to add every field that I've selected to the group_by clause.
So I tried appending values() after the select_related(). No go. Then I tried various permutations of each in different positions of the query. Close, but not quite.
I ended up "wimping out" and just using raw SQL, since I already knew how to write the SQL query.
def coins_by_country(request, country_name):
country = get_object_or_404(Country, name=country_name)
cursor = connection.cursor()
cursor.execute('SELECT count(*), face_value, collection_currency.name FROM collection_collectible, collection_currency WHERE collection_collectible.currency_type_id = collection_currency.id AND country_id=%s AND type=1 group by face_value, collection_currency.name', [country.id] )
coin_values = cursor.fetchall()
return render_to_response('icollectit/coins_by_country.html', {'coin_values': coin_values, 'country': country } )
If there's a way to phrase that exact query in the Django queryset language I'd be curious to know. I imagine that an SQL join with a count and grouping by two columns isn't super-rare, so I'd be surprised if there wasn't a clean way.
Have you tried select_related() http://docs.djangoproject.com/en/dev/ref/models/querysets/#id4
I use it a lot it seems to work well then you can go coin_values.currency.name.
Also I dont think you need to do country=country.id in your filter, just country=country but I am not sure what difference that makes other than less typing.