django - OR query using lambda - django

I want to perform OR query using django ORM. I referred this answer and it fits my need.
I have a list of integers which gets generated dynamically. These integers represent user id in a particular table. This table also has a date field. I want to query the database for all user ids in the list for a given date.
For example: From below table, I want records for user ids 2 and 3 for the date 2015-02-28
id | date
---------------
1 | 2015-02-23
1 | 2015-02-25
1 | 2015-02-28
2 | 2015-02-28
2 | 2015-03-01
3 | 2015-02-28
I am unable to figure out which of the following two should be perfect for my use case:
Table.objects.filter(reduce(lambda x, y: (x | y) & Q(date=datetime.date(2015, 2, 28)), [Q(user_id=i) for i in ids])
OR
Table.objects.filter(reduce(lambda x, y: (x | y), [Q(user_id=i) for i in ids]) & Q(date=datetime.date(2015, 2, 28))
Both of the above yield similar output at the moment. Without lambda, below query would fit my need:
Table.objects.filter(Q(user_id=3) & Q(date=datetime.date(2015, 2, 28))| Q(user_id=2) & Q(date=datetime.date(2015, 2, 28)))

I think you do not need reduce and Q objects here, you can just do:
Table.objects.filter(
user_id__in=[2,3],
date=datetime.date(2015, 2, 28),
)

Related

Django ORM. Select only duplicated fields from DB

I have table in DB like this:
MyTableWithValues
id | user(fk to Users) | value(fk to Values) | text | something1 | something2 ...
1 | userobject1 | valueobject1 |asdasdasdasd| 123 | 12321
2 | userobject2 | valueobject50 |QWQWQWQWQWQW| 515 | 5555455
3 | userobject1 | valueobject1 |asdasdasdasd| 12345 | 123213
I need to delete all objects where are repeated fields user, value and text, but save one from them. In this example will be deleted 3rd record.
How can I do this, using Django ORM?
PS:
try this:
recs = (
MyTableWithValues.objects
.order_by()
.annotate(max_id=Max('id'), count_id=Count('user__id'))
#.filter(count_id__gt=1)
.annotate(count_values=Count('values'))
#.filter(count_icd__gt=1)
)
...
...
for r in recs:
print(r.id, r.count_id, , r.count_values)
it prints something like this:
1 1 1
2 1 1
3 1 1
...
Dispite the fact, that in database there are duplicated values. I cant understand, why Count function does not work.
Can anybody help me?
You should first be aware of how count works.
The Count method will count for identical rows.
It uses all the fields available in an object to check if it is identical with fields of other rows or not.
So in current situation the count_values is resulting 1 because Count is using all fields excluding id to look for similar rows.
Count is including user,value,text,something1,something2 fields to check for similarity.
To count rows with similar fields you have to use only user,values & text field
Query:
recs = MyTableWithValues.objects
.values('user','values','text')
.annotate(max_id=Max('id'),count_id=Count('user__id'))
.annotate(count_values=Count('values'))
It will return a list of dictionary
print(recs)
Output:
<QuerySet[{'user':1,'values':1,'text':'asdasdasdasd','max_id':3,'count_id':2,'count_values':2},{'user':2,'values':2,'text':'QWQWQWQWQWQW','max_id':2,'count_id':1,'count_values':1}]
using this queryset you can check how many times a row contains user,values & text field with same values
Would a Python loop work for you?
import collections
d = collections.defaultdict(list)
# group all objects by the key
for e in MyTableWithValues.objects.all():
k = (e.user_id, e.value_id, e.text)
d[k].append(e)
for k, obj_list in d.items():
if len(obj_list) > 1:
for e in obj_list[1:]:
# except the first one, delete all objects
e.delete()

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

django rawsql postgres nested json/jsonb query

I have a jsonb structure on postgres named data where each row (there are around 3 million of them) looks like this:
[
{
"number": 100,
"key": "this-is-your-key",
"listr": "20 Purple block, THE-CITY, Columbia",
"realcode": "LA40",
"ainfo": {
"city": "THE-CITY",
"county": "Columbia",
"street": "20 Purple block",
"var_1": ""
},
"booleanval": true,
"min_address": "20 Purple block, THE-CITY, Columbia LA40"
},
.....
]
I would like to query the min_address field in the fastest possible way. In Django I tried to use:
APModel.objects.filter(data__0__min_address__icontains=search_term)
but this takes ages to complete (also, "THE-CITY" is in uppercase, so, I have to use icontains here. I tried dropping to rawsql like so:
cursor.execute("""\
SELECT * FROM "apmodel_ap_model"
WHERE ("apmodel_ap_model"."data"
#>> array['0', 'min_address'])
#> %s \
""",\
[json.dumps([{'min_address': search_term}])]
)
but this throws me strange errors like:
LINE 4: #> '[{"min_address": "some lane"}]'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
I am wondering what is the fastest way I can query the field min_address by using rawsql cursors.
Late answer, probably it won't help OP anymore. Also I'm not at all an expert in Postgres/JSONB, so this might be a terrible idea.
Given this setup;
so49263641=# \d apmodel_ap_model;
Table "public.apmodel_ap_model"
Column | Type | Collation | Nullable | Default
--------+-------+-----------+----------+---------
data | jsonb | | |
so49263641=# select * from apmodel_ap_model ;
data
-------------------------------------------------------------------------------------------
[{"number": 1, "min_address": "Columbia"}, {"number": 2, "min_address": "colorado"}]
[{"number": 3, "min_address": " columbia "}, {"number": 4, "min_address": "California"}]
(2 rows)
The following query "expands" objects from data arrays to individual rows. Then it applies pattern matching to the min_address field.
so49263641=# SELECT element->'number' as number, element->'min_address' as min_address
FROM apmodel_ap_model ap, JSONB_ARRAY_ELEMENTS(ap.data) element
WHERE element->>'min_address' ILIKE '%col%';
number | min_address
--------+---------------
1 | "Columbia"
2 | "colorado"
3 | " columbia "
(3 rows)
However, I doubt it will perform well on large datasets as the min_address values are casted to text before pattern matching.
Edit: Some great advice here on indexing JSONB data for search https://stackoverflow.com/a/33028467/1284043

Compare fields within relationship on Django ORM

I have two models, route and stop.
A route can have several stop, each stop have a name and a number. On same route, stop.number are unique.
The problem:
I need to search which route has two different stops and one stop.number is less than the other stop.number
Consider the following models:
class Route(models.Model):
name = models.CharField(max_length=20)
class Stop(models.Model):
route = models.ForeignKey(Route)
number = models.PositiveSmallIntegerField()
location = models.CharField(max_length=45)
And the following data:
Stop table
| id | route_id | number | location |
|----|----------|--------|----------|
| 1 | 1 | 1 | 'A' |
| 2 | 1 | 2 | 'B' |
| 3 | 1 | 3 | 'C' |
| 4 | 2 | 1 | 'C' |
| 5 | 2 | 2 | 'B' |
| 6 | 2 | 3 | 'A' |
In example:
Given two locations 'A' and 'B', search which routes have both location and A.number is less than B.number
With the previous data, it should match route id 1 and not route id 2
On raw SQL, this works with a single query:
SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = 'A'
AND stop_to.`stop_location_id` = 'B'
AND stop_from.stop_number < stop_to.stop_number
Is this possible to do with one single query on Django ORM as well?
Generally ORM frameworks like Django ORM, SQLAlchemy and even Hibernate is not design to autogenerate most efficient query. There is a way to write this query only using Model objects, however, since I had similar issue, I would suggest to use raw query for more complex queries. Following is link for Django raw query:
[https://docs.djangoproject.com/en/1.11/topics/db/sql/]
Although, you can write your query in many ways but something like following could help.
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute("SELECT
`route`.id
FROM
`route`
LEFT JOIN `stop` stop_from ON stop_from.`route_id` = `route`.`id`
LEFT JOIN `stop` stop_to ON stop_to.`route_id` = `route`.`id`
WHERE
stop_from.`stop_location_id` = %s
AND stop_to.`stop_location_id` = %s
AND stop_from.stop_number < stop_to.stop_number", ['A', 'B'])
row = cursor.fetchone()
return row
hope this helps.

Django group by in many to many relationships

I have a model named Evaluation with following schema:
user = models.ForeignKey(User)
value = models.IntegerField()
The value field will take value in 0,1,2,3.
Now I want to get the count of evaluations of a given user with each value. For example, suppose my data are:
user.id | value
1 | 0
1 | 0
1 | 1
1 | 2
1 | 3
1 | 3
I want to get the result
value | count
0 | 2
1 | 1
2 | 1
3 | 2
I use the query
Evaluation.objects.filter(user=request.user).annotate(count=Count('value')).order_by('value')
But it does not return the correct answer. Could anyone help?
you can do it this way:
Evaluation.objects.filter(user=request.user).values('value').annotate(count=Count('value')).order_by('value')
Add the values() method:
Evaluation.objects.filter(user_id=request.user) \
.values('value').annotate(count=Count('value')) \
.order_by('value')
You could build reverse query and query the User model instead:
User.objects.filter(user=request.user).values('evaluation__value').annotate(count=Count('evaluation__user'))
which will produce below results:
[{'count': 1, 'evaluation__value': 1}, {'count': 1, 'evaluation__value': 2}, {'count': 2, 'evaluation__value': 0}, {'count': 2, 'evaluation__value': 3}]
Additionally you might want to sort the results:
queryset.order_by('-count') # sorts by count desc
Unfortunately you cannot alias the value in values queryset method hence the ugly evaluation__value as field name. See this Django ticket.
HTH.