django rawsql postgres nested json/jsonb query - django

I have a jsonb structure on postgres named data where each row (there are around 3 million of them) looks like this:
[
{
"number": 100,
"key": "this-is-your-key",
"listr": "20 Purple block, THE-CITY, Columbia",
"realcode": "LA40",
"ainfo": {
"city": "THE-CITY",
"county": "Columbia",
"street": "20 Purple block",
"var_1": ""
},
"booleanval": true,
"min_address": "20 Purple block, THE-CITY, Columbia LA40"
},
.....
]
I would like to query the min_address field in the fastest possible way. In Django I tried to use:
APModel.objects.filter(data__0__min_address__icontains=search_term)
but this takes ages to complete (also, "THE-CITY" is in uppercase, so, I have to use icontains here. I tried dropping to rawsql like so:
cursor.execute("""\
SELECT * FROM "apmodel_ap_model"
WHERE ("apmodel_ap_model"."data"
#>> array['0', 'min_address'])
#> %s \
""",\
[json.dumps([{'min_address': search_term}])]
)
but this throws me strange errors like:
LINE 4: #> '[{"min_address": "some lane"}]'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
I am wondering what is the fastest way I can query the field min_address by using rawsql cursors.

Late answer, probably it won't help OP anymore. Also I'm not at all an expert in Postgres/JSONB, so this might be a terrible idea.
Given this setup;
so49263641=# \d apmodel_ap_model;
Table "public.apmodel_ap_model"
Column | Type | Collation | Nullable | Default
--------+-------+-----------+----------+---------
data | jsonb | | |
so49263641=# select * from apmodel_ap_model ;
data
-------------------------------------------------------------------------------------------
[{"number": 1, "min_address": "Columbia"}, {"number": 2, "min_address": "colorado"}]
[{"number": 3, "min_address": " columbia "}, {"number": 4, "min_address": "California"}]
(2 rows)
The following query "expands" objects from data arrays to individual rows. Then it applies pattern matching to the min_address field.
so49263641=# SELECT element->'number' as number, element->'min_address' as min_address
FROM apmodel_ap_model ap, JSONB_ARRAY_ELEMENTS(ap.data) element
WHERE element->>'min_address' ILIKE '%col%';
number | min_address
--------+---------------
1 | "Columbia"
2 | "colorado"
3 | " columbia "
(3 rows)
However, I doubt it will perform well on large datasets as the min_address values are casted to text before pattern matching.
Edit: Some great advice here on indexing JSONB data for search https://stackoverflow.com/a/33028467/1284043

Related

Aggregate the results of a union query without using raw

I have a table that looks like this
date car_crashes city
01.01 1 Washington
01.02 4 Washington
01.03 0 Washington
01.04 2 Washington
01.05 0 Washington
01.06 3 Washington
01.07 4 Washington
01.08 1 Washington
01.01 0 Detroit
01.02 2 Detroit
01.03 4 Detroit
01.04 2 Detroit
01.05 0 Detroit
01.06 3 Detroit
01.07 1 Detroit
I want to know how many car crashes for each day happened in the entire nation, and I can do that with this:
Model.values("date") \
.annotate(car_crashes=Sum('car_crashes')) \
.values("date", "car_crashes")
Now, let's suppose I have an array like this:
weights = [
{
"city": "Washington",
"weight": 1,
},
{
"city": "Detroit",
"weight": 2,
}
]
This means that Detroit's car crashes should be multiplied by 2 before being aggregated with Washington's.
It can be done like this:
from django.db.models import IntegerField
when_list = [When(city=w['city'], then=w['weight']) for w in weights]
case_params = {'default': 1, 'output_field': IntegerField()}
Model.objects.values('date') \
.annotate(
weighted_car_crashes=Sum(
F('car_crashes') * Case(*when_list, **case_params)
))
However, this generates very slow SQL code, especially as more properties and a larger array are introduced.
Another solution which is way faster but still sub-optimal is using pandas:
aggregated = false
for weight in weights:
ag = Model.filter(city=w[‘city’]).values("date") \
.annotate(car_crashes=Sum('car_crashes') * w[‘weight’]) \
.values("date", "car_crashes")
if aggregated is False:
aggregated = ag
else:
aggregated = aggregated.union(ag)
aggregated = pd.DataFrame(aggregated)
if len(weights) > 1:
aggregated = aggregated.groupby("date", as_index=False).sum(level=[1])
This is faster, but still not as fast as what happens if, before calling pandas, I take the aggregated.query string and
wrap it with a few lines of SQL.
SELECT "date", sum("car_crashes") FROM (
// String from Python
str(aggregated.query)
) as "foo" GROUP BY "date"
This works perfectly when pasted into my database SQL. I could do this in Python/Django using .raw() but the documentation says to ask before using .raw() as mostly anything could be acomplished with the ORM.
Yet, I don't see how. Once I call .union() on 2 querysets, I cannot aggregate further.
aggregated.union(ag).annotate(cc=Sum('car_crashes'))
gives
Cannot compute Sum('car_crashes'): 'car_crashes' is an aggregate
Is this possible to do with the Django ORM or should I use .raw()?

Extract value from a string occurring after a particular word

A json script is passed as a string and I need to extract the numeric value after the content_id for further mapping. Sample Data below:
{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}
The parameters are dynamic so I can't extract using a substr function or count to extract after certain number of occurrences of a special character.
JSON in your example is malformed, it contains extra ] and some tail after closing }. For correct JSON you can use get_json_object, for example:
select get_json_object(src_json,'$.url.content_id') from
(
select '{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36], "packager_path": "/opt/bento4"}}' as src_json
)s
;
Result:
OK
1000231205
Time taken: 21.606 seconds, Fetched: 1 row(s)
You can use regexp_extract funciton in hive with matching regex to extract only the digits from content_id.
Example:
select regexp_extract(col1,'"content_id":\\s"(\\d+)"',1) from (
select string('{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}')col1
)t;
+-------------+--+
| _c0 |
+-------------+--+
| 1000231205 |
+-------------+--+
Regex description:
"content_id":\\s"(\\d+)" //match literal "content_id": + any space + "digit inside quotes"
Found an expensive way to do it through a combination of regex and substring functions
substr(split(regexp_extract(message,'content_id([^&]*)'), '"')[3],1) as content_id

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

django - OR query using lambda

I want to perform OR query using django ORM. I referred this answer and it fits my need.
I have a list of integers which gets generated dynamically. These integers represent user id in a particular table. This table also has a date field. I want to query the database for all user ids in the list for a given date.
For example: From below table, I want records for user ids 2 and 3 for the date 2015-02-28
id | date
---------------
1 | 2015-02-23
1 | 2015-02-25
1 | 2015-02-28
2 | 2015-02-28
2 | 2015-03-01
3 | 2015-02-28
I am unable to figure out which of the following two should be perfect for my use case:
Table.objects.filter(reduce(lambda x, y: (x | y) & Q(date=datetime.date(2015, 2, 28)), [Q(user_id=i) for i in ids])
OR
Table.objects.filter(reduce(lambda x, y: (x | y), [Q(user_id=i) for i in ids]) & Q(date=datetime.date(2015, 2, 28))
Both of the above yield similar output at the moment. Without lambda, below query would fit my need:
Table.objects.filter(Q(user_id=3) & Q(date=datetime.date(2015, 2, 28))| Q(user_id=2) & Q(date=datetime.date(2015, 2, 28)))
I think you do not need reduce and Q objects here, you can just do:
Table.objects.filter(
user_id__in=[2,3],
date=datetime.date(2015, 2, 28),
)

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?