PostgreSQL Full Text Search: Documents composed of multiple rows - django

I'm using PostgreSQL's Full Text Search in Django.
My data is stored in a tree structure, like so:
- note 100 #This is a tree
- note 341
- note 422
- note 101 #This is another tree
- note 218
- note 106
In the database, each note is just an individual row with a link to its parent:
id | note_body | parent_id
-----------------------------
341 | "foo" | 100
422 | "bar" | 341
...
This makes it possible to retrieve a single tree (i.e. several individual notes) at once.
My question is: How can I use full text where each document is a whole tree, rather than a single note?
In this example, searching for "foo" or "bar" should return note 100, which is the root of the tree that contains that word.
I would like to do this using Django's full text search API.

Related

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

In R, use regular expression to match multiple patterns and add new column to list

I've found numerous examples of how to match and update an entire list with one pattern and one replacement, but what I am looking for now is a way to do this for multiple patterns and multiple replacements in a single statement or loop.
Example:
> print(recs)
phonenumber amount
1 5345091 200
2 5386052 200
3 5413949 600
4 7420155 700
5 7992284 600
I would like to insert a new column called 'service_provider' with /^5/ as Company1 and /^7/ as Company2.
I can do this with the following two lines of R:
recs$service_provider[grepl("^5", recs$phonenumber)]<-"Company1"
recs$service_provider[grepl("^7", recs$phonenumber)]<-"Company2"
Then I get:
phonenumber amount service_provider
1 5345091 200 Company1
2 5386052 200 Company1
3 5413949 600 Company1
4 7420155 700 Company2
5 7992284 600 Company2
I'd like to provide a list, rather than discrete set of grepl's so it is easier to keep country specific information in one place, and all the programming logic in another.
thisPhoneCompanies<-list(c('^5','Company1'),c('^7','Company2'))
In other languages I would use a for loop on on the Phone Company list
For every row in thisPhoneCompanies
Add service provider to matched entries in recs (such as the grepl statement)
end loop
But I understand that isn't the way to do it in R.
Using stringi :
library(stringi)
recs$service_provider <- stri_replace_all_regex(str = recs$phonenumber,
pattern = c('^5.*','^7.*'),
replacement = c('Company1', 'Company2'),
vectorize_all = FALSE)
recs
# phonenumber amount service_provider
# 1 5345091 200 Company1
# 2 5386052 200 Company1
# 3 5413949 600 Company1
# 4 7420155 700 Company2
# 5 7992284 600 Company2
Thanks to #thelatemail
Looks like if I use a dataframe instead of a list for the phone companies:
phcomp <- data.frame(ph=c(5,7),comp=c("Company1","Company2"))
I can match and add a new column to my list of phone numbers in a single command (using the match function).
recs$service_provider <- phcomp$comp[match(substr(recs$phonenumber,1,1), phcomp$ph)]
Looks like I lose the ability to use regular expressions, but the matching here is very simple, just the first digit of the phone number.

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?

The best way to generate path pattern for materialized path tree structures

Browsing through examples all over the web, I can see that people generate the path using something like "parent_id.node_id". Examples:-
uid | name | tree_id
--------------------
1 | Ali | 1.
2 | Abu | 2.
3 | Ita | 1.3.
4 | Ira | 1.3.
5 | Yui | 1.3.4
But as explained in this question - Sorting tree with a materialized path?, using zero padding to the tree_id make it easy to sort it by the creation order.
uid | name | tree_id
--------------------
1 | Ali | 0001.
2 | Abu | 0002.
3 | Ita | 0001.0003.
4 | Ira | 0001.0003.
5 | Yui | 0001.0003.0004
Using fix length string like this also make it easy for me to calculate the level - length(tree_id)/5. What I'm worried is it would limit me to maximum 9999 users rather than 9999 per branch. Am I right here ?
9999 | Tar | 0001.9999
10000 | Tor | 0001.??
You are correct -- zero-padding each node ID would allow you to sort the entire tree quite simply. However, you have to make the padding width match the upper limit of digits of the ID field, as you have pointed out in your last example. E.g., if you're using an int unsigned field for your ID, the highest value would be 4,294,967,295. This is ten digits, meaning that the record set from your last example might look like:
uid | name | tree_id
9999 | Tar | 0000000001.0000009999
10000 | Tor | 0000000001.0000010000
As long as you know you're not going to need to change your ID field to bigint unsigned in the future, this will continue work, though it might be a bit data-hungry depending on how huge your tables get. You could shave off two bytes per node ID by storing the values in hexadecimal, which would still be sorted correctly in a string sort:
uid | name | tree_id
9999 | Tar | 00000001.0000270F
10000 | Tor | 00000001.00002710
I can imagine this would make things a real headache when trying to update the paths (pruning nodes, etc) though.
You can also create extra fields for sorting, e.g.:
uid | name | tree_id | name_sort
9999 | Tar | 00000001.0000270F | Ali.Tar
10000 | Tor | 00000001.00002710 | Ali.Tor
There are limitations, however, as laid out by this guy's answer to a similar materialized path sorting question. The name field would have to be padded to a set length (fortunately, in your example, each name seems to be three characters long), and it would take up a lot of space.
In conclusion, given the above issues, I've found that the most versatile way to do sorting like this is to simply do it in your application logic -- say, using a recursive function that builds a nested array, sorting the children of each node as it goes.