Finding duplicated values by pattern in postgres sql

Finding duplicated values by pattern in postgres sql - regex

is it possible to write a query, which can find duplicates (similar) values by pattern, without spaces between words, only by 3-5 words, all of them lower (upper) case?
I have documents table with many columns, which one of them is 'title'.
I need to find documents by title, but title may differ like one with two spaces between words, lover upper case.
Or maybe it can find duplicates similar, where string begins with three - five words
The query:
SELECT title, COUNT(title)
FROM doc_documents
where not deleted and status ='CONFIRMED'
GROUP BY title
HAVING ( COUNT(title) > 1 )
order by count
Works sort of ok, but it did not find any values which differs with to spaces between word.
Like:
10-12 year classmates, which learns differently"
11 – 12 year classmates, which learns differently
Also is it possible to find only by three words, ignoring spaces and left of the string, like:
10-12 year classmates and 11 – 12 year classmates will be found?
I can't think any of the solutions

use a regexp to split the title string into an array of wanted words
implode this array back into a string
group on this string, or us it as a canonical identifier for the fuzzy string
YMMV
-- sample table and data
CREATE TABLE titles
( id serial NOT NULL PRIMARY KEY
, title text
);
INSERT INTO titles ( title ) VALUES
('10-12 year classmates, which learns differently')
, ('10-12 year classmates, which learns differently')
, (' 11 – 12 year classmates, which learns differently');
-- CTE performing the regexp and array magic
WITH tit AS (
SELECT t.id
, array_to_string( regexp_split_to_array( btrim(t.title) , E'[^0-9A-Za-z]+'), ' ') AS tit
, t.title AS org -- you could add a ',' after the 'z' here: ---------- ^
FROM titles t
)
-- Use the CTE to see if it works
SELECT tit
-- , MIN(org) AS one
-- , MAX(org) AS two
, COUNT(*) AS cnt
FROM tit
GROUP BY tit
;

Related

oracle: using dbms.random for randomizing characters and numbers

I need help on how I can fully randomize the characters and numbers in the address without affecting the spaces. I tried the query below(PICTURE1) but it replaces all characters into 1 random character only. Any alternatives or logic to achieve the desired output? thanks
RESULT NEEDED SAMPLE ONLY:(random numbers/letters and same length and spacing position)

Try something like this (depending on your version, this can be simplified to make detecting numeric strings easier/faster):
with strings as (
select 'abc def 123 ktr' colv, 1 id from dual
union all
select 'abcdef gh 1trzzz' , 2 id from dual
)
select id,
(select listagg(case when length(regexp_replace(strd,'\d+','0'))=1 then
rpad(to_char(mod(abs(dbms_random.random()),power(10,length(strd)))),length(strd),'0')
else dbms_random.string('U',length(trim(strd))) end,' ')
from (select
trim(regexp_substr(colv,'(\w*\Z)|(\w* )',1,level)) strd from dual connect by level<=regexp_count(colv,' ')+1)
) rand_col
from strings
/
ID RAND_COL
---------- ------------------------------------------------------------
1 IYV MFS 609 SRV
2 GBLPUY LY BHIYNS
The idea is to split the strings into words, replace these words with random strings of equal size, and then reconstruct the string.

Wildcard expression in SQL Server

I know that, the following query returns the rows, which are contain the exact 5 characters between the A and G
select *
from
(select 'prefixABBBBBGsuffix' code /*this will be returned. */
union
select 'prefixABBBBGsuffix') rex
where
code like '%A_____G%'
But I want 17 character between A and G, then like condition must have 17 underscores. So I search little in google I found [] will be used in like. Then I tried so for.
select *
from
(select 'AprefixABBBBBGsuffixG' code
union
select 'AprefixABBBBGsuffixG') rex
where
code like '%A[_]^17G%' /*As per my understanding, '[]' makes a set. And
'^17' would be power of the set (like Mathematics).*/
Then it returns the NULL set. How can I search rows which has certain number of character in the set []?
Note:
I'm using SQL Server 2012.

I would use REPLICATE to generate desired number of '_':
select * from (
select 'prefixABBBBBGsuffix' code
union
select 'prefixABBBBGsuffix'
) rex
where code like '%A' + REPLICATE('_',17) + 'G%';

same answer as previously but corrected. 17 wasn't the number, it was 18 and 19 for strings, also put in the len(textbetweenA and G) to show.
select rex.*
from (
select len('prefixABBBBBGsuffix') leng, 'AprefixABBBBBGsuffixG' code
union
select len('prefixABBBBGsuffix'), 'AprefixABBBBGsuffixG'
union
select 0, 'A___________________G'
) rex
where
rex.code like '%A' + replicate('_',19) + 'G%'
--and with [] the set would be [A-Za-z]. Notice this set does not match the A___________________G string.
select rex.*
from (
select len('prefixABBBBBGsuffix') leng, 'AprefixABBBBBGsuffixG' code
union
select len('prefixABBBBGsuffix'), 'AprefixABBBBGsuffixG'
union
select 0, 'A___________________G'
) rex
where
rex.code like '%A' + replicate('[A-Za-z]',19) + 'G%'
[A-Za-z0-9] matches one character within the scope of alphabet (both cases) or a number 0 through 9
I can't find any working information about another way to handle a number of chars like that, replicate is just a way to ease parameterization and typing.

How to lookup an array of strings to match a value in a column?

I have a master table holding the list of possible street types:
CREATE TABLE land.street_type (
str_type character varying(300)
);
insert into land.street_type values
('STREET'),
('DRIVE'),
('ROAD');
I have a table in which address is loaded and I need to parse the string to do a lookup on the master street type to fetch the suburb following the street.
CREATE TABLE land.bank_application (
mailing_address character varying(300)
);
insert into land.bank_application values
('8 115 MACKIE STREET VICTORIA PARK WA 6100 AU'),
('69 79 CABBAGE TREE ROAD BAYVIEW NSW 2104 AU'),
('17 COWPER DRIVE CAMDEN SOUTH NSW 2570 AU');
Expected output:
VICTORIA PARK
BAYVIEW
CAMDEN SOUTH
Any PostgreSQL technique to look up a array of values against a table column and fetch the data following the matching word?
If I'm able to fetch the data present after the street type, then I can remove the last 3 fields state, postal code and country code from that to identify the suburb.

This query does what you ask for using regular expressions:
SELECT substring(b.mailing_address, ' ' || s.str_type || ' (.*) \D+ \d+ \D+$') AS suburb
FROM bank_application b
JOIN street_type s ON b.mailing_address ~ (' ' || s.str_type || ' ');
The regexp ' (.*) \D+ \d+ \D+$' explained step by step:
.. leading space (the assumed delimiter, else something like 'BROAD' would match 'ROAD')
(.*) .. capturing parentheses with 0-n arbitrary characters: .*
\D+ .. 1-n non-digits
\d+ .. 1-n digits
$ .. end of string
The manual on POSIX Regular Expressions.
But it relies on the given format of mailing_address. Is the format of your strings that reliable?
And suburbs can have words like 'STREET' etc. as part of their name - the approach seems unreliable on principal.
BTW, there is no array involved, you seem to be confusing arrays and sets.

POSTGRESQL at least 8 characters in name with LIKE or REGEX

SELECT name
FROM players
WHERE name ~ '(.*){8,}'
It is really simple but I cannot seem to get it.
I have a list with names and I have to filter out the ones with at least 8 characters... But I still get the full list.
What am I doing wrong?
Thanks! :)

A (.*){8,} regex means match any zero or more chars 8 or more times.
If you want to match any 8 or more chars, you would use .{8,}.
However, using character_lenth is more appropriate for this task:
char_length(string) or character_length(string) int Number of characters in string
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('abc'),
('abc45678'),
('abc45678910')
;
SELECT * from table1 WHERE character_length(s) >= 8;
See the online demo

Find matching strings in table column Oracle 10g

I am trying to search a varchar2 column in a table for matching strings using the value in another column. The column being searched allows free form text and allows words and numbers of different lengths. I want to find a string that is not part of a larger string of text and numbers.
Example: 1234a should match "Invoice #1234a" but not "Invoice #1234a567"
Steps Taken:
I have tried Regexp_Like(table2.Searched_Field,table1.Invoice) but get many false hits when the invoice number has a number sequence that can be found in other invoice numbers.

Suggestions:
Match only at end:
REGEXP_LIKE(table2.Searched_Field, table1.Invoice || '$')
Match exactly:
table2.Searched_Field = 'Invoice #' || table1.Invoice
Match only at end with LIKE:
table2.Searched_Field LIKE '%' || table1.Invoice

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding duplicated values by pattern in postgres sql - regex

Related

oracle: using dbms.random for randomizing characters and numbers

Wildcard expression in SQL Server

How to lookup an array of strings to match a value in a column?

POSTGRESQL at least 8 characters in name with LIKE or REGEX

Find matching strings in table column Oracle 10g

Categories

Resources