How to compare two columns by ignoring special charactes? - regex

I am comparing two columns from different tables to get the matching records. Those tables do not have any unique key other than first and last names.But I don't get the correct output if tableA has Aa'aa and tableB has Aaaa. Could any one advice how to compare by ignoring the special characters / any other alternate solution to get them matched?
SELECT * FROM TableA A where EXISTS
(SELECT '' FROM TableB B
WHERE
TRIM(A.namef) = TRIM(B.namef)
AND TRIM(A.namel) = TRIM(B.namel)
)
-Thanks

You could try a regex approach. Assuming that you want to compare only the alphabetic and numeric characters, you can do:
where
regexp_replace(a.namef, '\W', '', 'g') = regexp_replace(b.namef, '\W', '', 'g')
and regexp_replace(a.namel, '\W', '', 'g') = regexp_replace(b.namel, '\W', '', 'g')
Basically this removes non-word characters from each string before comparing them - with a word character being defined as a letter or a digit, plus the underscore character.

If you only want to remove anything that is not a letter, use:
regexp_replace(a.namef, '[^A-Za-z]', '', 'g') = regexp_replace(b.namef, '[^A-Za-z]', '', 'g')

Related

SQLite Pattern Matching with Extra Character

My database contains these rows:
DuPage
Saint John
What queries could I use that would match people entering either 'Du Page' or 'SaintJohn': in other words: adding an extra character (at any position) that shouldn't be there, or removing a character (at any position) that should be there?
The first example has a possible workaround: I could just remove the space character from the 'Du Page' input before searching the table, but I cannot do that with the second example unless there was some way of saying 'match 'SaintJohn' with the database text that has had all spaces removed', or alternatively 'match a database row that has every letter in 'SaintJohn' somewhere in the row.
Remove spaces from the column and the search text:
select * from tablename
where replace(textcolumn, ' ', '') like '%' || replace('<your search string>', ' ', '') || '%'

How to simplify postgres regexp_replace

Is there a way to simplify this query using only one regexp_replace?
select regexp_replace(regexp_replace('BL 081', '([^0-9.])', '', 'g'), '(^0+)', '', 'g')
the result should be 81
I'm trying to remove all non-numeric chars and leading 0's from the result
You can do this by capturing the digits you want (not including any leading zeros) and removing everything else:
select regexp_replace('BL 0081', '.*?([1-9][0-9]*)$', '\1')
Output
81
Note you don't need the g flag as you are only making one replacement.
Demo on dbfiddle
Why not just change the range from 0-9 to 1-9?
regexp_replace('BL 081', '(^[^1-9]+)', '', 'g')
This pattern should do: \D+|(?<=\s)0+
\D - matches characters that are not digits
(?<=\s) - looks behind for spaces and matches leading zeros
You can use 1 fewer regexp_replace:
select regexp_replace('BL 081', '\D+|(?<=\s)0+', '', 'g')
# outputs 81
alternatively, if you are interested in the numeric value, you could use a simpler regex and then cast to an integer.
select regexp_replace('BL 081', '\D+', '')::int
# also outputs 81, but its type is int

Regex: how to separate strings by apostrophes in certain cases only

I am looking to capitalize the first letter of words in a string. I've managed to put together something by reading examples on here. However, I'm trying to get any names that start with O' to separate into 2 strings so that each gets capitalized. I have this so far:
\b([^\W_\d](?!')[^\s-]*) *
which omits selecting the X' from any string X'XYZ. That works for capitalizing the part after the ', but doesn't capitalize the X'. Further more, i'm becomes i'M since it's not specific to O'. To state the goal:
o'malley should go to O'Malley
o'malley's should go to O'Malley's
don't should go to Don't
i'll should go to I'll
(as an aside, I want to omit any strings that start with numbers, like 23F, that seems to work with what I have)
How to make it specific to the strings that start with O'? Thx
if you use the following pattern:
([oO])'([\w']+)|([\w']+)
then you can access each word by calling:
match[0] == 'o' || match[1] == 'name' #if word is "o'name"
match[2] == 'word' #if word is "word"
if it is one of the two above, the others will be blank, ie if word == "word" then
match[0] == match[1] == ""
since there is no o' prefix.
Test Example:
>>> import re
>>> string = "o'malley don't i'm hello world"
>>> match = re.findall(r"([oO])'([\w']+)|([\w']+)",string)
>>> match
[('o', 'malley', ''), ('', '', "don't"), ('', '', "i'm"), ('', '', 'hello'), ('', '', 'world')]
NOTE: This is for python. This MIGHT not work for all engines.

regexp rule which returns column entries of text database

given a simple delimiter separated text database, I want to construct a regexp rule, which returns the column / field entries.
given the following two example lines
entry1 = '|123|some|string |101112 |'
entry2 = '|123|some| |101112 |'
i want to get the following output:
values1 = '123', 'some', 'string', '101112'
values2 = '123', 'some', '', '101112'
so far I'm using the following regexp and regexprep combination:
values = regexp(regexprep(entry '[\s]', ''), '\|', 'split')
which unfortunately returns the following:
values1 = '' '123' 'some' 'string' '101112' ''
values2 = '' '123' 'some' '' '101112' ''
but I want to get (no extra '' before the 123 and not extra '' after '101112'):
values1 = '123', 'some', 'string', '101112'
values2 = '123', 'some', '', '101112'
given my regexp rule, why do I get the '' at the beginning and the end? How do I have to change my regexp rule, to only return the field values?
I am not sure it is exactly what you are asking for, but you can use strread:
strread(entry1(2:end),'%d','delimiter','|')
ans =
123
456
789
101112
Empty strings are there because you tell matlab to split at | characters. And splitting means that you cut there. If there is nothing before |, you'll get empty string. For example, splitting this (subresult after regexprep):
'|123|456|789|101112|'
results in (imagine cutting the string at |):
'', '123', '456', '789', '101112', ''
So, either split the string between the first and the last |:
nospaces = regexprep(entry, '\s', '')
betweenpipes = nospaces(2:size(nospaces,2)-1)
values = regexp(betweenpipes, '\|', 'split')
..or don't use split at all and just search for the required pattern:
regexp(entry, '(?=\)(?:\s*)(\d+)(?:\s*)(?=\)', 'match')
Regexp explained:
look for |, but don't remember it: (?=\|)
skip possible whitespace but don't remember it: (?:\s*)
match a number: (\d+)
skip possible whitespace but don't remember it: (?:\s*)
look for |, but don't remember it: (?=\|)
I'm writing this from memory as I don't have matlab here, so there may be some bugs..

postgres regexp_replace want to allow only a-z and A-Z

In a table column in string we can have numbers/special chars/white spaces.
I want to replace numbers/special chars/white space with empty char, i see there is function named regexp_replace but how to use not much user friendly help avaialble for example i want to use following string.
String = 'abc$wanto&toremove#special~chars'
I want to remove all special chars and numbers from above string want to allow only a-z and A-Z rest of chars should be replaced with '' how to do that ?
SELECT regexp_replace('abc$wanto&toremove#special~chars', '[^a-zA-Z]', '', 'g');
regexp_replace
------------------------------
abcwantotoremovespecialchars
For me the following worked.
regexp_replace(code, '[^a-zA-Z0-9]+', '','g')
As it adds global filter so it repeats the regex for the entire string.
Example,
SELECT regexp_replace('Well- This Did-Not work&*($%%)_', '[^a-zA-Z0-9]+', '')
Returns: "WellThis Did-Not work&*($%%)_"
SELECT regexp_replace('Well- This Did-Not work&*($%%)_', '[^a-zA-Z0-9]+', '','g')
Returns: "WellThisDidNotwork"
Which has the characters we don't want removed.
To make it simpler:
regexp_replace('abc$wanto&toremove#special~chars', '[^[:alpha:]]')
If you want to replace the char with the closest not special char, you can do something like this:
select
translate(
lower( name ), ' ''àáâãäéèëêíìïîóòõöôúùüûçÇ', '--aaaaaeeeeiiiiooooouuuucc'
) as new_name,
name
from cities;
Should be:
regexp_replace('abc$wanto&toremove#special~chars', '[^a-zA-Z]+', '')