SQLite Pattern Matching with Extra Character - regex

My database contains these rows:
DuPage
Saint John
What queries could I use that would match people entering either 'Du Page' or 'SaintJohn': in other words: adding an extra character (at any position) that shouldn't be there, or removing a character (at any position) that should be there?
The first example has a possible workaround: I could just remove the space character from the 'Du Page' input before searching the table, but I cannot do that with the second example unless there was some way of saying 'match 'SaintJohn' with the database text that has had all spaces removed', or alternatively 'match a database row that has every letter in 'SaintJohn' somewhere in the row.

Remove spaces from the column and the search text:
select * from tablename
where replace(textcolumn, ' ', '') like '%' || replace('<your search string>', ' ', '') || '%'

Related

Split records with complex delimiter

I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required
If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1
Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.

Extract book name from a string in Hive

My data is something like this -
1124 An Orphan's Journey
234 Red Dragon
35600 You'll Know When It's Time
It has two values, the first one is Book ID, and the second one is the book name.
I used the split function in Hive but that doesn't look proper.
SELECT split(books, '\\ ')[0] book_id,
split(books, '\\ ')[1] + ' ' +
split(books, '\\ ')[2] + ' ' +
split(books, '\\ ')[3] + ' ' +
split(books, '\\ ')[4] as book_name
FROM books;
So far values are good but I don't feel it is the right approach.
Please help.
You may use
REGEXP_EXTRACT(books, '^\\d+', 0)
to get the book ID and
REGEXP_EXTRACT(books, '\\s+(\\S.*)', 1)
to extract the book name. The second regex can be more verbose, say, you may also check if there are digits at the start of the string before, '^\\d+\\s+(\\S.*)'.
Here,
^\d+ - matches one or more (+) digits at the start of the string (^)
\s+(\S.*) - matches one or more whitespace chars (\s+) and then captures into Group 1 any non-whitespace char (\S) and then the rest of the string (.* matches any zero or more chars other than line break chars as many as possible). Note the index argument is set to 1 in the second callt o REGEXP_EXTRACT to make sure the Group 1 value is only returned, without the initial whitespace.

Regex for replacing all characters excepting last and first non space ones

I have emails stored in SAP Hana table column of char datatype. I need to replace all letters and digits with '*' char excepting first and last non-whitespace chars. I wrote the regex like this: regex_replace('abcd#efg.hij', '(?!^)[A-Za-z0-9](?!$)', '*')
It works fine and I get masked email 'a***#***.**j'.
But it goes wrong when there are some white spaces at the start and/or the end of the email. For example, if the email string is ' abcd#efg.hij ' the result would be
' ****#***.**** ' while I need ' a***#***.**j '
Unfortunately, I cannot trim email before regexing.
Denis, I tried following in a SELECT statement with Replace_Regexp function
select
REPLACE_REGEXPR('(?!^)[\sA-Za-z0-9](?!$)' IN trim(' abcd#efg.hij ') WITH '*')
from dummy;
It removes the leading and trailing spaces and returns "a***#***.**j"

Teradata regexp_replace to eliminate specific special characters

I imported a file that contains email addresses (email_source). I need to join this table to another, using this field but it contains commas (,) and double quotes (") before and after the email address (eg. "johnsmith#gmail.com,","). I want to replace all commas and double quotes with a space.
What is the correct syntax in teradata?
Just do this:
REGEXP_REPLACE(email_source, '[,"]', ' ',1,0,i)
Breakdown:
REGEXP_REPLACE(email_source, -- sourcestring
'[,"]', -- regexp
' ', --replacestring
1, --startposition
0, -- occurrence, 0 = all
'i' -- match -> case insensitive
)
You don't need a regex for this, a simple oTranslate should be more efficient:
oTranslate(email_source, ',"', ' ')

PostgreSQL regexp_replace() to keep just one whitespace

I need to clean up a string column with both whitespaces and tabs included within, at the beginning or at the end of strings (it's a mess !). I want to keep just one whitespace between each word. Say we have the following string that includes every possible situation :
mystring = ' one two three four '
2 whitespaces before 'one'
1 whitespace between 'one' and 'two'
4 whitespaces between 'two' and 'three'
2 tabs after 'three'
1 tab after 'four'
Here is the way I do it :
I delete leading and trailing whitespaces
I delete leading and trailing tabs
I replace both 'whitespaces repeated at least two' and tabs by a sole whitespace
WITH
t1 AS (SELECT' one two three four '::TEXT AS mystring),
t2 AS (SELECT TRIM(both ' ' from mystring) AS mystring FROM t1),
t3 AS (SELECT TRIM(both '\t' from mystring) AS mystring FROM t2)
SELECT regexp_replace(mystring, '(( ){2,}|\t+)', ' ', 'g') FROM t3 ;
I eventually get the following string, which looks nice but I still have a trailing whitespace...
'one two three four '
Any idea on doing it in a more simple way and solving this last issue ?
Many thanks !
SELECT trim(regexp_replace(col_name, '\s+', ' ', 'g')) as col_name FROM table_name;
Or In case of update :
UPDATE table_name SET col_name = trim(regexp_replace(col_name, '\s+', ' ', 'g'));
The regexp_replace is flags are described on this section of the documentation.
SELECT trim(regexp_replace(mystring, '\s+', ' ', 'g')) as mystring FROM t1;
Posting an answer in case folks don't look at comments.
Use '\s+'
Not '\\s+'
Worked for me.
It didn't work for me with trim and regexp_replace. So I came with another solution:
SELECT trim(
array_to_string(
regexp_split_to_array(' test with many spaces for this test ', E'\\s+')
, ' ')
) as mystring;
First regexp_split_to_array eliminates all spaces leaving "blanks" at the beginning and the end.
-- regexp_split_to_array output:
-- {"",test,with,many,spaces,for,this,test,""}
When using array_to_string all the ',' become spaces
-- regexp_split_to_array output ( '_' instead of spaces for viewing ):
-- _test_with_many_spaces_for_this_test_
The trim is to remove the head and tail
-- trim output ( '_' instead of spaces for viewing ):
-- test_with_many_spaces_for_this_test