How to compare Unicode characters in SQL server? - regex

Hi I am trying to find all rows in my database (SQL Server) which have character é in their text by executing the following queries.
SELECT COUNT(*) FROM t_question WHERE patindex(N'%[\xE9]%',question) > 0;
SELECT COUNT(*) FROM t_question WHERE patindex(N'%[\u00E9]%',question) > 0;
But I found two problems: (a) Both of them are returning different number of rows and (b) They are returning rows which do not have the specified character.
Is the way I am constructing the regular expression and comparing the Unicode correct?
EDIT:
The question column is stored using datatype nvarchar.
The following query gives the correct result though.
SELECT COUNT(*) FROM t_question WHERE question LIKE N'%é%';

Why not use SELECT COUNT(*) FROM t_question WHERE question LIKE N'%é%'?
NB: Likeand patindex do not accept regular expressions.
In the SQL Server pattern syntax [\xE9] means match any single character within the specified set. i.e. match \, x, E or 9. So any of the following strings would match that pattern.
"Elephant"
"axis"
"99.9"

Related

How to remove whitespaces from string in Redshift?

I've been trying to join two tables 'A' and 'B' using a column say 'Col1'. The problem I'm facing is that the data coming in both columns are in different format. For example : 'A - Air' is coming as 'A-Air', 'B - Air' is coming as 'B-Air' etc.
Therefore, I'm trying to remove white spaces from data coming in Col1 in A but i'm not able to remove it using any function given in AWS documentation. I've tried Trim and replace, but they wont work in this case. This might be achieved using regular expressions but i'm not able to find how. Below is the snippet of how I tried using regex but didn't work.
select Col1, regexp_replace( Col1, '#.*\\.( )$')
from A
WHERE
date = TO_DATE('2020/08/01', 'YYYY/MM/DD')
limit 5
Please let me know how can I possibly remove the spaces from a string using regular expressions or any other possible means in Redshift.
Col1, regexp_replace( Col1,'\\s','')
This worked for me.

Extract string after last match strings [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm using BigQuery and I want to extract string after the specific match strings, in my case, the strings is sc
I have a string like this :
www.xxss.com?psct=T-EST2%20.coms&.com/u[sc'sc(mascscin', sc'.c(scscossccnfiscg.scjs']-/ci=1(sctitis)
My expected result is:
titis)
Is this possible?
In general, across all RDBMS finding the index of the last instance of a match in a string is easy to compute by first reversing the string. Then we are only looking for the first match.
Update: BigQuery
Follow the documentation for REGEXP_EXTRACT in the String Functions documentation for BigQuery
NOTE: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
However, this problem can be solved without RegEx.
BigQuery supports array processing and has a SPLIT function, so you could split by the lookup variable and capture only the last result:
SELECT ARRAY_REVERSE(SPLIT( !YOUR COLUMN HERE! , "sc"))[OFFSET(1)]
The following adaptation from my original submission may still work:
SELECT REVERSE(SUBSTR(REVERSE(#text), 1, STRPOS(REVERSE(#text), "cs") -1))
For those who have a similar requirement in MS SQL Server the following syntax can be used.
other RDBMS can use a similar query, you will have to use the appropriate platform functions to acheive the result.
DECLARE #text varchar(200) = 'www.xxss.com?psct=T-EST2%20.coms&.com/u[sc''sc(mascscin'', sc''.c(scscossccnfiscg.scjs'']-/ci=1(sctitis)'
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Produces: titis)
You could achieve a similar result by obtaining the last index of 'sc' as above and using that value in a SUBSTRING however for that to work you need to re-compute the Length, this solution instead uses the LEFT function and then REVERSE's the result , reducing the functional complexity of the query by 1 (1 less function call)
Step this through:
Reverse the value:
SELECT REVERSE(#text)
Results in:
)sititcs(1=ic/-]'sjcs.gcsifnccssocscs(c.'cs ,'nicscsam(cs'cs[u/moc.&smoc.02%2TSE-T=tcsp?moc.ssxx.www
Now we find the first Index of 'cs'
Note: we have to reverse the sequece of the lookup string as well!
SELECT CharIndex('cs', REVERSE(#text),1)
Result: 7
Select the characters before this index:
Note: we must use -1 here because SQL uses 1-based index result from CharIndex so we must reduce it by 1
SELECT LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1)
Finally, we reverse the result:
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Guess you could use 'sc' as seperator, define (if constant string length) string length in your query (wildcard),
STRING_SPLIT ( string , separator )

Vertica REGEXP_SUBSTR use /g flag

I am trying to extract all occurrences of a word before '=' in a string, i tried to use this regex '/\w+(?=\=)/g' but it returns null, when i remove the first '/' and the last '/g' it returns only one occurrence that's why i need the global flag, any suggestions?
As Wiktor pointed out, by default, you only get the first string in a REGEXP_SUBSTR() call. But you can get the second, third, fourth, etc.
Embedded into SQL, you need to treat regular expressions differently from the way you would treat them in perl, for example. The pattern is just the pattern, modifiers go elsewhere, you can't use $n to get the n-th captured sub-expression, and you need to proceed in a specific way to get the n-th match of a pattern, etc.
The trick is to CROSS JOIN your queried table with an in-line created index table, consisting of as many consecutive integers as you expect occurrences of your pattern - and a few more for safety. And Vertica's REGEXP_SUBSTR() call allows for additional parameters to do that. See this example:
WITH
-- one exemplary input row; concatenating substrings for
-- readability
input(s) AS (
SELECT 'DRIVER={Vertica};COLUMNSASCHAR=1;CONNECTIONLOADBALANCE=True;'
||'CONNSETTINGS=set+search_path+to+public;DATABASE=sbx;'
||'LABEL=dbman;PORT=5433;PWD=;SERVERNAME=127.0.0.1;UID=dbadmin;'
)
,
-- an index table to CROSS JOIN with ... maybe you need more integers ...
loop_idx(i) AS (
SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
UNION SELECT 10
)
,
-- the query containing the REGEXP_SUBSTR() call
find_token AS (
SELECT
i -- the index from the in-line index table, needed
-- for ordering the outermost SELECT
, REGEXP_SUBSTR (
s -- the input string
, '(\w+)=' -- the pattern - a word followed by an equal sign; capture the word
, 1 -- start from pos 1
, i -- the i-th occurrence of the match
, '' -- no modifiers to regexp
, 1 -- the first and only sub-pattern captured
) AS token
FROM input CROSS JOIN loop_idx -- the CROSS JOIN with the in-line index table
)
-- the outermost query filtering the non-matches - the empty strings - away...
SELECT
token
FROM find_token
WHERE token <> ''
ORDER BY i
;
The result will be one row per found pattern:
token
DRIVER
COLUMNSASCHAR
CONNECTIONLOADBALANCE
CONNSETTINGS
DATABASE
LABEL
PORT
PWD
SERVERNAME
UID
You can do all sorts of things in modern SQL - but you need to stick to the SQL and to the relational paradigm - that's all ...
Happy playing ...
Marco

Find a string with or without space in oracle using like or regex

I have a string which contains specific 'winner code' which needs to be matched exactly but in the database some records contains spaces and extra characters within 'winners code' and if I use 'like operator' it only returns the matching criteria. I want to use one simplified query which can return all the records if it contains the winner code.Please find below my query and details
Winner code - أ4 ب3 ج10
Records with spaces - أ4 ب 3 ج 10
Records with extra character - (أ(4)
ب(3)
ج(10
My Query -
SELECT COLUMN_NAME,
FROM TABLE_NAME
WHERE
((COLUMN_NAME LIKE '%أ4%ب3%ج10%') or(COLUMN_NAME LIKE '%أ 4%ب 3%ج 10%'))
The above query returns with and without space data as its matching the criteria.
Thanks
If I correctly understand your need, you may try :
with test(str) as (
select '10X3Y4Z' from dual union all
select '10 X 3 Y 4 Z' from dual union all
select '(10)X(3)Y(4)Z' from dual union all
select '10#X3Y4 Z' from dual union all
select '10 # X3Y4Z' from dual )
select str
from test
where regexp_instr(str, '10[ |\)]{0,1}X[ |\(]{0,1}3[ |\)]{0,1}Y[ |\(]{0,1}4[ |\)]{0,1}Z') != 0
This matches your "winner code" ( I used different characters to simplify my test) even if the numbers are surrounded by '()' or a single space.
This can be re-written in a more compact way, but I believe this form is clear enough; it uses regular expressions like [ |\)]{0,1} to match a space or a parenthesis, with zero or one occurrence.

In Oracle, how do I select rows which contain a character within a certain numeric range?

I have a table in Oracle with a VARCHAR column called DESCRIPTION. Some of the rows contain non-printable characters such as the character with numeric value 150 (which is not in Latin-1 and is "Start of Protected Area" in Unicode).
I want to select all the rows whose DESCRIPTION columns contain a character whose numeric value is between 128 and 160. Is there a way to do this without a long list of LIKE clauses OR'ed together? I suppose it can be done with regular expressions, but I haven't found a way to do it.
I had to do something very like this recently and used some SQL like this:
with codes as (select rownum code from dual connect by level <= 160)
select distinct t.id, t.description
from mytable t, codes c
where t.description like '%' || chr(c.code) || '%'
and c.code >= 128;
Vincent's post helped me a lot with this problem! I wanted to find all rows that had any extended ASCII: 128-255, so I shortened the statement to this:
SELECT description
FROM your_table
WHERE regexp_like (description, '['||chr(128)||'-'||chr(255)||']');
Short way to grab a range.
You could use a regular expression, it may perform better than 30+ single WHERE clause but it won't be much prettier:
SELECT *
FROM your_table
WHERE regexp_like(description, '['||chr(128)||chr(129)||...||chr(160)||']')