I'm using ANTLR with Presto grammar in order to parse SQL queries.
This is the original string definition I've used to parse queries:
STRING
: '\'' ( '\\' .
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
This worked ok for most queries until I saw queries with different escaping rules. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
So I've modified my String definition and now it looks like:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
However, this won't work for the query mentioned above as I'm getting
'\\'',''),'
as a single string.
The predicate returns True for the following query.
Any idea how can I handle this query as well?
Thanks,
Nir.
In the end I was able to solve it. This is the expression I was using:
STRING
: '\'' ( '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
#init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\\\'' '\'' // '\\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\\']* '\'\'' ~[\\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
File input.txt (not having more examples, I can only guess) :
replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
Execution :
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[#0,0:7='replace1',<REPLACE>,1:0]
[#1,8:8='(',<'('>,1:8]
[#2,9:15='replace',<REPLACE>,1:9]
[#3,16:16='(',<'('>,1:16]
[#4,17:24='some_col',<ID>,1:17]
[#5,25:25=',',<','>,1:25]
[#6,26:30=''\\''',<STRING>,1:26]
[#7,31:31=',',<','>,1:31]
[#8,32:33='''',<STRING>,1:32]
[#9,34:34=')',<')'>,1:34]
[#10,35:35=',',<','>,1:35]
[#11,36:39=''\"'',<STRING>,1:36]
[#12,40:40=' ',<WS>,channel=1,1:40]
[#13,41:41=',',<','>,1:41]
[#14,42:43='''',<STRING>,1:42]
[#15,44:44=')',<')'>,1:44]
[#16,45:45='\n',<NL>,channel=1,1:45]
[#17,46:53='replace2',<REPLACE>,2:0]
[#18,54:54='(',<'('>,2:8]
[#19,55:62='some_col',<ID>,2:9]
[#20,63:63=',',<','>,2:17]
[#21,64:67='''''',<STRING>,2:18]
[#22,68:68=',',<','>,2:22]
[#23,69:70='''',<STRING>,2:23]
[#24,71:71=')',<')'>,2:25]
[#25,72:72='\n',<NL>,channel=1,2:26]
[#26,73:80='replace3',<REPLACE>,3:0]
[#27,81:81='(',<'('>,3:8]
[#28,82:89='some_col',<ID>,3:9]
[#29,90:90=',',<','>,3:17]
[#30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[#31,106:106=',',<','>,3:33]
[#32,107:111=''xyz'',<STRING>,3:34]
[#33,112:112=')',<')'>,3:39]
[#34,113:113='\n',<NL>,channel=1,3:40]
[#35,114:121='replace4',<REPLACE>,4:0]
[#36,122:122='(',<'('>,4:8]
[#37,123:130='some_col',<ID>,4:9]
[#38,131:131=',',<','>,4:17]
[#39,132:141=''abc\ndef'',<STRING>,4:18]
[#40,142:142=',',<','>,4:28]
[#41,143:147=''xyz'',<STRING>,4:29]
[#42,148:148=')',<')'>,4:34]
[#43,149:149='\n',<NL>,channel=1,4:35]
[#44,150:157='replace5',<REPLACE>,5:0]
[#45,158:158='(',<'('>,5:8]
[#46,159:166='some_col',<ID>,5:9]
[#47,167:167=',',<','>,5:17]
[#48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[#49,186:186=',',<','>,5:36]
[#50,187:189=''8'',<STRING>,5:37]
[#51,190:190=')',<')'>,5:40]
[#52,191:191='\n',<NL>,channel=1,5:41]
[#53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352
I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 | | \u00a0 | f | t
5760 | 1680 | | \u1680 | t | t
6158 | 180e | | \u180e | f | t
8192 | 2000 | | \u2000 | t | t
8193 | 2001 | | \u2001 | t | t
8194 | 2002 | | \u2002 | t | t
8195 | 2003 | | \u2003 | t | t
8196 | 2004 | | \u2004 | t | t
8197 | 2005 | | \u2005 | t | t
8198 | 2006 | | \u2006 | t | t
8199 | 2007 | | \u2007 | f | t
8200 | 2008 | | \u2008 | t | t
8201 | 2009 | | \u2009 | t | t
8202 | 200a | | \u200a | t | t
8203 | 200b | | \u200b | f | t
8204 | 200c | | \u200c | f | t
8205 | 200d | | \u200d | f | t
8206 | 200e | | \u200e | f | t
8207 | 200f | | \u200f | f | t
8239 | 202f | | \u202f | f | t
8287 | 205f | | \u205f | t | t
8288 | 2060 | | \u2060 | f | t
12288 | 3000 | | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore.
For example :
String : AM_HTCEVOBLKHS_BX [false]
String : AP_HTCEVOBLKHSPBX [false]
String : BM_HTCEVOBLKHS_BX [true]
String : A_HTCEVODSAP_DSSD [true]
String : A_HTCEVOB_A_CDSED [true]
String : MP_HTCEVOBLKHS_BX [true]
Can you make this pattern ?
My current solution doesn't work:
BEGIN
IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN
dbms_output.put_line('TRUE');
ELSE
dbms_output.put_line('FALSE');
END IF;
END;
/
why you need regexp why you not use simple substr?
with t1 as
(select 'AM_HTCEVOBLKHS_BX' as f1
from dual
union all
select 'AP_HTCEVOBLKHSPBX'
from dual
union all
select 'BM_HTCEVOBLKHS_BX'
from dual
union all
select 'A_HTCEVODSAP_DSSD'
from dual
union all
select 'A_HTCEVOB_A_CDSED'
from dual
union all
select 'MP_HTCEVOBLKHS_BX' from dual
union all
select null from dual
union all
select '1' from dual)
select f1,
case
when substr(f1, 1, 2) in ('AM', 'AP') then
'false'
else
'true'
end as check_result
from t1
If you have a table of patterns then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE strings ( string ) AS
SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL
UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL
UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL
UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL;
CREATE TABLE patterns ( pattern ) AS
SELECT '^AM' FROM DUAL
UNION ALL SELECT '^AP' FROM DUAL;
Query 1:
-- Negative Matches:
SELECT string
FROM strings s
LEFT OUTER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
WHERE p.pattern IS NULL
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 2:
-- Positive Matches:
SELECT DISTINCT
string
FROM strings s
INNER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 3:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string,
( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL )
FROM patterns )
)
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings s
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
If you want to pass the pattern as a single string then:
Query 4:
-- Negative Matches:
SELECT string
FROM strings
WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 5:
-- Positive Matches:
SELECT string
FROM strings
WHERE REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 6:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' )
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
Try this:
^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$
The idea is to describe all possible start of the string.
( # a group
[B-Z][A-Z]* # The first character is not a "A"
| # OR
A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M"
| # OR
A[A-Z]{2,} # a "A" followed by more than 1 letter
) # close the group
^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this:
not regexp_like( field, '^(AM_)|^(AP_)' )
As it is a LIKE function you don't need any more on the regex expression.