How to replace multiple sub occurrences of string - regex

I have one use case where I can have a text string which can contain anything. What I want to achieve is to replace a certain pattern within that given string.
Let's say I have given string as
:es1
es2
aaes1aa
:es3,
es1:
ees1,
ees1
{
"es1 :
What I am trying to do is here is suppose I have to replace all es1 in this string but with one condition. It has to be either end or start with [, | ; | : | " | ' | \\ | \s]. :es1, es1,, es1: and so on are accepted but eees1sss is not.
I tried ([, | ; | : | " | ' | \\ | \s])(es1)([, | ; | : | " | ' | , | \s]) something like this but I don't think it's what I need.
Go program:
match := regexp.MustCompile(`([, | ; | : | " | ' | \\ | \s])(es1)([, | ; | : | " | ' | , | \s])`)
test := `:es1
es2
aaes1aa
:es3,
es1:
ees1,
ees1
{
"es1 :`
fmt.Println(match.ReplaceAllString(test, "$1es4$3"))
output:
es2
aaes1aa
:es3,
:
ees1,
ees1
{
:
I was expecting my output to be more like
:es4
es2
aaes1aa
:es3,
es4:
ees1,
ees1
{
"es4 :

the solution provided below is not well tested against all possibilities, but it seems to be working.
package main
import (
"fmt"
"regexp"
)
func main() {
match := regexp.MustCompile(`([, | ; | : | " | ' | \\ | \s])es1|^es1([, | ; | : | " | ' | , | \s])`)
test := `:es1
es2
aaes1aa
:es3,
es1:
ees1,
ees1
{
"es1 :`
fmt.Println(match.ReplaceAllString(test, "${1}es4${2}"))
}
https://play.golang.org/p/E8lb9vmM_Sa

You can use
package main
import (
"fmt"
"regexp"
)
func main() {
match := regexp.MustCompile(`([,;:"'\\\s])es1\b|\bes1([,;:"'\\\s])`)
test := ":es1\n es2\n aaes1aa\n :es3,\n es1:\n ees1,\n ees1 \n \n {\n \"es1 :"
fmt.Println(match.ReplaceAllString(test, "${1}es4$2"))
}
See the Go demo and the regex demo. Note that the spaces and | chars inside square brackets are meaningful and match these chars literally, thus, you need to remove them all from your pattern.
The regex matches:
([,;:"'\\\s])es1\b - Group 1: a comma, or a semi-colon, colon, double or single quotation mark, backslash or whitespace; then es1 as a whole word (\b is a word boundary)
| - or
\bes1 - a whole word es1
([,;:"'\\\s]) - Group 2: a comma, or a semi-colon, colon, double or single quotation mark, backslash or whitespace

Related

Handling different escaping sequences?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
This is the original string definition I've used to parse queries:
STRING
: '\'' ( '\\' .
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
This worked ok for most queries until I saw queries with different escaping rules. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
So I've modified my String definition and now it looks like:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
However, this won't work for the query mentioned above as I'm getting
'\\'',''),'
as a single string.
The predicate returns True for the following query.
Any idea how can I handle this query as well?
Thanks,
Nir.
In the end I was able to solve it. This is the expression I was using:
STRING
: '\'' ( '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
grammar Question;
sql
#init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\\\'' '\'' // '\\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\\']* '\'\'' ~[\\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
File input.txt (not having more examples, I can only guess) :
replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
Execution :
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[#0,0:7='replace1',<REPLACE>,1:0]
[#1,8:8='(',<'('>,1:8]
[#2,9:15='replace',<REPLACE>,1:9]
[#3,16:16='(',<'('>,1:16]
[#4,17:24='some_col',<ID>,1:17]
[#5,25:25=',',<','>,1:25]
[#6,26:30=''\\''',<STRING>,1:26]
[#7,31:31=',',<','>,1:31]
[#8,32:33='''',<STRING>,1:32]
[#9,34:34=')',<')'>,1:34]
[#10,35:35=',',<','>,1:35]
[#11,36:39=''\"'',<STRING>,1:36]
[#12,40:40=' ',<WS>,channel=1,1:40]
[#13,41:41=',',<','>,1:41]
[#14,42:43='''',<STRING>,1:42]
[#15,44:44=')',<')'>,1:44]
[#16,45:45='\n',<NL>,channel=1,1:45]
[#17,46:53='replace2',<REPLACE>,2:0]
[#18,54:54='(',<'('>,2:8]
[#19,55:62='some_col',<ID>,2:9]
[#20,63:63=',',<','>,2:17]
[#21,64:67='''''',<STRING>,2:18]
[#22,68:68=',',<','>,2:22]
[#23,69:70='''',<STRING>,2:23]
[#24,71:71=')',<')'>,2:25]
[#25,72:72='\n',<NL>,channel=1,2:26]
[#26,73:80='replace3',<REPLACE>,3:0]
[#27,81:81='(',<'('>,3:8]
[#28,82:89='some_col',<ID>,3:9]
[#29,90:90=',',<','>,3:17]
[#30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[#31,106:106=',',<','>,3:33]
[#32,107:111=''xyz'',<STRING>,3:34]
[#33,112:112=')',<')'>,3:39]
[#34,113:113='\n',<NL>,channel=1,3:40]
[#35,114:121='replace4',<REPLACE>,4:0]
[#36,122:122='(',<'('>,4:8]
[#37,123:130='some_col',<ID>,4:9]
[#38,131:131=',',<','>,4:17]
[#39,132:141=''abc\ndef'',<STRING>,4:18]
[#40,142:142=',',<','>,4:28]
[#41,143:147=''xyz'',<STRING>,4:29]
[#42,148:148=')',<')'>,4:34]
[#43,149:149='\n',<NL>,channel=1,4:35]
[#44,150:157='replace5',<REPLACE>,5:0]
[#45,158:158='(',<'('>,5:8]
[#46,159:166='some_col',<ID>,5:9]
[#47,167:167=',',<','>,5:17]
[#48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[#49,186:186=',',<','>,5:36]
[#50,187:189=''8'',<STRING>,5:37]
[#51,190:190=')',<')'>,5:40]
[#52,191:191='\n',<NL>,channel=1,5:41]
[#53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352

Remove all Unicode space separators in PostgreSQL?

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 |   | \u00a0 | f | t
5760 | 1680 |   | \u1680 | t | t
6158 | 180e | ᠎ | \u180e | f | t
8192 | 2000 |   | \u2000 | t | t
8193 | 2001 |   | \u2001 | t | t
8194 | 2002 |   | \u2002 | t | t
8195 | 2003 |   | \u2003 | t | t
8196 | 2004 |   | \u2004 | t | t
8197 | 2005 |   | \u2005 | t | t
8198 | 2006 |   | \u2006 | t | t
8199 | 2007 |   | \u2007 | f | t
8200 | 2008 |   | \u2008 | t | t
8201 | 2009 |   | \u2009 | t | t
8202 | 200a |   | \u200a | t | t
8203 | 200b | ​ | \u200b | f | t
8204 | 200c | ‌ | \u200c | f | t
8205 | 200d | ‍ | \u200d | f | t
8206 | 200e | ‎ | \u200e | f | t
8207 | 200f | ‏ | \u200f | f | t
8239 | 202f |   | \u202f | f | t
8287 | 205f |   | \u205f | t | t
8288 | 2060 | ⁠ | \u2060 | f | t
12288 | 3000 |   | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';

Remove certain letters in foma

I am trying to write a rule to remove the non-start [a | e | h | i | o | u | w | y] letters in a string. The rule should keep the first letter, but remove given letters in other locations.
For example,
vave -> vv
aeiou -> a
My code is as below:
?* [ a | e | h | i | o | u | w | y ]+:0 ?* [ a | e | h | i | o | u | w | y ]+:0;
However, when applying the rule on vaavaa, it returns
vaav
vava
vava
vav
vava
vava
vav
vvaa
vva
vva
vv
while vv is what I want.
Please share some advice. Thanks!
You may use this regex for search:
(?!^)[aehiouwy]+
and replace it by emptry string ""
RegEx Demo
RegEx Details:
(?!^): Lookahead to make sure it is not at start
[aehiouwy]+: Match one or more of these letters inside [...]
You can use a captured group and alternation
^(.)|[aehiouwy]+
replace by \1
Regex demo

Match single character not enclosed by braces

I am making a property list syntax definition file (tmLanguage) for practice. It's in Sublime Text's YAML format, but I'll be using it in VS Code.
I need to match all characters (including unterminated { and }) that are not enclosed by {}.
I have tried using negative lookahead and lookbehind assertions, but it just matches not the first or last character in brackets.
(?<!{).(?!})
Adding a greedy quantifier to consume all characters just matches the full line.
(?<!{).+(?!})
Adding a lazy quantifier just matches everything except the first character after the {. It also matches {} exactly.
(?<!{).+?(?!})
| Test | Expected Matches |
| ----------------- | ----------------------- |
| `{Ctrl}{Shift}D` | `D` |
| `D{Ctrl}{Shift}` | `D` |
| `{Ctrl}D{Shift}` | `D` |
| `{Ctrl}{Shift}D{` | `D` `{` |
| `{Ctrl}{Shift}D}` | `D` `}` |
| `D}{Ctrl}{Shift}` | `D` `}` |
| `D{{Ctrl}{Shift}` | `D` `{` |
| `{Shift` | `{` `S` `h` `i` `f` `t` |
| `Shift}` | `S` `h` `i` `f` `t` `}` |
| `{}` | `{` `}` |
Sample text file: https://raw.githubusercontent.com/spikespaz/windows-tiledwm/master/hotkeys.conf
Full syntax highligher:
# [PackageDev] target_format: plist, ext: tmLanguage
---
name: Hotkeys Config
scopeName: source.hks
fileTypes: ["hks", "conf"]
uuid: c4bcacab-0067-43db-a1d7-7a74fffe2989
patterns:
- name: keyword.operator.assignment
match: \=
- name: constant.language
match: "null"
- name: constant.numeric.integer
match: \{(?:Alt|Ctrl|Shift|Super)\}
- name: constant.numeric.float
match: \{(?:Menu|RMenu|LMenu|Tab|Enter|PgUp|PgDown|Ins|Del|Home|End|PrntScr|Esc|Back|Space|F[0-12]|Up|Down|Left|Right)\}
- name: comment.line
match: \#.*
...
You can use the following RegEx to match:
(?:{(Ctrl|Shift|Alt)})*
Then simply replace the matches with an empty string and what you get is according to your wishes.
The RegEx is selfexplaining, but here's a short description:
It creates a non capturing Group consisting of one of your modifier keys in curly brackets. The plus sign '+' at the right means it repeats that one or more times.

PL/SQL: regexp_like for string not start with letters

For regexp_like running on Oracle database 11g. I want a pattern to match a string not start with AM or AP,the string is usually few letters followed by an underscore and other letters or underscore.
For example :
String : AM_HTCEVOBLKHS_BX [false]
String : AP_HTCEVOBLKHSPBX [false]
String : BM_HTCEVOBLKHS_BX [true]
String : A_HTCEVODSAP_DSSD [true]
String : A_HTCEVOB_A_CDSED [true]
String : MP_HTCEVOBLKHS_BX [true]
Can you make this pattern ?
My current solution doesn't work:
BEGIN
IF regexp_like('AM_HTCEVOBLKHS_BX','[^(AM)(AP)]+_.*') THEN
dbms_output.put_line('TRUE');
ELSE
dbms_output.put_line('FALSE');
END IF;
END;
/
why you need regexp why you not use simple substr?
with t1 as
(select 'AM_HTCEVOBLKHS_BX' as f1
from dual
union all
select 'AP_HTCEVOBLKHSPBX'
from dual
union all
select 'BM_HTCEVOBLKHS_BX'
from dual
union all
select 'A_HTCEVODSAP_DSSD'
from dual
union all
select 'A_HTCEVOB_A_CDSED'
from dual
union all
select 'MP_HTCEVOBLKHS_BX' from dual
union all
select null from dual
union all
select '1' from dual)
select f1,
case
when substr(f1, 1, 2) in ('AM', 'AP') then
'false'
else
'true'
end as check_result
from t1
If you have a table of patterns then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE strings ( string ) AS
SELECT 'AM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'AP_HTCEVOBLKHSPBX' FROM DUAL
UNION ALL SELECT 'BM_HTCEVOBLKHS_BX' FROM DUAL
UNION ALL SELECT 'A_HTCEVODSAP_DSSD' FROM DUAL
UNION ALL SELECT 'A_HTCEVOB_A_CDSED' FROM DUAL
UNION ALL SELECT 'MP_HTCEVOBLKHS_BX' FROM DUAL;
CREATE TABLE patterns ( pattern ) AS
SELECT '^AM' FROM DUAL
UNION ALL SELECT '^AP' FROM DUAL;
Query 1:
-- Negative Matches:
SELECT string
FROM strings s
LEFT OUTER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
WHERE p.pattern IS NULL
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 2:
-- Positive Matches:
SELECT DISTINCT
string
FROM strings s
INNER JOIN
patterns p
ON ( REGEXP_LIKE( string, pattern ) )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 3:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string,
( SELECT LISTAGG( pattern, '|' ) WITHIN GROUP ( ORDER BY NULL )
FROM patterns )
)
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings s
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
If you want to pass the pattern as a single string then:
Query 4:
-- Negative Matches:
SELECT string
FROM strings
WHERE NOT REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| BM_HTCEVOBLKHS_BX |
| A_HTCEVODSAP_DSSD |
| A_HTCEVOB_A_CDSED |
| MP_HTCEVOBLKHS_BX |
Query 5:
-- Positive Matches:
SELECT string
FROM strings
WHERE REGEXP_LIKE( string, '^(AM|AP)' )
Results:
| STRING |
|-------------------|
| AM_HTCEVOBLKHS_BX |
| AP_HTCEVOBLKHSPBX |
Query 6:
-- All Matches:
SELECT string,
CASE WHEN REGEXP_LIKE( string, '^(AM|AP)' )
THEN 'True'
ELSE 'False'
END AS Matched
FROM strings
Results:
| STRING | MATCHED |
|-------------------|---------|
| AM_HTCEVOBLKHS_BX | True |
| AP_HTCEVOBLKHSPBX | True |
| BM_HTCEVOBLKHS_BX | False |
| A_HTCEVODSAP_DSSD | False |
| A_HTCEVOB_A_CDSED | False |
| MP_HTCEVOBLKHS_BX | False |
Try this:
^([B-Z][A-Z]*|A[A-LNOQ-Z]?|A[A-Z]{2,})_[A-Z_]+$
The idea is to describe all possible start of the string.
( # a group
[B-Z][A-Z]* # The first character is not a "A"
| # OR
A[A-LNOQ-Z]? # a single "A" or a "A" followed by a letter except "P" or "M"
| # OR
A[A-Z]{2,} # a "A" followed by more than 1 letter
) # close the group
^ and $ are anchors and means "start of the string" and "end of the string"
I think you need just this:
not regexp_like( field, '^(AM_)|^(AP_)' )
As it is a LIKE function you don't need any more on the regex expression.