In PostgreSQL, I want to exclude rows if the desc field contains any forbidden words.
items:
| id | desc |
|----|------------------|
| 1 | apple foo cat bar|
| 2 | foo bar |
| 3 | foocatbar |
| 4 | foo dog bar |
The forbidden words list is stored in another table, currently it has 400 words to check.
forbidden_word_table:
| word |
|---------|
| apple |
| boy |
| cat |
| dog |
| .... |
SQL query:
select id, desc
from items
where
desc !~* (select '\y(' || string_agg(word, '|') || ')\y' from forbidden_word_table)
I am checking if desc does not match the regex expression:
desc !~* '\y(apple|boy|cat|dog|.............)\y'
Results:
| id | desc |
|----|------------------|
| 2 | foo bar |
| 3 | foocatbar |
** 3rd is not excluded since cat is not a single word
My forbidden_word_table will likely grow with many rows, the above regex will become a very lengthy expression.
Do regex expressions have a maximum length limit (in bytes or characters)? I'm afraid of my regex matching approach will not work if forbidden_word_table keeps growing.
Seems, that Wiktor Stribiżew is right about "catastrophic backtracking".
I'd suggest to use ILIKE and ANY:
SELECT *
FROM items i
WHERE NOT i."desc" ILIKE ANY
(
SELECT '%' || word || '%'
FROM forbidden_word_table
);
db-fiddle
I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 | | \u00a0 | f | t
5760 | 1680 | | \u1680 | t | t
6158 | 180e | | \u180e | f | t
8192 | 2000 | | \u2000 | t | t
8193 | 2001 | | \u2001 | t | t
8194 | 2002 | | \u2002 | t | t
8195 | 2003 | | \u2003 | t | t
8196 | 2004 | | \u2004 | t | t
8197 | 2005 | | \u2005 | t | t
8198 | 2006 | | \u2006 | t | t
8199 | 2007 | | \u2007 | f | t
8200 | 2008 | | \u2008 | t | t
8201 | 2009 | | \u2009 | t | t
8202 | 200a | | \u200a | t | t
8203 | 200b | | \u200b | f | t
8204 | 200c | | \u200c | f | t
8205 | 200d | | \u200d | f | t
8206 | 200e | | \u200e | f | t
8207 | 200f | | \u200f | f | t
8239 | 202f | | \u202f | f | t
8287 | 205f | | \u205f | t | t
8288 | 2060 | | \u2060 | f | t
12288 | 3000 | | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
I want to capture all numbers in a string
for example:
+================+============+
| string | match |
+================+============+
| 5*-33 = 75.3 | 5|-33|75.3 |
+----------------+------------+
| s44+2=7 | 2|7 |
+----------------+------------+
| ii2*-5 = 46 | -5|46 |
+----------------+------------+
| -2*-2.1 = 0.1 | -2|-2.1|0.1|
+================+============+
i tried with following expression, but its not working with signed numbers.
\b([0-9]+(\.\d+)?)\b
Regexr
Don't forget the optional -. - is not a number, so you have to capture it separately.
\b(-?\d+(\.\d+)?)\b
Of course, this will have issues with valid expressions such as:
4-3
But that seems to be a different problem.
I don't know regex and need to find the expressions to isolate strings that have the word "comp" plus any price (number)
any ideas?
249.00 | 259.00 | 279.00 | comp | 349.00 | //I need to return this as match
369.00 | 359.00 | 599.00 | //don't want to return this as match
299.00 | 499.00 | //don't want to return this as match
329.00 | //don't want to return this as match
comp | 269.00 | 269.00 | //I need to return this as match
179.00 | 239.00 | comp | //I need to return this as match
comp | //don't want to return this as match
89.00 | 89.00 | 89.00 | //no match
249.00 | //don't want to return this as match
comp | 249.00 | //I need to return this as matc
199.00 | comp | comp | //I need to return this as match
comp | comp | 99.00 | 99.00 | //I need to return this as match
comp | comp | comp | comp | comp | //I need to return this as match
Try
(\bcomp\b.+([0-9]*[.]?[0-9]+))|(([0-9]*[.]?[0-9]+).+\bcomp\b)
\bcomp\b for boundary word comp.
.+ for one to many characters.
[0-9]*[.]?[0-9]+ for float number.
| for or condition. number | comp or comp | number
Let try this
/\d?+comp\d?+/g
It will match all string "comp" with numbers. I think it right for you
I have a Hive table column which has string separated by '-' and i need to extract the string between first and last occurrence of '-'
+-----------------+
| col1 |
+-----------------+
| abc-123-na-00-sf|
| 123-abc-01-sd |
| 123-abcd-sd |
+-----------------+
Required output:
+-----------+
| col1 |
+-----------+
| 123-na-00 |
| abc-01 |
| abcd |
+-----------+
Please suggest some regex to extract the desired output.
Thanks
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'-(.*)-',1)
from t
;
123-na-00
abc-01
abcd
or
with t as (select explode(array('abc-123-na-00-sf','123-abc-01-sd','123-abcd-sd')) as str)
select regexp_extract (str,'(?<=-).*(?=-)',0)
from t
;
123-na-00
abc-01
abcd