In PostgreSQL, I want to exclude rows if the desc field contains any forbidden words.
items:
| id | desc |
|----|------------------|
| 1 | apple foo cat bar|
| 2 | foo bar |
| 3 | foocatbar |
| 4 | foo dog bar |
The forbidden words list is stored in another table, currently it has 400 words to check.
forbidden_word_table:
| word |
|---------|
| apple |
| boy |
| cat |
| dog |
| .... |
SQL query:
select id, desc
from items
where
desc !~* (select '\y(' || string_agg(word, '|') || ')\y' from forbidden_word_table)
I am checking if desc does not match the regex expression:
desc !~* '\y(apple|boy|cat|dog|.............)\y'
Results:
| id | desc |
|----|------------------|
| 2 | foo bar |
| 3 | foocatbar |
** 3rd is not excluded since cat is not a single word
My forbidden_word_table will likely grow with many rows, the above regex will become a very lengthy expression.
Do regex expressions have a maximum length limit (in bytes or characters)? I'm afraid of my regex matching approach will not work if forbidden_word_table keeps growing.
Seems, that Wiktor Stribiżew is right about "catastrophic backtracking".
I'd suggest to use ILIKE and ANY:
SELECT *
FROM items i
WHERE NOT i."desc" ILIKE ANY
(
SELECT '%' || word || '%'
FROM forbidden_word_table
);
db-fiddle
I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
Based on the Posix "space" character-class (class shorthand \s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor's post), I condensed this custom character class:
'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'
So use:
SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));
Note: trim() comes after regexp_replace(), so it covers converted spaces.
It's important to include the basic space class \s (short for [[:space:]] to cover all current (and future) basic space characters.
We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
, '\u' || lpad(to_hex(d), 4, '0') AS unicode
, chr(d) ~ '\s' AS in_posix_space_class
, chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM (
-- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
-- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202) AS dec -- UNICODE "Spaces"
UNION ALL
SELECT generate_series (8203, 8207) AS dec -- First 5 space-like UNICODE "Format characters"
) t(d)
ORDER BY d;
decimal | hex | glyph | unicode | in_posix_space_class | in_custom_class
---------+------+----------+---------+----------------------+-----------------
9 | 9 | | \u0009 | t | t
32 | 20 | | \u0020 | t | t
160 | a0 | | \u00a0 | f | t
5760 | 1680 | | \u1680 | t | t
6158 | 180e | | \u180e | f | t
8192 | 2000 | | \u2000 | t | t
8193 | 2001 | | \u2001 | t | t
8194 | 2002 | | \u2002 | t | t
8195 | 2003 | | \u2003 | t | t
8196 | 2004 | | \u2004 | t | t
8197 | 2005 | | \u2005 | t | t
8198 | 2006 | | \u2006 | t | t
8199 | 2007 | | \u2007 | f | t
8200 | 2008 | | \u2008 | t | t
8201 | 2009 | | \u2009 | t | t
8202 | 200a | | \u200a | t | t
8203 | 200b | | \u200b | f | t
8204 | 200c | | \u200c | f | t
8205 | 200d | | \u200d | f | t
8206 | 200e | | \u200e | f | t
8207 | 200f | | \u200f | f | t
8239 | 202f | | \u202f | f | t
8287 | 205f | | \u205f | t | t
8288 | 2060 | | \u2060 | f | t
12288 | 3000 | | \u3000 | t | t
65279 | feff | | \ufeff | f | t
(26 rows)
Tool to generate the character class:
SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM (
SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
UNION ALL
SELECT generate_series (8192, 8202)
UNION ALL
SELECT generate_series (8203, 8207)
) t(d)
WHERE chr(d) !~ '\s'; -- not covered by \s
[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]
db<>fiddle here
Related, with more explanation:
Trim trailing spaces with PostgreSQL
You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.
Compiling blank characters from several sources, I've ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:
[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]
And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")
-- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
UPDATE
mytable
SET
text = regexp_replace(text, '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]', ' ', 'g')
WHERE
text ~ '[\x0009\x000B\x0088-\x008A\x00A0\x1680\x180E\x2000-\x200F\x202F\x205F\x2060\x2409\x240A\x2420\x2423\x2800\x3000\xFEFF\xE0020]';
-- then squeeze multiple spaces into one
UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '% %';
-- and finally, trim leading/ending spaces
UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
I want to capture all numbers in a string
for example:
+================+============+
| string | match |
+================+============+
| 5*-33 = 75.3 | 5|-33|75.3 |
+----------------+------------+
| s44+2=7 | 2|7 |
+----------------+------------+
| ii2*-5 = 46 | -5|46 |
+----------------+------------+
| -2*-2.1 = 0.1 | -2|-2.1|0.1|
+================+============+
i tried with following expression, but its not working with signed numbers.
\b([0-9]+(\.\d+)?)\b
Regexr
Don't forget the optional -. - is not a number, so you have to capture it separately.
\b(-?\d+(\.\d+)?)\b
Of course, this will have issues with valid expressions such as:
4-3
But that seems to be a different problem.
I am trying to write a rule to remove the non-start [a | e | h | i | o | u | w | y] letters in a string. The rule should keep the first letter, but remove given letters in other locations.
For example,
vave -> vv
aeiou -> a
My code is as below:
?* [ a | e | h | i | o | u | w | y ]+:0 ?* [ a | e | h | i | o | u | w | y ]+:0;
However, when applying the rule on vaavaa, it returns
vaav
vava
vava
vav
vava
vava
vav
vvaa
vva
vva
vv
while vv is what I want.
Please share some advice. Thanks!
You may use this regex for search:
(?!^)[aehiouwy]+
and replace it by emptry string ""
RegEx Demo
RegEx Details:
(?!^): Lookahead to make sure it is not at start
[aehiouwy]+: Match one or more of these letters inside [...]
You can use a captured group and alternation
^(.)|[aehiouwy]+
replace by \1
Regex demo
I'm not great at programming and recently started to read tutorials on C++.
I decided I would attempt to make a simple blackjack program. I tried to make a title with "big text" but C++ is preventing me from doing it because it is detecting other things inside the text.
//Start Screen Begin
cout << " ____ _ _ _ _ ";
cout << "| __ )| | __ _ ___| | __(_) __ _ ___| | __ ";
cout << "| _ \| |/ _` |/ __| |/ /| |/ _` |/ __| |/ / ";
cout << "| |_) | | (_| | (__| < | | (_| | (__| < ";
cout << "|____/|_|\__,_|\___|_|\_\/ |\__,_|\___|_|\_\ ";
cout << " |__/ ";
//Start Screen End
This is what I am trying to display, yet keep getting the following error:
undefined reference to 'WinMain#16'
I am asking if there is a way to tell C++ I only want it to read and display the text, and not use any functions.
That's a better job for C++11 raw string literals than escaping \ with \\:
#include <iostream>
int main() {
using namespace std;
//Start Screen Begin
cout << R"( ____ _ _ _ _ )" << '\n';
cout << R"(| __ )| | __ _ ___| | __(_) __ _ ___| | __ )" << '\n';
cout << R"(| _ \| |/ _` |/ __| |/ /| |/ _` |/ __| |/ / )" << '\n';
cout << R"(| |_) | | (_| | (__| < | | (_| | (__| < )" << '\n';
cout << R"(|____/|_|\__,_|\___|_|\_\/ |\__,_|\___|_|\_\ )" << '\n';
cout << R"( |__/ )" << '\n';
//Start Screen End
}
Check the output here that it works for a decent compiler that support C++11: http://coliru.stacked-crooked.com/a/964b0d2b8bde8b3d
The following would also work:
#include <iostream>
int main() {
using namespace std;
//Start Screen Begin
cout <<
R"(
____ _ _ _ _
| __ )| | __ _ ___| | __(_) __ _ ___| | __
| _ \| |/ _` |/ __| |/ /| |/ _` |/ __| |/ /
| |_) | | (_| | (__| < | | (_| | (__| <
|____/|_|\__,_|\___|_|\_\/ |\__,_|\___|_|\_\
|__/
)";
//Start Screen End
}
http://coliru.stacked-crooked.com/a/b89a0461ab8cdc97
Your second-to-last text literal has several \ characters in it. That is an escape character, so to use the literal \ character you have to escape it as \\, eg:
cout << "|____/|_|\\__,_|\\___|_|\\_\\/ |\\__,_|\\___|_|\\_\\ ";
It won't look as good in code, but it will look fine when the app is run.
As for the reference error, WinMain() is the entry point for GUI apps, whereas main() is the entry point for console apps, so it sounds like you did not create/configure your project correctly if it is trying to link to WinMain() instead of main().
You'll need to escape the backslashes, instead of one \ have two \\.
These are used to indicate special characters like "\n" (line-break) appearing within a string literal ("..."). Your Big-Text misses all of those line-breaks BTW.
undefined reference to 'WinMain#16'
Apparently you are trying to compile a GUI project. Check your Code Blocks project type.