Selecting for a Jsonb array contains regex match - regex

Given a data structure as follows:
{"single":"someText", "many":["text1", text2"]}
I can query a regex on single with
WHERE JsonBColumn ->> 'single' ~ '^some.*'
And I can query a contains match on the Array with
WHERE JsonBColumn -> 'many' ? 'text2'
What I would like to do is to do a contains match with a regex on the JArray
WHERE JsonBColumn -> 'many' {Something} '.*2$'

I found that it is also possible to convert the entire JSONB array to a plain text string and simply perform the regular expression on that. A side effect though is that a search on something like
xt 1", "text
would end up matching.
This approach isn't as clean since it doesn't search each element individually but it gets the job done with a visually simpler statement.
WHERE JsonBColumn ->>'many' ~ 'text2'

Use jsonb_array_elements_text() in lateral join.
with the_data(id, jsonbcolumn) as (
values
(1, '{"single":"someText", "many": ["text1", "text2"]}'::jsonb)
)
select distinct on (id) d.*
from
the_data d,
jsonb_array_elements_text(jsonbcolumn->'many') many(elem)
where elem ~ '^text.*';
id | jsonbcolumn
----+----------------------------------------------------
1 | {"many": ["text1", "text2"], "single": "someText"}
(1 row)
See also this answer.
If the feature is used frequently, you may want to write your own function:
create or replace function jsonb_array_regex_like(json_array jsonb, pattern text)
returns boolean language sql as $$
select bool_or(elem ~ pattern)
from jsonb_array_elements_text(json_array) arr(elem)
$$;
The function definitely simplifies the code:
with the_data(id, jsonbcolumn) as (
values
(1, '{"single":"someText", "many": ["text1", "text2"]}'::jsonb)
)
select *
from the_data
where jsonb_array_regex_like(jsonbcolumn->'many', '^text.*');

Related

Query to fetch data for list of regular expressions given

I have to fetch data for given list of regular expressions. for single regular expression below query is working but for list i am facing issue
select id from res r where
(?1 is null or CAST(r.value AS TEXT) ~ cast(?1 as TEXT));
?1 is [^\d{3}\d{1,}133\d{1,}$] and it is working fine
Now when i put list of regular expressions then [^\d{3}\d{1,}133\d{1,}$, 75$] it is not working
If ?1 type is text array (text[]) or a valid textual representation of text array syntax then
select id from res r where
?1 is null
or exists (select from unnest(?1::text[]) rx where r.value::text ~ rx);
Please note that the syntax of the text array of regular expressions shall not be
[^\d{3}\d{1,}133\d{1,}$, 75$] but {"^\\d{3}\\d{1,}133\\d{1,}$", 75$}
Edit
It might be a good idea to define a function that returns true if the first argument matches any of the regular expressions in an array. Something similar to a non-existent but good to have REGEX_IN operator.
create or replace function regex_any(needle text, haystack_rules text[])
returns boolean language sql immutable as $$
select exists (select from unnest(haystack_rules) haystack_rule where needle ~ haystack_rule);
$$;
Then your query will look like this:
select id from res r
where ?1 is null
or regex_any(r.value::text, ?1);
From the documentation, the ~ operator works against a single regular expression. You would need to update your query to work against a list of regular expressions. For example,
select id from res r where
(?1 is null or (select bool_and(r.value::text ~ x.exp) FROM unnest(?1)))
The second part of the where returns true if the column matches all regular expressions in ?1. Depending on the size of your input, you can extract unnest(?1) into a CTE.

Vertica REGEXP_SUBSTR use /g flag

I am trying to extract all occurrences of a word before '=' in a string, i tried to use this regex '/\w+(?=\=)/g' but it returns null, when i remove the first '/' and the last '/g' it returns only one occurrence that's why i need the global flag, any suggestions?
As Wiktor pointed out, by default, you only get the first string in a REGEXP_SUBSTR() call. But you can get the second, third, fourth, etc.
Embedded into SQL, you need to treat regular expressions differently from the way you would treat them in perl, for example. The pattern is just the pattern, modifiers go elsewhere, you can't use $n to get the n-th captured sub-expression, and you need to proceed in a specific way to get the n-th match of a pattern, etc.
The trick is to CROSS JOIN your queried table with an in-line created index table, consisting of as many consecutive integers as you expect occurrences of your pattern - and a few more for safety. And Vertica's REGEXP_SUBSTR() call allows for additional parameters to do that. See this example:
WITH
-- one exemplary input row; concatenating substrings for
-- readability
input(s) AS (
SELECT 'DRIVER={Vertica};COLUMNSASCHAR=1;CONNECTIONLOADBALANCE=True;'
||'CONNSETTINGS=set+search_path+to+public;DATABASE=sbx;'
||'LABEL=dbman;PORT=5433;PWD=;SERVERNAME=127.0.0.1;UID=dbadmin;'
)
,
-- an index table to CROSS JOIN with ... maybe you need more integers ...
loop_idx(i) AS (
SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
UNION SELECT 10
)
,
-- the query containing the REGEXP_SUBSTR() call
find_token AS (
SELECT
i -- the index from the in-line index table, needed
-- for ordering the outermost SELECT
, REGEXP_SUBSTR (
s -- the input string
, '(\w+)=' -- the pattern - a word followed by an equal sign; capture the word
, 1 -- start from pos 1
, i -- the i-th occurrence of the match
, '' -- no modifiers to regexp
, 1 -- the first and only sub-pattern captured
) AS token
FROM input CROSS JOIN loop_idx -- the CROSS JOIN with the in-line index table
)
-- the outermost query filtering the non-matches - the empty strings - away...
SELECT
token
FROM find_token
WHERE token <> ''
ORDER BY i
;
The result will be one row per found pattern:
token
DRIVER
COLUMNSASCHAR
CONNECTIONLOADBALANCE
CONNSETTINGS
DATABASE
LABEL
PORT
PWD
SERVERNAME
UID
You can do all sorts of things in modern SQL - but you need to stick to the SQL and to the relational paradigm - that's all ...
Happy playing ...
Marco

Use Regex from a column in Redshift

I have 2 tables in Redshift, one of them has a column containing Regex strings. And I want to join them like so:
select *
from one o
join two t
on o.value ~ t.regex
But this query throws an error:
[Amazon](500310) Invalid operation: The pattern must be a valid UTF-8 literal character expression
Details:
-----------------------------------------------
error: The pattern must be a valid UTF-8 literal character expression
code: 8001
context:
query: 412993
location: cgx_impl.cpp:1911
process: padbmaster [pid=5211]
-----------------------------------------------;
As far as I understood from searching in the docs, the right side of a regex operator ~ must be a string literal.
So this would work:
select *
from one o
where o.value ~ 'regex'
And this would fail:
select *
from one o
where 'regex' ~ o.value
Is there any way around this? Anything I missed?
Thanks!
Here's a workaround I am using. Maybe it's not super fast, but it works:
First create a function:
CREATE FUNCTION is_regex_match(pattern text, s text) RETURNS BOOLEAN IMMUTABLE AS $$
import re
return True if re.search(pattern, s) else False
$$ LANGUAGE plpythonu;
Then use it like this (o.value contains a regex pattern):
select *
from one o
where is_regex_match(o.value, 'some string');
You could try using the built-in function regexp_substr()
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html
select *
from one o
join two t
on regexp_substr(o.value, t.regex) <> ''
Edit example added of raw query
It appears that the fields must be explicitly cast as varchars when built.
with fake_table as (
SELECT 'sample value'::varchar as value, '[a-z]'::varchar as pattern
)
SELECT *
, regexp_substr(value, pattern)
FROM
fake_table
WHERE
regexp_substr(value, pattern) <>''

Regular Expression in redshift

I have a data which is being fed in the below format -
2016-006-011 04:58:22.058
This is an incorrect date/timestamp format and in order to convert this to a right one as below -
2016-06-11 04:58:22.058
I'm trying to achieve this using regex in redshift. Is there a way to remove the additional Zero(0) in the date and month portion using regex. I need something more generic and not tailed for this example alone as date will vary.
The function regexp_replace() (see documentation) should do the trick:
select
regexp_replace(
'2016-006-011 04:58:22.058' -- use your date column here instead
, '\-0([0-9]{2}\-)0([0-9]{2})' -- matches "-006-011", captures "06-" in $1, "11" in $2
, '-$1$2' -- inserts $1 and $2 to give "-06-11"
)
;
And so the result is, as required:
regexp_replace
-------------------------
2016-06-11 04:58:22.058
(1 row)

Is there a way to usefully index a text column containing regex patterns?

I'm using PostgreSQL, currently version 9.2 but I'm open to upgrading.
In one of my tables, I have a column of type text that stores regex patterns.
CREATE TABLE foo (
id serial,
pattern text,
PRIMARY KEY(id)
);
CREATE INDEX foo_pattern_idx ON foo(pattern);
Then I do queries on it like this:
INSERT INTO foo (pattern) VALUES ('^abc.*$');
SELECT * FROM foo WHERE 'abc literal string' ~ pattern;
I understand that this is sort of a reverse LIKE or reverse pattern match. If it was the other, more common way, if my haystack was in the database, and my needle was anchored, I could use a btree index more or less effectively depending on the exact search pattern and data.
But the data that I have is a table of patterns and other data associated with the patterns. I need to ask the database which rows have patterns that match my query text. Is there a way to make this more efficient than a sequential scan that checks every row in my table?
There is no way.
Indexes require IMMUTABLE expressions. The result of your expression depends on the input string. I don't see any other way than to evaluate the expression for every row, meaning a sequential scan.
Related answer with more details for the IMMUTABLE angle:
Does PostgreSQL support "accent insensitive" collations?
Just that there is no workaround for your case, which is impossible to index. The index needs to store constant values in its tuples, which is just not available because the resulting value for every row is computed based on the input. And you cannot transform the input without looking at the column value.
Postgres index usage is bound to operators and only indexes on expressions left of the operator can be used (due to the same logical restraints). More:
Can PostgreSQL index array columns?
Many operators define a COMMUTATOR which allows the query planner / optimizer to flip the indexed expressions to the left. Simple example: The commutator of = is =. the commutator of > is < and vice versa. The documentation:
the index-scan machinery expects to see the indexed column on the left of the operator it is given.
The regular expression match operator ~ has no commutator, again, because that's not possible. See for yourself:
SELECT oprname, oprright::regtype, oprleft::regtype, oprcom
FROM pg_operator
WHERE oprname = '~'
AND 'text'::regtype IN (oprright, oprleft);
oprname | oprright | oprleft | oprcom
---------+----------+-----------+------------
~ | text | name | 0
~ | text | text | 0
~ | text | character | 0
~ | text | citext | 0
And consult the manual here:
oprcom ... Commutator of this operator, if any
...
Unused column contain zeroes. For example, oprleft is zero for a prefix operator.
I have tried before and had to accept it's impossible on principal.