Vertica REGEXP_SUBSTR use /g flag - regex

I am trying to extract all occurrences of a word before '=' in a string, i tried to use this regex '/\w+(?=\=)/g' but it returns null, when i remove the first '/' and the last '/g' it returns only one occurrence that's why i need the global flag, any suggestions?

As Wiktor pointed out, by default, you only get the first string in a REGEXP_SUBSTR() call. But you can get the second, third, fourth, etc.
Embedded into SQL, you need to treat regular expressions differently from the way you would treat them in perl, for example. The pattern is just the pattern, modifiers go elsewhere, you can't use $n to get the n-th captured sub-expression, and you need to proceed in a specific way to get the n-th match of a pattern, etc.
The trick is to CROSS JOIN your queried table with an in-line created index table, consisting of as many consecutive integers as you expect occurrences of your pattern - and a few more for safety. And Vertica's REGEXP_SUBSTR() call allows for additional parameters to do that. See this example:
WITH
-- one exemplary input row; concatenating substrings for
-- readability
input(s) AS (
SELECT 'DRIVER={Vertica};COLUMNSASCHAR=1;CONNECTIONLOADBALANCE=True;'
||'CONNSETTINGS=set+search_path+to+public;DATABASE=sbx;'
||'LABEL=dbman;PORT=5433;PWD=;SERVERNAME=127.0.0.1;UID=dbadmin;'
)
,
-- an index table to CROSS JOIN with ... maybe you need more integers ...
loop_idx(i) AS (
SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
UNION SELECT 10
)
,
-- the query containing the REGEXP_SUBSTR() call
find_token AS (
SELECT
i -- the index from the in-line index table, needed
-- for ordering the outermost SELECT
, REGEXP_SUBSTR (
s -- the input string
, '(\w+)=' -- the pattern - a word followed by an equal sign; capture the word
, 1 -- start from pos 1
, i -- the i-th occurrence of the match
, '' -- no modifiers to regexp
, 1 -- the first and only sub-pattern captured
) AS token
FROM input CROSS JOIN loop_idx -- the CROSS JOIN with the in-line index table
)
-- the outermost query filtering the non-matches - the empty strings - away...
SELECT
token
FROM find_token
WHERE token <> ''
ORDER BY i
;
The result will be one row per found pattern:
token
DRIVER
COLUMNSASCHAR
CONNECTIONLOADBALANCE
CONNSETTINGS
DATABASE
LABEL
PORT
PWD
SERVERNAME
UID
You can do all sorts of things in modern SQL - but you need to stick to the SQL and to the relational paradigm - that's all ...
Happy playing ...
Marco

Related

Oracle REGEXP_SUBSTR - SEMICOLON STRING EXTRACTION

I have below input and need mentioned output. How I can get it. I tried different pattern but could not get through it.
so in brief, any value having all three 1#2#3 parts(if it is present) or first value should be returned
2#9#;2#37#65 -> 2#37#65
2#9#;2#37#65;2#37# -> 2#37#65
2#9#;2#37#65;2#37#;2#37#56 -> 2#37#65 or 2#37#56
2#37#65;2#99 -> 2#37#65
3#9#;3#37#65;3#37#36;2#37#56 -> 3#37#65 or 3#37#36 or 2#37#56
2#37#;2#99# -> 2#37 or 2#99# ( in this case any value)
I tried few patterns and other pattern but no help.
regexp_substr('2#9#;2#37#65;2#37#','#[^;]+',1)
SUBSTR(REGEXP_SUBSTR(SUBSTR(uo_filiere,1,INSTR(uo_filiere,';',1)-1), '#[^#]+$'),2)
You can use a REGEXP_REPLACE here:
REGEXP_REPLACE(uo_filiere, '^(.*;)?([0-9]+(#[0-9]+){2,}).*|^([^;]+).*', '\2\4')
See the regexp demo
Details:
^ - start of string
(.*;)? - an optional Group 1 capturing any text and then a ;
([0-9]+(#[0-9]+){2,}) - Group 2 (\2): one or more digits, and then two or more occurrences of # followed with one or more digits
.* - the rest of the string
| - or
^([^;]+).* - start of string, Group 4 capturing one or more chars other than ; and then any text till end of string.
The replacement is Group 2 + Group 4 values.
Here is a simple-minded way to solve this. It may prove more efficient than other approaches, given the particular nature of the problem.
First, use a regular expression to find the first token that has all three parts. This part of the solution should be the most efficient approach for those strings that do have a three-part token, and it performs work that must be performed on all input strings in any case.
In the second part, wrap within nvl - if no three-part token is found, select the first token regardless of how many parts are present. This part uses only substr and instr in a trivial manner, so it should be very fast too.
Here's the query, run on a few more sample inputs to test those cases too.
with
sample_data (uo_filiere) as (
select '2#9#;2#37#65' from dual union all
select '2#9#;2#37#65;2#37#' from dual union all
select '2#9#;2#37#65;2#37#;2#37#56' from dual union all
select '2#37#65;2#99' from dual union all
select '3#9#;3#37#65;3#37#36;2#37#56' from dual union all
select '2#37#;2#99#' from dual union all
select '1#22#333' from dual union all
select '33#444#' from dual
)
select uo_filiere,
nvl(regexp_substr(uo_filiere, '(;|^)(([^;#]+#){2}[^;]+)', 1, 1, null, 2)
, substr(uo_filiere, 1, instr(uo_filiere || ';', ';') - 1)
) as first_value
from sample_data
;
UO_FILIERE FIRST_VALUE
---------------------------- ----------------------------
2#9#;2#37#65 2#37#65
2#9#;2#37#65;2#37# 2#37#65
2#9#;2#37#65;2#37#;2#37#56 2#37#65
2#37#65;2#99 2#37#65
3#9#;3#37#65;3#37#36;2#37#56 3#37#65
2#37#;2#99# 2#37#
1#22#333 1#22#333
33#444# 33#444#

Oracle regex and replace

I have varchar field in the database that contains text. I need to replace every occurrence of a any 2 letter + 8 digits string to a link, such as VA12345678 will return /cs/page.asp?id=VA12345678
I have a regex that replaces the string but how can I replace it with a string where part of it is the string itself?
SELECT REGEXP_REPLACE ('test PI20099742', '[A-Z]{2}[0-9]{8}$', 'link to replace with')
FROM dual;
I can have more than one of these strings in one varchar field and ideally I would like to have them replaced in one statement instead of a loop.
As mathguy had said, you can use backreferences for your use case. Try a query like this one.
SELECT REGEXP_REPLACE ('test PI20099742', '([A-Z]{2}[0-9]{8})', '/cs/page.asp?id=\1')
FROM DUAL;
For such cases, you may want to keep the "text to add" somewhere at the top of the query, so that if you ever need to change it, you don't have to hunt for it.
You can do that with a with clause, as shown below. I also put some input data for testing in the with clause, but you should remove that and reference your actual table in your query.
I used the [:alpha:] character class, to match all letters - upper or lower case, accented or not, etc. [A-Z] will work until it doesn't.
with
text_to_add (link) as (
select '/cs/page.asp?id=' from dual
)
, sample_strings (str) as (
select 'test VA12398403 and PI83048203 to PT3904' from dual
)
select regexp_replace(str, '([[:alpha:]]{2}\d{8})', link || '\1')
as str_with_links
from sample_strings cross join text_to_add
;
STR_WITH_LINKS
------------------------------------------------------------------------
test /cs/page.asp?id=VA12398403 and /cs/page.asp?id=PI83048203 to PT3904

Find a string with or without space in oracle using like or regex

I have a string which contains specific 'winner code' which needs to be matched exactly but in the database some records contains spaces and extra characters within 'winners code' and if I use 'like operator' it only returns the matching criteria. I want to use one simplified query which can return all the records if it contains the winner code.Please find below my query and details
Winner code - أ4 ب3 ج10
Records with spaces - أ4 ب 3 ج 10
Records with extra character - (أ(4)
ب(3)
ج(10
My Query -
SELECT COLUMN_NAME,
FROM TABLE_NAME
WHERE
((COLUMN_NAME LIKE '%أ4%ب3%ج10%') or(COLUMN_NAME LIKE '%أ 4%ب 3%ج 10%'))
The above query returns with and without space data as its matching the criteria.
Thanks
If I correctly understand your need, you may try :
with test(str) as (
select '10X3Y4Z' from dual union all
select '10 X 3 Y 4 Z' from dual union all
select '(10)X(3)Y(4)Z' from dual union all
select '10#X3Y4 Z' from dual union all
select '10 # X3Y4Z' from dual )
select str
from test
where regexp_instr(str, '10[ |\)]{0,1}X[ |\(]{0,1}3[ |\)]{0,1}Y[ |\(]{0,1}4[ |\)]{0,1}Z') != 0
This matches your "winner code" ( I used different characters to simplify my test) even if the numbers are surrounded by '()' or a single space.
This can be re-written in a more compact way, but I believe this form is clear enough; it uses regular expressions like [ |\)]{0,1} to match a space or a parenthesis, with zero or one occurrence.

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab

Simple regular expression in Oracle

I have a table with the following values:
ID NAME ADDRESS
1 Bob Super stree1 here goes
2 Alice stree100 here goes
3 Clark Fast left stree1005
4 Magie Right stree1580 here goes
I need to make a query using LIKE and get only the row having stree1 (in this case only get the one with ID=1) and I use the following query:
select * from table t1 WHERE t1.ADDRESS LIKE '%stree1%';
But the problem is that I get all rows as each of them contains stree1 plus some char/number after.
I have found out that I can use REGEXP_LIKE as I am using oracle, what would be the proper regex to use in:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1');
I would think that this would be the reg-ex you are seeking:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1(?:[^[:word:]]|$)');
If you want to, you can further simplify this to:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1(?:\W|$)');
That is, 'stree1' is not followed by a word character (i.e., is followed by space/punctuation/etc...) or 'stree1' appears at the end of the string. Of course there are many other ways to do the same thing, including word boundaries 'stree1\b', expecting particular characters after the 1 in stree1 (e.g., a white-space with 'stree1\s'), etc...
This may help:
stree1\b
The first '\W' is tells it it a non-word character since you need noting after 'stree1' but space
and '$' tells take it as a valid string if it ends with stree1
select *
from table1
where regexp_like(address,'stree1(\W|$)')