Extract all substrings bounded by the same characters - regex

Given a name_loc column of text like the following:
{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}
I'm trying to extract the names, ideally separated by commas:
Charlie, Wrigley, Ana
I've gotten this far:
SELECT SUBSTRING(CAST(name_loc AS VARCHAR) from '"([^ –]+)')
FROM table;
which returns
Charlie
How can I extend this query to extract all names?

You can do this with a combination of regexp_matches (to extract the names), array_agg (to regroup all matches in a row) and array_to_string (to format the array as you'd like, e.g. with a comma separator):
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere}') -- added this to show multiple rows do not get merged
)
SELECT array_to_string(names, ', ')
FROM input
CROSS JOIN LATERAL (
SELECT array_agg(name)
FROM regexp_matches(name_loc, '"(\w+)', 'g') AS f(name)
) AS f(names);
array_to_string
Charlie, Wrigley, Ana
Other
View on DB Fiddle

My two cents, though I'm rather new to postgreSQL and I had to copy the 1st piece from #Marth's his answer:
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere"}')
)
SELECT REGEXP_REPLACE(name_loc, '{?(,)?"(\w+)[^"]+"}?','\1\2', 'g') FROM input;
regexp_replace
Charlie,Wrigley,Ana
Other

Your string literal happens to be a valid array literal.
(Maybe not by coincidence? And the column should be type text[] to begin with?)
If that's the reliable format, there is a safe and simple solution:
SELECT t.id, x.names
FROM tbl t
CROSS JOIN LATERAL (
SELECT string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM unnest(t.name_loc::text[]) elem
) x;
Or:
SELECT id, string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM (SELECT id, unnest(name_loc::text[]) AS elem FROM tbl) t
GROUP BY id;
db<>fiddle here
Steps
Unnest the array with unnest() in a LATERAL CROSS JOIN, or directly in the SELECT list.
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Take the first part with split_part(). I chose ' – ' as delimiter, not just ' ', to allow for names with nested space like "Anne Nicole". See:
Split comma separated column data into additional columns
Aggregate results with string_agg(). I added no particular order as you didn't specify one.
Concatenate multiple result rows of one column into one, group by another column

Related

BigQuery regexp replace character between quotes

I'm trying to use the BigQuery function regexp_replace for the following scenario:
Given a string field with comma as a delimiter, I need to only remove the commas within double quotes.
I found the following regex to work in the website but it seems that the BigQuery function doesn't support Lookahead groups. Could you please help me find an equivalent expression that is supported by the Big Query function regexp_replace?
https://regex101.com/r/nxkqtb/3
Big Query example code not supported:
WITH tbl AS (
SELECT 'LINE_NR="1",TXT_FIELD="Some text",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="2",TXT_FIELD=",,Some text",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="3",TXT_FIELD="Some text ,",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="4",TXT_FIELD=",Some ,text,",CID="0"' as text
)
SELECT
REGEXP_REPLACE(text, r'(?m),(?=[^"]*"(?:[^"\r\n]*"[^"]*")*[^"\r\n]*$)', "")
FROM tbl;
Thank you
Consider below approach (assuming you know in advance keys within the text field)
select text,
( select string_agg(replace(kv, ',', ''), ',' order by offset)
from unnest(regexp_extract_all(text, r'((?:LINE_NR|TXT_FIELD|CID)=".*?")')) kv with offset
) corrected_text
from tbl;
if applied to sample data in your question - output is

Oracle SQL String Manipulation

My field contains short codes that I want to access, such as C-COR3.
The issue is some records have additional information (F and H with numbers). An example is C-COR3 F1.54H19, I only care about C-COR3. Anything after "F" I want to ignore.
Code below works, but only if I hard-code the full F1.54H19. I want to use wildcards to abstract this for other occurrences that have F and H info in the field. (Ex C-R3 F0.18H18 -> C-R3 or C-COR3 F0.23H8.5 -> C-COR3), note varying short code string lengths.
/* Translates C-COR3 F1.54H19 to C-COR3. */
select distinct SUBSTR(lud_code_short,1,INSTR(lud_code_short, 'F1.54H19')-2)
from rep_dba.mytable
I've read that SUBSTR does not allow wildcards, but have had no luck trying my hand at REGEXP_INSTR and REGEX_SUBSTR instead. Any help appreciated.
Assuming that the "code" is always the first continuous sequence of non-space characters (and that there are no leading spaces - if there are, that's easy to handle), you could do something like this. Note the str || ' ' in the call to instr() - that takes care of the case when the input string has no spaces in it to begin with. Also notice the last input - since there are no spaces anywhere, the output is the same as the input. (Showing that if the "code" is not always separated from the "additional information" by at least one space, the solution would not work.)
with
test_data (str) as (
select 'C-COR3 F14H2.5' from dual union all
select 'C-AB3' from dual union all
select null from dual union all
select 'C-AB2F14H2.5' from dual
)
select str, substr(str, 1, instr(str || ' ', ' ') - 1) as code
from test_data
;
STR CODE
-------------- --------------
C-COR3 F14H2.5 C-COR3
C-AB3 C-AB3
C-AB2F14H2.5 C-AB2F14H2.5
Try using regexp_replace within your query like below
SELECT
regexp_replace('C-COR3 F14H2.5', '(C-[[:alnum:]]+) [FH].*', '\1')
FROM dual;

Oracle Regex remove duplicates

I have a requirement to remove duplicate values from a comma separated string.
Input String: a,a,a,b,c,a,b
Expected output: a,b,c
What I have tried:
with ct(str) as(
select 'a,a,a,b,c,a,b' from dual
)
select REGEXP_REPLACE(str,'([^,]*)(,\1)+($|,)','\1\3') col from ct
Output: a,b,c,a,b
The above query can remove repetitive characters which are consecutive.
I know that the above requirement can be solved by creating a table out of the comma separated values and do a listagg on the distinct values.
Is it possible to achieve the above requirement using a single regex statement?.
This should give you the required result:
with borken as (SELECT distinct column_value as str,'1' cnt FROM
table(apex_string.split('a,a,a,b,c,a,b' ,',')) )
select listagg(str,',') within group (order by cnt) from borken;

Hive - regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1
from vp_nlp_protext_males
;
But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from vp_nlp_protext_males
;
You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
translate(ntext, ' ', '') as ntext1
from vp_nlp_protext_males
;

Using Oracle Regular Expression - Masking based on pattern

Cleaning up ,
With Oracle 11g PL/SQL, for below query, can I get the capture groups' positions (something like what Matcher.start() provides in java).
`select regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') from dual`
Result should look like : "zone", 9(start of text "zone").
The bigger problem I was trying to solve is to mask data like account number using patterns like '^.....(.*)..$' (this pattern can vary depending on installation).
Will something like below work for you?
select regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') expr
,instr('1234bankzone1234',regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2')) pos from dual
or more readable subquery like
select a.*, instr(a.value,a.expr) from (
select '1234bankzone1234' value,
regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') expr from dual
) a
I couldn't find any direct equivalent of Matcher API like functionality and there is no way you can access the position group buffer in SQL.
1: Reverse pattern using this
regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
pattern, '(\()', '\1#') , '(\))', '#\1') , '\(#', ')#') , '\^\)#', '^') , '#\)\$', '$') , '#\)', '(#') , '#', '') , '\^([^\(]+\))', '^(\1') , '\(([^\)]+)\$', '(\1)$');
So, "^(.)..(.).$"; becomes "^.(..).(.)$";
2: Use this to bulk collect index and count of capture groups within both patterns
SELECT REGEXP_instr(pattern, '\(.*?\)+', 1, LEVEL) bulk collect into posCapture FROM v CONNECT BY LEVEL <= REGEXP_COUNT(pattern, '\(.*?\)');
3: Match both patterns against the text-to-be-masked. Merge them by the order found in step 2.
select regexp_replace(v_src, pattern, '\' || captureIndex) into tempStr from dual;