Split records with complex delimiter - regex

I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required

If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1

Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.

Related

Extract from string in BigQuery using regexp_extract

I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.
I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same
You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.

Extract book name from a string in Hive

My data is something like this -
1124 An Orphan's Journey
234 Red Dragon
35600 You'll Know When It's Time
It has two values, the first one is Book ID, and the second one is the book name.
I used the split function in Hive but that doesn't look proper.
SELECT split(books, '\\ ')[0] book_id,
split(books, '\\ ')[1] + ' ' +
split(books, '\\ ')[2] + ' ' +
split(books, '\\ ')[3] + ' ' +
split(books, '\\ ')[4] as book_name
FROM books;
So far values are good but I don't feel it is the right approach.
Please help.
You may use
REGEXP_EXTRACT(books, '^\\d+', 0)
to get the book ID and
REGEXP_EXTRACT(books, '\\s+(\\S.*)', 1)
to extract the book name. The second regex can be more verbose, say, you may also check if there are digits at the start of the string before, '^\\d+\\s+(\\S.*)'.
Here,
^\d+ - matches one or more (+) digits at the start of the string (^)
\s+(\S.*) - matches one or more whitespace chars (\s+) and then captures into Group 1 any non-whitespace char (\S) and then the rest of the string (.* matches any zero or more chars other than line break chars as many as possible). Note the index argument is set to 1 in the second callt o REGEXP_EXTRACT to make sure the Group 1 value is only returned, without the initial whitespace.

Concatenate special characters to the column values based on pattern matching in Postgres

I have a table with following column in postgres
col1
C29[40
D1305_D1306delinsKK
E602C[20
I would like to append a string 'p.' & closing square brackets in row 1 and 3 elements and 'p.' to the row2 element.
The expected output is:
col2
p.C29[40]
p.D1305_D1306delinsKK
p.E602C[20]
I am running following query, which runs without an error but the expected output is missing.
SELECT *,
CASE
WHEN t.c LIKE 'p.?=[%'
THEN 'p.'|| t.c || ']'
ELSE 'p.'|| t.c
END AS col2
FROM table;
You may use two chained REGEXP_REPLACE calls:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('C29[40', '^(.*\[\d+)$', 'p.\1]'), '^(?:p\.)?', 'p.')
See the regex demo #1 and regex demo #2 and the PostgreSQL demo.
Pattern details
^ - start of string
(.*\[\d+) - Group 1 (\1): any 0+ chars as many as possible (.*), then[ and 1+ digits
$ - end of string.
The ^(?:p\.)? pattern matches an optional p. substring at the beginning of the string, and thus either adds p. or replaces p. with p. (thus, keeping it).

Split single row string into multiple rows by multi-chracter delimiter Oracle

I have attempted to use this question here Splitting string into multiple rows in Oracle and adjust it to my needs however I'm not very confident with regex and have not been able to solve it via searching.
Currently that questions answers it with a lot of regex_substr and so on, using [^,]+ as the pattern so it splits by a single comma. I need it to split by a multi-character delimiter (e.g. #;) but that regex pattern matches any single character to split it out so where there are #s or ;s elsewhere in the text this causes a split.
I've worked out the pattern (#;+) will match every group of #; but I cannot workout how to invert this as done above to split the row into multiple.
I'm sure I'm just missing something simple so any help would be greatly appreciated!
I think you should use:
[^#;+]+
instead of
(#;+)
As, it will be checking for any one of the characters in the range which can be # ; or + and then you can split accordingly.
You can change it according to your requirement but in the regex I
shared, I am consudering # , ; and + as delimeter
So, in end, the query would look something like this:
with tbl(str) as (
select ' My, Delimiter# Hello My; Delimiter World My Delimiter My Delimiter test My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;+]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;+]')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My
3 Delimiter World My Delimiter My Delimiter test My Deli
-- EDIT --
In case you want to check unlimited numbers of # or ; to split and don't want to split at one existence, I found the below regex, but again that is not supported by Oracle.
(?:(?:(?![;#]+).#(?![;#]+).|(?![;#]+).;(?![;#]+).|(?![;#]+).)*)+
So, I found no easy apart from below query which will not split on single existence if there is only one such instance between two delimeters:
select ' My, Delimiter;# Hello My Delimiter ;;# World My Delimiter ; My Delimiter test#; My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;]+#?[^#;]+;?[^#;]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;]{2,}')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My Delimiter
3 World My Delimiter ; My Delimiter test
4 My Delimiter

Teradata regexp_replace to eliminate specific special characters

I imported a file that contains email addresses (email_source). I need to join this table to another, using this field but it contains commas (,) and double quotes (") before and after the email address (eg. "johnsmith#gmail.com,","). I want to replace all commas and double quotes with a space.
What is the correct syntax in teradata?
Just do this:
REGEXP_REPLACE(email_source, '[,"]', ' ',1,0,i)
Breakdown:
REGEXP_REPLACE(email_source, -- sourcestring
'[,"]', -- regexp
' ', --replacestring
1, --startposition
0, -- occurrence, 0 = all
'i' -- match -> case insensitive
)
You don't need a regex for this, a simple oTranslate should be more efficient:
oTranslate(email_source, ',"', ' ')