Extract from string in BigQuery using regexp_extract

Extract from string in BigQuery using regexp_extract - regex

I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.

I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same

You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.

Related

Split records with complex delimiter

I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required

If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1

Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.

Regex to match variables only when inside quotes

I have many sql files. I am trying to locate files that contain a variable (format of #varname) ONLY if they appear within matching single or double quotes. I only care that it exists and is there, I just need to know the files that this occurs in.
I can match all the quoted strings, but can't figure out how to test that even just a single # char appears within the match
matching single and double quote pairs (["'])(.*?)\1
example file:
...sql statements
select #sql = 'select * from
Users where id = #id '
...more sql statements
Thanks in advance.
EDIT
Here is a better example file, with comments (sql comments) on which statements should match and examples of ones that shouldn't
...sql statements
-- only this quoted string would match
select #sql = "select * from
Users where id = #id "
-- other statements that wouldn't match because not in a pair of quotes
if ltrim(isnull(#stat,'')) <> '' and #stat <> '""'
begin
select #sql = #sql + " and Stat in ("+#stat+")"
end
if isnull(#atype,'') <> ''
begin
select #sql = #sql + " and Type in ("+#atype+")"
end
...more sql statements

For the sample text given....
Try:
(?:\s|=)(?:\"[^"]*#[^"]*\"|(?:\s|=)\'[^']*#[^']*\')
Demo:
https://regex101.com/r/BXcYt4/1

Using PCRE and to
to test that even just a single # char appears within the match
You can use an alternation excluding either " or ' and also exclude matching an # adding it to the negated character class.
To get both values in the same group, you can use a branch reset group.
=\h*(?|"([^"#]*#[^"#]+)"|'([^#']*#[^'#]*)')
The pattern matches:
=\h* Match = and optional horizontal whitespace chars
(?| Branch reset group
"( Match " and start group 1
[^"#]*# Match optional chars other than " or # and then match #
[^"#]+ Match 1+ chars other than " or #
)" Close group 1 and atch "
| Or
'([^#']*#[^'#]*)' The same as previous pattern, this time for '
) Close branch reset group
Regex demo

Concatenate special characters to the column values based on pattern matching in Postgres

I have a table with following column in postgres
col1
C29[40
D1305_D1306delinsKK
E602C[20
I would like to append a string 'p.' & closing square brackets in row 1 and 3 elements and 'p.' to the row2 element.
The expected output is:
col2
p.C29[40]
p.D1305_D1306delinsKK
p.E602C[20]
I am running following query, which runs without an error but the expected output is missing.
SELECT *,
CASE
WHEN t.c LIKE 'p.?=[%'
THEN 'p.'|| t.c || ']'
ELSE 'p.'|| t.c
END AS col2
FROM table;

You may use two chained REGEXP_REPLACE calls:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('C29[40', '^(.*\[\d+)$', 'p.\1]'), '^(?:p\.)?', 'p.')
See the regex demo #1 and regex demo #2 and the PostgreSQL demo.
Pattern details
^ - start of string
(.*\[\d+) - Group 1 (\1): any 0+ chars as many as possible (.*), then[ and 1+ digits
$ - end of string.
The ^(?:p\.)? pattern matches an optional p. substring at the beginning of the string, and thus either adds p. or replaces p. with p. (thus, keeping it).

Having issues seperating the required data using regex

I am trying to use regex on a rails application I'm building to seperate input without splitting the string up manually.
My regex is:
(?<action>\S+)(?:\s(?<query>.*)\s)(?<id>(?<=.).*?(?=\s))
And the test data I am using is as follows:
add hello by name
remove first second by id
add first
From this, I want the following values:
action: add, query: hello, id: name
action: remove, query: first second, id: id
action: add, query: first, id: nil (or "")
What am I doing wrong? It won't match at all on the last line of test data. Any help would be great.

Try this one:
^(?<action>\S+)(?:\s(?<query>(?:(?! by ).)*))(?: by (?<id>\w+))?
The id is always preceded by " by ", so each character in your <query> group should repeat a negative lookahead for that " by " substring.
Also ensure that the group around the id is optional, so that the third line gets matched as well.
Demo
Another option, instead of repeating a negative lookahead, would be to have a single positive lookahead for " by " or the end of the string, and repeat lazily:
^(?<action>\S+)(?:\s(?<query>.*?(?= by |$)))(?: by (?<id>\w+))?$

How to make regular expression correctly?

I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?

This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;

How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.

You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.

Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract from string in BigQuery using regexp_extract - regex

Related

Split records with complex delimiter

Regex to match variables only when inside quotes

Concatenate special characters to the column values based on pattern matching in Postgres

Having issues seperating the required data using regex

How to make regular expression correctly?

Categories

Resources