I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.
I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same
You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.
Related
I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required
If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1
Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.
I have many sql files. I am trying to locate files that contain a variable (format of #varname) ONLY if they appear within matching single or double quotes. I only care that it exists and is there, I just need to know the files that this occurs in.
I can match all the quoted strings, but can't figure out how to test that even just a single # char appears within the match
matching single and double quote pairs (["'])(.*?)\1
example file:
...sql statements
select #sql = 'select * from
Users where id = #id '
...more sql statements
Thanks in advance.
EDIT
Here is a better example file, with comments (sql comments) on which statements should match and examples of ones that shouldn't
...sql statements
-- only this quoted string would match
select #sql = "select * from
Users where id = #id "
-- other statements that wouldn't match because not in a pair of quotes
if ltrim(isnull(#stat,'')) <> '' and #stat <> '""'
begin
select #sql = #sql + " and Stat in ("+#stat+")"
end
if isnull(#atype,'') <> ''
begin
select #sql = #sql + " and Type in ("+#atype+")"
end
...more sql statements
For the sample text given....
Try:
(?:\s|=)(?:\"[^"]*#[^"]*\"|(?:\s|=)\'[^']*#[^']*\')
Demo:
https://regex101.com/r/BXcYt4/1
Using PCRE and to
to test that even just a single # char appears within the match
You can use an alternation excluding either " or ' and also exclude matching an # adding it to the negated character class.
To get both values in the same group, you can use a branch reset group.
=\h*(?|"([^"#]*#[^"#]+)"|'([^#']*#[^'#]*)')
The pattern matches:
=\h* Match = and optional horizontal whitespace chars
(?| Branch reset group
"( Match " and start group 1
[^"#]*# Match optional chars other than " or # and then match #
[^"#]+ Match 1+ chars other than " or #
)" Close group 1 and atch "
| Or
'([^#']*#[^'#]*)' The same as previous pattern, this time for '
) Close branch reset group
Regex demo
I have a table with following column in postgres
col1
C29[40
D1305_D1306delinsKK
E602C[20
I would like to append a string 'p.' & closing square brackets in row 1 and 3 elements and 'p.' to the row2 element.
The expected output is:
col2
p.C29[40]
p.D1305_D1306delinsKK
p.E602C[20]
I am running following query, which runs without an error but the expected output is missing.
SELECT *,
CASE
WHEN t.c LIKE 'p.?=[%'
THEN 'p.'|| t.c || ']'
ELSE 'p.'|| t.c
END AS col2
FROM table;
You may use two chained REGEXP_REPLACE calls:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('C29[40', '^(.*\[\d+)$', 'p.\1]'), '^(?:p\.)?', 'p.')
See the regex demo #1 and regex demo #2 and the PostgreSQL demo.
Pattern details
^ - start of string
(.*\[\d+) - Group 1 (\1): any 0+ chars as many as possible (.*), then[ and 1+ digits
$ - end of string.
The ^(?:p\.)? pattern matches an optional p. substring at the beginning of the string, and thus either adds p. or replaces p. with p. (thus, keeping it).
I am trying to use regex on a rails application I'm building to seperate input without splitting the string up manually.
My regex is:
(?<action>\S+)(?:\s(?<query>.*)\s)(?<id>(?<=.).*?(?=\s))
And the test data I am using is as follows:
add hello by name
remove first second by id
add first
From this, I want the following values:
action: add, query: hello, id: name
action: remove, query: first second, id: id
action: add, query: first, id: nil (or "")
What am I doing wrong? It won't match at all on the last line of test data. Any help would be great.
Try this one:
^(?<action>\S+)(?:\s(?<query>(?:(?! by ).)*))(?: by (?<id>\w+))?
The id is always preceded by " by ", so each character in your <query> group should repeat a negative lookahead for that " by " substring.
Also ensure that the group around the id is optional, so that the third line gets matched as well.
Demo
Another option, instead of repeating a negative lookahead, would be to have a single positive lookahead for " by " or the end of the string, and repeat lazily:
^(?<action>\S+)(?:\s(?<query>.*?(?= by |$)))(?: by (?<id>\w+))?$
I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?
This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;
How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.
You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.
Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html