REGEXP_SUBSTR in Teradata

REGEXP_SUBSTR in Teradata - regex

I am having data in a column like XXX/XXXX/XXXX/XYYUX/YYY. I am trying to extract only the first two digits after the 3rd backslash(/) in the column which is 'XY' in this example. Can you please help?
Thanks!

Try this:
REGEXP_SUBSTR('XXX/XXXX/XXXX/XYYUX/YYY','^([^/]*/){3}\K..',1,1,'i')
'^' start of string
'([^/]*/){3}' looks for 0 or more non-slashes followed by a slash, 3 times
'\K' match reset operator drops the part of the string that has been matched up to this point
'..' grabs the next two characters in the string

Try using - STRTOK('/88/209/89/132]', ' /]', 3)
returns the 3rd octet, '89'

Related

regexextract isn't working the way I want in Google Sheets

...or rather...there's something wrong with my formula.
I have a series of item numbers, and I want to extract only the info between the first and 3rd dash, if any.
The info before the 1st dash must be letters.
The info between the 1st and second dash must be letters (i.e. A-z).
The info between the 2nd and 3rd dash must be numbers.
I want everything else to be ignored (I've wrapped my regexextract in an iferror to do this)
Here's my formula:
=arrayformula(iferror(regexextract(B1:B,"[A-z]+-([A-Z\{\\\]\^_`a-z]+-[0-9]+)-"),"")
It's working most of the time.
But for this: AAB-2971-PN-B-11-03
It extracts this: B-11
But I'm expecting this one to be an error/blank.
Other correct examples:
AAB-LL-1234-00 should extract LL-1234
AAN-1234 should error out
AAC-1234-LL should error out
AAC-1234-ll-123 should error out

Use this regex:
[A-Za-z]+-[A-Za-z]+-([0-9]+)-
And extract group 2.
There are a few problems with your regex, but the main one is [A-z] does not mean "all letters", it means "all characters between A and z", which includes the characters between Z and a, ie [, \, ], ^, _ and the back tick.
I suspect [A-Z{\]\^_a-z]+is your attempt at[A-Za-z]`.

try:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(INDEX(SPLIT(B1:B, "-"),,2)&"", "\D+")&
REGEXEXTRACT(INDEX("-"&SPLIT(B1:B, "-"),,3), "-\d+")))
or:
=ARRAYFORMULA(IFERROR(IF((REGEXMATCH(INDEX(SPLIT(B1:B, "-"),,1), "[A-Za-z]+"))*
(NOT(REGEXMATCH(INDEX(SPLIT(B1:B, "-"),,1), "[0-9]+"))),
IFNA(REGEXEXTRACT(INDEX(SPLIT(B1:B, "-"),,2)&"", "\D+")&
REGEXEXTRACT(INDEX("-"&SPLIT(B1:B, "-"),,3), "-\d+")), )))

Regex match everything after first and until 2nd occurrence of a slash

Need to match everything after the first / and until the 2nd / or end of string. Given the following examples:
/US
/CA
/DE/Special1
/FR/Special 1/special2
Need the following returned:
US
CA
DE
FR
Was using this in DataStudio which worked:
^(.+?)/
However the same in BigQuery is just returning null. After trying dozens of other examples here, decided to ask myself. Thanks for your help.

For such simple extraction - consider alternative of using cheaper string functions instead of more expensive regexp functions. See an example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '/US' line UNION ALL
SELECT '/CA' UNION ALL
SELECT '/DE/Special1' UNION ALL
SELECT '/FR/Special 1/special2'
)
SELECT line, SPLIT(line, '/')[SAFE_OFFSET(1)] value
FROM `project.dataset.table`
with result
Row line value
1 /US US
2 /CA CA
3 /DE/Special1 DE
4 /FR/Special 1/special2 FR

Your regex matches any 1 or more chars as few as possible at the start of a string (up to the first slash) and puts this value in Group 1. Then it consumes a / char. It does not actually match what you need.
You can use a regex in BigQuery that matches a string partially and capture the part you need to get as a result:
/([^/]+)
It will match the first occurrence of a slash followed with one or more chars other than a slash placing the captured substring in the result you get.

Regular expression/ Redshift

I have following data, how do i find 11th occurrence of ':' . I want to print/display the information after 11th occurrence of ':'.
https://www.example.com/rest/1/07/myself/urn:ads:accod:org:pki:71E4/Riken/List:abc:bcbc:hfhhf:ncnnc:shiv:hgh:bvbv:hghg:
I have tried [^] tag but its not working.
select regexp_substr(id,'[:]{5}?.*') from tempnew;

regexp_substr does not care about capture-groups, so counting characters not included in the match is not possible. Counting from the end would work though:
-- Returns the substring after the 6th ':' from the end.
select regexp_substr(id, '([^:]*:){5}[^:]*$') from tempnew
-- If the string does not contain 5 ':', an empty string is returned.
If you need to count from the start, you could use regexp_replace instead:
-- Returns the substring after the 11th ':'
select regexp_replace(id, '^([^:]*:){11}') from tempnew
-- If the string does not contain 11 ':', the whole string is returned.

see this demo https://regex101.com/r/wR9aU3/1
/^(?:[^:]*\:){11}(.*)$/
or
/^(?:.+\:){11}(.+)$/gm
https://regex101.com/r/oC5yQ6/1

I would split on ":" and use the 11th element.
But if you must use a regex:
^(?:[^:]*:){10}:([^:]*)
And use group 1 of the match.

you can use split_part for this purpose,
select split_part(id, ':', 12) from tempnew

Comma Separated Numbers Regex

I am trying to validate a comma separated list for numbers 1-8.
i.e. 2,4,6,8,1 is valid input.
I tried [0-8,]* but it seems to accept 1234 as valid. It is not requiring a comma and it is letting me type in a number larger than 8. I am not sure why.

[0-8,]* will match zero or more consecutive instances of 0 through 8 or ,, anywhere in your string. You want something more like this:
^[1-8](,[1-8])*$
^ matches the start of the string, and $ matches the end, ensuring that you're examining the entire string. It will match a single digit, plus zero or more instances of a comma followed by a digit after it.

/^\d+(,\d+)*$/
for at least one digit, otherwise you will accept 1,,,,,4

[0-9]+(,[0-9]+)+
This works better for me for comma separated numbers in general, like: 1,234,933

You can try with this Regex:
^[1-8](,[1-8])+$

If you are using python and looking to find out all possible matching strings like
XX,XX,XXX or X,XX,XXX
or 12,000, 1,20,000 using regex
string = "I spent 1,20,000 on new project "
re.findall(r'(\b[1-8]*(,[0-9]*[0-9])+\b)', string, re.IGNORECASE)
Result will be ---> [('1,20,000', ',000')]

You need a number + comma combination that can repeat:
^[1-8](,[1-8])*$
If you don't want remembering parentheses add ?: to the parens, like so:
^[1-8](?:,[1-8])*$

Regex: find all IN clause with number of arguments greater or equal to

As an input I've got a plain SQL query smth like:
select * from (
select * from Table where id in (1,2,3,4,5,6,642,7,8,9)
or another_id in (1,2,3,4,5,6, 34 ,7 , 8,9))
where yet_another_id in (1,2)
I want to find all IN clause statements where the amount of arguments passed in is greater than XXX.
So far I've came up with this solution.
^.*\s*+(?:in)+\s*+(\((?:\s*+\d+\s*+\,?+){XXX,}+\){1}).*$
where XXX is the number of arguments.
Obviously, the first part:
^.*
eats all IN clause statements except the last one. How can I fix that? Any suggestions how can I improve the regex?

Try this here
\bin\b\s*(?:\((?:\s*\d+\s*\,?){5,}\))
So I removed some stuff from your expression and fixed an obvious error (\(?: where you escaped the wrong bracket.
The \b is a word boundary.
This is working now for me here on Regexr

You seem to be massively over complicating this with random + characters all over the place: \s*+ means 0 or more spaces repeated one or more times. \s* is sufficient. Then (?:in)+ means you want to match in or ininininininininin which doesn't seem right. Again the \,?+ means an optional comma repeated one or more times.
The real problem however is that after the literal \( you have ?: which isn't following open parentheses so that means \(?: is matching an optional ( followed by a non-optional :. You don't have any colons in the input so no possible matches.
Try something like this:
>>> import re
>>> text = '''select * from (
select * from Table where id in (1,2,3,4,5,6,642,7,8,9)
or another_id in (1,2,3,4,5,6, 34 ,7 , 8,9))
where yet_another_id in (1,2)'''
>>> re.findall("(?:in)\s*(\((?:[^),]+\,?){10,}\))", text)
['(1,2,3,4,5,6,642,7,8,9)', '(1,2,3,4,5,6, 34 ,7 , 8,9)']
You may or may not need the extra ^.*? and .*$ around the regex depending on how you are using this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEXP_SUBSTR in Teradata - regex

I am having data in a column like XXX/XXXX/XXXX/XYYUX/YYY. I am trying to extract only the first two digits after the 3rd backslash(/) in the column which is 'XY' in this example. Can you please help? Thanks!

Try using - STRTOK('/88/209/89/132]', ' /]', 3) returns the 3rd octet, '89'

Related

regexextract isn't working the way I want in Google Sheets

Regex match everything after first and until 2nd occurrence of a slash

Regular expression/ Redshift

Comma Separated Numbers Regex

Regex: find all IN clause with number of arguments greater or equal to

Categories

Resources