Regex, select the number after a particular string - regex

I have the below string:
rollover#7500,another1#3000,another2#4000, another1#7000
I need to extract the number that comes directly after rollover#
So far, I have this, but it's matching rollover#7500
(?:rollover#[0-9]*)
I'm not sure how to extract only the numbers?
I will be running this in a Hive query

You may use
regexp_extract(your_col,'rollover#([0-9]+)', 1)
The rollover#([0-9]+) pattern will find rollover# and then will capture 1 or more digits into Group 1, the third 1 argument will make regexp_extract return just the Group 1 value.

Related

Finding nth occurrence of a pattern within a string in SQL (Presto)

I am writing a query in Presto SQL using the function regexp_extract
I have a string that may look like the following examples:
'1A2B2C3D3E'
'1A1B2C2D3E'
'1A2B1C2D2E'
What I'm trying to do is find for example the second occurrence of 1[A-E].
If I try
regexp_extract(col, '(1[A-E])(1[A-E])', 2)
This will work for the second example (and the first since it returns nothing since there is no second occurence). However, this will fail for the third example. It returns nothing. I know that is because my regex is searching for a 1[A-E] followed directly by another 1[A-E].
So then I tried
regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
But this does not work either. I am not sure how I can account for the fact that I may have 1A1B2C or 1A2B1C to find that second 1. Any help?
Your second pattern does work in the latest version of Trino (formerly known as Presto SQL):
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E')
SELECT regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
FROM t
_col0
-------
NULL
1B
1C
(3 rows)
As others have commented, you don't need the capture groups for the first match or for the .*, and you should use the lazy quantifier to avoid .* eagerly matching all characters between the first and last occurrence:
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E',
'1A2B1C2D1E')
SELECT regexp_extract(col, '1[A-E].*?(1[A-E])', 1)
FROM t
_col0
-------
NULL
1B
1C
1C
(4 rows)
You don't need the second capture group (.*) to keep the 2 capture groups in the result, and you can optionally match the allowed characters in between.
From what I read on this page you might also consider using regexp_extract_all to get all the matches, as regexp_extract returns the first match.
As the example data consists of a digit followed by a char A-E, you could exclude matching the 1 from the character class to prevent overmatching and backtracking.
(1[A-E])[02-9A-E]*(1[A-E])
Regex demo
If using a single capture group to get the second value is also ok, you can use
1[A-E][02-9A-E]*(1[A-E])
Regex demo

How to limit characters when using regexp_extract in hive?

I have a fixed length string in which I need to extract portions as fields. First 5 characters to ACCOUNT1, next 2 characters to ACCOUNT2 and so on.
I would like to use regexp_extract (not substring) but I am missing the point. They return nothing.
select regexp_extract('47t7916A2088M040323','(.*){0,5}',1) as ACCOUNT1,
regexp_extract('47t7916A2088M040323','(.*){6,8}',1) as ACCOUNT2 --and so on
If you want using regexp then use it like in this example. For Account1 expression '^(.{5})' means: ^ is a beginning of the string, then capturing group 1 consisting of any 5 characters (group is in the round brackets). {5} - is a quantifier, means exactly 5 times. For Account2 - capturing group 2 is after group1. (.{2}) - means two any characters.
In this example in the second regexp there are two groups(for first and second column) and we extracting second group.
hive> select regexp_extract('47t7916A2088M040323','^(.{5})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2;
OK
47t79 16
Time taken: 0.064 seconds, Fetched: 1 row(s)
Actually you can use the same regexp containing groups for all columns, extracting different capturing groups.
Example using the same regexp and extracting different groups:
hive> select regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2
> ;
OK
47t79 16
Time taken: 1.043 seconds, Fetched: 1 row(s)
Add more groups for each column. This approach works only for fixed length columns. If you want to parse delimited string, then put the delimiter characters between groups, modify group to match everything except delimiters and remove/modify quantifiers. For such example substring or split for delimited string looks much more simple and cleaner, regexp allows to parse very complex patterns. Hope you have caught the idea.

REGEX : Extract group of number where digits are more than 3

HI I have a question regarding REGEX.
This sounds very simple and I remember doing it but somehow it got deleted and I am finding it hard to get it back.
I want to extract group of numbers from one line.
If the count of digits > 3 - select that.
EG:
ga3rdparty/phpMyAdmin/i0ndex.php?&t0oken=abf540063shakk
This line can be different everytime but there will be only 1 group of digits with more than 2 digits.
OUTPUT: 540063
Thank you in advance
You can use \d{3,} where 3 is the minimum number of digits. You an take a look at the following python code
import re
var= "ga3rdparty/phpMyAdmin/i0ndex.php?&t0oken=abf540063shakk"
pattern = re.compile(r'\d{3,}')
for match in pattern.findall(ver):
print(match)

How can I get a list of regex matches for a group?

I have a group which can occur any number of times in the input string. I need to get a list of all the matching items.
For example, for input:
example repeattext 1 anything here repeattext 2 anything repeattext 3
My regex is:
(repeattext \d)
I want to get the list of matches for the group. Is it possible to use regex here or do I need to parse it myself?
Yes, you can use regex here. Your existing regex will do fine.
See http://rubular.com/r/fS8c9C61rG for it in use on your example.
If numbers will ever become 10 or higher, consider this regex:
(repeattext \d+)
^
|
`- matches 1 or more repeating of previous
Use
result = subject.scan(/repeattext \d+/)
=> ["repeattext 1", "repeattext 2", "repeattext 3"]
See the docs for the .scan() method.

What is wrong with this Regular Expression?

I am beginner and have some problems with regexp.
Input text is : something idUser=123654; nick="Tom" something
I need extract value of idUser -> 123456
I try this:
//idUser is already 8 digits number
MatchCollection matchsID = Regex.Matches(pk.html, #"\bidUser=(\w{8})\b");
Text = matchsID[1].Value;
but on output i get idUser=123654, I need only number
The second problem is with nick="Tom", how can I get only text Tom from this expresion.
you don't show your output code, where you get the group from your match collection.
Hint: you will need group 1 and not group 0 if you want to have only what is in the parentheses.
.*?idUser=([0-9]+).*?
That regex should work for you :o)
Here's a pattern that should work:
\bidUser=(\d{3,8})\b|\bnick="(\w+)"
Given the input string:
something idUser=123654; nick="Tom" something
This yields 2 matches (as seen on rubular.com):
First match is User=123654, group 1 captures 123654
Second match is nick="Tom", group 2 captures Tom
Some variations:
In .NET regex, you can also use named groups for better readability.
If nick always appears after idUser, you can match the two at once instead of using alternation as above.
I've used {3,8} repetition to show how to match at least 3 and at most 8 digits.
API links
Match.Groups property
This is how you get what individual groups captured in a match
Use look-around
(?<=idUser=)\d{1,8}(?=(;|$))
To fix length of digits to 6, use (?<=idUser=)\d{6}(?=($|;))