How to limit characters when using regexp_extract in hive?

How to limit characters when using regexp_extract in hive? - regex

I have a fixed length string in which I need to extract portions as fields. First 5 characters to ACCOUNT1, next 2 characters to ACCOUNT2 and so on.
I would like to use regexp_extract (not substring) but I am missing the point. They return nothing.
select regexp_extract('47t7916A2088M040323','(.*){0,5}',1) as ACCOUNT1,
regexp_extract('47t7916A2088M040323','(.*){6,8}',1) as ACCOUNT2 --and so on

If you want using regexp then use it like in this example. For Account1 expression '^(.{5})' means: ^ is a beginning of the string, then capturing group 1 consisting of any 5 characters (group is in the round brackets). {5} - is a quantifier, means exactly 5 times. For Account2 - capturing group 2 is after group1. (.{2}) - means two any characters.
In this example in the second regexp there are two groups(for first and second column) and we extracting second group.
hive> select regexp_extract('47t7916A2088M040323','^(.{5})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2;
OK
47t79 16
Time taken: 0.064 seconds, Fetched: 1 row(s)
Actually you can use the same regexp containing groups for all columns, extracting different capturing groups.
Example using the same regexp and extracting different groups:
hive> select regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',1) as Account1,
> regexp_extract('47t7916A2088M040323','^(.{5})(.{2})',2) as Account2
> ;
OK
47t79 16
Time taken: 1.043 seconds, Fetched: 1 row(s)
Add more groups for each column. This approach works only for fixed length columns. If you want to parse delimited string, then put the delimiter characters between groups, modify group to match everything except delimiters and remove/modify quantifiers. For such example substring or split for delimited string looks much more simple and cleaner, regexp allows to parse very complex patterns. Hope you have caught the idea.

Related

Finding nth occurrence of a pattern within a string in SQL (Presto)

I am writing a query in Presto SQL using the function regexp_extract
I have a string that may look like the following examples:
'1A2B2C3D3E'
'1A1B2C2D3E'
'1A2B1C2D2E'
What I'm trying to do is find for example the second occurrence of 1[A-E].
If I try
regexp_extract(col, '(1[A-E])(1[A-E])', 2)
This will work for the second example (and the first since it returns nothing since there is no second occurence). However, this will fail for the third example. It returns nothing. I know that is because my regex is searching for a 1[A-E] followed directly by another 1[A-E].
So then I tried
regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
But this does not work either. I am not sure how I can account for the fact that I may have 1A1B2C or 1A2B1C to find that second 1. Any help?

Your second pattern does work in the latest version of Trino (formerly known as Presto SQL):
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E')
SELECT regexp_extract(col, '(1[A-E])(.*)(1[A-E])', 3)
FROM t
_col0
-------
NULL
1B
1C
(3 rows)
As others have commented, you don't need the capture groups for the first match or for the .*, and you should use the lazy quantifier to avoid .* eagerly matching all characters between the first and last occurrence:
WITH t(col) AS (
VALUES
'1A2B2C3D3E',
'1A1B2C2D3E',
'1A2B1C2D2E',
'1A2B1C2D1E')
SELECT regexp_extract(col, '1[A-E].*?(1[A-E])', 1)
FROM t
_col0
-------
NULL
1B
1C
1C
(4 rows)

You don't need the second capture group (.*) to keep the 2 capture groups in the result, and you can optionally match the allowed characters in between.
From what I read on this page you might also consider using regexp_extract_all to get all the matches, as regexp_extract returns the first match.
As the example data consists of a digit followed by a char A-E, you could exclude matching the 1 from the character class to prevent overmatching and backtracking.
(1[A-E])[02-9A-E]*(1[A-E])
Regex demo
If using a single capture group to get the second value is also ok, you can use
1[A-E][02-9A-E]*(1[A-E])
Regex demo

Reg exp search in notes/comments/description data in PostgreSQL 10.7

I have a scenario which I am not able to do in 10.7 version. Basically, I have a data column in which I need to find the Reg Exp pattern inside the data which is in the form of notes/comments/description.
For example, Data in the column : The SSN number is 760-56-6289
In the above data 760-56-6289 is the actual SSN number which I need to find across all schemas/tables/columns for the defined reg exp pattern. And, we can have a pre or post text for actual SSN value.
Could you please let me know how to achieve this PostgreSQL 10.7?
Please let me know if you need more information for the same.

demo:db<>fiddle
SELECT
(regexp_matches(mycolumn, '^.*([\d]{3}-[\d]{2}-[\d]{4}).*$'))[1]
FROM mytable
The RegEx means:
Start of text: ^
arbitrary number of characters: .*
group of your number: (...)
3 digit characters: [\d]{3}
- character
2 digits: [\d]{2}
- character
4 digits: [\d]{4}
arbitrary number of characters: .*
end of text: $
regexp_matches() gives out all found groups as an array. So, there is only one group, the array contains only one value. This is your number which can be get with the index [1]

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

I have this inputs:
John/Bean/4000-M100
John/4000-M100
John/4000
How can I get just the 4000 but note that the 4000 there will be change from time to time it can be 3000 or 2000 how can I treat that using regex pattern?
Here's my output so far, it statisfies John/400-M100 and John/4000 but the double slash doesnt suffice the match requirements in the regex I have:
REGEXP_REPLACE(REGEXP_SUBSTR(a.demand,'/(.*)-|/(.*)',1,1),'-|/','')

You can use this query to get the results you want:
select regexp_replace(data, '^.*/(\d{4})[^/]*$', '\1')
from test
The regex looks for a set of 4 digits following a / and then not followed by another / before the end of the line and replaces the entire content of the string with those 4 digits.
Demo on dbfiddle

This would also work, unless you need any digit followed by three zeros. See it in action here, for as long as it lives, http://sqlfiddle.com/#!4/23656/5
create table test_table
( data varchar2(200))
insert into test_table values('John/Bean/4000-M100')
insert into test_table values('John/4000-M100')
insert into test_table values('John/4000')
select a.*,
replace(REGEXP_SUBSTR(a.data,'/\d{4}'), '/', '')
from test_table a

The following will match any multiple of 1000 less than 10000 when its preceded by a slash:
\/[1-9]0{3}
To match any four-digit number preceded by a slash, not followed by another digit, such as 4031 in—
Sal_AS_180763852/4200009751_S5_154552/4031
—try:
\/\d{3}(?:(?:\d[^\d])|(?:\d$))
https://regex101.com/r/Am34WO/1

Regex, select the number after a particular string

I have the below string:
rollover#7500,another1#3000,another2#4000, another1#7000
I need to extract the number that comes directly after rollover#
So far, I have this, but it's matching rollover#7500
(?:rollover#[0-9]*)
I'm not sure how to extract only the numbers?
I will be running this in a Hive query

You may use
regexp_extract(your_col,'rollover#([0-9]+)', 1)
The rollover#([0-9]+) pattern will find rollover# and then will capture 1 or more digits into Group 1, the third 1 argument will make regexp_extract return just the Group 1 value.

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.

Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo

Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.

Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.

Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to limit characters when using regexp_extract in hive? - regex

Related

Finding nth occurrence of a pattern within a string in SQL (Presto)

Reg exp search in notes/comments/description data in PostgreSQL 10.7

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

Regex, select the number after a particular string

Why is this regex performing partial matches?

Categories

Resources