Extract number (with different delimiters) from string using REGEXP_SUBSTR in plsql - regex

I need to extract patterned numbers from I/P string. I have the following patterns:
xxx-xxx-xxxx
xxx xxx-xxxx
xxx xxx xxxx
I am using this query to find matching string:
select REGEXP_substr('phn: 678 987-0987 Date: 12/2029',
'[0-9]{3}(\-|\ |\ )[0-9]{3}(\-|\--|\ )[0-9]{4}')
from dual;
I also want to extract the following patterns:
xxxxxx-xxxx
xxxxxxxxxx
etc...
Where do I modify the query?

Change your regex to,
[0-9]{3}(\-|\ |\ )?[0-9]{3}(\-|\--|\ )?-?[0-9]{4}
DEMO
(\-|\ |\ )? turns the whole group as optional. And -? turns - as optional. The function of ? after a character literal is, it makes the preceding token as optional.

Regular expressions are not always a good approach, since they are high resource consuming feature. I would still use old SUBSTR + INSTR technique :
16777216 * to_number(substr(ip, 1, instr(ip, '.', 1, 1) - 1))
+ 65536 * to_number(substr(ip, instr(ip, '.', 1, 1) + 1, instr(ip, '.', 1, 2) - instr(ip, '.', 1, 1) - 1))
+ 256 * to_number(substr(ip, instr(ip, '.', 1, 2) + 1, instr(ip, '.', 1, 3) - instr(ip, '.', 1, 2) - 1))
+ to_number(substr(ip, instr(ip, '.', 1, 3) + 1))
IP# is a simple 32-bit (4 bytes) integer; which is being presented in "dotted quad" format.
Each byte will contain a value between 0 and 255.
so converting to number & using between is as efficient as possible.

Related

Regex match character after lazy grouping

I want to match specifically the comma "," after the two groups ( AS) and (.*?).
I have a positive lookbehind that skips the AS but I cant get the grouping to skip the wildcard lazy group.
Regex:
(?<= AS)(.*?)(,)
Sample text
SELECT LEFT(CustomerCode, 5) AS SMSiteCode, SUBSTRING(CustomerCode, 6, LEN(CustomerCode) - 5) AS SMCustCode, SUBSTRING(AgreeNo, 6, LEN(AgreeNo) - 5)
AS SMAgreeNo, CAST(SeqNo AS int) AS SeqNo, SUBSTRING(TrxDate, 7, 2) + SUBSTRING(TrxDate, 4, 2) + SUBSTRING(TrxDate, 1, 2) AS TrxDate, TrxTime,
CAST(Charge AS bit) AS Charge, CASE WHEN LEN(AnalysisCode) > 5 THEN SUBSTRING(AnalysisCode, 6, LEN(AnalysisCode) - 5)
ELSE AnalysisCode END AS AnalysisCode, CAST(ISNULL(Description, N'') AS nvarchar(100)) AS Description, CAST(TaxAmt AS money) AS TaxAmt,
CAST(TotAmt AS money) AS TotAmt, CAST(Match AS bigint) AS Match, CAST(Confirmed AS bit) AS Confirmed, CAST(Balance AS money) AS Balance,
CAST(QtyBal AS money) AS QtyBal, CAST(ISNULL(Drawer, N'') AS nvarchar(50)) AS Drawer, SUBSTRING(DateBanked, 7, 2) + SUBSTRING(DateBanked, 4, 2)
+ SUBSTRING(DateBanked, 1, 2) AS DateBanked, CAST(ISNULL(BankBranch, N'') AS nvarchar(50)) AS BankBranch, CAST(Qty AS float) AS Qty, CAST(ISNULL(Narration,
N'') AS nvarchar(100)) AS Narration, SUBSTRING(DateFrom, 7, 2) + SUBSTRING(DateFrom, 4, 2) + SUBSTRING(DateFrom, 1, 2) AS DateFrom, SUBSTRING(DateTo, 7, 2)
+ SUBSTRING(DateTo, 4, 2) + SUBSTRING(DateTo, 1, 2) AS DateTo, CAST(PrintNarration AS bit) AS PrintNarration, CAST(DiscAmt AS float) AS DiscAmt,
CAST(ISNULL(CCAuthNo, N'') AS nvarchar(20)) AS CCAuthNo, CAST(ISNULL(CCTransID, N'') AS nvarchar(20)) AS CCTransID, CAST(UserLogin AS nvarchar(20))
AS UserLogin, CAST(Reconciled AS bit) AS Reconciled, SUBSTRING(DateReconciled, 7, 2) + SUBSTRING(DateReconciled, 4, 2) + SUBSTRING(DateReconciled, 1, 2)
AS DateReconciled, CAST(PrimaryKey AS bigint) AS PrimaryKey, SUBSTRING(InvDate, 7, 2) + SUBSTRING(InvDate, 4, 2) + SUBSTRING(InvDate, 1, 2) AS InvDate,
CAST(InvNo AS int) AS InvNo FROM SomeDatabase.dbo.tblTransaction WHERE IsDate(trxTime) = 1
You could try \K, but make sure to change Javescript in RegExr from top right of the screen to PCRE.
\K is defined as:
Sets the given position in the regex as the new "start" of the match. This means that nothing preceding the K will be captured in the overall match.
With \K, you could try something like this:
(?<= AS).*?\K(,)
Example: https://regex101.com/r/X3AdbH/1/
If \K is supported, you could get your matches without using a lookbehind and a capturing group by matching AS and use a negated character class to match any char except a comma.
AS [^,]+\K,
Explanation
AS Match space, AS and space
[^,]+ Match 1+ times any char except a comma
\K, Forget what was matched and match a comma
Regex demo
I'm guessing that your expression is just fine, you maybe want to limit the first capturing group to some specific chars, if you wish, maybe looking like:
(?<= AS)([A-Za-z\d\s]+)(,)
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Scala: Tokenizing simple arithmetic expressions

How can I split 23+3*5 or 2 + 3*5 into a list List("23", "+", "3", "*", "5")?.
I tried things like split, splitAt, but nothing with the wished result.
I want that it splits at the arithmetic operators.
Try something like
"2 + 4 - 3 * 5 / 7 / 3".split("(?=[+/*-])|(?<=[+/*-])").map(_.trim)
In this particular example, it gives you:
Array(2, +, 4, -, 3, *, 5, /, 7, /, 3)
The (?= ) are lookaheads, (?<= ) are lookbehinds. Essentially, it cuts the string before and after every operator. Note that - in [+/*-] is at the last position: otherwise it's interpreted as a character range (e.g. [a-z]).
I suggest matching on what you want as tokens.
e.g.
"\\d+|[-+*/]".r.findAllIn(" 23 + 4 * 5 / 7").toList
// List(23, +, 4, *, 5, /, 7)

get integers from string

i have in the database data like this
61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28
where the part like 10/16(without #) is invalid should not use for the calculation,
but all other has next format min_hr + "/" + min_hrv + "#" + max_hr + "/" + max_hrv
and the issue is get AVG value by next psevdo formula [ summ(all(min_hrv)) + summ(all(max_hrv)) ] / count(all(min_hrv)) + all(max_hrv)), for the axample string result will be ((10 + 12 + 28 + 23) + (12 + 33 + 34 + 28))/8) == 22
What i try is:
SELECT regexp_replace(
'61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28',
',\d+/\d+,', ',',
'g'
);
to remove invalid data but 10/16 still in the strin, result is:
regexp_replace
--------------------------------------------------
61/10#61/12,10/16,0/12#61/33,0/28#0/34,0/23#0/28
if do good clean the string my plan is split to array some way like this, for max (not full solution, has empty string), has no solution for min:
SELECT
regexp_split_to_array(
regexp_replace(
'61/10#61/12,0/12,0/12#61/33,0/28#0/34,0/23#0/28',
',\d+/\d+,', ',',
'g'
)
,',?\d+/\d+#\d+/'
);
result is:
regexp_split_to_array
-----------------------
{"",12,33,34,28}
and then calculate the data, something like this:
SELECT ((
SELECT sum(tmin.unnest)
FROM
(SELECT unnest('{10,12,28,23}'::int[])) as tmin
)
+
(
SELECT sum(tmax.unnest)
FROM
(SELECT unnest('{12,33,34,28}'::int[])) as tmax
))
/
(SELECT array_length('{12,33,34,28}'::int[], 1) * 2)
may be some one know more simple and right way for such issue?
Use regexp_matches():
select (regexp_matches(
'61/10#61/12,0/12,0/12#61/33,0/28#0/34,0/23#0/28',
'\d+#\d+/(\d+)',
'g'))[1]
regexp_matches
----------------
12
33
34
28
(4 rows)
The whole calculation may look like this:
with my_data(str) as (
values
('61/10#61/12,0/12,10/16,0/21,0/12#61/33,0/28#0/34,0/23#0/28')
),
min_max as (
select
(regexp_matches(str, '(\d+)#\d+', 'g'))[1] as min_hrv,
(regexp_matches(str, '\d+#\d+/(\d+)', 'g'))[1] as max_hrv
from my_data
)
select avg(min_hrv::int+ max_hrv::int) / 2 as result
from min_max;
result
---------------------
22.5000000000000000
(1 row)
The pattern you are looking for should match the digits after #, a streak of digits and a / char. With regexp_matches, you may extract a part of the pattern only if you wrap that part within a pair of parentheses.
The solution is
regexp_matches(your_col, '#\d+/(\d+)', 'g')
Note that g stands for global, meaning that all occurrences found in the string will be returned.
Pattern details
\d+ - 1 or more (+) digits
/ - a /char
(\d+) - Capturing group 1: 1 or more digits
See the regex demo.
You may extract specific bits from your data if you use a single pair of parentheses in different parts of the '(\d+)/(\d+)#(\d+)/(\d+)' regex. To extract min_hr, you'd use '(\d+)/\d+#\d+/\d+'.

Regex Queries for Lines

I am trying to figure out a couple regular expressions for the below cases:
Lines with a length divisible by n, but not by m for integers n and m
Lines that do not contain a certain number n of a given character,
but may contain more or less
I am a newcomer and would appreciate any clarification on these.
I've used JavaScript for my examples.
For the first one the key is to note that 'multiples' are just repeats. So using /(...)+/ will match 3 characters and then repeat that match as many times as it can. Each matching group doesn't need to be the same set of 3 characters but they do need to be consecutive. Suitable anchoring using ^ and $ ensures you're checking the exact string length and ?: can be used to negate.
e.g. Multiple of 5 but not 3:
/^(?!(.{3})+$)(.{5})+$/gm
Note that JavaScript uses / to mark the beginning and end of the expression and gm are modifiers to perform global, multiline matches. I wasn't clear what you meant by matching 'lines' so I've assumed the string itself contains newline characters that must be taken into consideration. If you had, say, and array of lines and could check each one individually then things get slightly easier, or a lot easier in the case of your second question.
Demo for the first question:
var chars = '12345678901234567890',
str = '';
for (var i = 1 ; i <= chars.length ; ++i) {
str += chars.slice(0, i) + '\n';
}
console.log('Full text is:');
console.log(str);
console.log('Lines with length that is a multiple of 2 but not 3:');
console.log(lineLength(str, 2, 3));
console.log('Lines with length that is a multiple of 3 but not 2:');
console.log(lineLength(str, 3, 2));
console.log('Lines with length that is a multiple of 5 but not 3:');
console.log(lineLength(str, 5, 3));
function lineLength(str, multiple, notMultiple) {
return str.match(new RegExp('^(?!(.{' + notMultiple + '})+$)(.{' + multiple + '})+$', 'gm'));
}
For the second question I couldn't come up with a nice way to do it. This horror show is what I ended up with. It wasn't too bad to match n occurrences of a particular character in a line but matching 'not n' proved difficult. I ended up matching {0,n-1} or {n+1,} but the whole thing doesn't feel so great to me. I suspect there's a cleverer way to do it that I'm currently not seeing.
var str = 'a\naa\naaa\naaaa\nab\nabab\nababab\nabababab\nba\nbab\nbaba\nbbabbabb';
console.log('Full string:');
console.log(str);
console.log('Lines with 1 occurrence of a:');
console.log(mOccurrences(str, 'a', 1));
console.log('Lines with 2 occurrences of a:');
console.log(mOccurrences(str, 'a', 2));
console.log('Lines with 3 occurrences of a:');
console.log(mOccurrences(str, 'a', 3));
console.log('Lines with not 1 occurrence of a:');
console.log(notMOccurrences(str, 'a', 1));
console.log('Lines with not 2 occurrences of a:');
console.log(notMOccurrences(str, 'a', 2));
console.log('Lines with not 3 occurrences of a:');
console.log(notMOccurrences(str, 'a', 3));
function mOccurrences(str, character, m) {
return str.match(new RegExp('^[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + m + '}[^' + character + '\n]*$', 'gm'));
}
function notMOccurrences(str, character, m) {
return str.match(new RegExp('^([^' + character + '\n]*(' + character + '[^' + character + '\n]*){0,' + (m - 1) + '}[^' + character + '\n]*|[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + (m + 1) + ',}[^' + character + '\n]*)$', 'gm'));
}
The key to how that works is that it tries to find n occurrences of a separated by sequences of [^a], with \n thrown in to stop it walking onto the next line.
In a real world scenario I would probably do the splitting into lines first as that makes things much easier. Counting the number of occurrences of a particular character is then just:
str.replace(/[^a]/g, '').length;
// Could use this instead but note in JS it'd fail if length is 0
str.match(/a/g, '').length;
Again, this assumes a JavaScript environment. If you were using regexes in an environment where you could literally only pass in the regex as an argument then it's back to my earlier horror show.

how to get out string oracle regex

I have the following string my trying get out the 1111111 and 33333333333 with out the |
character
SELECT regexp_substr('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|','*[|]*[|][0-9]*')FROM dual
Using REGEXP_REPLACE may be a bit simpler;
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){1}([^|]*).*$', '\2') FROM dual;
> 1111111
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){3}([^|]*).*$', '\2') FROM dual;
> 33333333333
You can choose column by choosing how many pipes to skip in the {1} part.
A simple SQLfiddle to test with.
A short explanation of the regexp;
([^|]+[|]){3} -- Matches 3 groups of {optional characters}{pipe}
(\d*) -- Matches the next digit group (the one we want)
.* -- Matches the rest of the expression
What we want is the second paranthesized group, that is, we replace the whole string by the back reference \2.
Because "|" separators always present it's simpler to extract fields with simple substring function rather than using regular expressions.
Just find positions of corresponding separators in source string and extract content between them:
with test_data as (
select
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' as s,
8 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
decode( field_number,
1,1,
instr(s,'|',1,field_number - 1) + 1
),
(
decode( instr(s,'|',1,field_number),
0, length(s)+ 1,
instr(s,'|',1,field_number)
)
-
decode( field_number,
1, 1,
instr(s,'|',1,field_number - 1) + 1
)
)
) as field_value
from
test_data
SQLFiddle
This variant works with empty fields, non-numeric fields and so on.
Possible simplification with appending additional separators to the start and the end of the string:
with test_data as (
select
(
'|' ||
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' ||
'|'
) as s, -- additional separators appended before and after original string
10 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
instr(s, '|', 1, field_number) + 1,
(
instr(s, '|', 1, field_number + 1)
-
(instr(s, '|', 1, field_number) + 1)
)
) as field_value
from
test_data
;
SQLFiddle