Regular Expression - Not matching certain characters and position of characters - regex

SOLUTION:
Finally solved it using the regex provided by Gary_W below and a simple PowerShell command that uses the discussed replacement function. So there was no need to use the built in regex activity in the software we use. Here´s the PS:
"100,000.00" -replace "([,.]\d{2}$)|[,.]",""
Regular Expressions are freaking me out. I cannot get used to that logic. However, I think my current RE problem is a quite simple one bur I cannot make it work :(
So here´s what I want to achieve:
I want the RE to match only the digits before the last two decimal places.
Thus, the RE must ignore any "." and "," AND always the last two digits.
> Examples:
> 1.000.000,00 --> 1000000
> 123,456.00 --> 123456
> 100.000,00 --> 100000
> 10.000,00 --> 10000
> 10,000.00 --> 10000
> 1.000,00 --> 1000
> 100,00 --> 100
> 99.88 --> 99
> 99,88 --> 99
> 1,23 --> 1
> ...
Any ideas how to get this working?

Here's how I would do it in Oracle, for what it's worth. Maybe the regex used here will give you an idea. Read the regex as "Look for a match of a comma or a decimal followed by 2 digits at the end of the line, OR a comma or a decimal and replace with nothing.
Note the match for the optional decimal places at the end needs to be first in the regex, otherwise the single characters are matched first, making the 2 decimal places non-existent and thus not matched.
SQL> with tbl(str) as (
select '1.000.000,00' from dual union all
select '123,456.00' from dual union all
select '100.000,00' from dual union all
select '10.000,00' from dual union all
select '10,000.00' from dual union all
select '1.000,00' from dual union all
select '100,00' from dual union all
select '99.88' from dual union all
select '99,88' from dual union all
select '1,23' from dual union all
select '3' from dual
)
select str,
regexp_replace(str, '([,.]\d{2}$)|[,.]') fixed
from tbl;
STR FIXED
------------ ------------
1.000.000,00 1000000
123,456.00 123456
100.000,00 100000
10.000,00 10000
10,000.00 10000
1.000,00 1000
100,00 100
99.88 99
99,88 99
1,23 1
3 3
11 rows selected.
SQL>
Just saw the regexr link, plugging in my regex looks like it works with the global flag. The characters you wish to remove are highlighted.

In which language/with which tool? With sed, you can do:
sed 's/\(.*\)[\.,]../\1/;s/[\.,]//g'
In perl it's similar, just without the initial backslashes:
perl -pe 's/(.*)[\.,]../\1/;s/[\.,]//g'
This is done with two regexes, by the way. The first one reads "save all that you can, up to a dot or a comma followed by two chars, and then replace the whole match with that". The second one reads "replace all dots and commas with nothing", that is, "remove all dots and commas".
In regexr.com you can use "Replace" in Tools to replace the match with the first capture group. Just put (.*)[\.,].. in Expression, and $1 in Replace, to see the first regex working. Then you can do something similar with the second one, as regexr doesn't support chaining of expressions, as far as I can see.

Related

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

I have this inputs:
John/Bean/4000-M100
John/4000-M100
John/4000
How can I get just the 4000 but note that the 4000 there will be change from time to time it can be 3000 or 2000 how can I treat that using regex pattern?
Here's my output so far, it statisfies John/400-M100 and John/4000 but the double slash doesnt suffice the match requirements in the regex I have:
REGEXP_REPLACE(REGEXP_SUBSTR(a.demand,'/(.*)-|/(.*)',1,1),'-|/','')
You can use this query to get the results you want:
select regexp_replace(data, '^.*/(\d{4})[^/]*$', '\1')
from test
The regex looks for a set of 4 digits following a / and then not followed by another / before the end of the line and replaces the entire content of the string with those 4 digits.
Demo on dbfiddle
This would also work, unless you need any digit followed by three zeros. See it in action here, for as long as it lives, http://sqlfiddle.com/#!4/23656/5
create table test_table
( data varchar2(200))
insert into test_table values('John/Bean/4000-M100')
insert into test_table values('John/4000-M100')
insert into test_table values('John/4000')
select a.*,
replace(REGEXP_SUBSTR(a.data,'/\d{4}'), '/', '')
from test_table a
The following will match any multiple of 1000 less than 10000 when its preceded by a slash:
\/[1-9]0{3}
To match any four-digit number preceded by a slash, not followed by another digit, such as 4031 in—
Sal_AS_180763852/4200009751_S5_154552/4031
—try:
\/\d{3}(?:(?:\d[^\d])|(?:\d$))
https://regex101.com/r/Am34WO/1

Regex match numbers of at least 6 digits that only have up to 3 digits different from 0

EDIT
I need to identify all numbers of at least 6 digits and maximum 25 digits having only 1 to 3 digits that are different from 0.
Examples: 000123, 0103040000, 10320000, 70000000, 12000009000
I was trying something like this:
regexp_like(number, '[1-9]\d{1,3}') AND regexp_like(number,'(0){5,24}')
(it's ok to use more than one regular expression)
But this also matches numbers like:
0046700000031,00394000007 - This should not match because they have 4 digits other than 0, it must match numbers with minimum 1 digit other than 0 and maximum 3 digits other than 0
I'm using Oracle 12C.
SOLUTION
Here is an alternative I've found, which seems to work but I presume only in Oracle.
SELECT NUMBER
FROM TABLE t
WHERE LENGTH(NUMBER) > 5 HAVING(regexp_count(NUMBER, '0') > 2
AND regexp_count(NUMBER, '[1-9]') BETWEEN 1 AND 3)
GROUP BY NUMBER
Thanks
You cannot use a single regex to do what you want in Oracle 12C, because the regex engine is POSIX based, and does not allow lookarounds, neither lookbehinds, nor lookaheads. You need to use a single pattern to check the format of the string, and a regular LENGTH function.
Here is a full demo:
WITH testdata(txt) AS (
SELECT '000123' from dual
UNION
SELECT '0103040000' from dual
UNION
SELECT '10320000' from dual
UNION
SELECT '70000000' from dual
UNION
SELECT '12000009000' from dual
UNION
SELECT '0046700000031' from dual
UNION
SELECT '00394000007' from dual
)
SELECT * FROM testdata WHERE REGEXP_LIKE(txt, '^(0*[1-9]){1,3}0*$') AND LENGTH(txt) > 5 AND LENGTH(txt) < 26
See the regex demo. Details:
^ - start of string
(0*[1-9]){1,3} - one, two or three repetitions of
0* - zero or more zeros
[1-9] - a non-zero digit
0* - 0+ zeros
$ - end of string.
See the Oracle demo online.
One option would be to use a positive lookahead to check for at least one, but no more than three, non zero digits:
^(?=.*[1-9])(?!.*[1-9].*[1-9].*[1-9].*[1-9])[0-9]{6,25}$
In a database such as Postgres, we could try the following query:
SELECT *
FROM yourTable
WHERE number ~ '^(?=.*[1-9])(?!.*[1-9].*[1-9].*[1-9].*[1-9])[0-9]{6,25}$';
Using a database like SQL Server which does not directly support regex, but has some regex LIKE capability, we could try:
WHERE LEN(number) BETWEEN 6 AND 25 AND -- 6 to 25 digits
number LIKE '%[1-9]%' AND -- at least 1 non zero digit
number NOT LIKE '%[1-9]%[1-9]%[1-9]%[1-9]%' -- at most 3 non zero digits
number NOT LIKE '%[^0-9]%'; -- all numbers
Try this pattern (?=^([123456789]*0){1,3}[123456789]*$)\d{6,25}.
Explanation: it uses lookahead to varify that what follows contains at most three zeros with pattern: (?=^([123456789]*0){1,3}[123456789]*$).
Demo

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

How to replace a single number from a comma separated string in oracle using regular expressions?

I have the following set of data where I need to replace the number 41 with another number.
column1
41,46
31,41,48,55,58,121,122
31,60,41
41
We can see four conditions here
41,
41
,41,
41,
I have written the following query
REGEXP_replace(column1,'^41$|^41,|,41,|,41$','xx')
where xx is the number to be replaced.
This query will replace the comma as well which is not expected.
Example : 41,46 is replaced as xx46. Here the expected output is xx,46. Please note that there are no spaced between the comma and numbers.
Can somebody help out how to use the regex?
Assuming the string is comma separated, You can use comma concatenation with replace and trim to do the replacement. No regex needed. You should avoid regex as the solution is likely to be slow.
with t (column1) as (
select '41,46' from dual union all
select '31,41,48,55,58,121,122' from dual union all
select '31,60,41' from dual union all
select '41' from dual
)
-- Test data setup. Actual solution is below: --
select
column1,
trim(',' from replace(','||column1||',', ',41,', ',17,')) as replaced
from t;
Output:
COLUMN1 REPLACED
41,46 17,46
31,41,48,55,58,121,122 31,17,48,55,58,121,122
31,60,41 31,60,17
41 17
4 rows selected.
Also, it's worth noting here that the comma separated strings is not the right way of storing data. Normalization is your friend.

Remove prefix and suffix with oracle regex

I have some records with that come with a prefix and suffix (may or may not come).
I'm trying to figure the REGEXP_REPLACE that will always return me a parametrized value.
My attempt so far is:
with teste as (
select '+123145#domain.com' as num from dual
union
select '0054321#domain.com' as num from dual
union
select '006789' as num from dual
union
select '+9876' as num from dual
union
select '13579#domain.com' as num from dual
union
select '123456789' as num from dual
)
select REGEXP_REPLACE(num,'^00(.*)\#.*$|^\+(.*)\#.*$','\1') from teste
but is not quite there.
The output of that should be:
num
12345
54321
6789
9876
13579
123456789
Try this one here
REGEXP_REPLACE(num,'^(00|\+)?(\d*)(\#.*)?$','\2')
See it here online at Regexr
I am not sure what Oracle regex is able to do. Critical points could be the \d meaning a digit, if this is not working replace \d with [0-9].
(?:) are non capturing groups. The pattern inside is not stored in a capturing group so you can always replace with the first capturing group \1
I also changed from your alternatives using | to optional parts using ? after the non capturing groups. The two brackets in your "OR" caused that the result is sometimes in group 1 (when the first alternative is matching) and sometimes in group 2 (when the second alternative is matching)