REGEX_LIKE not selecting correct result - regex

I am trying to select numbers starting with one + character and having only digits afterward from a varchar column. I have used the regex_like operator but it also selects special character in the result.
Expected Correct value:
+369
+6589445
+5896552
Wrong:
693
+4534dfgfgf#
+3435435*%
I tried,
SELECT Column FROM Table WHERE REGEXP_LIKE(Column , '^[+][0-9]');

To select values starting with + and then 1 or more digits, use
^[+][0-9]+$
^^
The $ will force the end-of-string boundary and + will allow matching 1 or more occurrences of the construct the plus quantifies (the [0-9] character class).
Here is a demo showing how this regex works.

Related

matching numbers after nth occurence of a certain symbol in a line

I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)
I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)
UNA:+,? '
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'
UNH+00001+BRKE:01+00+0'
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'
INL+181095200+385762115+++0'
BEE+20080702++++0'
BAA+++J+J++++++J+++++++J++0'
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'
BHP+J+++++J+++++J+++++0'
BLA+++J+++++++++0'
BFA++++++++++++J++0'
BSA++J+++J+J+++0'
BAT+20190801+0'
DAT+**20080702**++++0'
UNT+000014+00001'
UNZ+00001+00001'
at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).
Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+), but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)
(I have 6 versions for each type, and i have 6 types total)
Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.
is this possible to achieve with regex? and if yes, could someone please tell me how?
I suggest using a pattern like
^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
where {7} can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101 (where 20200101 is your new date), in PHP/.NET, it is ${1}20200101. In JS, it will be just $1.
To run on a multiline text, use m flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+).
See the Python regex demo
Details
^ - start of string/line
((?:[^+\n]*\+){7}) - Group 1: 7 repetitions of any chars other than + and newline, and then a +
\d{8} - 8 digits
(?=\+(?:\d{8})?\+) - that are followed with +, and optional chunk of 8 digits and a +.

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Select only letters which are followed by a number

I am trying to select some codes from a PostgreSQl table.
I only want the codes that have numbers in them e.g
GD123
GD564
I don't want to pick any codes like `GDTG GDCNB
Here's my query so far:
select regexp_matches(no_, '[a-zA-Z0-9]*$')
from myschema.mytable
which of course doesn't work.
Any help appreciated.
The pattern to match a string that has at least 1 letter followed by at least 1 number is '[A-Za-z]+[0-9]+'.
Now, if the valid patterns had to start with two letters, and then have 3 digits after as your examples show, then replace the + with {2} & {4} respectively, and enclose the pattern in ^$, like this: '^[A-Za-z]{2}[0-9]{3}$'
The regex match operator is ~ which you can use in the where clause:
SELECT no_
FROM myschema.mytable
WHERE no_ ~ '[A-Za-z]+[0-9]+'
You may use
CREATE TABLE tb1
(s character varying)
;
INSERT INTO tb1
(s)
VALUES
('GD123'),
('12345'),
('GDFGH')
;
SELECT * FROM tb1 WHERE s ~ '^(?![A-Za-z]+$)[a-zA-Z0-9]+$';
Result:
Details
^ - start of string
(?![A-Za-z]+$) - a negative lookahead that fails the match if there are only letters to the end of the string
[a-zA-Z0-9]+ - 1 or more alphanumeric chars
$ - end of string.
If you want to avoid matching 12345, use
'^(?![A-Za-z]+$)(?![0-9]+$)[a-zA-Z0-9]+$'
Here, (?![0-9]+$) will similarly fail the match if, from the string start, all chars up to the end of the string are digits. Result:
smth like:
so=# with c(v) as (values('GD123'),('12345'),('GD ERT'))
select v ~ '[A-Z]{1,}[0-9]+', v from c;
?column? | v
----------+--------
t | GD123
f | 12345
f | GD ERT
(3 rows)
?..
If the format of the data you want to obtain is a set of characters follewd by a set of digits (i.e., GD123) you can use the regex:
[a-zA-Z0-9]+[0-9]
This captures every digit and letter which is in front of the digits:
([A-z]+\d+)

Extract nested string from text column

I have following SQL result entries.
Result
---------
TW - 5657980 Due Date updated : to <strong>2017-08-13 10:21:00</strong> by <strong>System</strong>
TW - 5657980 Priority updated from <strong> Medium</strong> to <strong>Low</strong> by <strong>System</strong>
TW - 5657980 Material added: <strong>1000 : Cash in Bank - Operating (Old)/ QTY:2</strong> by <strong>System</strong>#9243
TW - 5657980 Labor added <strong>Kelsey Franks / 14:00 hours </strong> by <strong>System</strong>#65197
Now I am trying to extract a short description from this result and trying to migrate it to the another column in the same table.
Expected result
--------------
Due Date Updated
Priority Updated
Material Added
Labor Added
Ignore first 13 characters. For most of the cases it ends with 'updated'. Few ends with 'added'. It should be case insensitive.
Is there any way to get the expected result.
Solution with substring() using a regular expression. It skips the first 13 characters, then takes the string up to the first ' updated' or ' added', case-insensitive, with leading blank. Else NULL:
SELECT substring(result, '(?i)^.{13}(.*? (?:updated|added))')
FROM tbl;
The regexp explained:
(?i) .. meta-syntax to switch to case-insensitive matching
^ .. start of string
.{13} .. skip the first 13 characters
() .. capturing parenthesis (captures payload)
.*? .. any number of characters (non-greedy)
(?:) .. non-capturing parenthesis
(?:updated|added) .. 2 branches (string ends in 'updated' or 'added')
If we cannot rely on 13 leading characters like you later commented, we need some other reliable definition instead. Your difficulty seems with hazy requirements more than with the actual implementation.
Say, we are dealing with 1 or more non-digits, followed by 1 or more digits, a space and then the payload as defined above:
SELECT substring(result, '(?i)^\D+\d+ (.*? (?:updated|added))') ...
\d .. class shorthand for digits
\D .. non-digits, the opposite of \d

Regular expression numeric format with length size of 22 and decimal size of 3 regex

I am using a profiling tool to validate the data inside the tables. I want to check to see if the data in the tables matches the requirement of checking to see if the values input for current market value amount usd are length size of 22 and decimal size of 3. I am using ataccamba profiling tool which picks up the variables.
iif(
matches(#"^\d{22}.\d{3}$", Current_Market_Value_Amount__USD_),
true,false
I am looking to make this validation rule satisfy the requirement of:
Current Market Value Amount (USD) attribute should be in numeric format with length size of 22 and decimal size of 3
You need to escape the ., by doing \. because it is a meta-character that has the meaning 'match any'. To mean the literal . you prefix it with the \ escape character.
On a side note: your [0-9] repetitions can be replaced easily:
^[0-9]{18}\.[0-9]{3}$
This assumes the total string length must be 22.
You don't need more than:
^\d{22}\.\d{3}$
Since \d matches a digit, number inside bracket ensure that the preceding item is repeated x times (or from m to n in case of {m,n}).
By the way, Regex are not supported inside formula, I suggest you to take a look at How do I get regex support in excel via a function, or custom function?