Postgres regex to get substring from path - regex

I need to get a substring from a path directory made of 2 numbers preceeded and followed by a '_'. The string is like:
'P:\pgdfecol\71698384737978\INFENTECONTROL\2011_9_43\2011_9_46_43_29_10.ZIP'
and in this case I want to get the 43 following the 46.
The path is stored following the next rule, after the last backslash:
'\TablesPK_twoCharactersClassification_twoCharactersDocumentType_anything.ZIP'
I want to get the classification. The catch is that the table's primary key could be more than one field, although I know in each case how many fields are in the pk.
I got with something like this:
select substring(substring(substring('P:\pgdfecol\71698384737978\INFENTECONTROL\2011_9_43\2011_9_46_43_29_10.ZIP' from '([^\\]*(\.ZIP|zip))') from '([^_]*_){4}') from '[0-9]{2}')
But I would like something simpler.
Other cases:
'P:\pgdfecol\71698384737978\INFENTECONTROL\2011_03_46\2011_03_46_46_48_.ZIP'
(need the second 46)
'P:\pgdfecol\71698384737978\INFCONTABLE\2009_05_INBP\2009_05_INBP_22_28_.ZIP'
(need the 22 after INBP)
'P:\pgdfecol\71698384737978\INFOFICIAL\2007_06_MB\2007_06_MB_29_28_.ZIP'
(need the 29)
'P:\pgdfecol\71698384737978\ASOCIADOS\8010625\8010625_02_04_20110111.ZIP'
(02 after 8010625)
In the last case, the pk is only one field, so I've changed the sentence as:
select substring(substring(substring('P:\pgdfecol\71698384737978\ASOCIADOS\8010625\8010625_02_04_20110111.ZIP' from '([^\\]*(\.ZIP|zip))') from '([^_]*_){2}') from '[0-9]{2}')
For one Pk, I need the second set of ([^_]*_), for three the fourth, and so on..
select substring(substring(substring('P:\pgdfecol\71698384737978\ACTASCOMITE\ACRE123\ACRE123_17_11_.ZIP' from '([^\\]*(\.ZIP|zip))') from '([^_]*_){2}') from '[0-9]{2}')
(I get 17)
I'm using postgres 9.0.

I am beginning to understand. Consider this test case:
WITH x(txt) AS ( VALUES
('P:\pgdfecol\71698384737978\INFENTECONTROL\2011_9_43\2011_9_46_43_29_10.ZIP') -- 43
,('P:\pgdfecol\71698384737978\INFENTECONTROL\2011_03_46\2011_03_46_46_48_.ZIP') --need the second 46
,('P:\pgdfecol\71698384737978\INFCONTABLE\2009_05_INBP\2009_05_INBP_22_28_.ZIP') --need the 22 after INBP
,('P:\pgdfecol\71698384737978\INFOFICIAL\2007_06_MB\2007_06_MB_29_28_.ZIP') --need the 29
)
SELECT txt, substring(txt, '\\(?:[^_\\]+_){3}(\d\d)_[^\\]*\.(?:ZIP|zip)$')
FROM x
(?:) .. non-capturing parenthesis
[^_\\].. character class with any character except \ and _
\d .. a digit, same as [0-9] effectively
+ .. 1 or more matches (greedy)
$ .. end of string
[_\d] .. character class with digits and _
The case with only one pk needs a different pattern. Use {1} instead of {3}.
Not sure why you escape the backslashes. In modern versions of PostgreSQL standard_conforming_strings is on by default so you don't need to escape backslashes in strings - but still in regular expression of course.

Related

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

How to creating a regex pattern in VBA to extract dates from string and exclude false matches

I am trying to use Regex to parse a series of strings to extract one or more text dates that may be in multiple formats. The strings will look something like the following:
24 Aug 2016: nno-emvirt010a/b; 16 Aug 2016 nnt-emvirt010a/b nnd-emvirt010a/b COSI-1.6.5
24.16 nno-emvirt010a/b nnt-emvirt010a/b nnd-emvirt010a/b EI.01.02.03\
9/23/16: COSI-1.6.5 Logs updated at /vobs/COTS/1.6.5/files/Status_2016-07-27.log, Status_2016-07-28.log, Status_2016-08-05.log, Status_2016-08-08.log
I am not concerned about validating the individual date fields; just extracting the date string. The part I am unable to figure out is how to not match on number sequences that match the pattern but aren’t dates (‘1.6.5’ in ex. (1) and 01.02.03 in ex. (2)) and dates that are part of a file name (2016-07-27 in ex. (3)). In each of these exception cases in my input data, the initial numbers are preceded by either a period(.), underscore (_) or dash (-), but I cannot determine how to use this to edit the pattern syntax to not match these strings.
The pattern I have that partially works is below. It will only ignore the non date matches if it starts with 1 digit as in example 1.
/[^_\.\(\/]\d{1,4}[/\-\.\s*]([1-9]|0[1-9]|[12][0-9]|3[01]|[a-z]{3})[/\-\.\s*]\d{1,4}/ig`
I am not sure about vba check if this works . seems they have given so much options : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch04s04.html
^(?:(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])|↵
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9]))/(?:[0-9]{2})?[0-9]{2}$
^(?:
# m/d or mm/dd
(1[0-2]|0?[1-9])/(3[01]|[12][0-9]|0?[1-9])
|
# d/m or dd/mm
(3[01]|[12][0-9]|0?[1-9])/(1[0-2]|0?[1-9])
)
# /yy or /yyyy
/(?:[0-9]{2})?[0-9]{2}$
According to the test strings you've presented, you can use the following regex
See this regex in use here
(?<=[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
This regex ensures that specific date formats are met and are preceded by nothing (beginning of the string) or by a non-word character (specifically a-z, A-Z, 0-9) or dot .. The date formats that will be matched are:
24 Aug 2016
24.16
9/23/16
The regex could be further manipulated to ensure numbers are in the proper range according to days/month, etc., however, I don't feel that is really necessary.
Edits
Edit 1
Since VBA doesn't support lookbehinds, you can use the following. The date is in capture group 1.
(?:[^a-zA-Z\d.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:-\d{2}){2})|\d{2}\.\d{2})(?=[^a-zA-Z\d.])
Edit 2
As per bulbus's comment below
(?:[^\w.]|^)((?:\d{1,2}\s*[A-Z][a-z]{2}\s*\d{2,4})|(?:(?:\d{‌1,2}\/){2}\d{2,4})|(‌​?:\d{2,4}(?:-\d{2}){‌​2})|\d{2}\.\d{2})
Took liberty to edit that a bit.
replaced [^a-zA-Z\d.] with [^\w.], comes with added advantage of excluding dates with _2016-07-28.log
Due to 1 removed trailing condition (?=[^a-zA-Z\d.]).
Forced year digits from \d+ to \d{2,4}
Edit 3
Due to added conditions of the regex, I've made the following edits (to improve upon both previous edits). As per the OP:
The edited pattern above works in all but 2 cases:
it does not find dates with the year first (ex. 2016/07/11)
if the date is contained within parenthesis in the string, it returns the left parenthesis as part of the date (ex. match = (8/20/2016)
Can you provide the edit to fix these?
In the below regexes, I've changed years to \d+ in order for it to work on any year greater than or equal to 0.
See the code in use here
(?:[^\w.]|^)((?:\d{1,2}\s+[A-Z][a-z]{2}\s+\d+)|(?:(?:\d{1,2}\/){2}\d+)|(?:\d+(?:\/\d{1,2}){2})|(?:\d+(?:-\d{2}){2})|\d{2}\.\d+)
This regex adds the possibility of dates in the XXXX/XX/XX format where the date may appear first.
The reason you are getting ( as a match before the regex is the nature of the Full Match. You need to, instead, grab the value of the first capture group and not the whole regex result. See this answer on how to grab submatches from a regex pattern in VBA.
Also, note that any additional date formats you need to catch need to be explicitly set in the regex. Currently, the regex supports the following date formats:
\d{1,2}\s+[A-Z][a-z]{2}\s+\d+
12 Apr 17
12 Apr 2017
(?:\d{1,2}\/){2}\d+
1/4/17
01/04/17
1/4/2017
01/04/2017
\d+(?:\/\d{1,2}){2}
17/04/01
2017/4/1
2017/04/01
17/4/1
\d+(?:-\d{2}){2}
17-04-01
2017-04-01
\d{2}\.\d+ - Although I'm not sure what this date format is even used for and how it could be considered efficient if it's missing month
24.16

Extract nested string from text column

I have following SQL result entries.
Result
---------
TW - 5657980 Due Date updated : to <strong>2017-08-13 10:21:00</strong> by <strong>System</strong>
TW - 5657980 Priority updated from <strong> Medium</strong> to <strong>Low</strong> by <strong>System</strong>
TW - 5657980 Material added: <strong>1000 : Cash in Bank - Operating (Old)/ QTY:2</strong> by <strong>System</strong>#9243
TW - 5657980 Labor added <strong>Kelsey Franks / 14:00 hours </strong> by <strong>System</strong>#65197
Now I am trying to extract a short description from this result and trying to migrate it to the another column in the same table.
Expected result
--------------
Due Date Updated
Priority Updated
Material Added
Labor Added
Ignore first 13 characters. For most of the cases it ends with 'updated'. Few ends with 'added'. It should be case insensitive.
Is there any way to get the expected result.
Solution with substring() using a regular expression. It skips the first 13 characters, then takes the string up to the first ' updated' or ' added', case-insensitive, with leading blank. Else NULL:
SELECT substring(result, '(?i)^.{13}(.*? (?:updated|added))')
FROM tbl;
The regexp explained:
(?i) .. meta-syntax to switch to case-insensitive matching
^ .. start of string
.{13} .. skip the first 13 characters
() .. capturing parenthesis (captures payload)
.*? .. any number of characters (non-greedy)
(?:) .. non-capturing parenthesis
(?:updated|added) .. 2 branches (string ends in 'updated' or 'added')
If we cannot rely on 13 leading characters like you later commented, we need some other reliable definition instead. Your difficulty seems with hazy requirements more than with the actual implementation.
Say, we are dealing with 1 or more non-digits, followed by 1 or more digits, a space and then the payload as defined above:
SELECT substring(result, '(?i)^\D+\d+ (.*? (?:updated|added))') ...
\d .. class shorthand for digits
\D .. non-digits, the opposite of \d

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.