PCRE2 - Match every word whose suffix matches a backreference - regex

Given the string below,
ay bee ceefooh deefoo38 ee 37 ef gee38 aitch 38 eye19 jay38 kay 99 el88 em38 en 29 ou38 38 pee 12 q38 arr 999 esss 555
the goal is to match every word such that the suffix is a number that matches the number that appears after foo (which happens to be 38 in this case).
There is only one substring that begins with foo and ends with a number. The expected matches all exist after said substring.
Expected matches:
gee38
jay38
em38
ou38
q38
I've tried foo(\d+).*?(\w+\1)\b and foo(\d+).*(\w+\1)\b, but they fail to match all, because they either match the first one (gee38) or the last one (q38).
Is it possible to match all with just a single regex and, importantly, in just a single run?
The PCRE2 engine that I use behaves in the same way as https://regex101.com/r/uFEDOE/1. So, if the regex can match multiple substrings on regex101, then the engine that I use can too.

(?:foo|\G(?!^))(\d+).*?(?=(\w+))\w+(?=\1\b)
Demo
It could be some size or performance optimization.
#Niko Gambt, say if any optimization is important for you.

Related

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

Regular Expression Extracting Text from a group

I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)

Regular expression for A123ABC

I have a string in the format A123ABC
First letter cannot contain <I,O,Q,U,Z>
Next 3 digits (0-9) from 21-998
Last 3 letters cannot include <I,Q,Z>
I used the following expression [A-HJ-NPR-TV-Y]{1}[0-9]{2,3}[A-HJ-PR-Y]{3}
But I am not able to restrict the number in the range 21-998.
Your letter part is fine, below is just the numbers portion:
regex = "(?:2[1-9]|[3-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-8])"
(?:...) group, but do not capture.
2[1-9] covers 21-29
[3-9][0-9] covers 30-99
[1-8][0-9][0-9] covers 100-899
9[0-8][0-9] covers 900-989
99[0-8] covers 990-998
| stands for "or"
Note: [0-9] may be replaced by \d. So, a more concise representation would be:
regex = "(?:2\d|[3-9]\d|[1-8]\d{2}|9[0-8]\d|99[0-8])"
One option would be matching (\d+) and checking if that falls in the range 21 - 998 outside a regex, in the language you're using, if possible.
If that is not feasible, you have to break it up (just showing the middle part):
(2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])
Breakdown:
2[1-9] matches 21 - 29
[3-9]\d matches 30 - 99
[1-8]\d\d matches 100 - 899
9[0-8]\d matches 900 - 989
99[0-8] matches 990 - 998
Also, the {1} is superfluous and can be omitted, making the complete regex
[A-HJ-NPR-TV-Y](2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])[A-HJ-PR-Y]{3}
Assuming the numbers between 21 and 99 are displayed with three digits (ie. : 021, 055, 099), here's a solution for the number part :
((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))
Entire regex :
[A-HJ-NPR-TV-Y]{1}((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))[A-HJ-PR-Y]{3}
There are probably easier ways to do this, but one way would be to use:
^((?=[^IOQUZ])([A-Z]))((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8]))((?=[^IQZ])([A-Z])){3}$
To explain:
^ denotes the beginning of the string.
((?=[^IOQUZ])([A-Z])) would give you any capital letter not in <I, O, Q, U, Z>.
((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8])) denotes any number between ((21 to 29) or (30 to 99) or (100 to 899) or (900 to 989) or (990 to 998)).
((?=[^IQZ])([A-Z])){3} would match any three capital letters not in <I, Q, Z>.
$ would denote the end of the string.

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.

Matching inner pattern an unlimited amount of times within outer pattern

Say I have the following pattern:
INDICATOR\s+([a-z0-9]+)
which would match for example:
INDICATOR AA or INDICATOR B3
I need to edit this pattern so it matches any instances of a string which starts with INDICATOR has a space and then has multiple matches of the inner pattern e.g.
INDICATOR AA A3 66 B8 34 CD
INDICATOR BG 4D CS
INDICATOR HG
Is it possible to do this?
Solution
With thanks to Gumbo I came up with the following regex which suits my requirements:
INDICATOR((\s+)?([,-])?(\s+)?([a-z0-9]+))+
Try this:
INDICATOR(\s+([a-z0-9]+))+
Here the repeating pattern is wrapped in a group and quantified using + to allow one or more repetitions of the expression inside the group. But you won’t get every match of the inner group with this but only the last match (or to be more specific: it depends on the implementation you’re using).