Note: this question is an outcome from another answer that as of now all its comments are removed.
In case of using a lookaround construct within a RegEx there is a backtrack or a kind of that takes place right before closing bracket. As I'm aware this backtrack comes to output of Perl and PCRE debuggers:
The question is what is this backtrack, why is it there and how is it interpreted as a backtrack?
The backtrack is a lie.
It's just a consequence of how the regex101 debugger is implemented. It uses a PCRE feature (flag) called PCRE_AUTO_CALLOUT. This flag tells the PCRE engine to invoke a user-defined function at every step of matching. This function receives the current match status as input.
The catch is that PCRE doesn't tell the callout when it really backtracks. Regex101 has to infer that from the match status.
As you can see, in the step before the "backtrack" occurs, the current matched text is a_, and just after you get out of the lookahead, it's reverted to a. Regex101 notices the matched text is shorter and therefore it infers that a backtrack must have happened, with the confusing outcome you noticed.
For reference, here's the internal PCRE representation of the pattern with auto-callout enabled:
$ pcretest
PCRE version 8.38 2015-11-23
re> /a(?=_)_b/DC
------------------------------------------------------------------
0 59 Bra
3 Callout 255 0 1
9 a
11 Callout 255 1 5
17 17 Assert
20 Callout 255 4 1
26 _
28 Callout 255 5 0
34 17 Ket
37 Callout 255 6 1
43 _
45 Callout 255 7 1
51 b
53 Callout 255 8 0
59 59 Ket
62 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options:
First char = 'a'
Need char = 'b'
As you can see, there's no branching opcode there, just an Assert.
Related
We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b, "street_number" would be 17, and "street_number_suffix" would be b.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-).
It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/), thought, once I include \s to this pattern, 21 from 19-21 will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+
Thought the full match of N°10 isn't 10 but N°10 (and our ETL doesn't support capturing groups, so I can't use /......(\d+)/)
To get the street numbers, you could update the pattern to:
(?<![-/.a-z\d])\d+
Explanation
(?<! Negative lookbehind
[-/.a-z\d] Match any of the listed using a charater class
) Close the negative lookbehind
\d+ Match 1+ digits
Regex demo
VB2010 Using regex I cant seem to get this seemingly easy regex to work. I first look for a line with a keyword TRIPS that has my data and then from that line I want to extract repeated groups of data made up of an alpha code and then a number.
MODES 1 0 0
OVERH X 28 H 0 Z 198
TRIPS X 23 D 1 Z 198
ITEMSQ 1 0 0
COSTU P 16 E 180
CALLS 0 0
I have
^TRIPS (?<grp>[A-Z]\s{1,4}\d{1,3})
Which gives me one match and the first group "X 23". So I extend it by allowing it to match up to 4 groups.
^TRIPS (?<grp>[A-Z]\s{1,4}\d{1,3}){0,4}
but I get one match with still only one group.
You aren't allowing for white space between the groups. You need to do something like this:
^TRIPS ((?<grp>[A-Z]\s{1,4}\d{1,3})\s+){0,4}
here is my regex demo
as the question states:
if the first digit is 1 return 1 but if it is 145 return 145 but if its 133 return 133
sample dataa:
K'8134567
K'81345678
K'6134516789
K'61345678
K'643456
K'646345678
K'1234567890
K'12345678901
K'1454567890 <<<--- want 145 returned and not 1
K'13345678901 <<<--- want 133 returned and not 1
K'3214567890123
K'32134567890123
K'3654567890123
K'8934567890123
K'6554567890123
regex exprtession:
K'(?|(?P<name1>81)\d+|(61)\d+|(64)\d+|(1)\d+|(44)\d+|(86)\d+|(678)\d+|(41)\d+|(49)\d+|(33)\d+|(685)\d+|(\d{1,3})\d+)
the regex explained:
I am interested in the digits after K'
I am looking to do this using regex but not sure if it can be done.
What I want is:
if the number starts with 81 return 81
if the number starts with 61 return 61
...
if the number starts with something i am not interested in return other(or its first digits of 1-3)
The above criteria works:
but my question is how do I do the following:
if the fist digit is 1 then return 1 BUT
if the fist digit is 1 and the 2nd and 3rd digit are 45 return 145 and don't return just 1
if the fist digit is 1 and the 2nd and 3rd digit are 33 return 133 and don't return just 1
I presume I have to put something inside this part of the regex |(1)\d+|
Som other questions for my own reference:
Does regex sort the data first?
Is the order of the regex search important to how it is implemented? i deally I do not want this.
You can use this regex:
K'(?P<name1>81|61|64|44|86|678|41|49|33|685|1(?:33|45)?|\d{2,3})\d+
Updated RegEx Demo
Try with:
K'(?|(?P<name1>81)\d+|(61)\d+|(64)\d+|(1(?:45|33)?)\d+|(44)\d+|(86)\d+|(678)\d+|(41)\d+|(49)\d+|(33)\d+|(685)\d+|(\d{1,3})\d+)
DEMO
regex doesn't sorts anything but the order of your regex is important, actually based on your regex engine it would be a bit different but since most of regex engines use Traditional NFA for parsing string the order is important.
And in this case you can simply us following regex or add it to your regex :
(?<=K')1(?:45|33)?
See demo https://regex101.com/r/rT2yJ0/1
I have a set of data that I want to extract from. Currently, I only want to extract the lines similar to 2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC where I am using the regex
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)(?!EMPTY)
However, I do not want to get the line that contains EMPTY. I have tried the regex at regex101 but it seems like it still matches the line that contains the string EMPTY.
Also, is there anyway to shorten the regex? I have tried (\d+)\s+(\S+)\s+(\w+)\d+(.+)(?!EMPTY) but then it captures A (under the header the header Rev) all the way the end of the line. Some of my other trial and errors have captured some blank spaces at the end also, I have used (?!) once so I'm not sure if I can use it twice, any help on this?
CATALYST_TH 1
BACKPLANE A
#Slot Type Serial # Rev Num Date XptA XptB Name
2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC
6 879-857-01 0x0253bb0 A 0 # 9517-0 15 16 PMM CC-01
7 000-000-00 0x0000000 P0 0 # 0000-0 13 14 EMPTY
8 000-000-00 0x0000000 P0 0 # 0000-0 11 12 EMPTY
9 000-000-00 0x0000000 P0 0 # 0000-0 9 10 EMPTY
10 000-000-00 0x0000000 P0 0 # 0000-0 7 8 EMPTY
20 000-000-00 0x0000000 P0 0 # 0000-0 37 38 EMPTY
21 000-000-00 0x0000000 P0 0 # 0000-0 39 40 EMPTY
22 000-000-00 0x0000000 P0 0 # 0000-0 41 42 EMPTY
23 000-000-00 0x01a2446 P0 0 # 0000-0 43 44 EMPTY
1 949-669-00 0x026a850 B 0 # 0809-0 3 0 HAS (Left HAS LA669-00)
13 949-668-00 0x200762d A 0 # 9530-0 0 0 CATALYST HAC
12 949-667-00 0x026a4ee D 0 # 0102-0 0 0 DIF
24 949-669-01 0x2006037 B 0 # 9717-0 4 0 HAS (Right HAS LA669-01)
END
Put .+ or .* after the negative lookahead. And also the worb boundary added before the negative lookahead is a much needed one.
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\b(?!\h+EMPTY\b)\s*(.*)
DEMO
You can use multiline mode and the following updated regex:
/(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+(?:\d+\s+){2}((?!.*EMPTY\b).+)$/m
See demo
The negative lookahead (?!.*EMPTY\b) in ((?!.*EMPTY\b).+) checks if the substring after the previous subpattern is not ending in EMPTY.
It is difficult to shorten your regex since there is only 1 repetitive pattern \d+\s+ that we can shorten as (?:\d+\s+){2}.
Use negative lookahead at the start:
^(?!.*EMPTY\s*$)\s+(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)
I used the your regex and prepended ^(?!.*EMPTY\s*$)\s+. The reason is that the negative lookahead will have to be anchored to something or else part of it will be eaten by .+ and it will be ignored, even though you have EMPTY at the end. Here I anchored it to the beginning of the string.
I have this thing where I usual have something like (but not always)
- 30 30: 0 4 58 E
and that must be
- 30 30
: 0 4 58 E
or, in another case
- 32 32
: 0 2 63 All
must remain as it is
- 32 32
: 0 2 63 All
So any : must always be on the next line.
Is there an regex for fixing every case of this (so that it only does this when the : isn't already on a new line?
I'm using Sublime text as editor
when the ":" is already on a new line, it can't be given another one
Then you want to use a negative lookbehind:
(?<!\n):
Replace that with \n:.
If lookbehind is not supported, you also could match colons that follow digits: Replace (\d): with $1\n: - using a capturing group.