I have a set of data that I want to extract from. Currently, I only want to extract the lines similar to 2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC where I am using the regex
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)(?!EMPTY)
However, I do not want to get the line that contains EMPTY. I have tried the regex at regex101 but it seems like it still matches the line that contains the string EMPTY.
Also, is there anyway to shorten the regex? I have tried (\d+)\s+(\S+)\s+(\w+)\d+(.+)(?!EMPTY) but then it captures A (under the header the header Rev) all the way the end of the line. Some of my other trial and errors have captured some blank spaces at the end also, I have used (?!) once so I'm not sure if I can use it twice, any help on this?
CATALYST_TH 1
BACKPLANE A
#Slot Type Serial # Rev Num Date XptA XptB Name
2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC
6 879-857-01 0x0253bb0 A 0 # 9517-0 15 16 PMM CC-01
7 000-000-00 0x0000000 P0 0 # 0000-0 13 14 EMPTY
8 000-000-00 0x0000000 P0 0 # 0000-0 11 12 EMPTY
9 000-000-00 0x0000000 P0 0 # 0000-0 9 10 EMPTY
10 000-000-00 0x0000000 P0 0 # 0000-0 7 8 EMPTY
20 000-000-00 0x0000000 P0 0 # 0000-0 37 38 EMPTY
21 000-000-00 0x0000000 P0 0 # 0000-0 39 40 EMPTY
22 000-000-00 0x0000000 P0 0 # 0000-0 41 42 EMPTY
23 000-000-00 0x01a2446 P0 0 # 0000-0 43 44 EMPTY
1 949-669-00 0x026a850 B 0 # 0809-0 3 0 HAS (Left HAS LA669-00)
13 949-668-00 0x200762d A 0 # 9530-0 0 0 CATALYST HAC
12 949-667-00 0x026a4ee D 0 # 0102-0 0 0 DIF
24 949-669-01 0x2006037 B 0 # 9717-0 4 0 HAS (Right HAS LA669-01)
END
Put .+ or .* after the negative lookahead. And also the worb boundary added before the negative lookahead is a much needed one.
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\b(?!\h+EMPTY\b)\s*(.*)
DEMO
You can use multiline mode and the following updated regex:
/(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+(?:\d+\s+){2}((?!.*EMPTY\b).+)$/m
See demo
The negative lookahead (?!.*EMPTY\b) in ((?!.*EMPTY\b).+) checks if the substring after the previous subpattern is not ending in EMPTY.
It is difficult to shorten your regex since there is only 1 repetitive pattern \d+\s+ that we can shorten as (?:\d+\s+){2}.
Use negative lookahead at the start:
^(?!.*EMPTY\s*$)\s+(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)
I used the your regex and prepended ^(?!.*EMPTY\s*$)\s+. The reason is that the negative lookahead will have to be anchored to something or else part of it will be eaten by .+ and it will be ignored, even though you have EMPTY at the end. Here I anchored it to the beginning of the string.
Related
I'm trying to create a regex to
valid following type string only
first 2 characters 0 to 9
next 3 characters 000 to 366 then 501 to 866
next 4 characters 0 to 9
finally x or X
so I did that using following
([0-9]{2}(001|002|003|004|005|006|007... |366|501|...|866)[0-9]{4}[x|X]$)
so this can validate properly 900033618x
but this also taking 9001033618x as valid onces
how to restrict to 9 character length before x or X
I tried do this, like following
(^\d[[0-9]{2}(001|002|003|004|005|006|007... |366|501|...|866)[0-9]{4}]{9}[x|X]$)
but this not working
Just do what you say:
^[0-9]{2}(?:[01][0-9][0-9]|[24]00|3(?:6[6-9]|[7-9][0-9]))[0-9]{4}[xX]$
Note: watch beginning caret ^
VB2010 Using regex I cant seem to get this seemingly easy regex to work. I first look for a line with a keyword TRIPS that has my data and then from that line I want to extract repeated groups of data made up of an alpha code and then a number.
MODES 1 0 0
OVERH X 28 H 0 Z 198
TRIPS X 23 D 1 Z 198
ITEMSQ 1 0 0
COSTU P 16 E 180
CALLS 0 0
I have
^TRIPS (?<grp>[A-Z]\s{1,4}\d{1,3})
Which gives me one match and the first group "X 23". So I extend it by allowing it to match up to 4 groups.
^TRIPS (?<grp>[A-Z]\s{1,4}\d{1,3}){0,4}
but I get one match with still only one group.
You aren't allowing for white space between the groups. You need to do something like this:
^TRIPS ((?<grp>[A-Z]\s{1,4}\d{1,3})\s+){0,4}
Note: this question is an outcome from another answer that as of now all its comments are removed.
In case of using a lookaround construct within a RegEx there is a backtrack or a kind of that takes place right before closing bracket. As I'm aware this backtrack comes to output of Perl and PCRE debuggers:
The question is what is this backtrack, why is it there and how is it interpreted as a backtrack?
The backtrack is a lie.
It's just a consequence of how the regex101 debugger is implemented. It uses a PCRE feature (flag) called PCRE_AUTO_CALLOUT. This flag tells the PCRE engine to invoke a user-defined function at every step of matching. This function receives the current match status as input.
The catch is that PCRE doesn't tell the callout when it really backtracks. Regex101 has to infer that from the match status.
As you can see, in the step before the "backtrack" occurs, the current matched text is a_, and just after you get out of the lookahead, it's reverted to a. Regex101 notices the matched text is shorter and therefore it infers that a backtrack must have happened, with the confusing outcome you noticed.
For reference, here's the internal PCRE representation of the pattern with auto-callout enabled:
$ pcretest
PCRE version 8.38 2015-11-23
re> /a(?=_)_b/DC
------------------------------------------------------------------
0 59 Bra
3 Callout 255 0 1
9 a
11 Callout 255 1 5
17 17 Assert
20 Callout 255 4 1
26 _
28 Callout 255 5 0
34 17 Ket
37 Callout 255 6 1
43 _
45 Callout 255 7 1
51 b
53 Callout 255 8 0
59 59 Ket
62 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options:
First char = 'a'
Need char = 'b'
As you can see, there's no branching opcode there, just an Assert.
In the sample text below, I want to match groups of text (newlines and all) starting with a line defined by \nI.*' and including the subsequent lines starting with \nA, only if none of the intermediate lines contains "BOM=". I.e. in the example, I would want to match the first "device" and its following attributes, but not the second device, as shown in my comments (after #s).
I 657 device:THAT 2 1290 400 0 1 ' # Start matching here because no lines have "BOM="
A 1335 425 12 0 5 0 some text
A 1335 455 12 0 5 0 some text
A 1300 440 12 0 9 3 some text
A 1370 375 12 0 3 0 some text # Finish matching here
C 655 1 3 0
A 1370 450 12 0 3 3 #=2
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
I 318 device:THIS 2 300 1840 0 1 ' # Do not match again here because there's a line with "BOM="
A 320 1880 12 0 7 3 some text
A 320 1880 12 0 9 3 some text
A 380 1880 12 0 1 1 BOM=1,2
A 345 1865 12 0 5 0 some text
A 380 1830 12 0 3 0 some text
C 666 1 3 0
In the sample text, "some text" is various descriptors for electrical devices, e.g. "RATING=63MW", "REFDES=R123". It may contain whitespace but not newlines.
The furthest I've gotten yet is the expression
((\n|^)I((?!misc).)*?'\n)((A.*\n)*(A.*BOM=.*\n)(A.*\n)*)
which matches the opposite of what I want, i.e it finds the text blocks that DO contain BOM=. I thought I could switch this by changing (A.*BOM=.*\n) to (?!(A.*BOM=.*\n)) but this did not work.
I'm hoping to use this in Notepad++ when I'm done.
You can perhaps try this regex:
^I(?:(?!misc).)*'\n(?!(?:A.*\n)*?A.*BOM=)(?:A.*\n)*
regex101 demo
I added a third block where the BOM= is instead on a line starting with C, where the device being matched because BOM= is not on the same line as the consecutive lines beginning with A.
Multiline by default matches on every line on Notepad++, so it's usually not necessary to have (^|\n), but you can revert it if you need it.
I also kept (?:(?!misc).)* in because you had it in your expression, although it doesn't have to do anything with your sample data.
(?!(?:A.*\n)*?A.*BOM=) is what's making the match fail when there's a BOM= in the lines. It's a negative lookahead which will prevent a match only if A.*BOM= matches after any number of lines of (?:A.*\n)*? (i.e. lines beginning with A).
675185538end432 204 9/9 4709 908 2
343269172end430 3 43 9335 975 7
590144128end89 7 29 3-5-4 420 2
337460105end8Y5 7A 78 2 23
292484648end70 A53 03 9235 93
These are the strings that I am working with. I want to find a regex to replace the above strings as follows
675185538
432 204 9/9 4709 908 2
343269172
430 3 43 9335 975 7
590144128
89 7 29 3-5-4 420 2
337460105
8Y5 7A 78 2 23
292484648
70 A53 03 9235 93
Wherever end comes, \r\n should be introduced.
The string before end is numeric and after end is alphanumeric with whiteline characters.
I am using notepad++.
To make the match strict, try this:
Find: ^(\d+)end(\w)
Replace: \1\r\n\2
This captures, then puts back via back references, the preceding number between start of line and "end" and the following digit/letter. This won't match "end" elsewhere.
Kludgery:
Find (\d\d\d\d\d\d\d\d\d)end(\d)
Replace \1\r\n\2
Find creates two capture groups:
each group is bounded by an ( and a )
one capture group matches exactly nine numerals
the other capture group matches exactly one numeral.
In the replace:
the first capture group is referenced with \1
and the second group with \2.