RegEx: Reject sub-portion of complicated expression - regex

In the sample text below, I want to match groups of text (newlines and all) starting with a line defined by \nI.*' and including the subsequent lines starting with \nA, only if none of the intermediate lines contains "BOM=". I.e. in the example, I would want to match the first "device" and its following attributes, but not the second device, as shown in my comments (after #s).
I 657 device:THAT 2 1290 400 0 1 ' # Start matching here because no lines have "BOM="
A 1335 425 12 0 5 0 some text
A 1335 455 12 0 5 0 some text
A 1300 440 12 0 9 3 some text
A 1370 375 12 0 3 0 some text # Finish matching here
C 655 1 3 0
A 1370 450 12 0 3 3 #=2
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
I 318 device:THIS 2 300 1840 0 1 ' # Do not match again here because there's a line with "BOM="
A 320 1880 12 0 7 3 some text
A 320 1880 12 0 9 3 some text
A 380 1880 12 0 1 1 BOM=1,2
A 345 1865 12 0 5 0 some text
A 380 1830 12 0 3 0 some text
C 666 1 3 0
In the sample text, "some text" is various descriptors for electrical devices, e.g. "RATING=63MW", "REFDES=R123". It may contain whitespace but not newlines.
The furthest I've gotten yet is the expression
((\n|^)I((?!misc).)*?'\n)((A.*\n)*(A.*BOM=.*\n)(A.*\n)*)
which matches the opposite of what I want, i.e it finds the text blocks that DO contain BOM=. I thought I could switch this by changing (A.*BOM=.*\n) to (?!(A.*BOM=.*\n)) but this did not work.
I'm hoping to use this in Notepad++ when I'm done.

You can perhaps try this regex:
^I(?:(?!misc).)*'\n(?!(?:A.*\n)*?A.*BOM=)(?:A.*\n)*
regex101 demo
I added a third block where the BOM= is instead on a line starting with C, where the device being matched because BOM= is not on the same line as the consecutive lines beginning with A.
Multiline by default matches on every line on Notepad++, so it's usually not necessary to have (^|\n), but you can revert it if you need it.
I also kept (?:(?!misc).)* in because you had it in your expression, although it doesn't have to do anything with your sample data.
(?!(?:A.*\n)*?A.*BOM=) is what's making the match fail when there's a BOM= in the lines. It's a negative lookahead which will prevent a match only if A.*BOM= matches after any number of lines of (?:A.*\n)*? (i.e. lines beginning with A).

Related

TCL: How the regex for every line should look like?

In TCL, in output I have something like this:
ABBAA 1 BAABA 1 DNS3 0 0 200 300 400 500 0 0
ABBAA 1 BAABA 1 DNS1 0 0 200 300 400 500 0 0
ABBAA 1 BAABA 1 DNS7 0 0 200 300 400 500 0 0
ABBAB 1 BAABB 1 DNS5 0 0 200 300 400 500 0 0
ABBAB 1 BAABB 1 DNS3 0 0 200 300 400 500 0 0
I would like to sort this table alike dataset by fourth column ascending (so the first one will be row with DNS1UP1, then DNS2UP2 etc.) I figured out that regexp will be easiest method by looking for string with "DNS.." in it. But my method doesn't work exacly how I thought, because it is matching only one line or no line at all.
My method:
regexp "ABB.*DNS1.*?\N"
ABB - match beginning of new line
.* - every character between ABB and DNS..
DNS1 - match the main looking for word
.* - every character between DNS... and new line symbol
?\n - non-greedy occurence of new line
Where am I wrong?
If you have a list of lines in such a regular format, you can just lsort them… with the right options. In particular, -dictionary is good for mixed text/numbers and -index 4 lets you choose the column to sort by.
set sortedLines [lsort -index 4 -dictionary $unsortedLines]
The only possible reasonable use of regexp in this would have been in preparing the data for the sort, but that string which you provided is already sortable (assuming you've done a split $data "\n" on it to actually convert it into a list of lines and are not just using a big ol' string).

Regex for 10 digit phone number with variable spacing

I need to validate that a string follows these rules:
contains numerals
may optionally contain any number of space characters in any position
may not contain any other kind of character
the first two numerals must be one of the set: 02; 03; 07; 08; 13; 18
and the number of numerals must be exactly 10 unless the first two numerals are 1 and 3, in which case the number of numerals may be 10 or 6.
Essentially these are Australian landline (with area code), free-call and 13 numbers.
Ideally the regex should be as implementation-agnostic as possible.
Examples of valid input:
0299998888
02 99998888
02 9999 8888
02 99 998 888
0299 998 888
0299 998888
131999
131 999
13 19 99
1300123456
1300 123456
1300 123 456
1300 12 34 56
1300 12 34 56
PS. I've checked at least 5 other answers and searched for multiple variations of this question, to no avail.
The nearest I have is:
^(?=\d{10}$)(02|03|04|07|08|13|18)\d+
... however this does not account for spacing and won't accept 6 digit numbers beginning with 13.
Note, in theory, the following is acceptable:
1 3 1999
1 3 1 9 9 9
By this I mean that first pair of numerals may have a space between them (as bad as that looks).
Following are examples of random numbers that should fail:
13145 (not enough numerals)
1300-123-456 (hyphens not permitted)
9999 8888 (not enough numerals)
(02) 9999 8888 (parentheses not permitted)
You can make a separate pattern for 13 in alternation:
^(?:(?=(?:\s*\d\s*){10}$)(?:0\s*[2378]|1\s*[38])|(?=(?:\s*\d\s*){6}$)1\s*3).*
Demo: https://regex101.com/r/Hkjus2/2

Matching across multiple lines regular expression

I have several lists in a single text file that look like below. It always starts with 0 and it always ends with the word Unique at the start of a newline. I would like to get rid of all of it apart from the line with Unique on it. I looked through stackoverflow and tried the following but it returns the whole text file (there are other strings in the file that I haven't put in this example). Basically the problem is how to account for the newlines in the regex selection
^0(.|\n)*
Input:
0 145
1 139
2 175
3 171
4 259
5 262
6 293
7 401
8 430
9 417
10 614
11 833
12 1423
13 3062
14 10510
15 57587
16 5057575
17 10071
18 375
19 152
20 70
21 55
22 46
23 31
24 25
25 22
26 25
27 14
28 16
29 16
30 8
31 10
32 8
33 21
34 8
35 51
36 65
37 605
38 32
39 2
40 1
41 2
44 1
48 2
51 1
52 1
57 1
63 2
68 1
82 1
94 1
95 1
101 3
102 7
103 1
110 1
111 1
119 1
123 1
129 2
130 3
131 2
132 1
135 1
136 2
137 7
138 4
Unique: 252851
Expected output:
Unique: 252851
You need to use something like
^0[\s\S]*?[\n\r]Unique:
and replace with Unique:.
^ - start of a line
0 - a literal 0
[\s\S]*? - zero or more characters incl. a newline as few as possible
[\n\r] - a linebreak symbol
Unique: - a whole word Unique:
Another possible regex is:
^0[^\r]*(?:\r(?!Unique:)[^\r]*)*
where \r is the line endings in the current file. Replace with an empty string.
Note that you could also use (?m)^0.*?[\r\n]Unique: regex (to replace with Unique:) with the (?m) option:
m: multi-line (dot(.) match newline)
Your method of matching newlines should work, although it's not optimal (alternation is rather slow); the next problem is to make sure the match stops before Unique:
(?s)^0.*(?=Unique:)
should work if there is only one Unique: in your file.
Explanation:
(?s) # Start "dot matches all (including newlines) mode
^0 # Match "0" at the start of the file
.* # Match as many characters as possible
(?=Unique:) # but then backtrack until you're right before "Unique:"

Regex matching lines not containing word EMPTY

I have a set of data that I want to extract from. Currently, I only want to extract the lines similar to 2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC where I am using the regex
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)(?!EMPTY)
However, I do not want to get the line that contains EMPTY. I have tried the regex at regex101 but it seems like it still matches the line that contains the string EMPTY.
Also, is there anyway to shorten the regex? I have tried (\d+)\s+(\S+)\s+(\w+)\d+(.+)(?!EMPTY) but then it captures A (under the header the header Rev) all the way the end of the line. Some of my other trial and errors have captured some blank spaces at the end also, I have used (?!) once so I'm not sure if I can use it twice, any help on this?
CATALYST_TH 1
BACKPLANE A
#Slot Type Serial # Rev Num Date XptA XptB Name
2 879-858-35 0x0109037 A 0 # 0131-0 23 24 PLFD CC
6 879-857-01 0x0253bb0 A 0 # 9517-0 15 16 PMM CC-01
7 000-000-00 0x0000000 P0 0 # 0000-0 13 14 EMPTY
8 000-000-00 0x0000000 P0 0 # 0000-0 11 12 EMPTY
9 000-000-00 0x0000000 P0 0 # 0000-0 9 10 EMPTY
10 000-000-00 0x0000000 P0 0 # 0000-0 7 8 EMPTY
20 000-000-00 0x0000000 P0 0 # 0000-0 37 38 EMPTY
21 000-000-00 0x0000000 P0 0 # 0000-0 39 40 EMPTY
22 000-000-00 0x0000000 P0 0 # 0000-0 41 42 EMPTY
23 000-000-00 0x01a2446 P0 0 # 0000-0 43 44 EMPTY
1 949-669-00 0x026a850 B 0 # 0809-0 3 0 HAS (Left HAS LA669-00)
13 949-668-00 0x200762d A 0 # 9530-0 0 0 CATALYST HAC
12 949-667-00 0x026a4ee D 0 # 0102-0 0 0 DIF
24 949-669-01 0x2006037 B 0 # 9717-0 4 0 HAS (Right HAS LA669-01)
END
Put .+ or .* after the negative lookahead. And also the worb boundary added before the negative lookahead is a much needed one.
(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\b(?!\h+EMPTY\b)\s*(.*)
DEMO
You can use multiline mode and the following updated regex:
/(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+(?:\d+\s+){2}((?!.*EMPTY\b).+)$/m
See demo
The negative lookahead (?!.*EMPTY\b) in ((?!.*EMPTY\b).+) checks if the substring after the previous subpattern is not ending in EMPTY.
It is difficult to shorten your regex since there is only 1 repetitive pattern \d+\s+ that we can shorten as (?:\d+\s+){2}.
Use negative lookahead at the start:
^(?!.*EMPTY\s*$)\s+(\d+)\s+(\S+)\s+(\w+)\s+\w+\s+\d*\s+\#\s+\S+\s+\d+\s+\d+\s+(.+)
I used the your regex and prepended ^(?!.*EMPTY\s*$)\s+. The reason is that the negative lookahead will have to be anchored to something or else part of it will be eaten by .+ and it will be ignored, even though you have EMPTY at the end. Here I anchored it to the beginning of the string.

Manipulate data in Awk

I am new to Awk programming.I have a question on manipulating text file,which is required to draw certain Network based images in a visualization software(Circos http://circos.ca)
I have input data for which I want to manipulate values using awk/grep/sed.
There are 9 pairs(18 lines).5 pairs(first 10 lines) are for "from=ABCB11", and 4 pairs(next 8 lines) are for "from =ABCC8". What I want is extract the value from the first line of the first pair and replace it in each alternate line of the rest of the other pairs.
So value for group-2 is 9 10 ,which should replace all the occurence of value in group2.
The next value for group-2 is 28 29,which should be replaced by 9 10.
The stop should be determined by "from=name" which is "from=ABCB11".Its not necessary that the rows that have to captured expression from and replace in its next occurence will belong to group-2 as in this instance.It could be group-3 or group-4 until group-10.So second set ("from =ABCC8")could have been belonged to group-4/5/6 not necessary group-2.Its just a coincidence here.
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Below is the FINAL output,I am looking for:
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Also,this is just a sample data.So many pairs would have group-1,group-4,group-5 upto group 10.Here,only pairs from lesser groups are mentioned.
I want to loop through the lines until the value in "from=name" remains same,so that I can change all occurences in each alternate line.Code:
awk -F, 'NR%2==1 {split($2,a,"="); print a[2]}' file.txt
The above code is able to extract the alternate lines and the "name" in "from=name"
The following is quite verbose (I love verbose variable names). Using your sample-data, I get the data you want to have. This assumes, that every "uneven" line gets the values from the first line with the same "from=xxxx" information.
awk '
BEGIN {
namevar=""
val1var=""
val2var=""
linenum=0
}
{
split($0, linearr)
split(linearr[5], csvarr, ",")
if (namevar != csvarr[2]) {
namevar=csvarr[2]
val1var=linearr[2]
val2var=linearr[3]
linenum=0
}
linenum+=1
if (linenum%2==1) {
print linearr[1], val1var, val2var, linearr[4], linearr[5]
} else {
print linearr[1], linearr[2], linearr[3], linearr[4], linearr[5]
}
}' file.txt