Regex to match blocks of text with key phrases in the middle

Regex to match blocks of text with key phrases in the middle - regex

VB2010: I have text that consists of blocks of text that start with day and time DD HHMM and end only at the next day/time.
Here is my sample text:
18 2131 Z50000 ZZ-AAA
PR
PR
AGM TPS P773QQ 1500 DCA 22FEB
21,77,23,M10,F,26,3100,2
OK
18 2134 Z50000 ZZ-AAA
PR
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2134 Z50000 ZZ-AAA
PR
PR
ARR OPN P773QQ 1500 DCA 22FEB
0757
OK
18 2135 Z50000 ZZ-AAA
PR
PR
ARR M58 P773QQ 1500 DCA 22FEB
212
UNKNOWN POL/SPOL
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2136 Z50000 ZZ-AAA
PRF 1500/18 MIA IN 0152 333
18 2137 Z50000 ZZ-AAA
PR
PRZ 1500/18 MIA IN 2026 N/A 333
My goal is to get only the blocks of text that have key phrases ^FI and ^DT in the middle. The matching groups should contain only two blocks. The one from 18 2134 and end at M33A and then from 18 2135 to M33A.
I have tried:
This works for the most part except it starts the match at the prior block.
RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}(.*?)^FI US(.*?)^DT DDL(.*?)\r
This one I took from another post but cant seem to wrap my head around. It matches only the first part of every block.
RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}.*\r[\s\S]*?(?=(?:^\d\d \d{4}|$))
Haven't used regex in a while so any help appreciated.

You may use
(?ms)^\d\d +\d{4}\b(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*?^(?:FI|DT).*?(?=^\d\d +\d{4}\b|\Z)
See the regex demo (Though it is a PCRE regex test, it will work the same in .NET).
Pattern details
(?ms) - multiline and singleline options
^ - start of a line
\d\d +\d{4}\b - 2 digits, 1 or more spaces and 4 digits as a whole word
(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*? - any char, 0+ repetitions, as few as possible, that does not start the sequence: start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word, or FI or DT
^(?:FI|DT) - FI or DT at the start of a line
.*? - any 0+ chars, as few as possible
(?=^\d\d +\d{4}\b|\Z) - a positive lookahead that requires ^\d\d +\d{4}\b (start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word) or \Z (end of string) to match immediately to the right of the current location.

This regex should find what you need, if single line enabled
[0-3]\d\s+[0-2]\d[0-5]\d.*?(FI.*?)\n(DT.*?)\n
Explanation:
[0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check
.*? ungreedy capturing, . includes newline
(FI.*?)\n first group, FI line, until line break
(DT.*?)\n second group, same deal

Related

regExp for HH:mm format including intermediate value

I am trying to compose a regExp that accepts HH:mm time formats, but also accepts all of the intermediate values:
e.g. all of these are accepted:
0
1
12
12:
12:3
12:30
1:
1:3
1:30
For now, I came up with this: ^([\d]{1,2}):?([\d]{1,2})?$
But this accepts any numeric 1/2 digit values for hours and minutes (e.g. 25:66 is acceptable)
So I came relatively close to my goal, but I need to filter out values x>24 from the hours, and x>60 from the minutes?

Try this:
^((?:[01][0-9]?)|(?:2[0-4]?)|(?:[3-9]))(?::((?:[0-5][0-9]?)|(?:60))|:)?$
NOTE:
This accepts 24 for HH and 60 for MM as stated in your question:
but I need to filter out values x>24 from the hours, and x>60 from the minutes?
Thus ff. are accepted:
0
1
12
12:
12:3
12:30
1:
1:3
1:30
1:60
24:60
24:00
00:60
and below are not accepted:
25:30
00:61
Regex DEMO 1
If you want to exclude 24 HH and 60 MM, try this instead:
^((?:[01]\d?)|(?:2[0-3]?))(?::|(?::([0-5][0-9]?)))?$
Regex DEMO 2
Groups (applies to both cases):
\1 = HH
\2 = MM

You are looking for
^(?:[01]\d?|2[0-3]?)(?::(?:[0-5]\d?)?)?$
See the regex demo and the regex graph:
Details:
^ - start of string
(?:[01]\d?|2[0-3]?) - either a 0 or 1 followed by an optional digit, or a 2 followed with an optional 0, 1, 2 or 3
(?::(?:[0-5]\d?)?)? - an optional sequence of patterns:
: - a colon
(?:[0-5]\d?)? - an optional sequence of patterns:
[0-5] - a digit from 1 to 5
\d? - an optional digit
$ - end of string.

Need improvement on the Regex

I have wrote an easy regex for extracting user SC08.
https://regex101.com/r/L1DOzH/1/ Performance wise, its really bad taking around 1448 steps.
Jun 2 11:16:44 192.168.55.19 1 2020-06-02T10:16:43.721Z chisdsm#abcd.com dsm 4493 USR1278I [U#21513 sev="INFO" msg="user logged out due to inactivity" user="SC08"]
Jun 2 10:13:50 192.168.55.19 1 2020-06-02T09:13:50.297Z chisdsm#abcd.com dsm 4493 DO0426I [DA#21513 sev="INFO" msg="switch domain" admin="SC08"
Jun 2 10:13:43 192.168.55.19 1 2020-06-02T09:13:42.956Z chisdsm#abcd.com dsm 4493 DAO0267I [DA#21513 sev="INFO" msg="user logged in" admin="SC08" stime="2020-06-02 10:13:42.944" role="ALL_ADMIN" source="192.168.54.9"]
May 27 15:53:38 192.168.55.129 1 2020-05-27T14:53:37.669Z chisdsm#abcd.com dsm 4493 DAO0227I [DA#21513 sev="INFO" msg="delete file signature" user="SC08" filePath="/bin/rm"]

Alternation group as the first pattern in a regex cancels some optimizations that are in place for patterns that start with a more specific pattern.
Since your alternatives match = delimited strings, you may put it at the beginning of the pattern, and then use lookarounds, as in Michail's suggestion. Here is a small variation with 139 steps:
=(?:(?<=user=)"(?<user1>\w+)|(?<=admin=)"(?<user2>\w+))
See the regex demo. Details
= - an equals sign
(?:(?<=user=)"(?<user1>\w+)|(?<=admin=)"(?<user2>\w+)) - a non-capturing group:
(?<=user=) - user= must be immediately to the left of the current position
" - a " char
(?<user1>\w+) - Group "user1": 1+ word chars
| - or
(?<=admin=) - admin= must be immediately to the left of the current position
" - a " char
(?<user2>\w+) - Group "user2": 1+ word chars
If your matches are always preceded with a whitespace, use it as the first pattern:
\s(?:user="(?<user1>\w+)|admin="(?<user2>\w+))
See this regex demo, with 918 steps.
If you know the matches are somewhere close to the end of the line, use
.*\b(?:user="(?<user1>\w+)|admin="(?<user2>\w+))
See this regex demo, 568 steps. .* at the start will move the regex index at the end of a line/string and then backtrack to find either user= or admin=.

Find a String from a varying number block to the end

I have nearly 8000 lines of the following text:
DIL 2 M 006 SC SCHÜTZ 083 1 Stck
25215-1 BIN-SORT 2152310251724-1 BIN-SORT getestet 048 133 Stck
RBBE60-T3dsg 21S003 SEALING 6X8.9X2.4 MM 082 3 Stck
I am only interested in the 3 digit block at the end and the number behind.
So this should be the output:
083 1
048 133
082 3
It could be, that the same number e.g. 048 appears at the beginning of the line. this shouldn't be a hit.
Unfortunatelly i have no idea how to extract this strings with the help of notpad++.

This expression,
.*(\d{3}\s+\d+).*
with a replacement of $1 is likely to work here.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

You may try the following find and replace, in regex mode:
Find: ^.*?(\d+ \d+) \S*$
Replace: $1
The logic here is to use .* to consume everything up until the last two consecutive digits in the line. Then, we replace with only the captured two digits.
Demo

Regex match line starting with whitespace and first character is non-digit

I am trying to create a regex that will only match lines that start with whitespace, then have 1-4 non-digits as the first characters, and then at least one or more spaces after the digits. The purpose of this regex is to use it in the "Find and Replace" option of Notepad++ to remove any lines that do not start with space(s) and then have a number as the first character in the line.
What I have now is allowing me to match the lines that start with whitespace and are followed with a group of digits and another space. However, these are the lines I want to keep. How can I modify the following regex so that it will match everything else other than these lines?
/^([\s]+\d[\s]|[\s]+\d\d[\s]|[\s]+\d\d\d[\s])/gm
Here's an example of the data we're using the regex on. The regex should only match the lines that DO NOT start with 1, 2, 49, 50, 99 and 100. Note that the lines that start with "40th" and "5/23/2017" should match.
Page 1
40th Marathon and 25th Marathon Relay
5/23/2017 USATF Certified Marathon (#RE98723UB) Downtown/City, ST
Timing: Race Services See our Calendar of Events at www.website.com
Results questions: http://www.website.com/fixresults
=====================================================================================
**** FINAL RESULTS IN NETTIME ORDER ****
Place Div/Tot Div Halfway 22miles Guntime Nettime Pace Name
===== ======== ===== ======= ======= ======= ======= ===== =======
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
Record 2:17:35 by Randy in 1986
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
# Under USATF OPEN guideline
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
* Under USATF Age-Group guideline
% For an Explanation of AgeGraded Percentages, See Here: http://www.website.com/agegrading
So if we used the regex in Notepad++ to find the matching strings/lines and replace (delete) them, the desired end result would be as follows (in other words, the following lines would NOT match the regex):
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
Any assistance would be greatly appreciated.

See regex in use here
^(?! +\d+ ).*\n*
^ Assert position at the start of the line
(?! +\d+ ) Negative lookahead ensuring what follows is not one or more spaces, then one or more digits, then a space
.* Match any character (except \n) any number of times
\n* Matches any number of newline characters
Result:
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan

If this is to use in the Find/Replace dialog then you can use a cunning trick...
^(pattern_I_want_to_keep)$|^.*$
And replace it with
\1
Anything that doesn't match what you want to keep will be removed, although it will leave a blank line. They can be removed with a plugin or another regex.
This is simpler to read than concocting a match for what you don't want to keep, or using a negative lookahead.

Identify an address number in a string

I have a list of addresses, currently quite unclean. They take the format:
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
What I would like to do is split up the street name and street number. I need a regex expression that is true for
955 - 959
95-99
4-9
95 - 99
99
I currently have this:
^[0-9][0-9]\s*+(\s*-\s*[0-9][0-9]+)
which works for the two digit addresses but does not work for the three or one digit addresses.
Thanks

I'm not sure what you're trying to do here \s*+ but you basically had the answer with the last part [0-9][0-9]+ that would find 2+ digits on the end.
Maybe try this (it's more concise). This searches for 1+ digits instead of 2+
\d+(\s*-\s*\d+)?

You can use braces {2,3} for 2-3 numbers - but also *+ isn't right.
/^(([0-9]{1,3}\s-\s)?[0-9]{1,3})\s/
I nested the braces so you only want the first result from the regex.
it breaks up like this
([0-9]{1,3}\s-\s)?
first, Is there a 1-3 digit number with a space-dash-space - OPTIONAL
then.. does it end in a 1-3 digit number followed by a space.

Starting from your regex:
^[0-9][0-9]\s*+(\s*-\s*[0-9][0-9]+)
You got an extra white space matcher in the second block:
^[0-9][0-9]\s*+(-\s*[0-9][0-9]+)
I would suggest you replace [0-9] with \d
^[\d][\d]\s*+(-\s*[\d][\d]+)
Use a + instead o 2 copies of \d meaning at least one number:
^[\d]+\s*+(-\s*[\d]+)
Make the last block optional, so it matches 99 Fake Address:
^[\d]+\s*+(-\s*[\d]+)?
If you know there's only going to be 1 white space, you could replace \s* with \s?:
^[\d]+\s?(-\s?[\d]+)?
That should match all of them :D

For your example, you can do:
/^(\d+[-\s\d]*)\s/gm
Demo
Explanation:
/^(\d+[-\s\d]*)\s/gm
^ start of line
^ at least 1 digit and as many digits as possible
^ any character of the set -, space, digit
^ zero or more
^ trailing space
^ multiline for the ^ start of line assertion

Another way could be
In [83]: s = '955 - 959 Fake Street'
In [84]: s1 = '95-99 Fake Street'
In [85]: s2 = '95 - 99 Fake Street'
In [86]: s3 = '99 Fake Street'
In [87]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s3)
In [88]: d.group()
Out[88]: '99 '
In [89]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s2)
In [90]: d.group()
Out[90]: '95 - 99'
In [91]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s1)
In [92]: d.group()
Out[92]: '95-99'
In [93]: d = re.search(r'^[0-9]+[ ]*(-[ ]*[0-9]+){0,1}', s)
In [94]: d.group()
Out[94]: '955 - 959'
the character set 0-9 cab be represented by \d like this
d = re.search(r'^[\d]+[ ]*(-[ ]*[\d]+){0,1}', s)
Here, in all the examples, we are searching at the beginning of the string, for a sequence of at least one digit followed by zero or more spaces and optionally followed by at most one sequence of only one - symbol followed by zero or more spaces and at least one or more digits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match blocks of text with key phrases in the middle - regex

This regex should find what you need, if single line enabled [0-3]\d\s+[0-2]\d[0-5]\d.?(FI.?)\n(DT.?)\n Explanation: [0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check .? ungreedy capturing, . includes newline (FI.?)\n first group, FI line, until line break (DT.?)\n second group, same deal

Related

regExp for HH:mm format including intermediate value

Need improvement on the Regex

Find a String from a varying number block to the end

Regex match line starting with whitespace and first character is non-digit

Identify an address number in a string

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match blocks of text with key phrases in the middle - regex

This regex should find what you need, if single line enabled [0-3]\d\s+[0-2]\d[0-5]\d.*?(FI.*?)\n(DT.*?)\n Explanation: [0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check .*? ungreedy capturing, . includes newline (FI.*?)\n first group, FI line, until line break (DT.*?)\n second group, same deal

Related

regExp for HH:mm format including intermediate value

Need improvement on the Regex

Find a String from a varying number block to the end

Regex match line starting with whitespace and first character is non-digit

Identify an address number in a string

Categories

Resources

This regex should find what you need, if single line enabled [0-3]\d\s+[0-2]\d[0-5]\d.?(FI.?)\n(DT.?)\n Explanation: [0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check .? ungreedy capturing, . includes newline (FI.?)\n first group, FI line, until line break (DT.?)\n second group, same deal