Regex match line starting with whitespace and first character is non-digit - regex

I am trying to create a regex that will only match lines that start with whitespace, then have 1-4 non-digits as the first characters, and then at least one or more spaces after the digits. The purpose of this regex is to use it in the "Find and Replace" option of Notepad++ to remove any lines that do not start with space(s) and then have a number as the first character in the line.
What I have now is allowing me to match the lines that start with whitespace and are followed with a group of digits and another space. However, these are the lines I want to keep. How can I modify the following regex so that it will match everything else other than these lines?
/^([\s]+\d[\s]|[\s]+\d\d[\s]|[\s]+\d\d\d[\s])/gm
Here's an example of the data we're using the regex on. The regex should only match the lines that DO NOT start with 1, 2, 49, 50, 99 and 100. Note that the lines that start with "40th" and "5/23/2017" should match.
Page 1
40th Marathon and 25th Marathon Relay
5/23/2017 USATF Certified Marathon (#RE98723UB) Downtown/City, ST
Timing: Race Services See our Calendar of Events at www.website.com
Results questions: http://www.website.com/fixresults
=====================================================================================
**** FINAL RESULTS IN NETTIME ORDER ****
Place Div/Tot Div Halfway 22miles Guntime Nettime Pace Name
===== ======== ===== ======= ======= ======= ======= ===== =======
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
Record 2:17:35 by Randy in 1986
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
# Under USATF OPEN guideline
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
* Under USATF Age-Group guideline
% For an Explanation of AgeGraded Percentages, See Here: http://www.website.com/agegrading
So if we used the regex in Notepad++ to find the matching strings/lines and replace (delete) them, the desired end result would be as follows (in other words, the following lines would NOT match the regex):
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
Any assistance would be greatly appreciated.

See regex in use here
^(?! +\d+ ).*\n*
^ Assert position at the start of the line
(?! +\d+ ) Negative lookahead ensuring what follows is not one or more spaces, then one or more digits, then a space
.* Match any character (except \n) any number of times
\n* Matches any number of newline characters
Result:
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan

If this is to use in the Find/Replace dialog then you can use a cunning trick...
^(pattern_I_want_to_keep)$|^.*$
And replace it with
\1
Anything that doesn't match what you want to keep will be removed, although it will leave a blank line. They can be removed with a plugin or another regex.
This is simpler to read than concocting a match for what you don't want to keep, or using a negative lookahead.

Related

Limiting parts and total of numbers in a string (regex)

I'm trying to use regex to find tax numbers with the formats:
nnn-nnn-nnn | nn-nnn-nnn
nnn nnn nnn | nn nnn nnn
nnnnnnnnn | nnnnnnnn
EDIT: some samples are 062-225-505, 62-225-505, 062 225 505, 62 225 505, 062225502, 62225505. The numbers should not be any longer than 9 numbers in total
So far I have ([0-9]{2,3}(\s|-|)+[0-9]{3,8}(\s|-|)+[0-9]{3,9})
This works, BUT it is also finding 050821862257111 which is too long for what I'm trying to find. How do I limit the total string as well as each part being limited?
Thanks!
Try ^\d{1,9}(?:(?:-| )\d{1,9})*$
Explanation:
^ - match beginning of a string
\d{1,9} - match between 1 and 9 digits
(?:...) - non-captuirng group
-| - alterantion: match or -
* - match zero or more times
$ - match end of a string
Demo
With a small change to your regex, you can limit the length to eight or nine numbers, although this would still allow a mix and match of the delimiters:
([0-9]{2,3}[\s-]?[0-9]{3}[\s-]?[0-9]{3})
If the actual number of delimiters is not important, then you could just remove then, and then just check the length of the remaining numbers.
^\d{2}\d?(?:-|\s)?\d{3}(?:-| )?\d{3}$
demo at regex101
This regex will only match if the spaces and dashes are in the right place.
This will match: 062-225-505
This will not match: 062-2255-05 or 062225--505
Found with a combination of all of your help! :)
\s\d{2,3}\d?(-|\s)?\d{3}(\1)?\d{3}(?!\d)
Found 62-225-505, 62225505, 062 225 505, and did not find 060821067254101
Thanks all :)

Find a String from a varying number block to the end

I have nearly 8000 lines of the following text:
DIL 2 M 006 SC SCHÜTZ 083 1 Stck
25215-1 BIN-SORT 2152310251724-1 BIN-SORT getestet 048 133 Stck
RBBE60-T3dsg 21S003 SEALING 6X8.9X2.4 MM 082 3 Stck
I am only interested in the 3 digit block at the end and the number behind.
So this should be the output:
083 1
048 133
082 3
It could be, that the same number e.g. 048 appears at the beginning of the line. this shouldn't be a hit.
Unfortunatelly i have no idea how to extract this strings with the help of notpad++.
This expression,
.*(\d{3}\s+\d+).*
with a replacement of $1 is likely to work here.
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
You may try the following find and replace, in regex mode:
Find: ^.*?(\d+ \d+) \S*$
Replace: $1
The logic here is to use .* to consume everything up until the last two consecutive digits in the line. Then, we replace with only the captured two digits.
Demo

Regex to match blocks of text with key phrases in the middle

VB2010: I have text that consists of blocks of text that start with day and time DD HHMM and end only at the next day/time.
Here is my sample text:
18 2131 Z50000 ZZ-AAA
PR
PR
AGM TPS P773QQ 1500 DCA 22FEB
21,77,23,M10,F,26,3100,2
OK
18 2134 Z50000 ZZ-AAA
PR
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2134 Z50000 ZZ-AAA
PR
PR
ARR OPN P773QQ 1500 DCA 22FEB
0757
OK
18 2135 Z50000 ZZ-AAA
PR
PR
ARR M58 P773QQ 1500 DCA 22FEB
212
UNKNOWN POL/SPOL
QU HMKKDBB
.DDVZAZC 182134
ARR
FI US1500/AN P773QQ/DA KDCA/AD KMIA/IN 2026/FB 152/LA /LR
DT DDL DCAV 182134 M33A
- OS KMIA /GNO6541/R200RR
18 2136 Z50000 ZZ-AAA
PRF 1500/18 MIA IN 0152 333
18 2137 Z50000 ZZ-AAA
PR
PRZ 1500/18 MIA IN 2026 N/A 333
My goal is to get only the blocks of text that have key phrases ^FI and ^DT in the middle. The matching groups should contain only two blocks. The one from 18 2134 and end at M33A and then from 18 2135 to M33A.
I have tried:
This works for the most part except it starts the match at the prior block.
RegexOptions.Singleline Or RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}(.*?)^FI US(.*?)^DT DDL(.*?)\r
This one I took from another post but cant seem to wrap my head around. It matches only the first part of every block.
RegexOptions.Multiline Or RegexOptions.IgnoreCase
^\d\d \d{4}.*\r[\s\S]*?(?=(?:^\d\d \d{4}|$))
Haven't used regex in a while so any help appreciated.
You may use
(?ms)^\d\d +\d{4}\b(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*?^(?:FI|DT).*?(?=^\d\d +\d{4}\b|\Z)
See the regex demo (Though it is a PCRE regex test, it will work the same in .NET).
Pattern details
(?ms) - multiline and singleline options
^ - start of a line
\d\d +\d{4}\b - 2 digits, 1 or more spaces and 4 digits as a whole word
(?:(?!^(?:\d\d +\d{4}\b|FI|DT)).)*? - any char, 0+ repetitions, as few as possible, that does not start the sequence: start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word, or FI or DT
^(?:FI|DT) - FI or DT at the start of a line
.*? - any 0+ chars, as few as possible
(?=^\d\d +\d{4}\b|\Z) - a positive lookahead that requires ^\d\d +\d{4}\b (start of a line, 2 digits, 1 or more spaces and 4 digits as a whole word) or \Z (end of string) to match immediately to the right of the current location.
This regex should find what you need, if single line enabled
[0-3]\d\s+[0-2]\d[0-5]\d.*?(FI.*?)\n(DT.*?)\n
Explanation:
[0-3]\d\s+[0-2]\d[0-5]\d day hour and minute check
.*? ungreedy capturing, . includes newline
(FI.*?)\n first group, FI line, until line break
(DT.*?)\n second group, same deal

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?
The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.
I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$
I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567
Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

How do I make a part of my Regex non-greedy?

FINAL SOLUTION
/(\d{1,5}\s[^\d].{5,20}(dr|drive)(\.|\s|\,))/i
ORIGINAL QUESTION
Regex
/([0-9]{1,5}.{5,20}(dr|drive)(\.|\s|\,))/i
Pattern
PO Box 66 23 Britton Drive Bloomfield CT 06002
This regex is returning '66 23 Britton Drive'. I want to return '23 Britton Drive'. I have tried the following variations of the Regex:
/(([0-9]{1,5}.{5,20})?(dr|drive)(\.|\s|\,))/i - adding a new capturing group and making it uncreedy
/([0-9]{1,5}.{5,20}?(dr|drive)(\.|\s|\,))/i - making the length of in between characters ungreedy
/([0-9]{1,5}.{5,20}(dr|drive)(\.|\s|\,))/Ui - adding ungreedy modifier
More Patterns That Don't Work
PO Box 156 430 S Wheeling Dr. Wheeling, IL 60090
Patterns That Do Work
1195 Columbia Dr PO Box 1256 Longview, WA 98632
3400 SW Washington drive PO Box 1349 Peoria, IL 61654
^[^\d]+\d{2} (\d{2} [^ ]+ [^ ]+)
Debuggex Demo
Obviously, i'd need more patterns to make this work for all situations
Small fix for your current RegEx
(\d{1,5} [^\d]{5,20}(dr|drive)(\.|\s|\,))