How to limit optional whitespace matches in regex - regex

YARP. (Yup, another regex problem).
Not sure the clearest way to describe this other than concrete examples.
Sample text:
4444 4444 4444 4444
4444444444444444
44 44 44 44 44 44 44 44
4444-4444-4444-4444
4444 (multiple spaces) 4444 (multiple spaces) 4444 (multiple spaces) 4444
0.4444444444444444
0.4444 4444 4444 4444
I need to build a regex that will match 1, 2 and 4 only. Requirements 13-16 digits, dashes and spaces optional, but only if single space, and no more than 3 total.
This is obviously CC info search related, and I've done a ton of research, found many examples that find matches for most, all or none, but nothing that will eliminate excessive false positives like 3 and 5 above. I'm using PowerGREP 5, I've read the entire tutorial on https://www.regular-expressions.info/tutorial.html and I can not figure out how to limit the number of optional whitespaces in the overall match. ie: "1 2 3 4 5 6 7 8 9" matches just as well as "123 456 789" if i make space(s) optional. Essentially, I want the regex to end match search if more than 3 spaces/dashes are detected.
Side note: I work for a company that deals with a TON of calendar data, so grepping a huge drive with many "1 2 3 4 5 6 7 8 ..." style text strings is generating a ton of false hits, even if I take time to tailor searches to CC inclusive patterns.
Any help would be super appreciated.
The closest I've found is:
\b(?:\d[ -]*?){13,16}\b
Which grabs any 13-16 digits (allowing for a dash or space in between) as expected, but it will also match "1 2 3 4 5 6 7 8 9 10 11" which is obviously not helpful.
All inclusive CC branded regex that fails to find valid numbers if they contain spaces/dashes: (but will find UK telephone numbers, heh):
\b(?:4[0-9]{12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12}|(?:2131|1800|35\d{3})\d{11})\b
So then I tried replacing any [0-9] character class instances above with (?:\d[ -]*?) and that will find valid CCs with dashes/spaces, but it also matches all the "1 2 3 4 5 6 7 8 9 10 11" type false positives.
I am very new to regex, so if I'm committing a huge noob error, please feel free to point me in the right direction. Thank you!
Edit:
Replacing [0-9] with (?:\d[ -]?) for just the bigger consecutive string parts seems to be pretty close to what I need. Grepped same drive as before and only got 311 matches, and all 3 positive files found, I can live with just 308 false matches, but I gotta imagine there's a better way to do this still. And it's still matching strings of 13-16 digits with more than 3 delimiters...
Current regex:
\b(?:4(?:\d[ -]?){12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)(?:\d[ -]?){12}|3[47](?:\d[ -]?){13}|3(?:0[0-5]|[68][0-9])(?:\d[ -]?){11}|6(?:011|5[0-9]{2})(?:\d[ -]?){12}|(?:2131|1800|35\d{3})(?:\d[ -]?){11})\b

Since it looks like you want ever fourth digit to be followed by either a dash, a single space, or nothing, the simplest way would be to use
^(\d{4}[\s\-]?){3}\d{4}$
This would meet your written criteria, but allow a mixture like: 1234-5678 9012. If that's not acceptable, you can use a positive lookahead to validate that the pattern repeats the same
^(?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})(\d{4}[\s-]?){3}\d{4}$
The first regex
Starts at the beginning of the string: ^
Finds four digits (0-9), optionally followed by space or dash, and repeats this pattern 3 times: (\d{4}[\s\-]?){3}
Then is followed by four more digits and the end of the string: \d{4}$
Taking just the lookahead from the second regex: (?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})
Before the pattern starts to capture anything, we again start at the beginning of the string and look at the first three repeated patterns and ensures that the delimiter between is the same.
I see that in your example regex, you want to allow 13-16 digits and mine was specifically for 16. For 13-16 digits, you need to determine where you want those delimiters to be. Can they be anywhere, as long as there are only three of them and they don't repeat? I also see that you're using word boundaries, so I'm guessing that you're trying to match substrings. You can do that, but it'll be a little more difficult. Dashes and spaces are both word boundaries, so you might be get some false positives without some lookarounds.
As far as integrating into your CC regex, you're lazy matching an infinite number of dashes or spaces; you just want ? instead of *?. If you need more flexibility where those spaces/numbers go, while still limiting them then I'd probably use a negative regex to validate.

Related

Regular Expression that find string that dont start with " with before numbers

i have a string like this.
1 2 3 4 5 "Test test"
1 2 3 4 5 Test test"
I need to find the second string, that dont start with " and before have the numbers.
I read many topics of stack overflow but i dont find the answer for me.
Reg exp have to work on visual studio code for a txt.
Thanks so much for your help
I tried:
^(?![0-9]+\t[0-9]+\t[0-9]+\t[0-9]+\t[0-9]+")
but it didn't work.
I've made the following assumptions about what is required.
the string must begin with one one or more instances of one or more digits followed by 1 or more spaces; and
the last instance of one or more digits followed by one or more spaces must be followed by a character that is not a digit, space or double quote.
That can be tested by the following regular expression.
^(?:\d+ +)+[^"\d ].*$
Demo
As shown a the link, this regular expression matches the last three strings below, but not the first three.
1 2 3 4 5 "Test test
11 22 33 44 "Test test"
11 22 33 44 The test"
1 2 3 4 5 Test test"
1 2 3 4 5 The "Test test"
11 22 33 44 The "Test test"
It can be tricky to match on what isn't there, because everything that doesn't match a pattern is a match for the negation of that pattern.
You are looking for runs of digits followed by runs of whitespace, and this sequence itself repeats
(\d+\s+)+
You want the above to be followed by anything .* that doesn't start with a digit, whitespace or the double-quote character [^\d\s"].
([^\d\s"])
Put it together
(\d+\s+)+([^\d\s"].*)
You can also make groups non-capturing. This has no logical effect but is more efficient of memory because it doesn't store the resolved groups as it searches the potential parse tree. This can be significant on large documents, especially when backreference cause deep recursion.
(?:\d+\s+)+(?:[^\d\s"].*)
You're very close. You need to change the outer [] to (). You also need to put .* after the negative lookahead to match the rest of the line when the lookahead fails.
And you don't have tabs between the numbers, you have spaces, so \t should be \s.
^(?![0-9]+\s[0-9]+\s[0-9]+\s[0-9]+\s[0-9]+\s+").*
DEMO

Regular expression to extract number before/after word

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars from 'arrests|arrested'.
Perhaps something like this?
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(\d+)[^,.\d\n]+?(?=arrest|custody) First option if # comes before watched terms
(\d+) the number to capture, with + one or more digits
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(?=arrest|custody) positive look ahead checking for either word:
(?<=arrest|custody)[^,.\d\n]+?(\d+) Second option if # comes after watched terms
(?<=arrest|custody) positive lookbehind checking that the word comes before #
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(\d+) the number to capture, with + one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (\d+) capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
The following works for me (solution based on #PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+

Regular Expression allow whitspace without counting them

How to get
[\d ]{6}
to match:
1 23456
1 2 3456
1 2 3 456
1 2 3 4 56
1 2 3 4 5 6
In other words, I would like the space to not be counted towards the char limit. Something like [\d]{6 + but allow spaces you can eat}
The following will match 6 numbers, with any amount of space characters between them.
(?:\d\s*){5}\d
?: at the beginning there makes the group non-capturing. It's not necessary if all you wish to do is a simple match.
A live example:
https://regex101.com/r/PZJ8DO/2
Just to put my two cents in: you could use the opposite of \d which is \D in most flavors:
^(?:\d\D*){6}$
See a demo on regex101.com.
Note, that this would even allow something like
1a2b3c4d5e6
If this is not what you want (meaning you only want to allow spaces, nothing else), use \s* instead of \D*.
You can try to use
(?<=).*6.*
This will match any line that contains '6' even if there are some white spaces or other characters in the line.
The (?<=) Positive Look Behind.
The . matches any character except line breaks.
The * matches 0 or more of the preceding token.
And 6 matches a "6" Character.
You can test Regular Expression here: RegExr
Note that the positive look behind feature is not supported in all flavors of RegEx.

Regex is possible to match?

I have files with these filename:
ZATR0008_2018.pdf
ZATR0018_2018.pdf
ZATR0218_2018.pdf
Where the 4 digits after ZATR is the issue number of magazine.
With this regex:
([1-9][0-9]*)(?=_\d)
I can extract 8, 18 or 218 but I would like to keep minimum 2 digits and max 3 digits so the result should be 08, 18 and 218.
How is possible to do that?
You may use
0*(\d{2,3})_\d
and grab Group 1 value. See the regex demo.
Details
0* - zero or more 0 chars
(\d{2,3}) - Group 1: two or three digits
_\d - a _ followed with a digit.
Here is a PCRE variation that grabs the value you need into a whole match:
0*\K\d{2,3}(?=_\d)
See another regex demo
Here, \K makes the regex engine omit the text matched so far (zeros) and then matches 2 to 3 digits that are followed with _ and a digit.
(?:[1-9][0-9]?)?[0-9]{2}(?=_[0-9])
or perhaps:
(?:[1-9][0-9]+|[0-9]{2})(?=_[0-9])
(https://www.freeformatter.com/regex-tester.html, which claims to use the XRegExp library, that you mention in another answer doesn't seem to backtrack into the (?:)? in my first suggestion where necessary, which makes it very different from any regex engine I've encoutered before and makes it prefer to match just the 18 of 218 even though it starts later in the string. But it does work with my second suggestion.
([1-9]\d{2,3})(?=_\d)
{x,y} will match from x to y times the previous pattern, in this case \d
Edit: from your own regex it looked as you wanted the part of the number which starts with a non-zero. However since your examples include leading 0s, maybe you really wanted :
(\d{2,3})(?=_\d)
Which will give you the last 3 digits before underscore unless there are only 2 digits.
I propose you:
^ZATR0*(\d{2,3})_\d+\.pdf$
demo code here. Result:
Match 1 Full match 0-17 ZATR0008_2018.pdf Group 1. 6-8 08
Match 2 Full match 18-35 ZATR0018_2018.pdf Group 1. 24-26 18
Match 3 Full match 36-53 ZATR0218_2018.pdf Group 1. 41-44 218

Can someone help me remove everything in a line except matching regex pattern?

I use ^.*?(\d{3}\D?\d{3}\D?\d{4}).*$ and replace with \1 or $1
so that everything in each separate line is removed except for the telephone number. Example link https://regex101.com/r/jK6eD8/3.
Basically it works like below
Line 1: this is crap text, only 818-333-2323 is kept in line
line 2: only the following number 4445553333 is kept in line.
What I need help with is finding matching regex patterns for the phone formats below, and remove everything else in its respective line EXCEPT the matching phone number JUST LIKE THE ABOVE LINK. The formats are below.
07123452670
07812 345 931
07412 123466
00447912345188
+971557017442
+971 557 856 832
0414 934 993
So basically, I need a regex for matching 11 digits. (07123456270)
Matching 5 digits, followed by space, followed by 3 digits, followed by space, followed by 3 digits. (07812 345 931)
Matching 5 digits, followed by space, followed by 6 digits (07412 123466)
Matching 14 digits (12345678901234)
Matching a + sign followed after with 12 digits (+971557017442)
Matching + followed with 3 digits, space, followed by 3 digits, space, 3 more
digits (+971 557 856 832)
Last one, 4 digits, space, 3 digits, space, 3 digits. (0414 934 993)
Someone please help
This regex meets the requirements:
^.*?(\+?(?:\d{11,14})|(?:\d{5}\s(?:\d{3}\s\d{3}|\d{6}))|(?:\d{3}(?:\s\d{3}){3})|(?:\d{4}\s\d{3}\s\d{3})).*$
As you can see here: https://regex101.com/r/lY3jW0/1
I hope it helps
If the text you're analyzing doesn't contain other "long" numbers, you could just get strings of digits with optional spaces, full stops and dashes between them. It could look like this:
^.*?(\d[\d .-]{9,13}\d).*$
The match group must consist of
a digit, followed by
9-13 characters either being a digit, a space, a dash or a full stop. Then followed by
a final digit.
This isn't that strict on the composition of the number though, so it might not suit your needs. But then again, it might ;)
Regards