How to exclude non-numeric character in regex - regex

I have a string which goes like this
Section 78(1) of the blabla
These are my regex
\b\s(?!\b(\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)\b\S*
Expected output is: of the blabla
This regex works but it does not exclude "of" because of the (). Can anyone help me? Thank you

Try this pattern: .+\d\)?
Explanation:
.+ - match one or more times of any charaters
\d - match digit
\)? - match ) zero or one time
Because of greediness of + it will match until last digit, if it's in bracket, then match following bracket.
Demo
Alternatively use \d+(?:\(\d+\))?(.+)
Then desired output is in first capturing group.
Demo

It seems all you need to change is to remove the \b before \S* and replace the \S* with .+ or .* (if the match can be an empty string).
\s(?!\b(?:\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)(.+)
See the regex demo, grab Group 1 value. Note I turned the first group matching digits in the negative lookahead into a non-capturing group to avoid clutter in the resulting match list.
VB.NET demo:
Dim r As New Regex("\s(?!\b(?:\d{1,3}|\d{1,2}[a-zA-Z]|\d{5,})\b)(.+)")
Dim s As String
s = "Section 78(1) of the blabla"
For Each m As Match In r.Matches(s)
Console.WriteLine(m.Groups(1).Value)
Next
Result: of the blabla.

Related

Match with optional positive lookahead

I've got 2 strings in the format:
Some_thing_here_1234 Match Me 1 & 1234 Match Me 1_1
In both cases I want the resultant match to be 1234 Match Me 1
So far I've got (?<=^|_)\d{4}\s.+ which works but in the case of string 2 also captures the _1 at the end. I thought I could use a lookahead at the end with an optional such as (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) but it always seems to revert to the second option and so the _1 gets through.
Any help would be great
You can use
(?<=^|_)\d{4}\s[^_]+
See the regex demo.
Details:
(?<=^|_) - a positive lookbehind that matches a location that is immediately preceded with either start of string or a _ char (equal to (?<![^_]))
\d{4} - four digits
\s - a whitespace
[^_]+ - one or more chars other than _.
Your second pattern (?<=^|_)\d{4}\s.+(?=_\d{1}$|$) is greedy and at the end of the string the second alternative |$ will match so you will keep matching the whole line.
Note that you can omit {1}
If you want to use an optional part in the lookahad, you can make the match non greedy and optionally match :_\d in the lookahead followed by the end of the string.
(?<=^|_)\d{4}\s.+?(?=(?:_\d)?$)
See a regex demo.

Regex to match the letter string group between 2 numbers

Is it possible to match only the letter from the following string?
RO41 RNCB 0089 0957 6044 0001 FPS21098343
What I want: FPS
What I'm trying LINK : [0-9]{4}\s*\S+\s+(\S+)
What I get: FPS21098343
Any help is much appreciated! Thanks.
You can try with this:
var String = "0258 6044 0001 FPS21098343";
var Reg = /^(?:\d{4} )+ *([a-zA-Z]+)(?:\d+)$/;
var Match = Reg.exec(String);
console.log(Match);
console.log(Match[1]);
You can match up to the first one or more letters in the following way:
^[^a-zA-Z]*([A-Za-z]+)
^.*?([A-Za-z]+)
^[\w\W]*?([A-Za-z]+)
(?s)^.*?([A-Za-z]+)
If the tool treats ^ as the start of a line, replace it with \A that always matches the start of string.
The point is to match
^ / \A - start of string
[^a-zA-Z]* - zero or more chars other than letters
([A-Za-z]+) - capture one or more letters into Group 1.
The .*? part matches any text (as short as possible) before the subsequent pattern(s). (?s) makes . match line break chars.
Replace A-Za-z in all the patterns with \p{L} to match any Unicode letters. Also, note that [^\p{L}] = \P{L}.
To grep all the groups of letters that go in a row in any place in the string you can simply use:
([a-zA-Z]+)
You could use a capture group to get FPS:
\b[0-9]{4}\s+\S+\s+([A-Z]+)
The pattern matches:
\b[0-9]{4} A wordboundary to prevent a partial match, and match 4 digits
\s+\S+\s+ Match 1+ non whitespace chars between whitespace chars
([A-Z]+) Capture group 1, match 1+ chars A-Z
Regex demo
If the chars have to be followed by digits till the end of the string, you can add \d+$ to the pattern:
\b[0-9]{4}\s+\S+\s+([A-Z]+)\d+$
Regex demo

Regex: Match partial string

I need some help - my skills here falls short :) (and I don't know if it is possible with pure regex)
Case: I have some text inputs in the form of:
input1: "abc,clutter,01;xyz,clutter,02;" (should match)
input2: "abc,clutter,02;zyz,clutter,01;" (no match)
input3: "abc,clutter,02;abc,txt,txt,01;xyz,clutter,01" (should match)
Then match should be
Starts with abc (anywhere in the input)
Everything in between - unless ,02; is in-between
Ends with ,01;
So something like:
abc(.*)(?!,02;),01;
.. but this also matches input2, and that was not the intension :)
You might use for example a repeating pattern matching all chars except , and ;
\babc(?:,(?!02,)[^,;\n]+)*,01;
\babc A word boundary, match abc
(?: Non capture group
,(?!02,)[^,;\n]+ Negative lookahead, assert not 02, and match any char except , ; or a newline
)* Close the group and optionally repeat
,01; Match literally
Regex demo
If abc should only be matched one, you can also add that to the negative lookahead
\babc(?:,(?!(?:02|abc),)[^,;\n]+)*,01;
Regex demo

How to extract all the strings between 2 patterns using regex Notepad++?

Extract all the string between 2 patterns:
Input:
test.output0 testx.output1 output3 testds.output2(\t)
Output:
output0 output1 ouput3 output2
Note: (" ") is the tab character.
You may try:
\.\w+$
Explanation of the above regex:
\. - Matches . literally. If you do not want . to be included in your pattern; please use (?<=\.) or simply remove ..
\w+ - Matches word character [A-Za-z0-9_] 1 or more time.
$ - Represents end of the line.
You can find the demo of the regex in here.
Result Snap:
EDIT 2 by OP:
According to your latest edit; this might be helpful.
.*?\.?(\w+)(?=\t)
Explanation:
.*? - Match everything other than new line lazily.
\.? - Matches . literally zero or one time.
(\w+) - Represents a capturing group matching the word-characters one or more times.
(?=\t) - Represents a positive look-ahead matching tab.
$1 - For the replacement part $1 represents the captured group and a white-space to separate the output as desired by you. Or if you want to restore tab then use the replacement $1\t.
Please find the demo of the above regex in here.
Result Snap 2:
Try matching on the following pattern:
Find: (?<![^.\s])\w+(?!\S)
Here is an explanation of the above pattern:
(?<![^.\s]) assert that what precedes is either dot, whitespace, or the start of the input
\w+ match a word
(?!\S) assert that what follows is either whitespace of the end of the input
Demo

Regexp matching a string - positive lookahead

Regexp: (?=(\d+))\w+\1
String: 456x56
Hi,
I am not getting the concept, how this regex matches "56x56" in the string "456x56".
The lookaround, (?=(\d+)), captures 456 and put into \1, for (\d+)
The wordcharacter, \w+, matches the whole string("456x56")
\1, which is 456, should be followed by \w+
After backtracking the string, it should not find a match, as there is no "456" preceded by a word character
However the regexp matches 56x56.
5) Regex engines concludes that it cannot find a match if it start searching from 4, so it skips one character and searches again. This time, it captures two digits into \1 and ends up matching 56x56
If you want to match only whole strings, use ^(?=(\d+))\w+\1$
^ matches beginning of string
$ matches end of string
You don't anchor your regex, as has been said. Another problem is that \w also matches digits... Now look at how the regex engine proceeds to match with your input:
# begin
regex: |(?=(\d+))\w+\1
input: |456x56
# lookahead (first group = '456')
regex: (?=(\d+))|\w+\1
input: |456x56
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack on \w+
regex: (?=(\d+))\w+|\1
input: 456x5|6
# And again, and again... Until the beginning of the input: \1 cannot match
# Regex engine therefore decides to start from the next character:
regex: |(?=(\d+))\w+\1
input: 4|56x56
# lookahead (first group = '56')
regex: (?=(\d+))|\w+\1
input: 4|56x56
# \w+
regex: (?=(\d+))\w+|\1
input: 456x56|
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x5|6
# \1 cannot be satisfied: backtrack
regex: (?=(\d+))\w+|\1
input: 456x|56
# \1 satified: match
regex: (?=(\d+))\w+\1|
input: 4<56x56>
The points you listed are almost entirely, but not quite, wrong!
1) The group (?=(\d+)) matches a sequence of one or more digits
not necessarily 456
2) \w captures only characters, not digits
3) \1 the is a back reference to the match in the group
So the role expression means find a sequence of digits followed by s sequence of word characters with are followed by the same sequence that was found in front of the characters. Hence the match 56x56.
Well that's what makes it a positive lookahead
(?=(\d+))\w+\1
You are correct when you say the first \d+ will match 456, so \1 must also be 456, but if that's the case: the expression won't match the string.
The only common characters of before the x and after the x are 56, and that's what it will do to get a positive match.
The operator + is greedy and backtracks as necessary. The lookahead (?=(\d+)) will match 456 then 56 if the regex fails then 6 if the regex fails. First attempt: 456. It matches, the group 1 contains 456. Then we have \w+ which is greedy and takes 456x56, there is nothing left but we still have to match \1 i.e. 456. Thus: failure. Then \w+ backtraks one step at a time till we get to the beginning of the regex. And it still fails.
We consume a character from the string. Next backtrack is trying to lookahead match with substring 56. it matches and the group 1 contains 56. \w+ matches until the end of the string and gets 456x56 and then we try to match 56: failure. So \w+ bactracks until we have 56 left in the string and then we have a global match and regex success.
You should try it with regex buddy debug mode.