My regex appears to be complete, yet it's missing matches

My regex appears to be complete, yet it's missing matches - regex

I'm currently working on a piece of regex that mostly works, however there's a few matches that aren't capturing, despite working when they're the only match. I'm hoping someone can point out what is clearly an obvious error, but one that I'm missing.
Specifically, the string kMad matches to [Kk\D]+ by itself, but not when it's part of the bigger string.
For reference:
Full Regex showing missing matches
Specific Regex showing matches
Expected matches by line

Non-matching occurrence of kMad10:31-18:5771 does not include 4 digits at the end since two digits after the colon is already captured by (\d{2}:\d{2}\-\d{2}:\d{2}). You can define a range for the digit occurrences for the regex section after it like \d{2,4} instead of \d{4}
The new regex will be:
(?:\d{4}\-\d{2}\-\d{2}+\.)?(?:Line\d{1,3})?(OFF|ADO|([Kk\D]+)?(\d{2}:\d{2}\-\d{2}:\d{2})(\d{2,4}))
Regex101 Demo

Related

Regex: repeated matches using start of line

Say that I would like to replace all as that are after 2 initial as and that only have as in between it and the first 2 as. I can do this in Vim using the (very magic \v) regex s:\v(^a{2}a{-})#<=a:X:g:
aaaaaaaaaaa
goes to
aaXXXXXXXXX
However, why does s:\v^a{2}a{-}\zsa:X:g only replace the first occurrence? I.e., giving
aaXaaaaaaaa
I presume this is because the first match "consumes" the start of the line and the first 2 as such that later matches only are matching on what remains of the line, which never can match the ^ again. Is this true? Or rather what is the most pedagogical explanation?
P.S. This is a minimal example of another problem.
Edit
Accepted answer corrected a typo in the original regex (a missing ^) and its comment answered the question: why can the ^ be "reused" in the lookbehind but not in the \zs case? (Ans: lookbehind doesn't consume the match whereas \zs does.)

The point here is that (a{2}a{-})#<=a matches any a (see the last a) that is preceded with two or more a chars. In NFA regex flavors, it is equal to (?<=a{2,}?)a, see its demo.
The ^a{2}a{-}\zsa regex matches the start of string, then two or more as, then discards this matched text and matches an a. So, it cannot match other as since the ^ anchors the match at the start of the string (and it does not allow matching anywhere else).
You probably want to go on using a lookbehind construct and add ^ there (if you want to only start matching if the string starts with two as):
:%s/\v(^a{2}a{-})#<=a/X/g

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.

Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.

There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.

You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Regex to match everything except this regex

I think this is a simple thing for a lot of you, but I have a very limited knowlegde of regex at the moment. I want to match everything except a double digit number in a string.
For example:
TEST22KLO4567
QE45C2C
LOP10G7G400
Now I found out the regex to match the double digit numbers:
\d{2}
Which matches the following:
TEST22KLO4567
QE45C2C
LOP10G7G400
Now it seems to me that it would be fairly easy to turn that regex around to match everything BUT "\d{2}". I searched a lot but I can't seem to get it done. I hope someone here can help.

This only works if your regex engine supports look behinds:
^.+?(?=\d{2})|(?<=\d{2}).+$
Explanation:
The | separates two cases where this would match:
^.+?(?=\d{2})
This matches everything from the start of the string (^) until \d{2} is encountered.
(?<=\d{2}).+$
This matches the end of the string, from the place just after two digits.
If your regex engine doesn't support look behinds (JavaScript for example), I don't think it is possible using a pure regex solution.
You can match the first part:
^.+?(?=\d{2})
Then get where the match ends, add 2 to that number, and get the substring from that index.

You are right rejecting a search in regex is usually rather tricky.
In your case I think you want to have [^\d{2}], however, this is tricky as your other strings also contain two digits so your regex using it won't select them.
I would go with this regex (using PCRE 8.36 but should work also in others):
\*{2}\w*\*{2}
Explanation:
\*{2} .... matches "*" literally exactly two times
\w* .... matches "word character" zero or unlimited times

Found one regex pretty straightforward :
^(.*?[^\d])\d{2}([^\d].*?)$
Explanations :
^ : matches the beginnning of a line
(.*?[^\d]) : matches and catches the first part before the two numbers. It can contain anything (.*?) but needs to end with something different to a number ([^\d]) so we ensure that there is only 2 numbers in the middle
\d{2} : is the part you found yourself
([^\d].*?) : is the symetric of (.*?[^\d]) : begins with something different from a number ([^\d]) and matches anything next.
$ : up to the end of the line.
To test this reges you can use this link
It will match the first occurence of double digit, but because OP said there was only one it does the job correctly. I expect it to work with every regex engine as nothing too complex is used.

Regex searching for number and letter combination optional brackets

I need to get a regex that will find a match of a single lower case a-z character followed by 5 numbers that is either:
at the start of a line
at the end of a line
surrounded by () or []
surrounded by whitespace
So the following results are expected:
a12345 MATCH
(a12345) MATCH
[a12345] MATCH
text a12345 MATCH
aa12345 NO MATCH
At the moment I have this (?<=[])]*)[a-z]{1}[0-9]{5}(?=[])]*) but it is not working for all scenarios, for example it sees aa12345 and a12345a as being matches when I don't want them to.
Can anyone help?
EDIT:
Apologies I should have mentioned this is for .NET c#

First of all your should mention programming language.
Following solution is for PCRE.
Regex: ((?<=[\[( ])|^)[a-z]\d{5}((?=[\]\) ])|$)
Explanation:
((?<=[\[( ])|^) checks for preceding brackets, whitespaces OR beginning.
[a-z]\d{5} checks for alphabet followed by 5 digits.
((?=[\]\) ])|$) checks for succeeding brackets, whitespaces OR end of line.
Regex101 Demo

Does this work:
(\[[a-z]\d{5}\])|(\([a-z]\d{5}\))|(\b[a-z]\d{5}\b)

Negative lookahead to match server directories not properly working

Given the following 3 example paths representing server paths i am trying to create a skiplist for my FTP client via PCRE regular expressions but can't seem to get the wished result.
/subdir-level-1/subdir-level-2/.../Author1_-_Title1-(1234)-Publisher1
/subdir-level-1/subdir-level-2/.../Author2_-_Title2_(5678)-PUBLiSHER2
/subdir-level-1/subdir-level-2/.../Author3_-_Title3-4951-publisher3
I want to skip all folders (not paths) that do not end with
-Publisher1
I am trying to create a working pattern with the help of this online help and and this regex tester but don't get any further than to this negative lookahead pattern
.*-(?!Publisher1)
But with this pattern all lines match because with all of them the substrings up to the pattern do all not contain the pattern.
/subdir/subdir/.../Author1_-_Title1-(1234) -Publisher1
/subdir/subdir/.../Author2_-_Title2_(5678) -PUBLiSHER2
/subdir/subdir/.../Author3_-_Title3-4951 -publisher3
What is my mistake and how would the correct pattern be just to match only the second and third line as line to be skipped but keep the first line?
EDIT to make it clearer what to highlight and what not.
Everything from the beginning of the path to the last slash must be ignored (allowed).
Everything after the last slash that matches the defined regex must be skipped.
EDIT to present an advanced pattern matching only the red part
[^/]*(?<!-Publisher2)$
Debuggex Demo

The regex which you have used is:
.*-(?!Publisher1)
I will tell you whats the fault in it.
According to this regex it will match those lines which dont have a - followed by Publisher1. Okay, do you notice the - there in between on yur text, yes. between author and title or after title. So all the strings satisfy this condition. Instead if you search with a negative lookahead in such a way that hiphen is with Publisher1 then your match should work.
So you plan on moving the hiphen inside the parenthesis so that it matches and make your regex like this :
^.*(?!-Publisher1)
but this will also not work, because here .* matches everything, so when we do a lookahead, we are not able to find a single character to match . Thus we will use a negative lookbehind. <.
.*(?<!-Publisher1)
what now ? . I have done everything but still I cannot get it to work. why is it so ?
because a negative lookbehind will lookback and tell if it is not followed by -Publisher1.
this is complex, just bear with me :
suppose your string
/subdir/subdir/.../Author1_-_Title1-(1234)-Publisher1
we do a negative lookbehind for -Publisher1. From the postition after 1 . i.e. at the end of the string -Publisher1 is visible when we lookback. BUT our condition is negative lookbehind. So it will move one character left to reach a position where it will no more be able to lookback and say that "Hey I can see -Publisher1 from here" because from here we are able to see "-Publisher" only. Our condtin satisfies but the regex still matches the rest of the string.
So it is essential to bind the lookbehind to the end of the string so that it doesnot move one character to the left to search for its match.
final regex:
.*(?<!-Publisher1)$
demo here : http://regex101.com/r/lE1vW2

This should suit your needs:
^.*(?<!-Publisher1)$
Debuggex Demo

I want to skip all folders that do not end with -Publisher1
You can use this negative lookahead based regex:
^(?!.*?-Publisher1$).+$
Working Demo

You could use the following regex in order to exclude lines containing Publisher1:
^((?!Publisher1).)*$
Online demo: http://regex101.com/r/gD8jK0

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

My regex appears to be complete, yet it's missing matches - regex

Related

Regex: repeated matches using start of line

Select Northings from a 1 Line String

Regex to match everything except this regex

Regex searching for number and letter combination optional brackets

Negative lookahead to match server directories not properly working

Categories

Resources