Generating regex to exclude a string in path

Generating regex to exclude a string in path - regex

I'm trying to write a regex which includes all 'component.ts' files which start with 'src' but excludes those files which have 'debug' folder in its file path using
(src\/.*[^((?!debug).)*$]/*.component.ts)
I'm testing the following strings on regex101 tester:
src/abcd/debug/xtz/component/ddd/xyz.component.ts
src/abcd/arad/xtz/xyz.component.ts
Both these strings are giving a perfect match, even though the first one has 'debug' in its path. Where am I going wrong?

You are specifying a negative lookahead (?! in a character class [^((?!debug).)*$] which would then only match the characters inside the character class.
What you could do is move the negative lookahead to the beginning to assert that what follows is not /debug or /debug/:
^(?!.*\/debug\/)src\/.*component\.ts$
Explanation
^ Assert the start of the line
(?!.*\/debug\/) Negative lookahead to assert that what follows is not /debug/
src Match literally
\/.*component\.ts Match a forward slash followed by any character zero or more times followed by .ts
$ Assert the end of the string
Note that to match the dot literally you have to escape it \. or else it would match any character.

Your regex matches:
src/
followed by zero or more non-newline characters
followed by one character that is not in the character class ((?!debug).)*$
followed by zero or more slashes
followed by a non-newline character
followed by component
followed by a non-newline character followed by ts.
In other words, the [^((?!debug).)*$], is not a lookbehind as you probably intended but rather a character class.
We can rephrase the desired match to see what we need:
src
followed by one or more path segments, each of which is not equal to debug
followed by the filename
Which gives us:
^src(?:/[^/]+(?<!debug))+/[^/]+\.component\.ts$
(Remember to escape the forward slashes if you’re using these in JavaScript.)
Try it on Regex101.
I added the ^ and $ because I assume you want the entire input to match. If you’re searching within a large string, you can remove those and instead change both instances of [^/] to [^\n/].
By the way, there’s no need to place the entire regex inside parentheses, as the first match will be the entire matched string in most languages.

Related

Regex to match string that does not contain slash

I am trying to set up a route using vue-router in a web app using regex to match the pattern. The pattern I am looking to match is any string that contains alphanumeric characters (and underscore) without slashes. Here are some examples (the first slash is just to show the string after the domain e.g. example.com/):
/codestack
/demo45
/i_am_long
Strings that should not match would be:
/data/files.xml
/share/home.html
/demo45/photos
The only regex I came up with so far is:
path: '/:Username([a-zA-Z0-9]+)'
That is not quite right because it matches all the characters except for the slash. Whereas I want to only match on the first set of alphanumeric characters (including underscore) before the first forward slash is encountered.
If a route contains a forward slash e.g. /data/files.xml then that should be a different regex route match. Therefore I also need a regex pattern to match the examples above containing slashes. Theoretically, they could contain any number of slashes e.g. /demo45/photos/holiday/2015/bahamas.

For the first part, you can match 1 or more word characters which will also match an underscore.
The anchors ^ and $ assert the start and end of the string.
^\w+$
For the second one, you can start the match with word characters followed by /
In case of more forward slashes you can optionally repeat the first pattern in a group.
The last part after the pattern can be 1 or more word characters with a optional part matching a dot and word characters.
^\w+/(?:\w+/)*\w+(?:\.\w+)?$
Regex demo
If you want to match any char except / you can use [^/]
^(?:[^/\s]+/)+[^/\s]+$
Regex demo

Regular Expression to match extension file depending of the drive letter

I have this regex which can detect specific extension file,
([a-zA-Z0-9\s_\\.\-\(\):])+(.cmd|.exe|.bat)$
but I would like to change it so that it never applies to c:\ , the goal is to detect these extension files only on secondary or external drives
Example
D:\test.bat match
c:\test.bat does not match
Thank you

In the pattern that you tried, you have to escape the dot to match it literally, and you don't have to escape the dot or the parenthesis in the character class.
Note that \s could also match a newline.
For the listed examples, you can make use of a negetive lookahead if supported, to rule out c:\ or C:\
Without the capture groups, to get a match only:
^(?![cC]:\\)[a-zA-Z0-9\s_\\.():-]+\.(?:cmd|exe|bat)$
^ Start of string
(?![cC]:\\) Negative lookahead to assert what is directly to the right is not c:\ or C:\
[a-zA-Z0-9\s_\\.():-]+ Match 1+ times any of the listed in the character class
\.(?:cmd|exe|bat) Match a dot, and 1 of the alternatives
$ End of string
Regex demo
Or with the capture groups:
^(?![cC]:\\)([a-zA-Z0-9\s_\\.():-]+)(\.(?:cmd|exe|bat))$
Regex demo

Assuming every path is in a separate line based on the $ you included in your pattern, here's a very simple solution you can build upon:
^[^cC].*(cmd|exe|bat)$
Explanation:
^ matches the beginning of a line.
[^cC] matches everything except c or C.
.* matches any character except line terminators, zero or more times.
(cmd|exe|bat) matches your extensions. Since the dot was matched in the previous line, there's no need to match it again.
$ matches end of line.
TL;DR: you forgot to match the beginning of your lines.

How negative lookahead works

I want to match a string not containing word "the"
The following solution looks logical to me:
^(?!.*the.*).*$
The following one (I've came across on SO) also works but I cannot understand WHY it works
^((?!the).)*$
In my view (?!the). should match a)ANY b)single character then repeatd by *, so the regex should match any string?
There is the great site I'm using for reference http://www.rexegg.com but no such example there

It's basically doing a match-any-character, and search for the string literal "the" in every position. If found, the negation cancels the match.
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character)
( # Match the regular expression below and capture its match into backreference number 1
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
the # Match the characters “the” literally
)
. # Match any single character that is not a line break character
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of a line (at the end of the string or before a line break character)

The above solution works but only if you also want to match strings not containing words with the characters the in them -- e.g., I was going there would be excluded. You need word boundaries if you want to match everything not containing the word the:
^((?!\bthe\b).)*$
or:
^(?!.*\bthe\b).*$

^((?!the).)*$
This will check at every point before consuming if there is the ahead of it.So in a string abcthe after c regex engine will see the and it will fail.But because you have ^$ anchors because the the engine could not make a complete match it will fail and not match anything.If you remove $ it will match upto abc.

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?

Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s

jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!

In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++

You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.

You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generating regex to exclude a string in path - regex

Related

Regex to match string that does not contain slash

Regular Expression to match extension file depending of the drive letter

How negative lookahead works

Perl: Matching string not containing PATTERN

Regular expression doesn't match if a character participated in a previous match

Categories

Resources