Exclude file extensions regular expression - regex

I have a regular expression for a URL check written in VBScript.
regLinkEx.Pattern = "(^|[\s>='])((((http|ftp|https):\/\/)?([а-яёa-z\-_]{1,})(\.[а-яёa-z\-_]{2,})*(\.([^exe|EXE|xml|XML|dll|DLL|ini|INI|bat|BAT|dat|DAT|bin|BIN|mif|MIF|txt|TXT|]){2,}|рф)+)([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
I exclude file extensions that I need, but I also want to match letters from a to z
this is the part where I want, and I'm trying to do like this...
(\.[a-z]*([^exe|EXE|xml|XML|dll|DLL|ini|INI|bat|BAT|dat|DAT|bin|BIN|mif|MIF|txt|TXT|]){2,}|рф)+)
...but it doesn't work.
Can anyone help me?

In a Regular Expression, square brackets indicate a match for "any character within". So, for example, the regular expression [^exe|EXE|xml]{2,} matches any two characters that aren't in [exEXml].
If you are looking to exclude certain file extensions, use negative lookahead. Since negative lookaheads are zero-length, you can string them together to create behavior like "X followed by none of the following: EXE, XML, DLL" (the regex for this would be X(?!EXE)(?!XML)(?!DLL)).
As a side note, VBScript fully supports negative lookahead, does not support negative look behind (a significantly more complex and intensive behavior).

Related

Regular expression, match anything but these strings

Within Splunk I have a number of field extractions for extracting values from uri stems. I have a few which match a specific pattern, I now want another regex which matches anything but these.
^/SiteName/[^/]*/(?<a_request_type>((?!Process)|(?!process)|(?!Assets)|(?!assets))[^/]+)
The regex above is what I have so far. I am expecting the negative lookaheads to prevent it from matching Process, process, assets or Assets. However it seems that the [^/]+ after these lookaheads can then go ahead and match these strings anyway. Resulting in this regex sometimes overriding the other regexes I wrote to accept these strings
What is the correct syntax for me to make the regex match any string, other than those specified in the negative lookaheads?
Thanks!
Negative lookaheads do not consume any of the string being searched. When you want multiple negative lookaheads, there is no need to separate them with | (OR). Try this:
^/SiteName/[^/]*/(?<a_request_type>((?![Pp]rocess)(?![Aa]ssets))[^/]+)
Note that I have combined your lookaheads ([Pp]rocess and [Aa]ssets) to make the regular expression more concise.
Live test.

Negative lookahead alternative

For a URL pattern such as this one:
/detail.php?a=BYGhs5w8e9o&b=234844617545&h=9827a
I would like Google Analytics to match only the URL's with the a and b parameters in it:
/orderdetail.php?a=BYGhs5w8e9o&b=234844617545
And thus strip out:
&h=9827a
The main goal is to be able to setup a goal in Google Analytics which covers only the a and b parameters and ignores the h parameter.
Is there an easy way to accomplish this without a negative lookahead?
Standard regular expressions do not need negative lookahead for this. Just do a match and replace. Searching for:
(/detail.php\?a=\w+&b=\w+)&h=\w+
and replacing with \1 works with the regular expressions in Notepad++ version 6.5.5. Google's regular expressions may be subtly different.
The above works by surrounding the wanted text with capturing braces and leaving the unwanted part outside. The ? needs escaping as un-escaped it means the previous item (ie the p) is optional. The \w sequence mean any "word" character so \w+ means a word.

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

Regular Expression to search for the "Not Existence" of a pattern at the end of a URL

I am writing a Rational Functional Testing (RFT) script using Java language where I am trying to create an object in my object map with a regular expression not to match a certain pattern.
The URL which I want not to match will look something like:
http://AnyHostName/index.jsp?safe=active&q=arab&ie=UTF-8&oe=UTF-8&start=10
http://AnyHostName/index.jsp?safe=active&q=arab&ie=UTF-8&oe=UTF-8&start=40
http://AnyHostName/index.jsp?safe=active&q=arab&ie=UTF-8&oe=UTF-8&start=210
I tried using the below expression but since the end of the URL is also any number of two or more digits the expression failed to fulfill the need:
^.*(?<!\start=10)$ or ^.*(?<!\start=40)$ or ^.*(?<!\start=110)$
If i tried using \d+ to replace the number in the above patterns, the expression stopped working correctly.
Note: It is worth to mention that using any Java code will not be possible since the regular expression will be given to the tool (i.e. RFT) and it will be used internally for matching.
Any help please on this matter?
Use this expression:
^(?:(?!start=\d+).)*$
It has the advantage that it excludes also the cases where start=10 appears in the middle of the URL (i.e. http://AnyHostName/index.jsp?safe=active&q=arab&start=210&ie=UTF-8&oe=UTF-8).
It can be slow, though, since it's checking the negative look-ahead for every character.
why not just match
^http://AnyHostName/index.jsp?safe=active&q=arab&ie=UTF-8&oe=UTF-8&start=\d+$
(you have to do escape in java.)
and add a "!" in your java if statement?
like if (!m.match())...
According to regular-expressions.info the look behind in java has to be of finite length. So \d+ would be infinite.
I am not sure, but you can try
^.*(?<!\start=\d{1,20})$
this quantifier {1,20} would allow any number of digits from 1 to 20 and should meet the finite criteria.

Regular Expression to exclude set of Keywords

I want an expression that will fail when it encounters words such as "boon.ini" and "http". The goal would be to take this expression and be able to construct for any set of keywords.
^(?:(?!boon\.ini|http).)*$\r?\n?
(taken from RegexBuddy's library) will match any line that does not contain boon.ini and/or http. Is that what you wanted?
An alternative expression that could be used:
^(?!.*IgnoreMe).*$
^ = indicates start of line
$ = indicates the end of the line
(?! Expression) = indicates zero width look ahead negative match on the expression
The ^ at the front is needed, otherwise when evaluated the negative look ahead could start from somewhere within/beyond the 'IgnoreMe' text - and make a match where you don't want it too.
e.g. If you use the regex:
(?!.*IgnoreMe).*$
With the input "Hello IgnoreMe Please", this will will result in something like: "gnoreMe Please" as the negative look ahead finds that there is no complete string 'IgnoreMe' after the 'I'.
Rather than negating the result within the expression, you should do it in your code. That way, the expression becomes pretty simple.
\b(boon\.ini|http)\b
Would return true if boon.ini or http was anywhere in your string. It won't match words like httpd or httpxyzzy because of the \b, or word boundaries. If you want, you could just remove them and it will match those too. To add more keywords, just add more pipes.
\b(boon\.ini|http|foo|bar)\b
you might be well served by writing a regex that will succeed when it encounters the words you're looking for, and then invert the condition.
For instance, in perl you'd use:
if (!/boon\.ini|http/) {
# the string passed!
}
^[^£]*$
The above expression will restrict only the pound symbol from the string. This will allow all characters except string.
Which language/regexp library? I thought you question was around ASP.NET in which case you can see the "negative lookhead" section of this article:
http://msdn.microsoft.com/en-us/library/ms972966.aspx
Strictly speaking negation of a regular expression, still defines a regular language but there are very few libraries/languages/tool that allow to express it.
Negative lookahed may serve you the same but the actual syntax depends on what you are using. Tim's answer is an example with (?...)
I used this (based on Tim Pietzcker answer) to exclude non-production subdomain URLs for Google Analytics profile filters:
^\w+-*\w*\.(?!(?:alpha(123)*\.|beta(123)*\.|preprod\.)domain\.com).*$
You can see the context here: Regex to Exclude Multiple Words