Regex for any string not ending on .js - regex

This has been driving me nuts. I'm trying to match everything that doesn't end in .js. I'm using perl, so ?<! etc. is more than welcome.
What I'm trying to do:
Do match these
mainfile
jquery.1.1.11
my.module
Do NOT match these
mainfile.js
jquery.1.1.11.js
my.module.js
This should be an insanely simple task, but I'm just stuck. I looked in the docs for both regex, sed, perl and was even fiddling around for half an hour on regexr. Intuitively, this example (/^.*?(?!\.js)$/) should do it. I guess I just stared myself blind.
Thanks in advance.

You can use this regex to make sure your match doesn't end with .js:
^(?!.+\.js$).+$
RegEx Demo
(?!.+\.js$) is a negative lookahead condition to fail the match if line has .js at the end.

This one should suit your needs:
^.*(?<![.]js)$

The simplest approach when you only have negative matching conditions is to construct a positive regex and then check that it doesn't match.
if ($string !~ /\.js$/)
{
print "Doesn't end in .js";
}
This is easier to understand and more efficient than a negative look-around.
Look-arounds are only needed when you need to mix positive and negative conditions (for example, "I need to match "foo" out of a string, but only when it is not followed by "bar"). Even then, sometimes it is easier to use multiple simple patterns and logic, rather than meeting all your requirements with one complex pattern.

Related

Transform negative regex lookahead to greedy needed

The task I'm trying to solve seems pretty simple - I need to choose all font-changing tags except for the particular one (AIGDT). I'm going to cut them out in order to simplify further text processing.
I'm trying to use negative regex lookahead like this:
Font='(?!(AIGDT))(.*)'
But for the single-line text sample:
<StyleOverride Font='Arial' FontSize='0,32971'>[</StyleOverride><StyleOverride FontSize='0,21558'> </StyleOverride><StyleOverride Font='AIGDT' Italic='False'>n</StyleOverride><DimensionValue/> <StyleOverride Font='Arial' FontSize='0,32971'>]</StyleOverride>
It returns single 200+symbol match ... while I'm expecting two 12-symbol matches (Font='Arial').
I believe this is because the lookahead is greedy.
Can anybody hint me to what is my mistake?
Thanks in advance.
How does Font='(?!(AIGDT))([^']+)' work for you?
Basically, narrow down the second capture to "anything but a single quote".
(Full disclosure: On my phone at the moment so I haven't run it, but in theory it works nicely)

capturing a repeating pattern with regex

I'm trying to match a pattern like this CODE-UH87H-98HSH-HB383-JWWB2U and I have the following regex pattern CODE\-[A-Z0-9]+\-[A-Z0-9]+\-[A-Z0-9]+\-[A-Z0-9]+ but is there a better way of doing this? I tried CODE(\-[A-Z0-9]+\-){4} and it didn't work
I tried CODE(\-[A-Z0-9]+\-){4} and it didn't work
That does require two dashes in succession. In full, it would be CODE\-[A-Z0-9]+\-\-[A-Z0-9]+\-\-[A-Z0-9]+\-\-[A-Z0-9]+\-. What you want is
CODE(\-[A-Z0-9]+){4}
You were almost there. CODE(\-[A-Z0-9]+){4} should work!
When the pattern between the dashes may contain any character, the following regex is even shorter:
CODE(-[^-]+){4}
Of course you may have to add \ for escaping before the dash depending on what regex engine you will use.

Regex lookahead

I am using a regex to find:
test:?
Followed by any character until it hits the next:
test:?
Now when I run this regex I made:
((?:test:\?)(.*)(?!test:\?))
On this text:
test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2
I expected to get:
test:?foo2=bar2&baz2=foo2
test:?foo=bar&baz=foo
test:?foo2=bar2&baz2=foo2
But instead it matches everything. Does anyone with more regex experience know where I have gone wrong? I've used regexes for pattern matching before but this is my first experience of lookarounds/aheads.
Thanks in advance for any help/tips/pointers :-)
I guess you could explore a greedy version.
(expanded)
(test:\? (?: (?!test:\?)[\s\S])* )
The Perl program below
#! /usr/bin/env perl
use strict;
use warnings;
$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";
while (/(test:\? .*?) (?= test:\? | $)/gx) {
print "[$1]\n";
}
produces the desired output from your question, plus brackets for emphasis.
[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]
Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.
Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.
The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.
Three issues:
(?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.
The .* is greedy; you want it non-greedy so that you grab just the first chunk.
You're wanting the last chunk also, so you want to match $ as well at the end.
End result:
(?:test:\?)(.*?)(?=test:\?|$)
I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.
(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)
If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.
Python:
>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']
You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.
The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).
Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.
By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:
test:[?](.*?)test:[?]

Regex: Does not have/include pattern

I have a regex pattern to match an HTML script tag. How can I change this script tag pattern so that the patterns means "input string DOES NOT MATCH" the script tag pattern?
In other words, given a pattern, what is the alteration needed to change the meaning of the pattern to "does not match this pattern"?
For example, if I have a pattern: \d{3}-\d{3}-\d{4}, what is the equivalent pattern for this that means "does not match \d{3}-\d{3}-\d{4}"?
You can negate a regex pattern by using a negative lookahead. This is slightly different than simply negating the regex though. Negative lookahead would look like the following in Java (and many other languages):
(?!\d{3}-\d{3}-\d{4})
It should be noted that this doesn't exactly answer the question. Finding the inverse of a regular language is not an easy task using a regular expression (I don't think). A much easier way to solve the problem would be to inverse the program logic:
Instead of:
if (string.matches(yourRegex))
Do:
if (!string.matches(yourRegex))
That is not easily achievable for arbitrary patterns. In practice, it's almost always easier to do what you want in the surrounding code than in the pattern itself. For instance, instead of
grep '\d{3}-\d{3}-\d{4}' file
you could use
grep -v '\d{3}-\d{3}-\d{4|' file
Or in a program you could change something like
if (pattern.matches()) {
foo();
}
into something like
if (!pattern.matches()) {
foo();
}
In a more tedious approach, you would have to enumerate all possible values that should match instead of what should not match. So, say you want to match everything but the string <html>, you could write a regex like so:
([^<]|<([^h]|h([^t]|t([^m]|m([^l]|l[^>])))))
Reading that regex is like saying: "Okay, you can match any character but '<', or you could match '<' but then you can't match an 'h' after that... or you do match an 'h' after that but then you can't match a 't' after that... and so on.
It's butt ugly, but then again, for simple string matches, you can easily write a recursive function that transforms any given term into a pattern like the above.
easier to just negate the test surely? eg...
if (!regex.test(str)) ...
(javascript example)
Negating a character class is easy with ^ but a whole regex will get much more convoluted.
What language are you using? The easiest solution to the specific problem you stated is to simply prepend a negation operator (usually "!") to the match.
I definitely agree with the other answers saying you should negate testing for a match, but this should do what you want using just a regex:
(?!.*\d{3}-\d{3}-\d{4})
This is a negative lookahead, by not placing any characters outside of the lookahead the regex basically means "fail on any string that starts with any number of characters (.*) followed by the regex \d{3}-\d{3}-\d{4}".

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);