Regex: match pattern but not certain word - regex

Is there a possibility to write a regex that matches for [a-zA-Z]{2,4} but not for the word test? Or do i need to filter this in several steps?

Sure, you can use a negative lookahead.
(?!test)[a-zA-Z]{2,4}
I don't know if you'll need it for what you're doing, but note that you may need to use start and end anchors (^ and $) if you're checking that an entire input matches that pattern. Otherwise, it could match something like ouaeghAEtest because it will still find four chars somewhere that aren't "test".

[A-Za-su-z][A-Za-df-z]{0,1}[A-Za-rt-z]{0,1}[A-Za-su-z]{0,1}
just a idea, haven't use real code to try

Related

regex look ahead behind (look around) negative problems

I am having trouble understanding negative regex lookahead / lookbehind. I got the impression from reading tutorials that when you set a criteria to look for, the criteria doesn't form part of the search match.
That seems to hold for positive lookahead examples I tried, but when I tried these negative ones, it matches the entire test string. 1, it shouldn't have matched anything, and 2 even if it did, it wasn't supposed to include the lookahead criteria??
(?<!^And).*\.txt$
with input
And.txt
See: https://regex101.com/r/vW0aXS/1
and
^A.*(?!\.txt$)
with input:
A.txt
See: https://regex101.com/r/70yeED/1
PS: if you're going to ask me which language. I don't know. we've been told to use regex without any specific reference to any specific languages. I tried clicking various options on regex101.com and they all came up the same.
Lookarounds only try to match at their current position.
You are using a lookbehind at the beginning of the string (?<!^And).*\.txt$, and a lookahead at the end of the string ^A.*(?!\.txt$), which won't work. (.* will always consume the whole string as it's first match)
To disallow "And", for example, you can put the lookahead at the beginning of the string with a greedy quantifier .* inside it, so that it scans the whole string:
(?!.*And).*\.txt$
https://regex101.com/r/1vF50O/1
Your understanding is correct and the issue is not with the lookbehind/lookahead. The issue is with .* which matches the entire string in both cases. The period . matches any character and then you follow it with * which makes it match the entire string of any length. Remove it and both you regexes will work:
(?<!^And)\.txt$
^A(?!\.txt$)

Regex matchin "java.util" but not "java.util.Collections"

I have a regex:
(abc|xyz|java\.util|)
However, I would like to ignore java.util.Collections. I'm stumped as to how to do this.
It's as simple as not matching a dot: [^.]
Of course, there might be other solutions that work better for you, depending on things like if that's the whole string, if a character is guaranteed to come after it, etc. If you give some more details, I can be more specific.
For example, if it's an import statement, you could just match a semicolon by putting a literal semicolon after it. If you plan to use the bit immediately afterwards, use a negative lookahead: (?!\.) If the string will end after the util, anchor it to the end with $.
If you want to fail on only java.util.Collections but accept anything else, then you want to use the specific negative lookahead (?!\.Collections). If you want to only allow one thing (say Random), you can add(?:\.Random)? immediately after java.util in your current regex.
You could use the end of line character
(abc|xyz|java\.util|)$
Or negative look-ahead
(abc|xyz|java\.util|)$(?!\.)
You can use a negative lookahead and use a regex like this:
java\.util(?!\.Collections)
Working demo
So, you can add the pattern to your regex and have:
(abc|xyz|java\.util(?!\.Collections)|)

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

Negative integer Regex doesn't match

I have Googled it, and found the following results:
http://icfun.blogspot.com/2008/03/regular-expression-to-handle-negative.html
http://regexlib.com/DisplayPatterns.aspx?cattabindex=2&categoryId=3
With some (very basic) Regex knowledge, I figured this would work:
r\.(^-?\d+)\.(^-?\d+)\.mcr
For parsing such strings:
r.0.0.mcr
r.-1.5.mcr
r.20.-1.mcr
r.-1.-1.mcr
But I don't get a match on these.
Since I'm learning (or trying to learn) Regex, could you please explain why my pattern doesn't match (instead of just writing a new working one for me)? From what I understood, it goes like so:
Match r
Match a period
Match a prefix negative sign or not, and store the group
Match a period
Match a prefix negative sign or not, and store the group
Match a preiod
Match mcr
But I'm wrong, apparently :).
You are very close. ^ matches the start of a string, so it should only be located at the start of a pattern (if you want to use it at all - that depends on whether you will also accept e.g. abcr.0.0.mcr or not). Similarly, one can use $ (but only at the end of the pattern) to indicate that you will only accept strings that do not contain anything after what the pattern matches (so that e.g. r.0.0.mcrabc won't be accepted). Otherwise, I think it looks good.
The ^ characters are telling it to match only at the beginning of a line; since it's obviously not at the beginning of a line in either case, it fails to match. In this case, you just need to remove both ^s. (I think what you're trying to say is "don't let anything else be in between these", but that's the default except at the start of the regex; you would need something like .* to make it allow additional characters between them.)
Since the ^ is not at the start of the expression, its meaning is 'not'. So in this case it means that there should not be a dash there.

How to get the inverse of a regular expression?

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:
(http://)([a-zA-Z0-9\/\.])*
If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?
You could simply search and replace everything that matches the regular expression with an empty string, e.g. in Perl s/(http:\/\/)([a-zA-Z0-9\/\.])*//g
This would give you everything in the original text, except those substrings that match the regular expression.
If for some reason you need a regex-only solution, try this:
((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)
I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.
The regex is a bit of a monster, so I'll try to break it down:
(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])
The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.
We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)
Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.
Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.
The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.
This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.
Corrections welcome, of course!
If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.
s/^(.*)(your regex here)(.*)$/$1$3/
im not sure if this will work exactly as you intend but it might help:
Whatever you place in the brackets [] will be matched against. If you put ^ within the bracket, i.e [^a-zA-Z0-9/.] it will match everything except what is in the brackets.
http://www.regular-expressions.info/