Regex matchin "java.util" but not "java.util.Collections" - regex

I have a regex:
(abc|xyz|java\.util|)
However, I would like to ignore java.util.Collections. I'm stumped as to how to do this.

It's as simple as not matching a dot: [^.]
Of course, there might be other solutions that work better for you, depending on things like if that's the whole string, if a character is guaranteed to come after it, etc. If you give some more details, I can be more specific.
For example, if it's an import statement, you could just match a semicolon by putting a literal semicolon after it. If you plan to use the bit immediately afterwards, use a negative lookahead: (?!\.) If the string will end after the util, anchor it to the end with $.
If you want to fail on only java.util.Collections but accept anything else, then you want to use the specific negative lookahead (?!\.Collections). If you want to only allow one thing (say Random), you can add(?:\.Random)? immediately after java.util in your current regex.

You could use the end of line character
(abc|xyz|java\.util|)$
Or negative look-ahead
(abc|xyz|java\.util|)$(?!\.)

You can use a negative lookahead and use a regex like this:
java\.util(?!\.Collections)
Working demo
So, you can add the pattern to your regex and have:
(abc|xyz|java\.util(?!\.Collections)|)

Related

Go regex, Negative Look Ahead alternative

I am trying to implement the regex (?<!\\{)\\[[a-zA-Z0-9_]+\\](?!\\}) with go regex.
Match value will be like [ua] and [ua_enc] and unmatched should be {[ua]} and {[ua_enc]}
As Negative lookahead is not supported in Go, what may be the alternative expression for this?
There is no alternative expression for this. Using plain (?:[^{]|^)(...)(?:[^}]|$) to capture the intended match and assert the previous and next characters are not braces will kind-of work: you will need to work with the first capture group instead of with the full match, and it will fail when there is only a single character between two matches (e.g. [foo]_[bar]). The best way, really, is to use FindAllStringSubmatchIndex and manually check the previous and next characters to make sure they are not braces outside of regexp.

Regex: match pattern but not certain word

Is there a possibility to write a regex that matches for [a-zA-Z]{2,4} but not for the word test? Or do i need to filter this in several steps?
Sure, you can use a negative lookahead.
(?!test)[a-zA-Z]{2,4}
I don't know if you'll need it for what you're doing, but note that you may need to use start and end anchors (^ and $) if you're checking that an entire input matches that pattern. Otherwise, it could match something like ouaeghAEtest because it will still find four chars somewhere that aren't "test".
[A-Za-su-z][A-Za-df-z]{0,1}[A-Za-rt-z]{0,1}[A-Za-su-z]{0,1}
just a idea, haven't use real code to try

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

How to simulate non-greedy quantifiers in languages that don't support them?

Consider this regex: <(.*)>
Applied against this string:
<2356> <my pal ned> <!#%#>
Obviously, it will match the entire string because of the greedy *. The best solution would be to use a non-greedy quantifier, like *?. However, many languages and editors don't support these.
For simple cases like the above, I've gotten around this limitation with a regex like this: <([^>]*)>
But what could be done with a regex like this? start (.*) end
Applied against this string:
start 2356 end start my pal ned end start !#%# end
Is there any recourse at all?
If the end condition is the presence of a single character you can use a negative character class instead:
<([^>]*)>
For more complexes cases where the end condition is multiple characters you could try a negative lookahead, but if lazy matching is not supported the chances are that lookaheads won't be either:
((?!end).)*
Your last recourse is to construct something horrible like this:
(en[^d]|e[^n]|[^e])*
I replace . with [^>] where > in this case is the next character in the RE.

How to get the inverse of a regular expression?

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:
(http://)([a-zA-Z0-9\/\.])*
If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?
You could simply search and replace everything that matches the regular expression with an empty string, e.g. in Perl s/(http:\/\/)([a-zA-Z0-9\/\.])*//g
This would give you everything in the original text, except those substrings that match the regular expression.
If for some reason you need a regex-only solution, try this:
((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)
I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.
The regex is a bit of a monster, so I'll try to break it down:
(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])
The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.
We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)
Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.
Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.
The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.
This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.
Corrections welcome, of course!
If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.
s/^(.*)(your regex here)(.*)$/$1$3/
im not sure if this will work exactly as you intend but it might help:
Whatever you place in the brackets [] will be matched against. If you put ^ within the bracket, i.e [^a-zA-Z0-9/.] it will match everything except what is in the brackets.
http://www.regular-expressions.info/