regex: ignore potential matches after a "#" - regex

EDIT: Although I've marked this question with the java tag, I don't want a solution that requires java code. I just would like the pattern to be compatible with Java's regex implementation if possible (which unfortunately is not quite PCRE compatible). What I would like is just a single regex that produces the matches I want.
Suppose I have this string:
foo bar foo bar # foo bar foo bar
I'd like to match instances of "foo", but only if they are not after any "#" symbol (if one is present). In other words, I want this result:
foo bar foo bar # foo bar foo bar
^^^ ^^^
I tried using a negative look-behind like this:
(?<!#.*)\bfoo\b
...but this doesn't work because a look-behind cannot be of variable length. Any suggestions?

This one should do the work
(?=.*#) lookahead and gets all text before "#"
global flag "g" repeats pattern
/(?=.*#)(\bfoo\b)/g

You can do replaceFirst method to remove text after # and then do a simple word match:
final Pattern pattern = Pattern.compile("\\bfoo\\b");
final Matcher matcher = pattern.matcher(input.replaceFirst("#.*$", ""));
while (matcher.find()) {
System.err.printf("Found Match: %s%n", matcher.group());
}

Java regex is not powerful enough for doing it with a single regex.
Lookbehind is fixed width, so that's not a solution.
Lookeahead is only applicable when you can be sure that there is a # in the string.
Java does not allow failing a match and then continuing searching at the end (like with SKIP/FAIL in PCRE). It always continues at the character after the last matching start.
#.*|(\bfoo\b) and then checking if the first matching group is defined would be a workaround here, but there's no pure way to just match \bfoo\b sequences.

There is no way to do it with a single regex as others said already. But there is a workaround for this.
Select # and every thing after:
#.*
Copy highlighted part and paste it in parenthesis in place of
HERE:
foo(?=.*\QHERE\E)

Related

Regex to find and replace where the Key is the same as Value {foo:foo} to {foo}

Summary:
Use regex with place holders to find where the key and value are the same, and replace with just the key (in my case leveraging ES6 object-property shorthand syntax to clean thousands lines of broken ES5 code - where I can't find an auto helper in eslint rules for use with --fix).
Example:
module.exports = {
foo: foo,
bar: bar,
baz: someFunctionNotCalledBaz,
someOther: () => console.log('Defined directly. Not a reference to same name function.')
};
What I want (cleaning up old, broken code and ES6'ing a NodeJS project):
module.exports = {
foo,
bar,
baz: someFunctionNotCalledBaz,
someOther: () => console.log('Defined directly. Not a reference to same name function.')
};
I'm pretty familiar with regex, and I'm not sure this is even be possible. Using Vim, or an IDE Replace w/ regex I'd like to find a way to say:
Find all "word: word", regardless of spaces, and then the matching key on the value side:
(\w+)(:{1}\s{0,})(*SOMEHOW_REFERENCE_FIRST_MATCHING_GROUP_WITHIN_FIND*)
Replace with reference (using placeholder that would already work with matching group):
$1
Is this "lookback" even possible within the same regex? I did look at a bunch of other posts that matched my query, but to no avail.
This should do it:
sed -E 's/(.+): \1/\1/g' file
If you're unfamiliar with sed, the first part will look for strings that match the pattern (.+): \1, and the second part will replace it with \1
The \1 you see are backreferences, they refer to a capturing group. A capturing group is text inside parenthesis (here, (.+)).
(.+): \1 will locate any string of 1 or more characters followed by a semicolon and a space, and then the same string again.
And finally, sed will replace any matching string with \1, which is the part before the semicolon.
Hope this makes sense!

Regex Expression - End search at linebreak

I am using Regex to search a file and find strings that are "sandwiched" between two other strings. This is my current code:
openingstring.*?closingstring
The issue I am having is that it is searching across multiple lines in the file. Let's say I want to find anything between "foo" and "bar" and my file looks like so:
foo this is NOT the string I want
foo this is the string I want bar
My regex expression is returning both lines, when what I would like is for it to only return line #2.
How can I go about only getting strings where foo and bar are on the same line?
I should also note that this is not being done in a text editor, or in a programming language necessarily, but in a user interface for automation software.
"." is supposed to match any characters except new line, which language are you using?
Anyway, You can try something like this:
foo[^\r\n]*bar
And note that you don't need "?" where "*" itself means 0 or more.
Why not using the inline modifier ?m?
(?m)foo.*bar
Or, to override Singleline mode, ?m-s:
(?m-s)foo.*bar
This is the case where .*? can be apparently greedy if it finds foo first, it will just go until it finds the next bar. This is only going to happen, in this case, though, if the dot . means Dot-All. You should try to turn that off. Or if you have no choice, use [^\r\n]*? instead of the Dot clause .*?
The Regex Engine will process Strings from "left-to-right".
Since your input string starts with foo, the engine will start to match at that point in the very first attempt. Nothing tells the engine, that it should not match the second foo with the expression .*? - so it proceeds until it finds bar:
foo .*? bar
foo this is NOT the string I want foo this is the string I want bar
perfect match.
It is always a good idea to exclude the opening and closing String from beeing matched inside the pattern to achieve the shortest possible match:
The pattern foo((?!foo|bar).)*bar will match anything between foo and bar only if it does neither contain foo nor bar:
foo((?!foo|bar).)*bar
Debuggex Demo

RegEx lookaround to find Start>Foo where Foo bar never appeared

I apologize for the horrendous topic name but I couldn't think of a way to further abstract this question. I have been wracking my brain trying to figure out the RegEx syntax for this problem and pouring over questions about lookarounds, but to no avail.
I want to return results from start to the first instance of foo (unless it is immediately followed by bar) OR the end of the file. Additionally, if foo bar appears before foo !bar or end of file, I do not want anything returned.
Below is what I have been working with so far. I may be completely off track; however, I am definitely looking to stay within RegEx unless it's completely impossible to do. I've already solved this problem using not RegEx, but I'm trying to expand my understanding of RegEx as it bothers me I couldn't work out how to do this search. Also the RegEx implementation I am using is PCRE.
Currently this RegEx will report regardless of whether foo bar appears as the first foo or not. I feel as though I am missing some simple solution but using negative lookbehind and other methods I've not been able to get the search to not return anything if foo bar appears as the first foo while also returning cases where foo !bar appears either on its own, before foo bar, or where no foo appears at all.
Current Search:
start(?:\n|\r|.)*?(?:\Z|foo(?! +bar))
Here's three example files and what I want the search to return delineated by single quotes.
Example 1: Should not return anything.
Start
Text
Text
Foo Bar
Foo Doo
Example 2: Should return text between quotes.
'Start
Text
Text
Foo Doo
Foo' Bar
Example 3: Should return text between quotes.
'Start
Text
Text'
Thanks!
You need first to prevent "foo" in the content after "start". To do that you can use several ways. A well known way is to use: (?:(?!foo).)* (you ensure that each character you match is not the begining of the word you don't want). However this way isn't very performant in general since a lookahead is tested at each position.
An other way consists to use the first character of the word you want to avoid and to build a negative character class with it. So you can describe the content like this:
(?>[^f]+|f(?!oo))*
The advantage of this approach is to limit the amount of lookahead tests that are only performed when the first letter "f" is encountered. The inconvenient, is that you need to hardcode the letter and the other part of the word in the pattern or to build the pattern dynamically with substrings of the word. (sprintf can be handy in this case)
Then the whole pattern becomes:
start(?>[^f]+|f(?!oo))*(?:foo(?! bar)|\z)
pattern description:
start
(?> # open an atomic group
[^f]+ # all characters except f (one or more times)
| # OR
f(?!oo) # f not followed by oo
)* # repeat the group zero or more times
(?:
foo(?! bar) # "foo" not followed by a space and "bar"
| # OR
\z # end of the string
)
It's a little messy but here we go:
((?(?=.*Foo Bar)Start.*?Foo(?= Bar(?![\s]*$)(?!.*?foo (?!bar)))|.*))
NOTE: You would need to enable the 's' modifier to enable dot to match newline.
The output is in the first capturing group (\1). The detailed explanation is at the bottom.
As a general comment, it will be probably easier to do conditionals(if/esle) stuff inside the codes than in the regex. It will also be more readable and easier to maintain.
Btw, you can try this regex here.
Hope it helps! :D
( # first capturing group
(? # if conditional
(?=.*Foo Bar) # if(foo bar exists in this file), using look ahead
Start.*?Foo # Match Start to the first instance of Foo
(?= # Look ahead
Bar # Match space and Bar
(?![\s]*$) # Match !(white spaces and end of line)
(?!.*?foo (?!bar))) # Match !(foo !bar)
| # else
.* # Match everything
)
)

Regex — only zero or one 's'

I have a name, "foo bar", and in any string, foo, foos, bar and bars should be matched.
I thought this should work like this: (foo|bar)s?. I tried some other regexes as well, but they all were like this. How can I do this?
(foo|bar)s? is correct...
You should use a boundary like \b(foo|bar)s?\b. Else it would also match hihellofoos.
Your question seems to reflect perplexity over why you found a match in foosss. Note the difference between finding a match in a string, and matching the whole string.
You have several ways of dealing with this, and the right choice depends on your application.
Anchor the regex to the whole input line or input: ^(foo|bar)s?$
Anchor the regex to one word: \b(foo|bar)s?\b
Some APIs (but not preg_match) have a separate function to match the whole string.

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}