I apologize for the horrendous topic name but I couldn't think of a way to further abstract this question. I have been wracking my brain trying to figure out the RegEx syntax for this problem and pouring over questions about lookarounds, but to no avail.
I want to return results from start to the first instance of foo (unless it is immediately followed by bar) OR the end of the file. Additionally, if foo bar appears before foo !bar or end of file, I do not want anything returned.
Below is what I have been working with so far. I may be completely off track; however, I am definitely looking to stay within RegEx unless it's completely impossible to do. I've already solved this problem using not RegEx, but I'm trying to expand my understanding of RegEx as it bothers me I couldn't work out how to do this search. Also the RegEx implementation I am using is PCRE.
Currently this RegEx will report regardless of whether foo bar appears as the first foo or not. I feel as though I am missing some simple solution but using negative lookbehind and other methods I've not been able to get the search to not return anything if foo bar appears as the first foo while also returning cases where foo !bar appears either on its own, before foo bar, or where no foo appears at all.
Current Search:
start(?:\n|\r|.)*?(?:\Z|foo(?! +bar))
Here's three example files and what I want the search to return delineated by single quotes.
Example 1: Should not return anything.
Start
Text
Text
Foo Bar
Foo Doo
Example 2: Should return text between quotes.
'Start
Text
Text
Foo Doo
Foo' Bar
Example 3: Should return text between quotes.
'Start
Text
Text'
Thanks!
You need first to prevent "foo" in the content after "start". To do that you can use several ways. A well known way is to use: (?:(?!foo).)* (you ensure that each character you match is not the begining of the word you don't want). However this way isn't very performant in general since a lookahead is tested at each position.
An other way consists to use the first character of the word you want to avoid and to build a negative character class with it. So you can describe the content like this:
(?>[^f]+|f(?!oo))*
The advantage of this approach is to limit the amount of lookahead tests that are only performed when the first letter "f" is encountered. The inconvenient, is that you need to hardcode the letter and the other part of the word in the pattern or to build the pattern dynamically with substrings of the word. (sprintf can be handy in this case)
Then the whole pattern becomes:
start(?>[^f]+|f(?!oo))*(?:foo(?! bar)|\z)
pattern description:
start
(?> # open an atomic group
[^f]+ # all characters except f (one or more times)
| # OR
f(?!oo) # f not followed by oo
)* # repeat the group zero or more times
(?:
foo(?! bar) # "foo" not followed by a space and "bar"
| # OR
\z # end of the string
)
It's a little messy but here we go:
((?(?=.*Foo Bar)Start.*?Foo(?= Bar(?![\s]*$)(?!.*?foo (?!bar)))|.*))
NOTE: You would need to enable the 's' modifier to enable dot to match newline.
The output is in the first capturing group (\1). The detailed explanation is at the bottom.
As a general comment, it will be probably easier to do conditionals(if/esle) stuff inside the codes than in the regex. It will also be more readable and easier to maintain.
Btw, you can try this regex here.
Hope it helps! :D
( # first capturing group
(? # if conditional
(?=.*Foo Bar) # if(foo bar exists in this file), using look ahead
Start.*?Foo # Match Start to the first instance of Foo
(?= # Look ahead
Bar # Match space and Bar
(?![\s]*$) # Match !(white spaces and end of line)
(?!.*?foo (?!bar))) # Match !(foo !bar)
| # else
.* # Match everything
)
)
Related
I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details
I have a multi record csv file (No CRLF) so it is a single string when it is imported.
Having imported the file I am trying to get a set of records.
There is a fixed number of fields to a record.
The first field of
each record has a known value say 'Foo'
I want the regex to match on Foo then capture everything that is not Foo
I am assuming this will give me a match collection of Records which I can then process.
I have mucked around with RegexBuddy trying 'negative look aheads' from various posts on SO but I can't figure it out.
I guess I simply don't understand the construction of 'capture anything except foo'
This regex matches Foo and everything up to, but not including, the next Foo:
(Foo|Bar)(.*?)(?=Foo|Bar|$)
See live demo.
The "not Foo" part will be in group 1.
The key things here are:
.*? is a reluctant quantifier - it consumes as little as possible while still matching - needed to avoid consuming everything up to the last Foo in the input
(?=Foo|$) a look ahead for either Foo or end of input, so the last record, which isn't followed by Foo, is matched too.
Look aheads don't consume input, so the next Foo is left in the input ready for the next match.
EDIT: Although I've marked this question with the java tag, I don't want a solution that requires java code. I just would like the pattern to be compatible with Java's regex implementation if possible (which unfortunately is not quite PCRE compatible). What I would like is just a single regex that produces the matches I want.
Suppose I have this string:
foo bar foo bar # foo bar foo bar
I'd like to match instances of "foo", but only if they are not after any "#" symbol (if one is present). In other words, I want this result:
foo bar foo bar # foo bar foo bar
^^^ ^^^
I tried using a negative look-behind like this:
(?<!#.*)\bfoo\b
...but this doesn't work because a look-behind cannot be of variable length. Any suggestions?
This one should do the work
(?=.*#) lookahead and gets all text before "#"
global flag "g" repeats pattern
/(?=.*#)(\bfoo\b)/g
You can do replaceFirst method to remove text after # and then do a simple word match:
final Pattern pattern = Pattern.compile("\\bfoo\\b");
final Matcher matcher = pattern.matcher(input.replaceFirst("#.*$", ""));
while (matcher.find()) {
System.err.printf("Found Match: %s%n", matcher.group());
}
Java regex is not powerful enough for doing it with a single regex.
Lookbehind is fixed width, so that's not a solution.
Lookeahead is only applicable when you can be sure that there is a # in the string.
Java does not allow failing a match and then continuing searching at the end (like with SKIP/FAIL in PCRE). It always continues at the character after the last matching start.
#.*|(\bfoo\b) and then checking if the first matching group is defined would be a workaround here, but there's no pure way to just match \bfoo\b sequences.
There is no way to do it with a single regex as others said already. But there is a workaround for this.
Select # and every thing after:
#.*
Copy highlighted part and paste it in parenthesis in place of
HERE:
foo(?=.*\QHERE\E)
I am using Regex to search a file and find strings that are "sandwiched" between two other strings. This is my current code:
openingstring.*?closingstring
The issue I am having is that it is searching across multiple lines in the file. Let's say I want to find anything between "foo" and "bar" and my file looks like so:
foo this is NOT the string I want
foo this is the string I want bar
My regex expression is returning both lines, when what I would like is for it to only return line #2.
How can I go about only getting strings where foo and bar are on the same line?
I should also note that this is not being done in a text editor, or in a programming language necessarily, but in a user interface for automation software.
"." is supposed to match any characters except new line, which language are you using?
Anyway, You can try something like this:
foo[^\r\n]*bar
And note that you don't need "?" where "*" itself means 0 or more.
Why not using the inline modifier ?m?
(?m)foo.*bar
Or, to override Singleline mode, ?m-s:
(?m-s)foo.*bar
This is the case where .*? can be apparently greedy if it finds foo first, it will just go until it finds the next bar. This is only going to happen, in this case, though, if the dot . means Dot-All. You should try to turn that off. Or if you have no choice, use [^\r\n]*? instead of the Dot clause .*?
The Regex Engine will process Strings from "left-to-right".
Since your input string starts with foo, the engine will start to match at that point in the very first attempt. Nothing tells the engine, that it should not match the second foo with the expression .*? - so it proceeds until it finds bar:
foo .*? bar
foo this is NOT the string I want foo this is the string I want bar
perfect match.
It is always a good idea to exclude the opening and closing String from beeing matched inside the pattern to achieve the shortest possible match:
The pattern foo((?!foo|bar).)*bar will match anything between foo and bar only if it does neither contain foo nor bar:
foo((?!foo|bar).)*bar
Debuggex Demo
I'm having trouble with lookaround in regex.
Here the problem : I have a big file I want to edit, I want to change a function by another keeping the first parameter but removing the second one.
Let say we have :
func1(paramIWantToKeep, paramIDontWant)
or
func1(func3(paramIWantToKeep), paramIDontWant)
I want to change with :
func2(paramIWantToKeep) in both case.
so I try using positive lookahead
func1\((?=.+), paramIDontWant\)
Now, I just try not to select the first parameter (then I'll manage to do the same with the parenthesis).
But it doesn't work, it appears that my regex, after ignoring the positive look ahead (.+) look for (, paramIDontWant\)) at the same position it was before the look ahead (so the opening parenthesis)
So my question is, how to continue a regex after a matching group, here after (.+).
Thanks.
PS: Sorry for the english and/or the bad construction of my question.
Edit : I use Sublime Text
The first thing you need to understand is that a regex will always match a consecutive string. There will never be gaps.
Therefore, if you want to replace 123abc456 with abc, you can't simply match 123456 and remove it.
Instead, you can use a capturing group. This will allow you to remember a section of the regex for later.
For example, to replace 123abc456 with abc, you could replace this regex:
\d+([a-z]+)\d+
with this string:
$1
What that does is actually replaces the match with the contents of the first capturing group. In this case, the capturing group was ([a-z]+), which matches abc. Thus, the entire match is replaced with just abc.
An example you may find more useful:
Given:
func1(foo, bar)
replacing this regex:
\w+\((\w+),\s*\w+\)
with this string:
func2($1)
results in:
func2(foo)
import re
t = "func1(paramKeep,paramLose)"
t1 = "func1(paramKeep,((paramLose(dog,cat))))"
t2 = "func1(func3(paramKeep),paramDont)"
t3 = "func1(func3(paramKeep),paramDont,((i)),don't,want,these)"
reg = r'(\w+\(.*?(?=,))(,.*)(\))'
keep,lose,end = re.match(reg,t).groups()
print(keep+end)
keep,lose,end = re.match(reg,t1).groups()
print(keep+end)
keep,lose,end = re.match(reg,t2).groups()
print(keep+end)
keep,lose,end = re.match(reg,t3).groups()
print(keep+end)
Produces
>>>
func1(paramKeep)
func1(paramKeep)
func1(func3(paramKeep))
func1(func3(paramKeep))
Apply these two regexp in this order
s/(func1)([^,]*)(, )?(paramIDontWant)(.)/func2$2$5/;
s/(func2\()(func3\()(paramIWantToKeep).*/$1$3)/;
These cope with the two examples you gave. I guess that the real world code you are editing is slightly more complicated but the general idea of applying a series of regexps might be helpful