Simple regex question..
I have a very basic expression built to pull text out between two words:
BEGN: (.*?)DETAIL:
Which works fine when both words exist, but on some occasions there is no "DETAIL:" so in those cases I just want to capture to the end of the text. Is that possible with a single expression, or do I need a conditional statement of some type?
The simplest is to use a group with a $ (end-of-string anchor) alternation:
BEGN: (.*?)(?:DETAIL:|$)
BEGN: (.*?)(?=DETAIL:|$)
(?<=BEGN: ).*?(?=DETAIL:|$)
See the regex demo.
The (?:DETAIL:|$) is a non-capturing group that matches DETAIL: or end of string. The other two cases are similar, just the left- and right-hand delimiters are put into non-cosuming lookarounds so that the text they match could be omitted from the match value.
There are alternative solutions.
If the trailing delimiter can be absent, use a tempered greedy token or an unrolled one:
BEGN: ((?:(?!DETAIL:).)*)
See a regex demo
The (?:(?!DETAIL:).)* matches any text up to the first DETAIL:. You may add a word boundary \b before D so as to only match DETAIL that is a whole word.
If the text can be spanning across multiple lines, do not forget a DOTALL modifier. If you use an unrolled version, the DOTALL modifier is not needed:
BEGN: ([^D]*(?:D(?!ETAIL:)[^D]*)*)
See another demo
Related
My regex (PCRE):
\b([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
is a match (the actual match is surrounded by stars) :
**this.is.an.error**
**this.IsAnerror**
**this.is.an.error**.
**this.is.an.error**(
bla **this_is-an-error**
**this.is.an.error**:
this is an (**error**)
not a match:
this.is.an.error.but.dont.match
this.is.an.error-but.dont.match
this.is.an.error/but.dont.match
this.is.an.error/
/this.is.an.error
for this sample: /this.is.an.error
I can't manage to have a condition that will reject the whole match if it starts with the character /.
every combination I've tried resulted in some partial catch (which is not the desired).
Is there any simple or fancy way to do that?
You can try to add lookabehinds at the beginning instead of a word boundary:
(?<!\/)(?<=[^\w-.])([\w-.]*error)\b(?:[^-\/.]|\.\W|\.$|$)
Explanation:
(?<!\/) - negative lookbehind assuring there is no / before the first character;
(?<=[^\w-.]) - word boundary implementation taking into account your extended definition of characters accepted for a word [\w-.];
Demo
Prepend your regex with \/.*|:
\/.*|\b([\w-.]*error)\b(?=[^-\/.]|(?:\.\W?)?$)
Now just like before the first capturing group holds the desired part.
See live demo here
Note: I made some modifications to your regex to remove unnecessary alternations.
I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").
I have <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] to catch everything inside
<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match< If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.
Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.
You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).
If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).
In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.
So, you could use
pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)
So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.
Explanation:
(?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
\p{L}+ - 1+ Unicode letters
(?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.
However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.
The version with capturing in place:
pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)
And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.
Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.
We have tab spaced list of "key=value" pairs.
How we can split it, using regexp?
Case key=value must be transformed into value. Case key=value=value2 must be transformed into value=value2.
https://regex101.com/r/dR5dT0/1 - I've started solution like this, but can't find beautiful way to remove only "key=" part from text.
UPD BTW, do you know cool crash courses on regular expressions?
You can just use
=(\S*)
See regex demo
Since the list is already formatted, the = in the pattern will always be the name/value delimiter.
The \S matches any non-whitespace character.
The * is a quantifier meaning that the \S should occur zero or more times (\S* matches zero or more non-whitespace characters).
You can use this regex for matching:
/\w+=(\S+)/
and grab captured group #1
RegEx Demo
I'm trying to create a regex for extracting singers, lyricists. I was wondering how to make lyricists search optional.
Sample Multiline String:
Fireworks Singer: Katy Perry
Vogue Singers: Madonna, Karen Lyricist: Madonna
Regex: /Singers?:(.\*)\s?Lyricists?:(.\*)/
This matches the second line correctly and extracts Singers(Madonna, Karen) and Lyricists(Madonna)
But it does not work with the first line, when there are no Lyricists.
How do I make Lyricists search optional?
You can enclose the part you want to match in a non-capturing group: (?:). Then it can be treated as a single unit in the regex, and subsequently you can put a ? after it to make it optional. Example:
/Singers?:(.*)\s?(?:Lyricists?:(.*))?/
Note that here the \s? is useless since .* will greedily eat all characters, and no backtracking will be necessary. This also means that the (?:Lyricists?:(.*)) part will never be matched for the same reason. You can use the non-greedy version of .*, .*? along with the $ to fix this:
/Singers?:(.*?)\s*(?:Lyricists?:(.*))?$/
Some extra whitespace ends up captured; this can be removed also, giving a final regex of:
/Singers?:\s*(.*?)\s*(?:Lyricists?:\s*(.*))?$/
Just to add to Cameron's solution. if the source string has multiple lines each containing both Singers and Lyricists, you'll probably need to add the 'm' multi-line modifier so that the '$' will match ends-of-lines. (You didn't say what language you are using - you may want to add the 'i' modifier as well.)