RegEx: Find Matches Excluding a Specific Phrase - regex

I have tried a few of the answers to similar questions, but none work for what I am trying to do.
I am trying to find text that matches a specific phrase (with wildcard), but excludes any that include a second phrase.
Correct: John yawns.
Incorrect: John opens his mouth wide and yawns.
Essentially, I want to match off "(Someone) yawns." but not off "(Someone) opens his mouth wide and yawns." So the "opens his mouth wide and" is the match for exclusion, but I can't seem to get it to work.
Sadly, I am working with a log parsing application so I do not know what language is being used.

You probably want a negative lookbehind, as in (?<!opens his mouth wide and )yawns. Beware that these can slow down the regex matching algorithm, and are not available everywhere.
You really should reduce your test-case to simple patterns (e.g. .*bc but not abc)

Related

Use REGEXEXTRACT() to extract only upper case letters from a sentence in Google sheets

I'm thinking this should be basic but having tried a number of things I'm nowhere nearer a solution:
I have a list of names and want to extract the initials of the names in the next column:
Name (have)
Initials (want)
John Wayne
JW
Cindy Crawford
CC
Björn Borg
BB
Alexandria Ocasio-Cortez
AOC
Björk
B
Mesut Özil
MÖ
Note that some of these have non-English letters and they may also include hyphens. Using REGEXMATCH() I've been able to extract the first initial but that's where it stops working for me. For example this should work according to regex101:
=REGEXEXTRACT(AH2, "\b[A-Z]+(?:\s+[A-Z]+)*") but only yields the first letter.
You can use REGEXREPLACE to replace any non-capital letters with empty string and that will leave you with your desired string.
Try using this =REGEXREPLACE(A1,"[^A-ZÀ-Ö]","") and you can include additional character sets as per your needs that you want to retain.
Here is a working screenshot.
-edit for anyone showing up here-
Please have a look at the comment to the question. It has the best answer to this problem, by using Regexreplace() rather than Regexextract() to target anything that is not an upper case character and applying the relevant unicode pattern.
-/ edit -
Finally realised that RE2 (the flavour of regex used by Google sheets) doesn't support the global modifier and only returns the first match (explained here). The accepted solution to that question finds a smart workaround which allowed me to solve the problem:
=join("",REGEXEXTRACT(AH2,REGEXREPLACE(AH2, "([A-Z]+(?:\s+[A-Z]+)*)","($1)")))
When considering the non-English characters, I got lost in a rabbit hole here and here and decided that I suddenly don't care that much about international characters.
try:
=INDEX(REGEXREPLACE(A1:A, "[a-z ö-]", ))

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Intelligent pattern matching in string

Let's say I have filenames which are formatted differently. I want to be able to extract certain aspects from said filename like a human would; pattern recognition.
Obviously I can bruteforce myself through with regular expressions but that's not what I'm after. Let's say I have these 4 strings:
[MAS] Hayate no Gotoku!! 20 [BD 720p] [21D138F8].mkv
[Leopard-Raws] Akatsuki no Yona - 05 RAW (MX 1280x720 x264 AAC).mp4
[BLAST] Wolf Girl and Black Prince - 05 [720p] [C1252A5E].mkv
[sage]_Mobile_Suit_Gundam_AGE_-_36_[720p][10bit][45C9E0D0].mkv
As you can see all these filenames have certain pattern in them but are not quite the same. So a silver bullet regular expression wouldn't cut it. Instead I want to look at computational intelligence techniques such as ANN's or another smart idea to solve this problem.
Let's say we want to extract the filenames. Humans would return these values:
Hayate no Gotoku!!
Akatsuki no Yona
Wolf Girl and Black Prince
Mobile Suit Gundam AGE
Or episode numbers: 20, 05, 05, 36. You get where I'm going with this.
What suggested techniques would be useful to achieve the desired result, or is this something that is being researched at universities and still has no solution?
What you are looking for is called grammar induction and it works but making a program figure out a regular expression (or some other type of pattern) that matches certain strings but not others. You have to give it the strings yourself however, called a training set, with positive examples (strings that should be matched) and negative examples (strings that shouldn't be matched).
An interesting technique is called boosting where you learn a lot of simple patterns which are precise (do not match negative examples) but match only a few positive examples; however when combined together will match a large amount of positive examples.
Since you want to extract substrings rather than just match strings, the way I would go about it is to take prefixes of the file names and try to match those. In this way you'd know where the substring starts. Here's an example:
Positives:
[MAS]
[Leopard-Raws]
[BLAST]
[sage]_
Negatives:
[MAS] H
[Leopard-Raws] Akat
[BL
[sage]_Mobile_Suit_Gundam_AGE_
If done correctly, you should obtain a regular expression which you can use on prefixes of the file names. By growing the prefix one letter at a time you can know where the content of interest starts. Like this:
[ False
[s False
[sa False
[sag False
[sage False
[sage] True
[sage]_ True
[sage]_M False
What happened here is that I increased the prefix of the file name one character at a time until the regular expression I learnt matched it. But I also wanted to find the longest prefix that matches (because otherwise I would have missed the underscore since [sage] is an acceptable prefix as well) so I continued moving forward until the regular expression stopped matching. In this way I would know that the prefix before the actual content starts is "[sage]_". You can do the same for matching where it ends as well by using prefixes which include the content of interest.
To learn about regular expression learning see this post. Keep in mind that automated learning will never be perfect but the more examples you use the more accurate it will be.