RegExp set contains one or multiple words - regex

Is there a way in regular expressions to match a subset of words against a set of words separated by a separator that does not involve creating a new pattern for every new word added to the set.
Right now I cannot think of anything else than creating a (?:{item1, item2, ...}) pattern for every extra item in the set (see example below).
Example matching a single word of the set:
Set: foo,bar,baz
Match: foo
RegExp:/^(foo|bar|baz)$/ <- MATCH
Example that will match a subset of words:
Set: foo,bar,baz
Match: foo,bar
RegExp: /^(foo|bar|baz)(?:,(foo|bar|baz)(?:,(foo|bar|baz))?)?$/ <- MATCH
The pattern grows rapidly when adding new items to the set. Is there some (magical) way to do this in a shorter version?

One general approach which looks slightly better than your current attempt would be to use lookaheads:
^(?=.*\bfoo\b)(?=.*\bbar\b).*$
Demo
You may add one lookahead assertion for each CSV term which needs to be matched in the input CSV list.
Edit: If you want OR behavior here, then we can use an alternation of lookaheads. To match either foo or bar as a CSV term we can try:
^(?:(?=.*\bfoo\b)|(?=.*\bbar\b)).*$

Related

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

How to exclude a certain word in regex?

I'm using this expression and it's perfect for what I need:
.*(cq|conquest).*
It returns any word/phrase/sentence/etc. with the letters 'cq' or the word 'conquest' in it. However, from those matches I want to exclude all that contain the term 'conquest power'.
Examples:
some conquest here (should match)
another cq with some conquest here (should match)
too much cq or conquest power is bad (should not match)
How can I do that to the regex above? It has to be only one regex otherwise the program that I'm using (Advanced Combat Tracker) will create two different tabs.
If you want to match any string which contains "conquest" or "cq", but not if the string contains "conquest power", then the regex is
^(?!.*conquest power).*?(?:cq|conquest).*
The above will attempt to match from the start of the string to the end of the line, if you want to match from the start of each line, switch on multiline mode if available - adding (?m) to the start of the regex may do that.
If you want to match across newlines change . to [\s\S], or switch on singleline mode if available.
You have confused people by stating "I want to match 'cq' or 'conquest'" but also "I want the regex to extract that line".
I assume you don't really want to match just "cq" or "conquest", you want to match strings/lines (?) containing "cq" or "conquest".
From your original question I got that you want to match all strings which contain "cq" or "conquest" but do not contain "power". For this case the following regexp works:
^([^p]|p(?!ower))*(cq|conquest)([^p]|p(?!ower))*$
(regexpal)

Regex is behaving lazy, should be greedy

I thought that by default my Regex would exhibit the greedy behavior that I want, but it is not in the following code:
Regex keywords = new Regex(#"in|int|into|internal|interface");
var targets = keywords.ToString().Split('|');
foreach (string t in targets)
{
Match match = keywords.Match(t);
Console.WriteLine("Matched {0,-9} with {1}", t, match.Value);
}
Output:
Matched in with in
Matched int with in
Matched into with in
Matched internal with in
Matched interface with in
Now I realize that I could get it to work for this small example if I simply sorted the keywords by length descending, but
I want to understand why this
isn't working as expected, and
the actual project I am working on
has many more words in the Regex and
it is important to keep them in
alphabetical order.
So my question is: Why is this being lazy and how do I fix it?
Laziness and greediness applies to quantifiers only (?, *, +, {min,max}). Alternations always match in order and try the first possible match.
It looks like you're trying to word break things. To do that you need the entire expression to be correct, your current one is not. Try this one instead..
new Regex(#"\b(in|int|into|internal|interface)\b");
The "\b" says to match word boundaries, and is a zero-width match. This is locale dependent behavior, but in general this means whitespace and punctuation. Being a zero width match it will not contain the character that caused the regex engine to detect the word boundary.
According to RegularExpressions.info, regular expressions are eager. Therefore, when it goes through your piped expression, it stops on the first solid match.
My recommendation would be to store all of your keywords in an array or list, then generate the sorted, piped expression when you need it. You would only have to do this once too as long as your keyword list doesn't change. Just store the generated expression in a singleton of some sort and return that on regex executions.

Regular Expression: how can I impose a perfect string matching?

Currently I am using this one ( edit: I missed to explain that I use this one for excluding exactly these words :p ):
String REGEXP = "^[^(REG_)?].*";
but matches (exluding) also ERG, EGR, GRE, etc... above
P.S.
I removed super because it is another keyword that I must filter, figure an array list composed with more of the following three words to be used as model:
REG_info1, info2, SUPER_info3, etc...
I need three filter matching one model at time, my question focus only on the second filter parsing keywords based on model "info2".
Just type it literally:
REG
This will only match REG.
So:
String REGEXP = "^(REG_|SUPER_)?.*";
Edit   After you clarified that you want to match every word that does not begin with REG_ or SUPER_, you could try this:
\b(?!REG_|SUPER_)\w+
The \b is a word boundary and the expression (?!expr) is a look-ahead assertion.
As everyone have already replied, if you want to match a line starting with REG, you use the regexp "^REG", if you want to match any line that starts REG or SUPER, you use "^(REG|SUPER)" and regular expression negation is, in general, a tricky problem.
To match all lines NOT starting with 'REG' you need to match "^[^R]|R[^E]|RE[^G]" and a regular expression to match all lines not starting with REG or SUPER can be constructed in a similar fashion (start by grouping the "not REG" in parentheses, then construct the "not SUPER" patterns as "[^S]|S[^U]|[SU[^P]...", group this and use alternation for both groups).
How about
\mREG\M
// \mREG\M
//
// Options: ^ and $ match at line breaks
//
// Assert position at the beginning of a word «\m»
// Match the characters “REG” literally «REG»
// Assert position at the end of a word «\M»
The [] indicate character classes. This is not what you want. You can just use "REG" to match REG. (You can use REG|SUPER for REG or SUPER)
REGEXP = "^(REG_|SUPER_)"
would match anything that haves REG_ or SUPER_ at the beginning of a string. You don't need more after the group "(..|..)"

Regex multi word search

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.
Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.
The regular expression for this is:
^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)
Where \b is a word boundary and the ?= is the lookaround modifier.
If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.
AND as concatenation
^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)
OR as alternation
^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
Maybe using a language recognition chart to recognize english would work. Some quick tests seem to work (this assumes paragraphs separated by newlines only).
The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.
my #paragraphs = split(/\n/,$text);
for my $p (#paragraphs) {
if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
print "Probable english\n$p\n";
}
}
Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?
Something like:
\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.
(where \b is the word boundary in your language)
would match a complete sentence that contained either of the two words or both..
You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match
Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.