Regex match a string within 2 different strings containing other characters

Regex match a string within 2 different strings containing other characters - regex

Given bar(alvin the chipmunk dude) and chipmunk(alvin the chipmunk dude), how would you match the word "chipmunk" only on the "bar" function?
Another question I just asked, but without the needed complexity I was looking for, is answered here. I do not believe this is a duplicate given the answer to the question from #revo. That answer does answer the other question however I see no way to adapt it to ensure the match is contained within two different strings ("bar(" and ")").
chipmunk(?=[^\)\(\\]*(?:\\.[^\)\(\\]*)*\)) (courtesy of #revo) matches "chipmunk" inside of the parentheses, but I want to constrain it to only to to being within "bar(" and ")".
Test here.
Using JetBrains IDE which uses Java.

Since you are using a Java regex library, you may leverage the constrained-width lookbehind feature:
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
You may use
(?<=bar\([^()]{0,1000})chipmunk
It matches any chipmunk string that is immediately preceded with bar( followed with 0 to 1000 chars other than ( and ).
You may test it at RegexPlanet.com.

Related

Regular expression to select single words inside parentheses using stringr

I specified that I'm using stringr because its character escaping is not "standard" regex escaping.
I want to detect strings that have a single word inside parentheses. Thus, I want to detect
"Men's shirt (blue)"
and not detect
"Blade Runner (Director's cut)"
If it helps simplify the regex, all the parenthetical parts are always at the end of the string.
I have attempted
str_detect(my_string, "\\(\\w?\\)") which yields no results and
str_detect(my_string, "\\(\\S\\)$") which returns everything including multiple words
as well as various combinations using //S, with or without the $, etc.
When I look for other stack overflow answers, I usually find slightly different questions whose answers are simply "use this:" along with what seems like an incomprehensible regex using lookaheads and other seemilgly too-complicated things. I thank you for a little explanation on why the (probably obvious and simple) regex works.

Looks like I didn't hit upon the //S+ in my combinations. I was checking for a single non-space character, not one or more non-space characters. This solution works:
str_detect(desc, "\\(\\S+\\)$")

Regex: Match all permutations [duplicate]

This question already has answers here:
Regex to match all permutations of {1,2,3,4} without repetition
(4 answers)
Closed 4 years ago.
First of all, I am aware that this is a problem you wouldn't usually use regex for, I am just trying to find out whether this is even possible.
That being said, what I am trying to do is match ALL occurrences of any permutation of a string (for now, I don't care if overlapping occurences match or not); for example, if I have the string abc, I want to match all occurrences of abc, acb, bac, bca, cab and cba.
What I have until now is the following regex: (?:([abc])(?!.{0,1}\1)){3} (note: I know that I could use + instead of {0,1}, but that only works for strings with length 3). This kind of works, but if there are two permutations next to each other where a letter of the first one is too close to a letter of the second one (eg. abc cba → c c), the first permutation does not match. Is it possible to solve this using regex?

Direct Approach
[abc]{3} would match too many results since it would also match aab.
In order to not double match a you would need to remove a from the group that follows leaving you with a[bc]{2}.
a[bc]{2} would match too many results since it would also match 'abb'.
In order to not double match b you would need to remove a from the group that follows leaving you with ab[c]{1} or abc for short.
abc would not match all combinations so you would need another group.
(abc)|([abc]{3}) which would match too many combinations again.
This path leads you down the road of having all permutations listed explicitly in groups.
Can you create combinations so that you do not need to write out all combinations?
(abc)|(acb) could be writtean as a((bc)|(cb)).
(bc)|(cb) I can not shorten that any further.
Match too many and remove unwanted
Depending on the regex engine you may be able to express AND as a look ahead so that you can remove matches. THIS and not THAT consume THIS.
(?=[abc]{3})(?=(?!a.a))[abc]{3} would not match aca.
This problem is now simmilar to the one above where you need to remove all combinations that would violate your permutations. In this example that is any expression containing the same character mutltiple times.
'(.)\1+' this expression uses grouping references on its own matches the same character multiple times but requires knowing how many groups exist in the expression and is very brittle Adding groups kills the expression ((.)\1+) no longer matches. Relative back references exist and require knowledge of your specific regex engine. \k<-1> may be what you could be looking for. I will assume .net since I happen to have a regex tester bookmarked for that.
The permutations that I want to exclude are: nn. n.n .nn nnn
So I create these patterns: ((?<1>.)\k<1>.) ((?<2>.).\k<2>) (.(?<3>.)\k<3>) ((?<4>.)\k<4>\k<4>)
Putting it all together gives me this expression, note that I used relative back references as they are in .net - your milage may vary.
(?=[abc]{3})(?=(?!((?<1>.)\k<1>.)))(?=(?!((?<2>.).\k<2>)))(?=(?!(.(?<3>.)\k<3>)))(?=(?!((?<4>.)\k<4>\k<4>)))[abc]{3}
The answer is yes for a specific length.
Here is some testing data.

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.

From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.

Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Is it possible to match any wide character that appears more than once using only regxp?

For example, in this string with no \s:
abodnpjdcqe
only d should be matched.
But in my case there are thousands of different characters, is it possible to use ONLY regxp to match all characters that appear in the string more than once? It seems that all other problems use other tools.

It is possible to find characters that are present two times in a string as anubhava demonstrates it, and I don't see any other regex pattern to do it.
However, there are problems with an only regex way:
The complexity of this kind of pattern is very high, and you will experience problems (with backtracking limits and execution time) if your string is long and if there are few duplicates.
This way is unable to see if a duplicate character have been already found. For example the string a123a456a789a, the pattern will return a three times instead of one. If your goal is to obtain a list of unique duplicate characters, it can be problematic (but easy to solve programmatically)
So, to answer your question: my answer is no.
a simple way, to do it with code is to loop over the characters of your string and to build an associative array where the keys are the characters and the values the number of occurences. Then, removes each item that has the value 1 and extract the keys.
Note: you can solve the problem of duplicate results (2.) using this pattern:
(.)(?=(?:(?!\1).)*\1(?:(?!\1).)*$)
or if possessive quantifiers are available:
(.)(?=(?:(?!\1).)*+\1(?:(?!\1).)*+$)
but I'm afraid that the complexity may be even more high.
So, using your favorite language stay from far the best way.

You can use this regex:
([a-zA-Z])(?=.*\1)
Explanation:
Regex uses ([a-zA-Z]) to match any letter and captures it as group #1 i.e. \1
A positive lookahead (?=.*\1) then makes sure this match is successful only when it is followed by at least one of the backreference \1 i.e. the character itself.
RegEx Demo

Regex no two consecutive characters are the same

How do I write a regular expression where x is a string whose characters are either a, b, c but no two consecutive characters are the same
For example
abcacb is true
acbaac is false

^(?!.*(.)\1)[abc]+$ works if you follow the original question exactly. However, this does not work/check multiple "words" of characters a/b/c, ie. "abc cba".
The way it works is it asserts that any character is not followed by itself by utilizing a capture group inside a lookahead and that the entire string consists only of characters "a", "b", or "c".

Since the number of chars is limited, you can get away without a back reference in the look ahead:
^(?!.*(aa|bb|cc)[abc]*$
But I like tenub's answer better :)

using negative lookbehind: ^([abc]([abc](?<!(aa|bb|cc)))*)?$ TRY HERE
using negative lookahead: ^(((?!(aa|bb|cc))[abc])*[abc])?$ TRY HERE
Prefer either (both do the same job but differently) if you are going to use this regex as a part of some bigger regex that you might be creating.
In short, this is reusable. Copy & paste and it will do its work without disturbing any regex that is present around it.
In my humble opinion, regexes provided in #tenub and #Bohemian are not reusable which can cause bugs.
Note: empty string ("") will pass these 2 regexes. If you don't want it to, remove ? from regex.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex match a string within 2 different strings containing other characters - regex

Related

Regular expression to select single words inside parentheses using stringr

Regex: Match all permutations [duplicate]

How to invert an arbitrary Regex expression

Is it possible to match any wide character that appears more than once using only regxp?

Regex no two consecutive characters are the same

Categories

Resources