How to write a regex that captures a word, and separately captures an optional suffix [duplicate] - regex

This question already has answers here:
Regular Expressions: How to Express \w Without Underscore
(7 answers)
Closed 12 days ago.
[Note: I don't believe that is it a dup of https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match, which only involves one capture group.]
Suppose that input tokens have two possible forms that we want to match:
some_word
some_word_id123
I'd like to end up with two capture groups: one that has the word, and one that has the id (or empty if there is no id).
So in the two examples above, the captures should be:
["some_word", ""]
["some_word", "123"]
What I've tried:"
(\w+)_id(\d+): works for case 2, but obviously won't match anything in case 1
(\w+)(?:_id(\d+))?: works in case 1, but ends up with ["some_word_id123", ""] in case 2
(\w+?)(?:_id(\d+))?: trying to make the first group lazy. But that ends up causing it to only match the first letter of the word ("s" in the example)
So the challenge is: how do we prevent the first group from including the _id123 suffix (since they are valid word characters)?

The first example almost does what you need. As per the question, the _id... part is the unreliable bit. In your case, you would simply need to make use of the optional match operator (?) to instruct the engine that _id... might not be there.
Thus, your solution: (\w+)_id(\d+), becomes: ^(\w+?)(?:_id(\d+))?$. Example here.
There are some issues with your original solution, mainly the (\w+) section. The \w command is shorthand for [a-zA-Z0-9_], and since you are using the greedy version of the + operator, then the match will spill over to the _ token which belongs to the _id section, which causes some issues.

Related

RegEx to find count of special characters in String [duplicate]

This question already has answers here:
How to get the count of only special character in a string using Regex?
(6 answers)
Closed 2 years ago.
I need to form the RegEx to produce the output only if more than two occurrences of special characters exists in the given string.
1) abcd##qwer - Match
2) abcd#dsfsdg#fffj-Match
3) abcd#qwetg- No Match
4) acwexyz - No Math
5) abcd#ds#$%fsdg#fffj-Match
Can anyone help me on this?
Note: I need to use this regular expression in one of the existing tool not in any programming language.
UPDATE after OP edit
The edited OP introduces a small amount of additional complexity that necessitates a different pattern entirely. The keys here are that (a) there is now a significantly limited set of "special characters" and (b) that these characters must appear at least twice (c) in any position in the string.
To implement this, you would use something like:
(?:.*?[##$%].*?){2,}
Asserts a non-capturing group,
Which contains any number of characters, followed by
Any character in the set ##$%
Followed by any number of characters
Ensures this pattern happens twice in a given string.
Original answer
By "special characters", I assume you mean anything outside standard alphanumeric characters. You can use the pattern below in most flavors of Regex:
([^A-Za-z0-9])\1
This (a) creates a set of all characters not including alphanumeric characters and matches a character against it, then (b) checks to see if the same character appears adjacent.
Regex101

Dynamic class operations within a regular expression

I am trying to write a regex which excludes certain characters from a class based on the current content of capturing groups. The specific task that made me look for such a thing was to match lowercase letters in alphabetical order.
I searched through Rex's page (https://www.rexegg.com/regex-class-operations.html) to see if there was any way to change the class' content, but was unable to find anything.
Take the following attempt as a brief example: ([a-z])[a-z--[\1]]
Though it's not a correct regular expression, it demonstrates the concept. The idea is that it would match two letters that are not the same.
Note: the expression shown follows a Python-like syntax, and can also be written as:
([a-z])[a-z&&[\1]] or ([a-z])(?![\1])[a-z]
But I am going to use the Python syntax.
In the examples above the nested brackets are optional(in certain engines), but for the ultimate goal they are necessary. The pattern I am trying to match the ordered letters with would be something like this:
^(?:([a-z])([a-z--[a-(?(2)\2|\1)]])*+)?$
The first character class matches a letter which is immediately captured by the group, meaning that the letter will be excluded from the group containing the conditional. the first time the second group tries to match, condition inside the conditional statement evaluates to false, since there has not been a second capture yet, so it "matches" the first group's content, which should result in the exclusion of the first letter from the class. In later steps the second group will be set, meaning that all the letters between 'a' and the most recently captured letter will be excluded.
I know, it seems complicated. Maybe refactoring the pattern will help, take a look at this one:
^(?:([a-z])([(?(2)\2|\1)-z])*+)?$
This example makes no use of set operations, but the idea is roughly the same. The first group matches a letter, then the class inside the second group matches anything between the captured letter and 'z', which is noted by the [(?(2)\2|\1)-z] part. The conditional is there to ensure that the lower boundary of the character interval is the most recently captured character.
This could also be written using subroutine calls, but I doubt it would solve the problem. The issue might be that the classes are precompiled (and so are subroutines), so they cannot change during the matching process.
Are you guys aware of a workaround or an engine that supports such operations? I am interested in the dynamic class operation itself rather than a different way to match alphabetically ordered letters.

Regex: Match all permutations [duplicate]

This question already has answers here:
Regex to match all permutations of {1,2,3,4} without repetition
(4 answers)
Closed 4 years ago.
First of all, I am aware that this is a problem you wouldn't usually use regex for, I am just trying to find out whether this is even possible.
That being said, what I am trying to do is match ALL occurrences of any permutation of a string (for now, I don't care if overlapping occurences match or not); for example, if I have the string abc, I want to match all occurrences of abc, acb, bac, bca, cab and cba.
What I have until now is the following regex: (?:([abc])(?!.{0,1}\1)){3} (note: I know that I could use + instead of {0,1}, but that only works for strings with length 3). This kind of works, but if there are two permutations next to each other where a letter of the first one is too close to a letter of the second one (eg. abc cba → c c), the first permutation does not match. Is it possible to solve this using regex?
Direct Approach
[abc]{3} would match too many results since it would also match aab.
In order to not double match a you would need to remove a from the group that follows leaving you with a[bc]{2}.
a[bc]{2} would match too many results since it would also match 'abb'.
In order to not double match b you would need to remove a from the group that follows leaving you with ab[c]{1} or abc for short.
abc would not match all combinations so you would need another group.
(abc)|([abc]{3}) which would match too many combinations again.
This path leads you down the road of having all permutations listed explicitly in groups.
Can you create combinations so that you do not need to write out all combinations?
(abc)|(acb) could be writtean as a((bc)|(cb)).
(bc)|(cb) I can not shorten that any further.
Match too many and remove unwanted
Depending on the regex engine you may be able to express AND as a look ahead so that you can remove matches. THIS and not THAT consume THIS.
(?=[abc]{3})(?=(?!a.a))[abc]{3} would not match aca.
This problem is now simmilar to the one above where you need to remove all combinations that would violate your permutations. In this example that is any expression containing the same character mutltiple times.
'(.)\1+' this expression uses grouping references on its own matches the same character multiple times but requires knowing how many groups exist in the expression and is very brittle Adding groups kills the expression ((.)\1+) no longer matches. Relative back references exist and require knowledge of your specific regex engine. \k<-1> may be what you could be looking for. I will assume .net since I happen to have a regex tester bookmarked for that.
The permutations that I want to exclude are: nn. n.n .nn nnn
So I create these patterns: ((?<1>.)\k<1>.) ((?<2>.).\k<2>) (.(?<3>.)\k<3>) ((?<4>.)\k<4>\k<4>)
Putting it all together gives me this expression, note that I used relative back references as they are in .net - your milage may vary.
(?=[abc]{3})(?=(?!((?<1>.)\k<1>.)))(?=(?!((?<2>.).\k<2>)))(?=(?!(.(?<3>.)\k<3>)))(?=(?!((?<4>.)\k<4>\k<4>)))[abc]{3}
The answer is yes for a specific length.
Here is some testing data.

Is it possible to match any wide character that appears more than once using only regxp?

For example, in this string with no \s:
abodnpjdcqe
only d should be matched.
But in my case there are thousands of different characters, is it possible to use ONLY regxp to match all characters that appear in the string more than once? It seems that all other problems use other tools.
It is possible to find characters that are present two times in a string as anubhava demonstrates it, and I don't see any other regex pattern to do it.
However, there are problems with an only regex way:
The complexity of this kind of pattern is very high, and you will experience problems (with backtracking limits and execution time) if your string is long and if there are few duplicates.
This way is unable to see if a duplicate character have been already found. For example the string a123a456a789a, the pattern will return a three times instead of one. If your goal is to obtain a list of unique duplicate characters, it can be problematic (but easy to solve programmatically)
So, to answer your question: my answer is no.
a simple way, to do it with code is to loop over the characters of your string and to build an associative array where the keys are the characters and the values the number of occurences. Then, removes each item that has the value 1 and extract the keys.
Note: you can solve the problem of duplicate results (2.) using this pattern:
(.)(?=(?:(?!\1).)*\1(?:(?!\1).)*$)
or if possessive quantifiers are available:
(.)(?=(?:(?!\1).)*+\1(?:(?!\1).)*+$)
but I'm afraid that the complexity may be even more high.
So, using your favorite language stay from far the best way.
You can use this regex:
([a-zA-Z])(?=.*\1)
Explanation:
Regex uses ([a-zA-Z]) to match any letter and captures it as group #1 i.e. \1
A positive lookahead (?=.*\1) then makes sure this match is successful only when it is followed by at least one of the backreference \1 i.e. the character itself.
RegEx Demo

What does ?: do in regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.
It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (
Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.
Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.