Regex - Match Collection Must Consume Entire String

Regex - Match Collection Must Consume Entire String - regex

Let's say I have the following string:
var goodStr = "abcabcabc";
I want to write a Regex pattern for it to return three (3) matches, which each one's value being "abc". However, if the string deviates from the repeated "abc" pattern at all, I do not want to return ANY matches.
Also, I do not know if there will always be 3 repetitions (there could be any number of repetitions).
For example, the following string should fail, and it should not have any matches:
var badStr = "abcabcabc123";
What pattern should I use that would return 3 matches in goodStr, but 0 matches in badStr?
In other words, the sum of the last match's index and length should equal the total length of the subject string.
I am trying not to use captures/back-references in this scenario also.
EDIT:
The pattern ^(?:abc)+$ does not suffice since it only returns 1 match.
The pattern ^abcabcabc$ does not suffice since it assumes there will only be 3 repetitions of "abc", and I don't know how many repetitions there will be in my scenario. Also, it only returns 1 match.
Solved
With Aaron's and anubhava's help we made this pattern that works for my scenario:
\Gabc(?=(?:abc)*$)

PHP doesn't support dynamic length lookbehind so you may use this regex using \G:
(?:^(?=(?:abc)+$)|(?!^)\G)abc
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match.
After some more iterations this regex turns out to be most efficient:
\Gabc(?=(?:abc)*$)
RegEx Demo 2

You should be able to use the following in C# :
(?<=^(?:abc)*)abc(?=(?:abc)*$)
This matches occurences of abc from which you're able to reach both the start and the end of the string using only other repetitions of abc. This relies on the capacity to use variable-width lookbehinds which is quite rare but that C#'s regex engine implements.
I've been able to test it on http://regexstorm.net/tester where it does return 3 matches for abcabcabc but 0 for abcabcabc123.

You could use \G in combination with a positive lookahead. \G matches at the start of the string or asserts the position at the end of the previous match.
You can capture abc and check if what is on the right is a repetition of the group until the end of the string.
\G(abc)(?=\1*$)
Regex demo

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)

One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101

You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo

This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.

If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

RegExp: How do I include 'avoid non-numeric characters' from a pattern search?

I want to filter out all .+[0-9]. (correct way?) patterns to avoid duplicate decimal points within a numeral: (e.g., .12345.); but allow non-numerals to include duplicate decimal points: (e.g. .12345*.) where * is any NON-NUMERAL.
How do I include a non-numeral negation value into the regexp pattern? Again,
.12345. <-- error: erroneous numeral.<br/>
.12345(.' or '.12345*.' <-- Good.

I think you are looking for
^\d*(?:\.\d+)?(?:(?<=\d)[^.\d\n]+\.)?$
Here is a demo
Remember to escape the regex properly in Swift:
let rx = "^\d*(?:\\.\\d+)?(?:(?<=\\d)[^.\\d\\n]+\\.)?$"
REGEX EXPLANATION:
^ - Start of string
\d* - Match a digit optionally
(?:\.\d+)? - Match decimal part, 0 or 1 time (due to ?)
(?:(?<=\d)[^.\d\n]+\.)? - Optionally (due to ? at the end) matches 1 or more symbols preceded with a digit (due to (?<=\d) lookbehind) other than a digit ([^\d]), a full stop ([^.]) or a linebreak ([^\n]) (this one is more for demo purposes) and then followed by a full stop (\.).
$ - End of string
I am using non-capturing groups (?:...) for better performance and usability.
UPDATE:
If you prefer an opposite approach, that is, matching the invalid strings, you can use a much simpler regex:
\.[0-9]+\.
In Swift, let rx = "\\.[0-9]+\\.". It matches any substrings starting with a dot, then 1 or more digits from 0 to 9 range, and then again a dot.
See another regex demo

The non-numeral regex delimited character is \D. Conversely, if you're looking for only numerals, \d would work.
Without further context of what you're trying to achieve it's hard to suggest how to build a regex for it, though based on your example, (I think) this should work: .+\d+\D+

Regex to match string not ending with pattern

I try to find a regex that matches the string only if the string does not end with at least three '0' or more. Intuitively, I tried:
.*[^0]{3,}$
But this does not match when there one or two zeroes at the end of the string.

If you have to do it without lookbehind assertions (i. e. in JavaScript):
^(?:.{0,2}|.*(?!000).{3})$
Otherwise, use hsz's answer.
Explanation:
^ # Start of string
(?: # Either match...
.{0,2} # a string of up to two characters
| # or
.* # any string
(?!000) # (unless followed by three zeroes)
.{3} # followed by three characters
) # End of alternation
$ # End of string

You can try using a negative look-behind, i.e.:
(?<!000)$
Tests:
Test Target String Matches
1 654153640 Yes
2 5646549800 Yes
3 848461158000 No
4 84681840000 No
5 35450008748 Yes
Please keep in mind that negative look-behinds aren't supported in every language, however.

What wrong with the no-look-behind, more general-purpose ^(.(?!.*0{3,}$))*$?
The general pattern is ^(.(?!.* + not-ending-with-pattern + $))*$. You don't have to reverse engineer the state machine like Tim's answer does; you just insert the pattern you don't want to match at the end.

This is one of those things that RegExes aren't that great at, because the string isn't very regular (whatever that means). The only way I could come up with was to give it every possibility.
.*[^0]..$|.*.[^0].$|.*..[^0]$
which simplifies to
.*([^0]|[^0].|[^0]..)$
That's fine if you only want strings not ending in three 0s, but strings not ending in ten 0s would be long. But thankfully, this string is a bit more regular than some of these sorts of combinations, and you can simplify it further.
.*[^0].{0,2}$

Regular expression for duplicate words

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - Match Collection Must Consume Entire String - regex

You could use \G in combination with a positive lookahead. \G matches at the start of the string or asserts the position at the end of the previous match. You can capture abc and check if what is on the right is a repetition of the group until the end of the string. \G(abc)(?=\1*$) Regex demo

Related

Match all elements with n occurrences

Regular Expression: Find a specific group within other groups in VB.Net

RegExp: How do I include 'avoid non-numeric characters' from a pattern search?

Regex to match string not ending with pattern

Regular expression for duplicate words

Categories

Resources