Regex : split by occurrences groups

Regex : split by occurrences groups - regex

I am trying to find a solution to split a string by occurrences in groups.
Strings are formatted like this: "AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD"
I want the string to split like this:
1 ) AAA/BBB/CCC/DDD
2 ) BBB/CCC/DDD
3 ) BBB/DDD
'/' is always the separator and words are always AAA, BBB, CCC and DDD.
I tried regex expression (AAA|BBB|CCC|DDD){x} with {x} to specify the number of occurrences but it seems {} works only for characters, not words.

You can use re.findall with the following positive lookahead patterns to ensure that slashes are included only if they are followed by characters that are allowed in the sequence, and use ? as a repeater to make a match of each word optional (but greedy):
import re
s = 'AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD'
re.findall('(?=[ABCD])(?:AAA(?:/(?=[BCD]))?)?(?:BBB(?:/(?=[CD]))?)?(?:CCC(?:/(?=D))?)?(?:DDD)?', s)
This returns:
['AAA/BBB/CCC/DDD', 'BBB/CCC/DDD', 'BBB/DDD']

You can use re.split with an alternation pattern that includes slashes that are surrounded by positive lookbehind and lookahead patterns to ensure that the character preceding the slash is to be latter in the sequence than the character following the slash:
import re
s = 'AAA/BBB/CCC/DDD/BBB/CCC/DDD/BBB/DDD'
re.split('(?:(?<=[BCD])/(?=A)|(?<=[CD])/(?=B)|(?<=D)/(?=C))', s)
This returns:
['AAA/BBB/CCC/DDD', 'BBB/CCC/DDD', 'BBB/DDD']

Related

Regular Expression to match first word with a character in each line

I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?

Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}

If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo

The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator

Regex for first part of a string to match repeated (consecutive or non-consecutive) character

Basically I have a string and I want to find the shortest sub-string (including the beginning) that matches the repetition of a character N times, it doesn't matter if consecutive or not. I want to use it in Javascript.
Example:
Let's figure out the character is '/' and we want it to match 5 repetitions.
For this string:
http://remote-computer.example.local/home/dev/proj/sdk/docs/index.html#/api
The matching string would be:
http://remote-computer.example.local/home/dev/
For this string:
////remote-computer/example/local/home
The matching string would be:
////remote-computer/

How about this regex:
^((?:[^/]*/){5})
The sub-string you want will be catched in group 1.
In javascript you could do:
var re = new RegExp("^((?:[^/]*/){5})); // excape the slashes is not mandatory
or
var re = /^((?:[^\/]*\/){5})/; // here you have to excape the slashes
Explanation:
^ : begining of the string
( : start capture group 1
(?: : start non capture group
[^/]*/ : 0 or more any character that is not a slash, followed by a slash
){5} : the non capture group occurs 5 times
) : end of group 1

You can use (?:.*?/){5}. See a demo here.
This matches the exact same substrings as Toto’s regexp but is shorter:
There’s no need to use ^; regexps start matching at the beginning by default.
You don’t need a capture group because you want the whole match.
.*?/ matches "everything until the next /, including it", which can also be written as [^/]*/ like Toto did.

How to write a RegEx pattern that accepts a string with at most one of each letter, but unordered?

I have tried this:
[a]?[b]?[c]?[d]?[e]?[f]?[g]?[h]?[i]?[j]?[k]?[l]?[m]?[n]?[o]?[p]?[q]?[r]?[s]?[t]?[u]?[v]?[w]?[x]?[y]?[z]?
But this RegEx rejects string where the order in not alphabetical, like these:
"zabc"
"azb"
I want patterns like these two to be accepted too. How could I do that?
EDIT 1
I don't want letter repetitions, i.e., I want the following strings to be rejected:
aazb
ozob
Thanks.

You can use a negative lookahead assertion to make sure no two characters are the same:
^(?!.*(.).*\1)[a-z]*$
Explanation:
^ # Start of string
(?! # Assert that it's impossible to match the following:
.* # any number of characters
(.) # followed by one character (capture that in group 1)
.* # followed by any number of characters
\1 # followed by the same character as the one captured before
) # End of lookahead
[a-z]* # Match any number of ASCII lowercase letters
$ # End of string
Test it live on regex101.com.
Note: This regex needs to brute-force check all possible character pairs, so performance may be a problem with larger strings. If you can use anything besides regex, you're going to be happier. For example, in Python:
if re.search("^[a-z]*$", mystring) and len(mystring) == len(set(mystring)):
# valid string

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)

This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""

The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+

Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").

Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

Regular expression for duplicate words

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?

Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here

I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1

Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.

This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result

Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b

I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.

You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}

As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )

To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.

Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex : split by occurrences groups - regex

Related

Regular Expression to match first word with a character in each line

Regex for first part of a string to match repeated (consecutive or non-consecutive) character

How to write a RegEx pattern that accepts a string with at most one of each letter, but unordered?

regex to match entire words containing only certain characters

Regular expression for duplicate words

Categories

Resources