Compound Words - Regex [duplicate] - regex

I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?

When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
If it's an option, split on delimiters instead
Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
Do note that these two aren't exactly equivalent, but it may be an option
Do multilevel matching:
Capture the repeated group in one match
Then run another regex to break that match apart
References
regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Javascript flavor notes
Example
Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):
var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";
var r = /<(\w+(;\w+)*)>/g;
var match;
while ((match = r.exec(text)) != null) {
print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz
The pattern used is:
_2__
/ \
<(\w+(;\w+)*)>
\__________/
1
This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.
Related questions
How do you access the matched groups in a javascript regex?

How's about this? "foo bar baz".match(/(\w+)+/g)

Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);

try using 'g':
"foo bar baz".match(/\w+/g)

You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz

Related

Capturing all occurences of a repeated group in a string and reference them for substitution

I have following regex that matches any number in the string and returns it in the group, which then i replace with another text.
For the sample string:
/text_1/123456/text_2
With /^(.*[^0-9])+([0-9]{3,}+)+(.*)$ and using substitution like $1captured_group$3 i get my desired result i.e. /text_1/captured_group/text_2
However for scenarios where the capturing groups appears more than once in the give string such as:
/text_1/123456/text_2/789011
/text_1/123456/text_2/789011/abc/12345
The given regex would only capture last group i.e. 789011 and 12345 respectively. However, what i want is to capture all of the groups and be able to reference them later to replace them.
An explanation given on regex101.com i beleive addresses my scenario:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
However, i am not sure how to Put a capturing group around the repeated group to capture all iterations and later reference all the matched values?
As Hao Wu commented:
"If you want to match multiple occurrences you need to get rid of the anchors (^, $) and add a global (g) modifier, such as /\b[0-9]{3,}\b/g"
As for storing matches and referencing them for later use, you could have an array of objects wherein each object has the match and an array of two indices -- the first index being the index of the start of the match and the second index being the index of the end of the match:
// string = `123`
{match: 123, indices: [0, 2]}
In the example below, the function tagMatches(str, rgx) uses .matchAll() method.
const tagMatches = (str, rgx) => {
const matches = str.matchAll(rgx);
let result = [];
for (const match of matches) {
result.push({"match": +match[0], "indices": [match.index, match.index + match.length]});
}
return result;
}
const string = `utfuduyiutcv fvtycy 1sdtyveaf 678900 amsiofjsogifn979/125487/`;
const regexp = /\b(\d){3,}\b/g;
const tagged = tagMatches(string, regexp)
console.log(tagged);
console.log("first match: "+tagged[0].match);
console.log("second match start: "+tagged[1].indices[0]);
console.log("first match end: "+tagged[0].indices[1]);

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

regex to duplicate repeated patterns, substituting part of the pattern

I'd like to duplicate a multiple matches in a line, substituting part of the match, but keeping the runs of matches together (that seems to be the tricky part).
e.g.:
Regex:
(x(\d)(,)?)
Replacement:
X$2,O$2$3
Input:
x1,x2,Z3,x4,Z5,x6
Output: (repeated groups broken apart)
X1,O1,X2,O2,Z3,X4,O4,Z5,X6,O6
Desired output (repeated groups, "X1,X2" kept together):
X1,X2,O1,O2,Z3,X4,O4,Z5,X6,O6
Demo: https://regex101.com/r/gH9tL9/1
Is this possible with regex or do I need to use something else?
Update: Wills answer is what I expected. It occurs to me that it might be possible with multiple passes of regex.
You would have to capture the repeating patterns as one match and write out replacements for the whole repeating pattern at once. your current pattern cannot tell that your first and second matches, x1, and x2, respectively, are adjacent.
Im going to say no, this is not possible with one pure regex.
This is because of two important facts about capture groups and replacing.
Repeated capture groups will return the last capture:
Regex's are able to capture patterns which repeat an arbitrary amount of time by using the form <PATTERN>{1,},<PATTERN>+ or <PATTERN>*. However any capture group within <PATTERN> would only return the captures from the last iteration of the pattern. This would prevent your desired ability to capture matches that arbitrarily repeat.
"Hold on", you might say, "I only want to capture patterns that repeat one or two times, I could use (x(\d)(,)?)(x(\d)(,)?)?", which brings us to point 2.
There is no conditional replacement
Using the above pattern we could get your desired output for the repeated match, but not without mangling the solo match replacement.
See: https://regex101.com/r/gH9tL9/2 Without the ability to turn off sections of the replacement based on the existence of capture groups, we cannot achieve the desired output.
But "No, you can't do that" is a challenge to a hacker, I hope I am shown up by a true regex ninja.
Solution with 2 regexes and some code
There's definitely ways to achieve this goal with some code.
Here's a quick and dirty python hack using two regexes http://pythonfiddle.com/wip-soln-for-so-q/
This makes use of python's re.sub(), which can pass matches to one regex to a function ordered_repl which returns the replacement string. By using your original regex within the ordered_repl we can extract the information we want and get the right order by buffering our lists of Xs and Os.
import re
input_string="x1,x2,Z3,x4,Z5,x6"
re1 = re.compile("(?:x\d,?)+") # captures the general thing you want to match using a repeating non-capturing group
re2 = re.compile("(x(\d)(,)?)") # your actual matcher
def ordered_repl(m): # m is a matchobj
buf1 = []
buf2 = []
cap_iter = re.finditer(re2,m.group(0)) # returns an iterator of MatchObjects for all non-overlapping matches
for cap_group in cap_iter:
capture = cap_group.group(2) # capture the digit
buf1.append("X%s" % capture) # buffer X's of this submatch group
buf2.append("O%s" % capture) # buffer O's of this submatch group
return "%s,%s," % (",".join(buf1),",".join(buf2)) # concatenate the buffers and return
print re.sub(re1,ordered_repl,input_string).rstrip(',') # searches string for matches to re1 and passes them to the ordered_repl function
In my specific case I'm using powershell, so I was able to come up with the following:
(linebreaks added for readability)
("x1,x2,z3,x4,z5,x6"
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
| Foreach-Object {
if ($_ -match 'x') {
$_ + ',' + ($_ -replace 'x','y')
} else {$_}
}
) -join ''
Outputs:
x1,x2,y1,y2,z3,x4,y4,z5,x6,y6
Where:
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
breaks apart the string into these groups:
x1,x2
,
z3
,
x4
,
z5
,
x6
using positive and negative lookahead and lookbehind:
comma with x\d before and without x after:
(?<=x\d),(?!x)
comma without x\d before and with x after:
(?<!x\d),(?=x)

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}