Match at every second occurrence - regex

Is there a way to specify a regular expression to match every 2nd occurrence of a pattern in a string?
Examples
searching for a against string abcdabcd should find one occurrence at position 5
searching for ab against string abcdabcd should find one occurrence at position 5
searching for dab against string abcdabcd should find no occurrences
searching for a against string aaaa should find two occurrences at positions 2 and 4

Use capturing groups.
foo.*?(foo)
Use a regex like this to match all occurrences in a string. Every returned match will contain a second occurrence as its first captured group.
Here's an example that matches every second occurrence of \d+ in Python using findall:
import re
input = '10 is less than 20, 5 is less than 10'
second_occurrences = re.findall(r'\d+.*?(\d+)', input)
print(second_occurrences)
Output:
['20', '10']

Suppose the pattern you want is abc+d. You want to match the second occurrence of this pattern in a string.
You would construct the following regex:
abc+d.*?(abc+d)
This would match strings of the form: <your-pattern>...<your-pattern>. Since we're using the reluctant qualifier *? we're safe that there cannot be another match of between the two. Using matcher groups which pretty much all regex implementations provide you would then retrieve the string in the bracketed group which is what you want.

If you're using C#, you can either get all the matches at once (i.e. use Regex.Matches(), which returns a MatchCollection, and check the index of the item: index % 2 != 0).
If you want to find the occurrence to replace it, use one of the overloads of Regex.Replace() that uses a MatchEvaluator (e.g. Regex.Replace(String, String, MatchEvaluator). Here's the code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "abcdabcd";
// Replace *second* a with m
string replacedString = Regex.Replace(
input,
"a",
new SecondOccuranceFinder("m").MatchEvaluator);
Console.WriteLine(replacedString);
Console.Read();
}
class SecondOccuranceFinder
{
public SecondOccuranceFinder(string replaceWith)
{
_replaceWith = replaceWith;
_matchEvaluator = new MatchEvaluator(IsSecondOccurance);
}
private string _replaceWith;
private MatchEvaluator _matchEvaluator;
public MatchEvaluator MatchEvaluator
{
get
{
return _matchEvaluator;
}
}
private int _matchIndex;
public string IsSecondOccurance(Match m)
{
_matchIndex++;
if (_matchIndex % 2 == 0)
return _replaceWith;
else
return m.Value;
}
}
}
}

Would something like
(pattern.*?(pattern))*
work for you?
Edit:
The problem with this is that it uses the non-greedy operator *?, which can require an awful lot of backtracking along the string instead of just looking at each letter once. What this means for you is that this could be slow for large gaps.

Back references can find interesting solutions here. This regex:
([a-z]+).*(\1)
will find the longest repeated sequence.
This one will find a sequence of 3 letters that is repeated:
([a-z]{3}).*(\1)

There's no "direct" way of doing so but you can specify the pattern twice as in: a[^a]*a that match up to the second "a".
The alternative is to use your programming language (perl? C#? ...) to match the first occurence and then the second one.
EDIT: I've seen other responded using the "non-greedy" operators which might be a good way to go, assuming you have them in your regex library!

Related

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

Match a group of words using regex

I have a long string in the format:
' Random Key : Random Value\n Random Long Key : Random Long Value\n...'
and so on.
I am trying to change it to
Random Value:Random Key, Random Long Value:Random Long Key,...
by using regex. I can match a single word by
\w+
but in order to match more than one word i am doing
\w+(\s\w+)*
but that's not giving me the wanted result.
You might use this piece of code to find the key-value pairs:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var regex = new Regex(#"\s*(?<key>(\w+\s?\w+)*)\s*:\s*(?<val>(\w+\s?\w+))\s*");
var input = #" Key : Value\n Long Key : Long Value\n...";
Console.WriteLine(regex.Replace(input, "${key}:${val}").Replace("\\n", ", "));
}
}
The trick is to match "any number of (at least one word character, an optional space, and at least another word character)", which finds keys for us with spaces. The shortest no-space key is two characters though.
I admit that the escaped line breaks aren't replaced by a regex, but this way both expression and code are fairly simple.
If your \n really is part of the string you can match and replace it like this:
/(?:\s*([^\\]+?)\s*:\s*([^\\]+?)\s*)+\\n/g
and substitute it with
$1:$2,
See Demo.
If you have it line-by-line then its easier because you can use multiline matching:
/^\s*(.+?)\s*:\s*(.+?)\s*$/mg
and also substitute it with
$1:$2,
See Demo.
None of the other answers worked fully for me, as they kept matching the space after the final word. the regex that worked as intended is
(\w+(?:\s\w+)*)

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Using regex to match certain text

I try to look for this answer for a while but no luck (sorry if I could describe it well). I am still newbie with regex. I am trying to match a string with only number and a certain delimiter. For example: the patter would be 8/16/32/64/.... the number will be split by '/' with arbitrary amount of number, I could find a way to match them.
My attempt is \d+/\d+? but couldn't get it to work.
You could remove the '/' delimiter and then test for the existence of a number
Here is some C# as an example:
static void Main(string[] args)
{
string text = "8/16/32/64/";
Console.WriteLine(text);
TestForNum(text);
text = "8/16/32/64/b";
Console.WriteLine(text);
TestForNum(text);
Console.ReadKey();
}
private static void TestForNum(string text)
{
string tmp = Regex.Replace(text, #"/", "");
Match m = Regex.Match(tmp, #"^\d+$");
if(m.Success)
{
Console.WriteLine("\t" + m.Groups[0]);
}
else Console.WriteLine("\tno match");
}
A naive approach would be
[\d/]+
However, this does match //// as well as just 12345. To match only "proper" strings:
\d+(/\d+)+
Reads digits followed by delimiter+digits repeated at least once. If trailing/leading delimiters are allowed, then
/?(\d+/)+\d*
If you're using a flavor that uses slashes to quote the regex (like javascript), you'll need to escape them:
/\d+(\/\d+)+/
You can do:
(\d+)(\D|$)
See this work That will split a list of digits delimited by any non digit, so 1?2!3.4 would match
If you want a specific delimiter, such as /:
(\d+)(?:/|$)
As simple as possible:
(\d+\/?)+
Every digit followed by [a] slash, as many as possible. You may use g flag for all matches.

Capture multiple texts.

I have a problem with Regular Expressions.
Consider we have a string
S= "[sometext1],[sometext],[sometext]....,[sometext]"
The number of the "sometexts" is unknown,it's user's input and can vary from one to ..for example,1000.
[sometext] is some sequence of characters ,but each of them is not ",",so ,we can say [^,].
I want to capture the text by some regular expression and then to iterate through the texts in cycle.
QRegExp p=new QRegExp("???");
p.exactMatch(S);
for(int i=1;i<=p.captureCount;i++)
{
SomeFunction(p.cap(i));
}
For example,if the number of sometexts is 3,we can use something like this:
([^,]*),([^,]*),([^,]*).
So,i don't know what to write instead of "???" for any arbitrary n.
I'm using Qt 4.7,I didn't find how to do this on the class reference page.
I know we can do it through the cycles without regexps or to generate the regex itself in cycle,but these solutions don't fit me because the actual problem is a bit more complex than this..
A possible regular expression to match what you want is:
([^,]+?)(,|$)
This will match string that end with a coma "," or the end of the line. I was not sure that the last element would have a coma or not.
An example using this regex in C#:
String textFromFile = "[sometext1],[sometext2],[sometext3],[sometext4]";
foreach (Match match in Regex.Matches(textFromFile, "([^,]+?)(,|$)"))
{
String placeHolder = match.Groups[1].Value;
System.Console.WriteLine(placeHolder);
}
This code prints the following to screen:
[sometext1]
[sometext2]
[sometext3]
[sometext4]
Using an example for QRegex I found online here is an attempt at a solution closer to what you are looking for:
(example I found was at: http://doc.qt.nokia.com/qq/qq01-seriously-weird-qregexp.html)
QRegExp rx( "([^,]+?)(,|$)");
rx.setMinimal( TRUE ); // this is if the Qregex does not understand the +? non-greedy notation.
int pos = 0;
while ( (pos = rx.search(text, pos)) != -1 )
{
someFunction(rx.cap(1));
}
I hope this helps.
We can do that, you can use non-capturing to hook in the comma and then ask for many of the block:
Try:
QRexExp p=new QRegExp("([^,]*)(?:,([^,]*))*[.]")
Non-capturing is explained in the docs: http://doc.qt.nokia.com/latest/qregexp.html
Note that I also bracketed the . since it has meaning in RegExp and you seemed to want it to be a literal period.
I only know of .Net that lets you specify a variable number of captures with a single
expression. Example - (capture.*me)+
It creates a capture object that can be itterated over. Even then it only simulates
what every other regex engine provides.
Most engines provide an incremental match until no matches left from within a
loop. The global flag tells the engine to keep matching from where the last
sucessfull match left off.
Example (in Perl):
while ( $string =~ /([^,]+)/g ) { print $1,"\n" }