Finding all matches with a regular expression - greedy and non greedy! - regex

Take the following string: "Marketing and Cricket on the Internet".
I would like to find all the possible matches for "Ma" -any text- "et" using a regex. So..
Market
Marketing and Cricket
Marketing and Cricket on the Internet
The regex Ma.*et returns "Marketing and Cricket on the Internet". The regex Ma.*?et returns Market. But I'd like a regex that returns all 3. Is that possible?
Thanks.

As far as I know: No.
But you could match non-greedy first and then generate a new regexp with a quantifier to get the second match.
Like this:
Ma.*?et
Ma.{3,}?et
...and so on...

Thanks guys, that really helped. Here's what I came up with for PHP:
function preg_match_ubergreedy($regex,$text) {
for($i=0;$i<strlen($text);$i++) {
$exp = str_replace("*","{".$i."}",$regex);
preg_match($exp,$text,$matches);
if($matches[0]) {
$matched[] = $matches[0];
}
}
return $matched;
}
$text = "Marketing and Cricket on the Internet";
$matches = preg_match_ubergreedy("#Ma.*?et#is",$text);

Sadly, this is not possible to do with a standard POSIX regex, which returns a single (best candidate, per regex rules) match. You will need to utilize an extension feature, which may be present in the particular programming language in which you are using this regex, assuming that you are using it in a program, to accomplish this task.

For a more general regular expression, another option would be to recursively match the greedy regular expression against the previous match, discarding the first and last characters in turn to ensure that you're matching only a substring of the previous match. After matching Marketing and Cricket on the Internet, we test both arketing and Cricket on the Internet and Marketing and Cricket on the Interne for submatches.
It goes something like this in C#...
public static IEnumerable<Match> SubMatches(Regex r, string input)
{
var result = new List<Match>();
var matches = r.Matches(input);
foreach (Match m in matches)
{
result.Add(m);
if (m.Value.Length > 1)
{
string prefix = m.Value.Substring(0, m.Value.Length - 1);
result.AddRange(SubMatches(r, prefix));
string suffix = m.Value.Substring(1);
result.AddRange(SubMatches(r, suffix));
}
}
return result;
}
This version can, however, end up returning the same submatch several times, for example it would find Marmoset twice in Marketing and Marmosets on the Internet, first as a submatch of Marketing and Marmosets on the Internet, then as a submatch of Marmosets on the Internet.

Related

Building a Regex String - Any assistance provided

Im very new to REGEX, I understand its purpose, but Im struggling to yet fully comprehend how to use it. Im trying to build a REGEX string to pull the A8OP2B out from the following (or whatever gets dumped in that 5th group).
{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}}
The other items in above line, will change in character length, so I cannot say the 51st to the 56th character. It will always be the 5th group in quotation marks though that I want to pull out.
Ive tried building various regex strings up, but its still mostly a foreign language to me and I still have much reading to do on it.
Could anyone provide me a working example with the above, so I can reverse engineer and understand better?
Thanks
Demo 1: Reference the JSON to a var, then use either dot or bracket notation.
Demo 2: Using RegEx is not recommended, but here's one in JavaScript:
/\b(\w{6})(?=","RfKey":)/g
First Match
non-consuming match: :"A
meta border: \b: A non-word=:, any char=", and a word=A
consuming match: A8OP2B
begin capture: (, Any word =\w, 6 times={6}
end capture: )
non-consuming match: ","RfKey":
Look ahead: (?= for: ","RfKey": )
Demo 1
var obj = {"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}};
var dataDot = obj.RfReceived.Data;
var dataBracket = obj['RfReceived']['Data'];
console.log(dataDot);
console.log(dataBracket)
Demo 2
Note: This is consuming a string of 3 consecutive patterns. 3 matches are expected.
var rgx = /\b(\w{6})(?=","RfKey":)/g;
var str = `{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}},{"RfReceived":{"Sync":8080,"Low":102,"High":1200,"Data":"PFN07U","RfKey":"None"}},{"RfReceived":{"Sync":7580,"Low":471,"High":360,"Data":"XU89OM","RfKey":"None"}}`;
var res = str.match(rgx);
console.log(res);

Parsing of a string with the length specified within the string

Example data:
029Extract this specific string. Do not capture anything else.
In the example above, I would like to capture the first n characters immediately after the 3 digit entry which defines the value of n. I.E. the 29 characters "Extract this specific string."
I can do this within a loop, but it is slow. I would like (if it is possible) to achieve this with a single regex statement instead, using some kind of backreference. Something like:
(\d{3})(.{\1})
With perl, you can do:
my $str = '029Extract this specific string. Do not capture anything else.';
$str =~ s/^(\d+)(.*)$/substr($2,0,$1)/e;
say $str;
output:
Extract this specific string.
You can not do it with single regex, while you can use knowledge where regex stop processing to use substr. For example in JavaScript you can do something like this http://jsfiddle.net/75Tm5/
var input = "blahblah 011I want this, and 029Extract this specific string. Do not capture anything else.";
var regex = /(\d{3})/g;
var matches;
while ((matches = regex.exec(input)) != null) {
alert(input.substr(regex.lastIndex, matches[0]));
}
This will returns both lines:
I want this
Extract this specific string.
Depending on what you really want, you can modify Regex to match only numbers starting from line beginning, match only first match etc
Are you sure you need a regex?
From https://stackoverflow.com/tags/regex/info:
Fools Rush in Where Angels Fear to Tread
The tremendous power and expressivity of modern regular expressions
can seduce the gullible — or the foolhardy — into trying to use
regular expressions on every string-related task they come across.
This is a bad idea in general, ...
Here's a Python three-liner:
foo = "029Extract this specific string. Do not capture anything else."
substr_len = int(foo[:3])
print foo[3:substr_len+3]
And here's a PHP three-liner:
$foo = "029Extract this specific string. Do not capture anything else.";
$substr_len = (int) substr($foo,0,3);
echo substr($foo,3,substr_len+3);

Capture multiple texts.

I have a problem with Regular Expressions.
Consider we have a string
S= "[sometext1],[sometext],[sometext]....,[sometext]"
The number of the "sometexts" is unknown,it's user's input and can vary from one to ..for example,1000.
[sometext] is some sequence of characters ,but each of them is not ",",so ,we can say [^,].
I want to capture the text by some regular expression and then to iterate through the texts in cycle.
QRegExp p=new QRegExp("???");
p.exactMatch(S);
for(int i=1;i<=p.captureCount;i++)
{
SomeFunction(p.cap(i));
}
For example,if the number of sometexts is 3,we can use something like this:
([^,]*),([^,]*),([^,]*).
So,i don't know what to write instead of "???" for any arbitrary n.
I'm using Qt 4.7,I didn't find how to do this on the class reference page.
I know we can do it through the cycles without regexps or to generate the regex itself in cycle,but these solutions don't fit me because the actual problem is a bit more complex than this..
A possible regular expression to match what you want is:
([^,]+?)(,|$)
This will match string that end with a coma "," or the end of the line. I was not sure that the last element would have a coma or not.
An example using this regex in C#:
String textFromFile = "[sometext1],[sometext2],[sometext3],[sometext4]";
foreach (Match match in Regex.Matches(textFromFile, "([^,]+?)(,|$)"))
{
String placeHolder = match.Groups[1].Value;
System.Console.WriteLine(placeHolder);
}
This code prints the following to screen:
[sometext1]
[sometext2]
[sometext3]
[sometext4]
Using an example for QRegex I found online here is an attempt at a solution closer to what you are looking for:
(example I found was at: http://doc.qt.nokia.com/qq/qq01-seriously-weird-qregexp.html)
QRegExp rx( "([^,]+?)(,|$)");
rx.setMinimal( TRUE ); // this is if the Qregex does not understand the +? non-greedy notation.
int pos = 0;
while ( (pos = rx.search(text, pos)) != -1 )
{
someFunction(rx.cap(1));
}
I hope this helps.
We can do that, you can use non-capturing to hook in the comma and then ask for many of the block:
Try:
QRexExp p=new QRegExp("([^,]*)(?:,([^,]*))*[.]")
Non-capturing is explained in the docs: http://doc.qt.nokia.com/latest/qregexp.html
Note that I also bracketed the . since it has meaning in RegExp and you seemed to want it to be a literal period.
I only know of .Net that lets you specify a variable number of captures with a single
expression. Example - (capture.*me)+
It creates a capture object that can be itterated over. Even then it only simulates
what every other regex engine provides.
Most engines provide an incremental match until no matches left from within a
loop. The global flag tells the engine to keep matching from where the last
sucessfull match left off.
Example (in Perl):
while ( $string =~ /([^,]+)/g ) { print $1,"\n" }

Regular Expression to match two characters unless they're within two positions of another character

I'm trying to create a regular expression to match some certain characters, unless they appear within two of another character.
For example, I would want to match abc or xxabcxx but not tabct or txxabcxt.
Although with something like tabctxxabcxxtabcxt I'd want to match the middle abc and not the other two.
Currently I'm trying this in Java if that changes anything.
Try this:
String s = "tabctxxabcxxtabcxt";
Pattern p = Pattern.compile("t[^t]*t|(abc)");
Matcher m = p.matcher(s);
while (m.find())
{
String group1 = m.group(1);
if (group1 != null)
{
System.out.printf("Found '%s' at index %d%n", group1, m.start(1));
}
}
output:
Found 'abc' at index 7
t[^t]*t consumes anything that's enclosed in ts, so if the (abc) in the second alternative matches, you know it's the one you want.
EDITED! It was way wrong before.
Oooh, this one's tougher than I thought. Awesome. Using fairly standard syntax:
[^t]{2,}abc[^t]{2,}
That will catch xxabcxx but not abc, xabc, abcx, xabcx, xxabc, xxabcx, abcxx, or xabcxx. Maybe the best thing to do would be:
if 'abc' in string:
if 't' in string:
return regex match [^t]{2,}abc[^t]{2,}
else:
return false
else:
return false
Is that sufficient for your intention?

Match at every second occurrence

Is there a way to specify a regular expression to match every 2nd occurrence of a pattern in a string?
Examples
searching for a against string abcdabcd should find one occurrence at position 5
searching for ab against string abcdabcd should find one occurrence at position 5
searching for dab against string abcdabcd should find no occurrences
searching for a against string aaaa should find two occurrences at positions 2 and 4
Use capturing groups.
foo.*?(foo)
Use a regex like this to match all occurrences in a string. Every returned match will contain a second occurrence as its first captured group.
Here's an example that matches every second occurrence of \d+ in Python using findall:
import re
input = '10 is less than 20, 5 is less than 10'
second_occurrences = re.findall(r'\d+.*?(\d+)', input)
print(second_occurrences)
Output:
['20', '10']
Suppose the pattern you want is abc+d. You want to match the second occurrence of this pattern in a string.
You would construct the following regex:
abc+d.*?(abc+d)
This would match strings of the form: <your-pattern>...<your-pattern>. Since we're using the reluctant qualifier *? we're safe that there cannot be another match of between the two. Using matcher groups which pretty much all regex implementations provide you would then retrieve the string in the bracketed group which is what you want.
If you're using C#, you can either get all the matches at once (i.e. use Regex.Matches(), which returns a MatchCollection, and check the index of the item: index % 2 != 0).
If you want to find the occurrence to replace it, use one of the overloads of Regex.Replace() that uses a MatchEvaluator (e.g. Regex.Replace(String, String, MatchEvaluator). Here's the code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "abcdabcd";
// Replace *second* a with m
string replacedString = Regex.Replace(
input,
"a",
new SecondOccuranceFinder("m").MatchEvaluator);
Console.WriteLine(replacedString);
Console.Read();
}
class SecondOccuranceFinder
{
public SecondOccuranceFinder(string replaceWith)
{
_replaceWith = replaceWith;
_matchEvaluator = new MatchEvaluator(IsSecondOccurance);
}
private string _replaceWith;
private MatchEvaluator _matchEvaluator;
public MatchEvaluator MatchEvaluator
{
get
{
return _matchEvaluator;
}
}
private int _matchIndex;
public string IsSecondOccurance(Match m)
{
_matchIndex++;
if (_matchIndex % 2 == 0)
return _replaceWith;
else
return m.Value;
}
}
}
}
Would something like
(pattern.*?(pattern))*
work for you?
Edit:
The problem with this is that it uses the non-greedy operator *?, which can require an awful lot of backtracking along the string instead of just looking at each letter once. What this means for you is that this could be slow for large gaps.
Back references can find interesting solutions here. This regex:
([a-z]+).*(\1)
will find the longest repeated sequence.
This one will find a sequence of 3 letters that is repeated:
([a-z]{3}).*(\1)
There's no "direct" way of doing so but you can specify the pattern twice as in: a[^a]*a that match up to the second "a".
The alternative is to use your programming language (perl? C#? ...) to match the first occurence and then the second one.
EDIT: I've seen other responded using the "non-greedy" operators which might be a good way to go, assuming you have them in your regex library!