Regular Expressions: about Greediness, Laziness and Substrings

Regular Expressions: about Greediness, Laziness and Substrings - regex

I have the following string:
123322
In theory, the regex 1.*2 should match the following:
12 (because * can be zero characters)
12332
123322
If I use the regex 1.*2 it matches 123322.
Using 1.*?2, it will match 12.
Is there a way to match 12332 too?
The perfect thing would be to get all possible matchess in the string (no matter if one match is substring of another)

No, unless there is something else added to the regex to clarify what it should do it will either be greedy or non-greedy. There is no in-betweeny ;)

1(.*?2)*$
you will have multiple captures which you can concatenate to form all possible matches
see here:regex tester
click on 'table' and expand the captures tree

You would need a separate expression for each case, depending on the number of twos you want to match:
1(.*?2){1} #same as 1.*?2
1(.*?2){2}
1(.*?2){3}
...

Generally, this isn't possible. A regex matching engine isn't really designed to find overlapping matches. A quick solution is simply to check the pattern on all substrings manually:
string text = "1123322";
for (int start = 0; start < text.Length - 1; start++)
{
for (int length = 0; length <= text.Length - start; length++)
{
string subString = text.Substring(start, length);
if (Regex.IsMatch(subString, "^1.*2$"))
Console.WriteLine("{0}-{1}: {2}", start, start + length, subString);
}
}
Working example: http://ideone.com/aNKnJ
Now, is it possible to get a whole-regex solution? Mostly, the answer is no. However, .Net does has a few tricks in its sleeve to help us: it allows variable length lookbehind, and allows each capturing group to remember all captures (most engines only return the last match of each group). Abusing these, we can simulate the same for loop inside the regex engine:
string text = "1123322!";
string allMatchesPattern = #"
(?<=^ # Starting at the local end position, look all the way to the back
(
(?=(?<Here>1.*2\G))? # on each position from the start until here (\G),
. # *try* to match our pattern and capture it,
)* # but advance even if you fail to match it.
)
";
MatchCollection matches = Regex.Matches(text, allMatchesPattern,
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
foreach (Match endPosition in matches)
{
foreach (Capture startPosition in endPosition.Groups["Here"].Captures)
{
Console.WriteLine("{0}-{1}: {2}", startPosition.Index,
endPosition.Index - 1, startPosition.Value);
}
}
Note that currently there's a small bug there - the engine doesn't try to match the last ending position ($), so you loose a few matches. For now, adding a ! at the end of the string solves that issue.
working example: http://ideone.com/eB8Hb

Related

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!

To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

Regular expression to pull the inner content from a WordPress shortcode

I have a WordPress shortcode that opens with a [pullquote] and ends with [/pullquote]. I'm trying to get whatever is inside of the opening and closing tags.
I'm new to regular expressions so I stared with a simple one that captures letters, numbers and spaces.
\[pullquote\]([0-9a-zA-z\s]*)\[\/pullquote\]
That works fine but doesn't account for punctuation etc. so I tried (.*) which was doing too much and not nearly specific enough.
Finally I tried this
\[pullquote\](^(?:\[\/pullquote\])*)\[\/pullquote\]
I'm not clear on the terminology here but essentially wanted to get anything that started with [pullquote] captured whatever came after that provided it wasn't [/pullquote] and ended with [/pullquote].
At least on regexr.com it didn't work but I assume that means I did something wrong.
Text used on regexr
[pullquote]Something[/pullquote]
[pullquote]Something else.[/pullquote]
How can I make this work and am I doing anything else wrong here.
Thanks

You need just this:
(\[pullquote\])(.+)(\[\/pullquote\])
And get only what is group 2 $2
See it here: https://regex101.com/r/dS8eZ0/2
The information pulled out of the link:
MATCH INFORMATION
"(\[pullquote\])(.+)(\[\/pullquote\])/g"
1st Capturing group "(\[pullquote\])"
"\[" matches the character [ literally
"pullquote" matches the characters pullquote literally (case sensitive)
"\]" matches the character ] literally
2nd Capturing group "(.+)"
".+" matches any character (except newline)
"Quantifier: +" Between one and unlimited times, as many times as possible,
giving back as needed [greedy]
3rd Capturing group "(\[\/pullquote\])"
"\[" matches the character [ literally
"\/" matches the character / literally
"pullquote" matches the characters pullquote literally (case sensitive)
"\]" matches the character ] literally
"g" modifier: global. All matches (don't return on first match)

Here is rudimentary search using strpos() you might try something like this for performance comparison sake.
function extract_shortcode_content($needle, $haystack) {
if(empty($needle) || empty($haystack || !is_string($needle) || !is_string($haystack)) {
throw new Exception('Bad input');
}
// $needle is just intended to be shortcode value (i.e. 'pullquote')
// we will build appropriate start and end tags
$needle_trimmed = trim(trim($needle), '[]');
$start_code = '[' . $needle_trimmed. ']';
$end_code = '[/' . $needle_trimmed . ']';
$start_code_length = strlen($start_code);
$end_code_length = strlen($end_code);
$haystack_length = strlen($haystack);
$last_searchable_position = $haystack_length - $start_code_length - $end_code_length - 1;
$return_array = array();
// iterate through haystack extracting content
$search_offset = 0;
$continue = true;
while($search_offset < $last_searchable_position) {
$start_code_found = strpos($haystack, $start_code, $search_offset) {
if ($start_code_found === false) {
// no match in remainder of string
return $return_array;
}
// extract content
$content_start_position = $code_found + $start_code_length;
$end_code found = strpos($haystack, $start_code, $content_start_position);
if ($end_code_found === false) {
// we couldn't find close for current shortcode open tag.
// we don't count this as a match, so let's just return matches we have
return $return_array;
}
$match_length = $end_close_found - $content_start_position;
// add content to result array
$result_array[] = substr($haystack, $content_start_position, $match_length);
// set new search offset position for next iteration
$search_offset = $end_code_found + $end_code_length;
}
return $return_array;
}
Now, I am not suggesting that you should use this instead of the regex approach. Certainly the regex approach can get the same result in a few lines of code. I am simply suggesting that this approach may perform better than regex for this use case. This may however be a micro-optimization for your use case and not be worth the extra code complexity.
I just wanted to provide an alternate suggestion to regex.

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match

In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101

You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']

Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919

Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";

Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'

Capture multiple texts.

I have a problem with Regular Expressions.
Consider we have a string
S= "[sometext1],[sometext],[sometext]....,[sometext]"
The number of the "sometexts" is unknown,it's user's input and can vary from one to ..for example,1000.
[sometext] is some sequence of characters ,but each of them is not ",",so ,we can say [^,].
I want to capture the text by some regular expression and then to iterate through the texts in cycle.
QRegExp p=new QRegExp("???");
p.exactMatch(S);
for(int i=1;i<=p.captureCount;i++)
{
SomeFunction(p.cap(i));
}
For example,if the number of sometexts is 3,we can use something like this:
([^,]*),([^,]*),([^,]*).
So,i don't know what to write instead of "???" for any arbitrary n.
I'm using Qt 4.7,I didn't find how to do this on the class reference page.
I know we can do it through the cycles without regexps or to generate the regex itself in cycle,but these solutions don't fit me because the actual problem is a bit more complex than this..

A possible regular expression to match what you want is:
([^,]+?)(,|$)
This will match string that end with a coma "," or the end of the line. I was not sure that the last element would have a coma or not.
An example using this regex in C#:
String textFromFile = "[sometext1],[sometext2],[sometext3],[sometext4]";
foreach (Match match in Regex.Matches(textFromFile, "([^,]+?)(,|$)"))
{
String placeHolder = match.Groups[1].Value;
System.Console.WriteLine(placeHolder);
}
This code prints the following to screen:
[sometext1]
[sometext2]
[sometext3]
[sometext4]
Using an example for QRegex I found online here is an attempt at a solution closer to what you are looking for:
(example I found was at: http://doc.qt.nokia.com/qq/qq01-seriously-weird-qregexp.html)
QRegExp rx( "([^,]+?)(,|$)");
rx.setMinimal( TRUE ); // this is if the Qregex does not understand the +? non-greedy notation.
int pos = 0;
while ( (pos = rx.search(text, pos)) != -1 )
{
someFunction(rx.cap(1));
}
I hope this helps.

We can do that, you can use non-capturing to hook in the comma and then ask for many of the block:
Try:
QRexExp p=new QRegExp("([^,]*)(?:,([^,]*))*[.]")
Non-capturing is explained in the docs: http://doc.qt.nokia.com/latest/qregexp.html
Note that I also bracketed the . since it has meaning in RegExp and you seemed to want it to be a literal period.

I only know of .Net that lets you specify a variable number of captures with a single
expression. Example - (capture.*me)+
It creates a capture object that can be itterated over. Even then it only simulates
what every other regex engine provides.
Most engines provide an incremental match until no matches left from within a
loop. The global flag tells the engine to keep matching from where the last
sucessfull match left off.
Example (in Perl):
while ( $string =~ /([^,]+)/g ) { print $1,"\n" }

Get second part of a string using RegEx

I have string like this "first#second", and I wonder how to get "second" part without "#" symbol as result of RegEx, not as match capture using brackets
upd: I forgot to add one more special char at the end of string, real string is "first#second*"

Simple regex:
/#(.*)$/
If you really don't want it to be a match capture, and you know there's a # in the string but none in the part you want, you can do
/[^#]*$/
and the whole regex is what you want.

If you must use regex, and you insist on not using capturing groups, you can use lookbehind in flavors that support them like this:
(?<=#).*
Or you can also capture just anything but #, to the end of the string, so something like this:
[^#]*$
The capturing group option, of course, is:
#(.*)
\__/
1
This matches the # too, but group 1 captures the part that you want.
Lastly, a non-regex alternative may look something like this:
secondPart = wholeString.substring( wholeString.indexOf("#") + 1 )
There may be issues with some of these solutions if # can also appear (perhaps escaped) anywhere else in the string.
References
regular-expressions.info
Lookarounds, Brackets for Capturing, Anchors

/[a-z]+#([a-z]+)/

You can use lookaround to exclude parts of an expression.
http://www.regular-expressions.info/lookaround.html

if your using java then
you can consider using Pattern & Matcher class. Pattern gives you a compiled, optimizer version of Regular expression. Matcher gives a complete internals of RE Matches.
Both Pattern.match & String.spilt gives same result where in first is compartively faster.
for e.g)
String s = "first#second#third";
String re = "#";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher();
int ms = 0;
int me = 0;
while( m.find() ) {
System.out.println("start "+m.start()+" end "+ m.end()+" group "+m.group());
me = m.start();
System.out.println(s.substring(ms,me));
ms = m.end();
}
if other language u can consider using back-reference & groups also. if you find any repetitions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js