bash: extract executed line numbers from gcov report - regex

gcov is a GNU toolchain utility that produces code coverage reports (see documentation) formated as follows:
-: 0:Source:../../../edg/attribute.c
-: 0:Graph:tmp.gcno
-: 0:Data:tmp.gcda
-: 0:Runs:1
-: 0:Programs:1
-: 1:#include <stdio.h>
-: 2:
-: 3:int main (void)
1: 4:{
1: 5: int i, total;
-: 6:
1: 7: total = 0;
-: 8:
11: 9: for (i = 0; i < 10; i++)
10: 10: total += i;
-: 11:
1: 12: if (total != 45)
#####: 13: printf ("Failure\n");
-: 14: else
1: 15: printf ("Success\n");
1: 16: return 0;
-: 17:}
I need to extract the line numbers of the lines that were executed from a bash script. $ egrep --regexp='^\s+[1-9]' example_file.c.gcov seems to return the relevant lines. An exemple of typical output would be:
1: 978: attr_name_map = alloc_hash_table(NO_MEMORY_REGION_NUMBER,
79: 982: for (k = 0; k<KNOWN_ATTR_TABLE_LENGTH; ++k) {
78: 989: attr_name_map_entries[k].descr = &known_attr_table[k];
78: 990: *ep = &attr_name_map_entries[k];
1: 992:} /* init_attr_name_map */
519: 2085: new_attr_seen = FALSE;
519: 2103: p_attributes = last_attribute_link(p_attributes);
519: 2104: } while (new_attr_seen);
519: 2106: return attributes;
16: 3026:void transform_type_with_gnu_attributes(a_type_ptr *p_type,
16: 3041: for (ap = attributes; ap != NULL; ap = ap->next) {
1: 6979:void process_alias_fixup_list(void)
1: 6984: an_alias_fixup_ptr entries = alias_fixup_list, entry;
I subsequently must extract the line number strings. The expected output from this example would be:
978
982
989
990
992
2085
2103
2104
2106
3026
3041
6979
6984
Could someone suggest a reliable, robust way to achieve this?
NOTE:
My idea was to eliminate everything that is not placed between the first and the second instance of the character :, which I tried to do with sed without much success so far.

This is fairly simple to do using awk:
awk -F: '/ +[0-9]/ {gsub(/ /, "", $2); print $2}' file.gcov
That is, use : as the field separator,
and for lines starting with spaces and digits,
replace the spaces from the 2nd field and print the 2nd field.
But if you really want to use sed,
and you want something robust, you could do this:
sed -e '/^ *[0-9][0-9]*: *[0-9][0-9]*:/!d' -e 's/[^:]*: *//' -e 's/:.*//' file.gcov
What's happening here?
The first command uses a pattern to match lines starting with 1 or more spaces followed by 1 or more digits followed by a : followed by 1 or more spaces followed by 1 or more digits followed by a :. Then comes the interesting part, we invert this selection with ! and delete it with d. We effectively delete all other lines except the ones we need.
The second command is a simple substitution, replacing a sequence of characters that are not : followed by a : followed by zero or more spaces. The pattern is applied from the beginning of the line so no need for a starting ^, and no need to specify strictly 1-or-more-spaces, thanks to the previous command we already know that there will be at least one.
The last command is even simpler, replace a : and everything after it.
Some versions of sed will give you shortcuts for a more compact writing style, for example [0-9]+ instead of [0-9][0-9]*, but the example above will work with a wider variety of implementations (notably BSD).

Related

How to move repeated parts and lines of c++ code using awk or sed?

I have a huge chunk of C++ code with thousands of lines like this:
case 14: //OrderSelect
Execute_OrderSelect();
break;
case 15: // OrderGetDouble
Execute_OrderGetDouble();
break;
case 16: //OrderGetInteger
Execute_OrderGetInteger();
break;
My task is to make them look like this:
case 14: Execute_OrderSelect(); break; // OrderSelect
case 15: Execute_OrderGetDouble(); break; // OrderGetDouble
case 16: Execute_OrderGetInteger(); break; // OrderGetInteger
Note, that both the Execute... and comments can be any string.
I suppose that schematically we could write the original like this:
AAA NN BBB
CCC
DDD
and try to turn it into: AAA NN CCC DDD BBB.
I have tried unsuccessfully with all sorts of sed expressions, and the best I could do was the trivial operation of combining the Execute...() with the break;, but was not able to move the comment around. I am thinking I am using the wrong tool for this, and perhaps awk would be a better option or simpler to use?
Here are some awk variables:
FNR The input record number in the current input file.
FS The input field separator, a space by default.
NF The number of fields in the current input record.
NR The total number of input records seen so far.
OFMT The output format for numbers, "%.6g", by default.
OFS The output field separator, a space by default.
ORS The output record separator, by default a newline.
RS The input record separator, by default a newline.
RT The record terminator. Gawk sets RT to the input
text that matched the character or regular expression
specified by RS.
RSTART The index of the first character matched by match(); 0 if no match
How can I make my day brighter?
Related Questions:
AWK or sed way to paste non-adjacent lines
change the position of a line in a file using sed
How to select lines between two marker patterns which may occur multiple times with awk/sed
Here's the bones, massage to suit;
$ cat tst.awk
/^[[:space:]]*case[[:space:]]/ {
comment = ""
if ( match($0,"//") ) {
comment = substr($0,RSTART)
$0 = substr($0,1,RSTART-1)
}
caseLineNr = 1
}
caseLineNr {
if ( caseLineNr++ > 1 ) {
sub(/^[[:space:]]+/,"")
}
sub(/[[:space:]]+$/,"")
printf "%s\t", $0
if ( /^break[[:space:]]*;/ ) {
print comment
caseLineNr = 0
}
}
$ awk -f tst.awk file
case 14: Execute_OrderSelect(); break; //OrderSelect
case 15: Execute_OrderGetDouble(); break; // OrderGetDouble
case 16: Execute_OrderGetInteger(); break; //OrderGetInteger

How to get string from in between 2 strings

I am currently trying to get a string that is in between 2 substrings. In this case the string I need to manipulate is a block of code. Not sure if it is the regex or the search function but i keep getting none back and I shouldn't. I need to get the Offset on line 53 but I need to use Gusset To Backplate Left Gus 1 as the start marker and ENDFOR I think could be the end marker. Just not quite sure how to the syntax for something like this would work in python. I have tried some of the examples that I have seen online and have had no luck so far. Any help would be appreciated. Also I would like to do it with compile being that the offsets could be accessed multiple times.
s = '''!GUSSET TO BACKPLATE LEFT GUS 1 ;
45: E_NO(8) ;
46: FOR R[191:COUNTER B]=1 TO R[199:CHANNELS] ;
47: ;
48: CALL CHAN_BP_TO_GR ;
49: ;
50: PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]-R[197:X OFFSET MM] ;
51: --eg:THESE OFFSETS ONLY APPLY TO THIS BLOCK AND INCREASE THE AMOUNT GIVEN
: EACH LOOP ;
52: !X OFFSET ;
53: PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+21 ;
54: !Y OFFSET ;
55: PR[GP1:2,2:OFFSET]=PR[GP1:2,2:OFFSET]+0 ;
56: !Z OFFSET ;
57: PR[GP1:2,3:OFFSET]=PR[GP1:2,3:OFFSET]+0 ;
58: ENDFOR ;'''
string1 = re.compile('!GUSSET TO BACKPLATE LEFT GUS 1 ;')
string2 = re.compile('PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;')
string3 = re.compile('ENDFOR ;')
result = re.search(r'!GUSSET TO BACKPLATE LEFT GUS 1 ;, (PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;),ENDFOR ;', s)
'.(PR[GP1:2,1:OFFSET]=PR[GP1:2,1:OFFSET]+[0-9]* ;'
print(result)
As your text is multiline you will need the re.M flag.
To use . to match newline you also need the re.DOTALL flag.
!GUSSET.*PR[GP1:2,1:OFFSET]= will match all text up to the OFFSET on line 53 then we match anything that's not a space or ; and save that to be returned by result.group(1) as shown below.
(?!ENDFOR).ENDFOR. will match anything thats not ENDFOR followed by ENDFOR
This should prevent it from being too greedy and limit the match to this specific section and not span multiple ENDFOR's.
try
result = re.search('!GUSSET.*PR\[GP1:2,1:OFFSET\]=([^; ]+)(?!ENDFOR).*ENDFOR.*', s,re.M|re.DOTALL)
print(result.group(1))
this will return
PR[GP1:2,1:OFFSET]+21

How to Ignore capital using lapply(str_subset)

I am trying to create a new column (D$NEW) in Data.table D which matches each row of D to a whole column (D2$COLUMN1) in Data.table D2 using str_subset. (My data structure is at the bottom)
D[,NEW:= lapply(D[,C1],function(x)str_subset(as.character(D2$COLUMN1), x)]
This works fine.
But I also want str_subset to ignore capital case.
But when I use ignore.case(x)
D[,NEW:= lapply(D[,C1],function(x)str_subset(as.character(D2$COLUMN1), ignore.case(x))]
I get the following error
## PLEASE use (fixed|coll|regexp)(x, ignore_case=TRUE)
When I use ignore_case=TRUE
D[,F:= lapply(D[,V1],function(x) str_subset(as.character(D2$COLUMN1), x, ignore_case=TRUE))]
I get the following error:
Error in str_subset(as.character(), x, ignore_case = TRUE) : unused argument (ignore_case = TRUE)
How can I manage to force to ignore cases while using this function..
Data:
D<-data.table(C1=c("a","b","c","d","e","A","B","C"), C2=c(1,2,3,4,5,6,7,8,9,10))
D2<-data.table(COLUMN1=c("a"), COLUMN2=c("b"), COLUMN3=c(1:10))
The first error tells you that you cannot use an ignore.case() as a function. The second error is related to the fact that the str_subset function does not seem to have any ignore_case argument.
Use an inline case-insensitive modifier (?i):
D[,NEW:= lapply(D[,C1],function(x)str_subset(as.character(D2$COLUMN1), paste0("(?i)",x)))]
^^^^^^^^^^^^^^^^
The inline case-insensitive modifier (?i) does the same that as ignore.case / ignore_case are doing. It makes matching case-insensitive. See more details on inline modifiers at regular-expressions.info. When placed at some place of the pattern, the part after it matches the string in a case-insensitive way. So, by placing it at the start of the pattern, you make the whole pattern case-insensitive.
Else, you may pass the TRUE to the regex function:
D[,NEW:= lapply(D[,C1],function(x)str_subset(as.character(D2$COLUMN1), regex(x, TRUE)))]
^^^^^^^^^^^^^^
The TRUE is the value of the ignore_case argument (you may write it as regex(x, ignore_case=TRUE)). See more details on the options you may use in the stri_opts_regex section here. For some reason, the case_insensitive=TRUE does not work. I got an error:
Error in stri_opts_regex(case_insensitive = ignore_case, multiline = multiline, :
formal argument case_insensitive matched by multiple actual arguments
So, I had to replace it with ignore_case.
Result:
> D
C1 C2 NEW
1: a 1 a,a,a,a,a,a,
2: b 2
3: c 3
4: d 4
5: e 5
6: A 6 a,a,a,a,a,a,
7: B 7
8: C 8
9: a 9 a,a,a,a,a,a,
10: b 10

Unable to match Indian Rupee currency symbol using regex in Perl

Following is my text:
Total: ₹ 131.84
Thanks for choosing Uber, Pradeep
I would like to match the amount part, using the following code:
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
$amount = $1;
}
But, it does not match, tried using regex debugging, here's the output:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: DIGIT (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
UTF-8 pattern...
Match failed
Freeing REx: "Total: \x{20B9} (\d+)"
The full code is at http://pastebin.com/TGdFX7hg.
Disclaimer: This feels more like a comment than an answer, but I need more space.
I've never used MIME::Parser and friends before, but from what I've read in the documentation, the following might work:
use Encode qw(decode);
# according to your code, $text_mail is a MIME::Entity object
my $charset = $text_mail->head->mime_attr('content-type.charset');
my $mail_body_raw = $text_mail->bodyhandle->as_string;
my $mail_body = decode $charset, $mail_body_raw;
The idea is to get the charset from the MIME::Head object, then use Encode to decode the body accordingly.
Of course, if you know that it's always going to be UTF-8 text, you could also hardcode that:
my $mail_body = decode 'UTF-8', $mail_body_raw;
After that, your regex may still fail to work because according to the debugging output in your question the character between ₹ and the number is actually not a simple space (ASCII 32, U+0020), but a non-breaking space (U+00A0). You should be able to match that with \s:
if ( $mail_body =~ /Total: \x{20B9}\s(\d+)/ ) {
This is a bit of searching for the explanation, not an outright answer. Please bear with me.
I believe your $mail_body does not contain what you think it does. You posted the input data as plain text. Was that copied from a mail client?
If I take the code and the input data from the question and run it with use re 'debug' I get a different output.
use utf8;
use strict;
use warnings;
use re 'debug';
my $mail_body = qq{Total: ₹ 131.84
Thanks for choosing Uber, Pradeep};
if ( $mail_body =~ /Total: \x{20B9} (\d+)/ ) {
my $amount = $1;
}
It will produce this:
Compiling REx "Total: \x{20B9} (\d+)"
Final program:
1: EXACT <Total: \x{20b9} > (5)
5: OPEN1 (7)
7: PLUS (9)
8: POSIXU[\d] (0)
9: CLOSE1 (11)
11: END (0)
anchored utf8 "Total: %x{20b9} " at 0 (checking anchored) minlen 10
Matching REx "Total: \x{20B9} (\d+)" against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
UTF-8 pattern and string...
Intuit: trying to determine minimum start position...
Found anchored substr "Total: %x{20b9} " at offset 0...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <Total: > | 1:EXACT <Total: \x{20b9} >(5)
11 < %x{20b9} > <131.84%n%n>| 5:OPEN1(7)
11 < %x{20b9} > <131.84%n%n>| 7:PLUS(9)
POSIXU[\d] can match 3 times out of 2147483647...
14 <%x{20b9} 131> <.84%n%nTha>| 9: CLOSE1(11)
14 <%x{20b9} 131> <.84%n%nTha>| 11: END(0)
Match successful!
Freeing REx: "Total: \x{20B9} (\d+)"
Let's compare the line with the Matching REx to your output:
against against "Total: %x{20b9} 131.84%n%nThanks for choosing Uber, Pradeep"
against "Total: %342%202%271%302%240131.84%n%nThanks for choosing Ube"...
As we can see, there my output has %x{e2} and so on, while yours has %342.
When I started trying this code I forgot to put use utf8 in my code, so I got a bunch of single characters when the regex engine tried to match:
%x{e2}%x{82}%x{b9}
It then rejected the match.
So my conclusion is: Perl doesn't know your input data is utf8.

Why can't you use repetition quantifiers in zero-width look behind assertions?

I was always under the impression that you couldn't use repetition quantifiers in zero-width assertions (Perl Compatible Regular Expressions [PCRE]). However, it has recently transpired to me that you can use them in look ahead assertions.
How does the PCRE regex engine work when searching with zero-width look behinds which precludes repetition quantifiers from being used?
Here is a simple example from a PCRE in R:
# Our string
x <- 'MaaabcccM'
## Does it contain a 'b', preceeded by an 'a' and followed by zero or more 'c',
## then an 'M'?
grepl( '(?<=a)b(?=c*M)' , x , perl=T )
# [1] TRUE
## Does it contain a 'b': (1) preceeded by an 'M' and then zero or more 'a' and
## (2) followed by zero or more 'c' then an 'M'?
grepl( '(?<=Ma*)b(?=c*M)' , x , perl = TRUE )
# Error in grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) :
# invalid regular expression '(?<M=a*)b(?=c*M)'
# In addition: Warning message:
# In grepl("(?<=Ma*)b(?=c*M)", x, perl = TRUE) : PCRE pattern compilation error
# 'lookbehind assertion is not fixed length'
# at ')b(?=c*M)'
The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. In the meantime, let's gradually zoom into the question from higher levels.
Variable-Width Lookbehind vs. Infinite-Width Lookbehind
First off, a quick clarification on terms. A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance:
the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE)
the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE)
In contrast, in infinite-width lookbehind, you can use quantified tokens such as a+
Engines that Support Infinite-Width Lookbehind
For the record, these engines support infinite lookbehind:
.NET (C#, VB.NET etc.)
Matthew Barnett's regex module for Python
JGSoft (EditPad etc.; not available in a programming language).
As far as I know, they are the only ones.
Variable Lookbehind in PCRE
In PCRE, the most relevant section in the documentation is this:
The contents of a lookbehind assertion are restricted such that all
the strings it matches must have a fixed length. However, if there are
several top-level alternatives, they do not all have to have the same
fixed length.
Therefore, the following lookbehind is valid:
(?<=a |big )cat
However, none of these are:
(?<=a\s?|big )cat (the sides of the alternation do not have a fixed width)
(?<=#{1,10})cat (variable width)
(?<=\R)cat (\R does not have a fixed-width as it can match \n, \r\n, etc.)
(?<=\X)cat (\X does not have a fixed-width as a Unicode grapheme cluster can contain a variable number of bytes.)
(?<=a+)cat (clearly not fixed)
Lookbehind with Zero-Width Match but Infinite Repetition
Now consider this:
(?<=(?=#+))(cat#+)
On the face of it, this is a fixed-width lookbehind, because it can only ever find a zero-width match (defined by the lookahead (?=#++)). Is that a trick to get around the infinite lookbehind limitation?
No. PCRE will choke on this. Even though the content of the lookbehind is zero-width, PCRE will not allow infinite repetition in the lookbehind. Anywhere. When the documentation says all the strings it matches must have a fixed length, it should really be:
All the strings that any of its components matches must have a fixed
length.
Workarounds: Life without Infinite Lookbehind
In PCRE, the two main solutions to problems where infinite lookbehinds would help are \K and capture Groups.
Workaround #1: \K
The \K assertion tells the engine to drop what was matched so far from the final match it returns.
Suppose you want (?<=#+)cat#+, which is not legal in PCRE. Instead, you can use:
#+\Kcat#+
Workaround #2: Capture Groups
Another way to proceed is to match whatever you would have placed in a lookbehind, and to capture the content of interest in a capture group. You then retrieve the match from the capture group.
For instance, instead of the illegal (?<=#+)cat#+, you would use:
#+(cat#+)
In R, this could look like this:
matches <- regexpr("#+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)
In languages that don't support \K, this is often the only solution.
Engine Internals: What Does the PCRE Code Say?
The ultimate answer is to be found in pcre_compile.c. If you examine the code block that starts with this comment:
If lookbehind, check that this branch matches a fixed-length string
You find that the grunt work is done by the find_fixedlength() function.
I reproduce it here for anyone who would like to dive into further details.
static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
{
int length = -1;
register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;
/* Scan along the opcodes for this branch. If we get to the end of the
branch, check the length against that of the other branches. */
for (;;)
{
int d;
pcre_uchar *ce, *cs;
register pcre_uchar op = *cc;
switch (op)
{
/* We only need to continue for OP_CBRA (normal capturing bracket) and
OP_BRA (normal non-capturing bracket) because the other variants of these
opcodes are all concerned with unlimited repeated groups, which of course
are not of fixed length. */
case OP_CBRA:
case OP_BRA:
case OP_ONCE:
case OP_ONCE_NC:
case OP_COND:
d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
if (d < 0) return d;
branchlength += d;
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += 1 + LINK_SIZE;
break;
/* Reached end of a branch; if it's a ket it is the end of a nested call.
If it's ALT it is an alternation in a nested call. An ACCEPT is effectively
an ALT. If it is END it's the end of the outer call. All can be handled by
the same code. Note that we must not include the OP_KETRxxx opcodes here,
because they all imply an unlimited repeat. */
case OP_ALT:
case OP_KET:
case OP_END:
case OP_ACCEPT:
case OP_ASSERT_ACCEPT:
if (length < 0) length = branchlength;
else if (length != branchlength) return -1;
if (*cc != OP_ALT) return length;
cc += 1 + LINK_SIZE;
branchlength = 0;
break;
/* A true recursion implies not fixed length, but a subroutine call may
be OK. If the subroutine is a forward reference, we can't deal with
it until the end of the pattern, so return -3. */
case OP_RECURSE:
if (!atend) return -3;
cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1); /* Start subpattern */
do ce += GET(ce, 1); while (*ce == OP_ALT); /* End subpattern */
if (cc > cs && cc < ce) return -1; /* Recursion */
d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
if (d < 0) return d;
branchlength += d;
cc += 1 + LINK_SIZE;
break;
/* Skip over assertive subpatterns */
case OP_ASSERT:
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += PRIV(OP_lengths)[*cc];
break;
/* Skip over things that don't match chars */
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
cc += cc[1] + PRIV(OP_lengths)[*cc];
break;
case OP_CALLOUT:
case OP_CIRC:
case OP_CIRCM:
case OP_CLOSE:
case OP_COMMIT:
case OP_CREF:
case OP_DEF:
case OP_DNCREF:
case OP_DNRREF:
case OP_DOLL:
case OP_DOLLM:
case OP_EOD:
case OP_EODN:
case OP_FAIL:
case OP_NOT_WORD_BOUNDARY:
case OP_PRUNE:
case OP_REVERSE:
case OP_RREF:
case OP_SET_SOM:
case OP_SKIP:
case OP_SOD:
case OP_SOM:
case OP_THEN:
case OP_WORD_BOUNDARY:
cc += PRIV(OP_lengths)[*cc];
break;
/* Handle literal characters */
case OP_CHAR:
case OP_CHARI:
case OP_NOT:
case OP_NOTI:
branchlength++;
cc += 2;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;
/* Handle exact repetitions. The count is already in characters, but we
need to skip over a multibyte character in UTF8 mode. */
case OP_EXACT:
case OP_EXACTI:
case OP_NOTEXACT:
case OP_NOTEXACTI:
branchlength += (int)GET2(cc,1);
cc += 2 + IMM2_SIZE;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;
case OP_TYPEEXACT:
branchlength += GET2(cc,1);
if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
cc += 2;
cc += 1 + IMM2_SIZE + 1;
break;
/* Handle single-char matchers */
case OP_PROP:
case OP_NOTPROP:
cc += 2;
/* Fall through */
case OP_HSPACE:
case OP_VSPACE:
case OP_NOT_HSPACE:
case OP_NOT_VSPACE:
case OP_NOT_DIGIT:
case OP_DIGIT:
case OP_NOT_WHITESPACE:
case OP_WHITESPACE:
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
case OP_ANY:
case OP_ALLANY:
branchlength++;
cc++;
break;
/* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
otherwise \C is coded as OP_ALLANY. */
case OP_ANYBYTE:
return -2;
/* Check a class for variable quantification */
case OP_CLASS:
case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
case OP_XCLASS:
/* The original code caused an unsigned overflow in 64 bit systems,
so now we use a conditional statement. */
if (op == OP_XCLASS)
cc += GET(cc, 1);
else
cc += PRIV(OP_lengths)[OP_CLASS];
#else
cc += PRIV(OP_lengths)[OP_CLASS];
#endif
switch (*cc)
{
case OP_CRSTAR:
case OP_CRMINSTAR:
case OP_CRPLUS:
case OP_CRMINPLUS:
case OP_CRQUERY:
case OP_CRMINQUERY:
case OP_CRPOSSTAR:
case OP_CRPOSPLUS:
case OP_CRPOSQUERY:
return -1;
case OP_CRRANGE:
case OP_CRMINRANGE:
case OP_CRPOSRANGE:
if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
branchlength += (int)GET2(cc,1);
cc += 1 + 2 * IMM2_SIZE;
break;
default:
branchlength++;
}
break;
/* Anything else is variable length */
case OP_ANYNL:
case OP_BRAMINZERO:
case OP_BRAPOS:
case OP_BRAPOSZERO:
case OP_BRAZERO:
case OP_CBRAPOS:
case OP_EXTUNI:
case OP_KETRMAX:
case OP_KETRMIN:
case OP_KETRPOS:
case OP_MINPLUS:
case OP_MINPLUSI:
case OP_MINQUERY:
case OP_MINQUERYI:
case OP_MINSTAR:
case OP_MINSTARI:
case OP_MINUPTO:
case OP_MINUPTOI:
case OP_NOTMINPLUS:
case OP_NOTMINPLUSI:
case OP_NOTMINQUERY:
case OP_NOTMINQUERYI:
case OP_NOTMINSTAR:
case OP_NOTMINSTARI:
case OP_NOTMINUPTO:
case OP_NOTMINUPTOI:
case OP_NOTPLUS:
case OP_NOTPLUSI:
case OP_NOTPOSPLUS:
case OP_NOTPOSPLUSI:
case OP_NOTPOSQUERY:
case OP_NOTPOSQUERYI:
case OP_NOTPOSSTAR:
case OP_NOTPOSSTARI:
case OP_NOTPOSUPTO:
case OP_NOTPOSUPTOI:
case OP_NOTQUERY:
case OP_NOTQUERYI:
case OP_NOTSTAR:
case OP_NOTSTARI:
case OP_NOTUPTO:
case OP_NOTUPTOI:
case OP_PLUS:
case OP_PLUSI:
case OP_POSPLUS:
case OP_POSPLUSI:
case OP_POSQUERY:
case OP_POSQUERYI:
case OP_POSSTAR:
case OP_POSSTARI:
case OP_POSUPTO:
case OP_POSUPTOI:
case OP_QUERY:
case OP_QUERYI:
case OP_REF:
case OP_REFI:
case OP_DNREF:
case OP_DNREFI:
case OP_SBRA:
case OP_SBRAPOS:
case OP_SCBRA:
case OP_SCBRAPOS:
case OP_SCOND:
case OP_SKIPZERO:
case OP_STAR:
case OP_STARI:
case OP_TYPEMINPLUS:
case OP_TYPEMINQUERY:
case OP_TYPEMINSTAR:
case OP_TYPEMINUPTO:
case OP_TYPEPLUS:
case OP_TYPEPOSPLUS:
case OP_TYPEPOSQUERY:
case OP_TYPEPOSSTAR:
case OP_TYPEPOSUPTO:
case OP_TYPEQUERY:
case OP_TYPESTAR:
case OP_TYPEUPTO:
case OP_UPTO:
case OP_UPTOI:
return -1;
/* Catch unrecognized opcodes so that when new ones are added they
are not forgotten, as has happened in the past. */
default:
return -4;
}
}
/* Control never gets here */
}
Regex engines are designed to work from left to right.
For lookaheads, the engine matches the entire text at the right of current position. However, for lookbehinds, the regex engine determines the length of string to step back and then checks for the match (again left to right).
So, if you provide some infinite quantifiers like * or +, lookbehind wont work because the engine does not know how many steps to go backward.
I'll give an example of how lookbehind works (the example is pretty silly though).
Suppose you want to match the last name Panta, only if the first name is 5-7 characters long.
Let's take the string:
Full name is Subigya Panta.
Consider the regex:
(?<=\b\w{5,7}\b)\sPanta
How the engine works
The engine acknowledges the existence of a positive lookbehind and so it first searches for the word Panta (with a whitespace character before it). It is a match.
Now, the engine looks to match the regex inside the lookbehind. It steps backward 7 characters (as the quantifier is greedy). The word boundary matches the position between space and S. Then it matches all the 7 characters, and then the next word boundary matches the position between a and the space.
The regex inside the lookbehind is a match and thus the whole regex returns true because the matched string contains Panta. (Note that lookaround assertions are zero-width, and do not consume any characters.)
The pcrepattern man page documents the restriction that lookbehind assertions must be either be fixed-width, or be several fixed width patterns separated by |'s, and then explains that this is because:
The implementation of lookbehind assertions is, for each alternative,
to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the
current position, the assertion fails.
I'm not sure why they do it this way, but my guess is that they spent a lot of time writing a good backtracking RE-matching engine that runs forward, and they didn't want to duplicate all that effort to write another that runs backwards. The obvious approach would be to run over the string backwards -- that's easy -- while matching a "reverse" version of your lookbehind assertion. Reversing a "real" (DFA-matchable) RE is possible -- the reverse of a regular language is a regular language -- but PCRE's "extended" RE's are IIRC turing complete, and it may not even be possible to flip one around to run backwards efficiently in general. And even if it were, probably no-one has actually cared enough to bother. After all, lookbehind assertions are a pretty minor feature in the grand scheme of things.