Non-Greedy Regular Expression Matching in Flex

Non-Greedy Regular Expression Matching in Flex - regex

I have just started with Flex and can't seem to figure out how to match the following Expression :
"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat
But I want a non-greedy matching, with the following output :
Output:
Dog Ca Cat
How can this be acheived on Flex ?
EDIT
Tried the following :
%%
Dog.*Cat/.*Cat printf("Matched : ||%s||", yytext);
dog.*cat printf("Matched : ||%s||", yytext);
dOg[^c]*cAt printf("Matched : ||%s||", yytext);
DOG.*?CAT printf("Matched : ||%s||", yytext);
%%
Input :
Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
Output :
Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||
Also receiving a warning :
lex4.l:2: warning, dangerous trailing context
Flex Version :
flex 2.5.35 Apple(flex-31)

This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */ or even 'comment' ... 'tnemmoc'. These have exactly the same characteristics as your example. Consider the following C code:
/* This is a comment */ "This is a String */"
A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.
For your cat ... dog problem they can be programmed in the following ways:
Start Conditions
In this solution we need several states. The keyword dog causes causes it to enter the DOG state which continues until a letter c is encountered. This then enters the LETTERC state which must be followed by a letter a, if not the DOG state continues; a letter a causes the CAT state to be entered which must be followed by a letter t which causes the entire phrase to be matched and returns to the INITIAL state. The yymore causes the entire dog ... cat text to be retained for use.
%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%
<INITIAL>{d}{o}{g} {
BEGIN(DOG);
printf("DOG\n");
yymore();
}
<DOG>[^cC]*{c} {
printf("C: %s\n",yytext);
yymore();
BEGIN(LETTERC);
}
<LETTERC>{a} {
printf("A: %s\n",yytext);
yymore();
BEGIN(CAT);
}
<LETTERC>[^aA] {
BEGIN(DOG);
yymore();
}
<CAT>{t} {
printf("CAT: %s\n",yytext);
BEGIN(INITIAL);
}
<CAT>[^tT] {
BEGIN(DOG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
Manual Input
The Manual input method just matches the start phrase dog and the enters C code which swallows up input characters until the desired cat sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.
d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g} {
register int c;
for ( ; ; )
{
/* Not dealt with upper case .. left as an exercise */
while ( (c = input()) != 'c' &&
c != EOF )
; /* eat up text of dog */
if ( c == 'c' )
{
if ( ( c = input()) == 'a' )
if ( (c = input()) == 't' )
break; /* found the end */
}
if ( c == EOF )
{
REJECT;
break;
}
}
/* because we have used input() yytext always contains "dog" */
printf("cat: %s\n", yytext);
}
{ws} /* skip */ ;
(Both these solutions have been tested)

A good question. Here is a pure regex solution, without using the non-greedy .*? syntax:
Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

Here's a minimal C++ flex lexer for this problem. The key for nongreedy matching is start conditions as mentioned in the flex manual and elsewhere.
A start condition is just another state for the lexer. When nongreedy matching is needed there's some pattern that needs to terminate the matching on its first occurrence
In general regardless of state if you're looking for a target string or pattern you just need to make sure there are no other more general patterns that could match a longer stretch of input containing the target pattern
Start conditions help when the target pattern is conditional and needs to be enabled after some earlier match. You turn on the start condition to enable matching the target pattern and turn it off by resetting the state to 0 or INITIAL - or switching to another state for even more conditional matching
States are switched with BEGIN - there's also a state stack for use through yy_push_state and yy_pop_state
There are many examples of start conditions in the flex manual
Here are the flex rules that show nongreedy matching with flex start conditions - the lexer matches the first occurrence of dog on a line till the first occurrence of cat - matching is case insensitive
The complete file is posted at the end - for people unfamiliar with flex please note many lines begin with a space - this is not accidental and required by flex
%%
/* flex rules section */
string match;
dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
BEGIN(HAVE_DOG);
// save the found dog
match = yytext;
}
/* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;
<HAVE_DOG>cat {
// save the found cat
match += yytext;
// output the matched dog and cat
cout << match << "\n";
// ignore rest of line
BEGIN(SKIP_LINE);
}
/* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);
/* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
.*
\n BEGIN(0);
}
/* nothing to do yet */
.|\n
Here's some test input
$ cat dogcat.in.txt
Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
cat dog dog cat cat
dog kitten cat dog cat
dig cat dog can dog cut
dig dug dog cow cat cat
doc dogcat catdog
dog dog dog
cat cat cat
Build with
flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
Run with
$ ./dogcat < dogcat.in.txt
Dog Ca Cat
dog Ca cat
dOg Ca cAt
DOG CA CAT
dog dog cat
dog kitten cat
dog cow cat
dogcat
The complete flex file
/* dogcat.flex.l */
/*
Build with:
flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
*/
/*
A minimal C++ flex lexer that shows nongreedy matching with flex
start conditions
matches the first occurrence of dog on a line till the first
occurrence of cat
matching is case insensitive
*/
/* C++ lexer using yyFlexLexer in FlexLexer.h */
%option c++
/* case-insensitive patterns */
%option case-insensitive
/* generate main function for executable */
%option main
/* all input must be matched, no echo by default */
%option nodefault
/* debug output with lexer.set_debug(1) */
%option debug
/* start condition means dog was matched */
%x HAVE_DOG
/* start condition means to ignore remaining line */
%x SKIP_LINE
%{
#include <string>
#include <iostream>
// C++ flex lexer class
// needed because header itself has no guard
#ifndef yyFlexLexerOnce
# include <FlexLexer.h>
#endif
using namespace std;
namespace {
// the C++ lexer class from flex
yyFlexLexer lexer;
// main generated by flex still calls free yylex function even for C++ lexer
int yylex() {
return lexer.yylex();
}
}
%}
%%
/* flex rules section */
string match;
dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
BEGIN(HAVE_DOG);
// save the found dog
match = yytext;
}
/* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;
<HAVE_DOG>cat {
// save the found cat
match += yytext;
// output the matched dog and cat
cout << match << "\n";
// ignore rest of line
BEGIN(SKIP_LINE);
}
/* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);
/* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
.*
\n BEGIN(0);
}
/* nothing to do yet */
.|\n

Related

Alphabetic order regex using backreferences

I recently came across a puzzle to find a regular expression that matches:
5-character-long strings comprised of lowercase English letters in ascending ASCII order
Valid examples include:
aaaaa
abcde
xxyyz
ghost
chips
demos
Invalid examples include:
abCde
xxyyzz
hgost
chps
My current solution is kludgy. I use the regex:
(?=^[a-z]{5}$)^(a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*)$
which uses a non-consuming capture group to assert a string length of 5, and then verifies that the string comprises of lowercase English letters in order (see Rubular).
Instead, I'd like to use back references inside character classes. Something like:
^([a-z])([\1-z])([\2-z])([\3-z])([\4-z])$
The logic for the solution (see Rubular) in my head is to capture the first character [a-z], use it as a backrefence in the second character class and so on. However, \1, \2 ... within character classes seem to refer to ASCII values of 1, 2... effectively matching any four- or five-character string.
I have 2 questions:
Can I use back references in my character classes to check for ascending order strings?
Is there any less-hacky solution to this puzzle?

I'm posting this answer more as a comment than an answer since it has better formatting than comments.
Related to your questions:
Can I use back references in my character classes to check for ascending order strings?
No, you can't. If you take a look a backref regular-expressions section, you will find below documentation:
Parentheses and Backreferences Cannot Be Used Inside Character Classes
Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, (, and ).
Backreferences, too, cannot be used inside a character class. The \1 in a regex like (a)[\1b] is either an error or a needlessly escaped literal 1. In JavaScript it's an octal escape.
Regarding your 2nd question:
Is there any less-hacky solution to this puzzle?
Imho, your regex is perfectly well, you could shorten it very little at the beginning like this:
(?=^.{5}$)^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$
^--- Here
Regex demo

If you are willing to use Perl (!), this will work:
/^([a-z])((??{"[$1-z]"}))((??{"[$2-z]"}))((??{"[$3-z]"}))(??{"[$4-z]"})$/

Since someone has broken the ice by using Perl, this is a
Perl solution I guess ..
Note that this is a basic non-regex solution that just happens to be
stuffed into code constructs inside a Perl regex.
The interesting thing is that if a day comes when you need the synergy
of regex/code this is a good choice.
It is possible then that instead of a simple [a-z] character, you may
use a very complex pattern in it's place and using a check vs. last.
That is power !!
The regex ^(?:([a-z])(?(?{ $last gt $1 })(?!)|(?{ $last = $1 }))){5}$
Perl code
use strict;
use warnings;
$/ = "";
my #DAry = split /\s+/, <DATA>;
my $last;
for (#DAry)
{
$last = '';
if (
/
^ # BOS
(?: # Cluster begin
( [a-z] ) # (1), Single a-z letter
# Code conditional
(?(?{
$last gt $1 # last > current ?
})
(?!) # Fail
| # else,
(?{ $last = $1 }) # Assign last = current
)
){5} # Cluster end, do 5 times
$ # EOS
/x )
{
print "good $_\n";
}
else {
print "bad $_\n";
}
}
__DATA__
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
Output
good aaaaa
good abcde
good xxyyz
good ghost
good chips
good demos
bad abCde
bad xxyyzz
bad hgost
bad chps

Ah, well, it's a finite set, so you can always enumerate it with alternation! This emits a "brute force" kind of regex in a little perl REPL:
#include <stdio.h>
int main(void) {
printf("while (<>) { if (/^(?:");
for (int a = 'a'; a <= 'z'; ++a)
for (int b = a; b <= 'z'; ++b)
for (int c = b; c <= 'z'; ++c) {
for (int d = c; d <= 'y'; ++d)
printf("%c%c%c%c[%c-z]|", a, b, c, d, d);
printf("%c%c%czz", a, b, c);
if (a != 'z' || b != 'z' || c != 'z') printf("|\n");
}
printf(")$/x) { print \"Match!\\n\" } else { print \"No match.\\n\" }}\n");
return 0;
}
And now:
$ gcc r.c
$ ./a.out > foo.pl
$ cat > data.txt
aaaaa
abcde
xxyyz
ghost
chips
demos
abCde
xxyyzz
hgost
chps
^D
$ perl foo.pl < data.txt
Match!
Match!
Match!
Match!
Match!
Match!
No match.
No match.
No match.
No match.
The regex is only 220Kb or so ;-)

Lex: match ignore space

I have a work to recognize hex number,
my problem is how to ignore space, but not allow any character before.
like this:
0x7f6e ---->match,and print"0x7f6e"
0X2146 ---->match,and print"0X21467"
acns0x8972 ----> not match
my work now:
hex \s*0[X|x][0-9a-fA-f]{1,4}(^.)*(\n)
{hex} { ECHO;}
.|\n {}
and it print:
0x7f6e
0X2146
how can i print it without space?
like this:
0x7f6e
0X2146

I got a working version which should do what you expect:
%{
#include <ctype.h>
#include <stdio.h>
%}
%%
^[ \t]*0[Xx][0-9a-fA-f]{1,4}(.*)$ {
/* skip spaces at begin of line */
const char *bol = yytext;
while (isspace((unsigned char)*bol)) ++bol;
/* echo rest of line */
puts(bol);
}
.|\n { }
%%
int main(int argc, char **argv) { return yylex(); }
int yywrap() { return 1; }
Notes:
\s seems to be unsupported (at least in my version 2.6.3 of flex). I replaced it by [ \t]. Btw. \s usually matches also carriage return, newline, formfeed what's not intended in my case.
(^.)* replaced by (.*). (I didn't understand the intention of the original one. Mistake?)
I added a ^ at begin of 1st pattern so that pattern is attached to begin of line.
I replaced \n at the end of hex line with $. The puts() function adds a newline to output. (Newlines are always matched by 2nd rule and thus skipped.)
I replaced ECHO; with some C code to (1st) remove spaces at begin of line and (2nd) output the rest of line to standard output channel.
Compiled and tested in cygwin on Windows 10 (64 bit):
$ flex --version
flex 2.6.3
$ flex -o test-hex.c test-hex.l ; gcc -o test-hex test-hex.c
$ echo "
0x7f6e
0X2146
acns0x8972
" | ./test-hex
0x7f6e
0X2146
$
Note: I used echo to feed your sample data via pipe into standard input channel of test-hex.

A regular expression to exclude a Comments?

I have a regular expression as follows:
(\/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*+\/)|(\/\/.*)
And my test string as follows:
<?
/* This is a comment */
cout << "Hello World"; // prints Hello World
/*
* C++ comments can also
*/
cout << "Hello World";
/* Comment out printing of Hello World:
cout << "Hello World"; // prints Hello World
*/
echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */
https://regex101.com/r/lx2f5F/1
It can matches the line 2, 4, 7~9, 13~17 correctly.
But it also matches single quotes('), double quotes(") and array in the last line,
how to Non-greedy Matching?
Any help would be gratefully appreciated.

I believe I have a new best pattern for you./\/\*[\s\S]*?\*\/|(['"])[\s\S]+?\1(*SKIP)(*FAIL)|\/{2}.*/This will accurately process the following block of text in just 683 steps:
<?
/* This is a comment */
cout << "Hello World"; // prints Hello World
/*
* C++ comments can also
*/
cout << "Hello World";
/* Comment out printing of Hello World:
cout << "Hello World"; // prints Hello World
*/
echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */
Pattern Explanation: (Demo *you can use the Substitution box at the bottom to replace the comment substrings with an empty string -- effectively removing all comments.)
/\/\*[\s\S]*?\*\/ Match \* then 0 or more characters then */
| OR
(['"])[\s\S]*?\1(*SKIP)(*FAIL) Don't match ' or " then 1 or more characters then the leading (captured) character
| OR
\/{2}.*/ Match // then zero or more non-newline characters
Using [\s\S] is like . except it allows newline characters, this is deliberately used in the first two alternatives. The third alternative intentionally uses . to stop when a newline character is found.
I have checked every sequence of alternatives, to ensure that the fastest alternatives come first and the pattern is optimized. My pattern correctly matches the OP's sample input. If anyone finds an issue with my pattern, please leave me a comment so that I can try to fix it.
Jan's pattern correctly matches all of the OP's desired substrings in 1006 steps using: ~([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)|(?|(?P<comment>(?s)\/\*.*?\*\/(?-s))|(?P<comment>\/\/.+))~gx
Sahil's pattern fails to completely match the final comment in your UPDATED sample input. This means either the question is wrong and should be closed as "unclear what you are asking", or Sahil's answer is wrong and it should not be awarded the green tick. When you updated your question, you should have requested that Sahil update his answer. When incorrect answers fail to satisfy the question, future SO readers are likely to become confused and SO becomes a less reliable resource.

With PCRE you can use the (*SKIP)(*FAIL) mechanism:
([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)
|
(?|
(?P<comment>(?s)/\*.*?\*/(?-s))
|
(?P<comment>//.+)
)
See a working demo on regex101.com.
Note: The branch reset (?|...) is not really needed here but was merely used to make clear the group called comment.

Flex RegEx to find string not starting with a pattern

I'm writing a lexer to scan a modified version of an INI file.
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")
These are the regular expressions that I currently use:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.
How could I achieve the result I want?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
Correct input file example:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;

To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.
This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex(); function.
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
"#".* ; /* ignored */
in your flex file. This makes generated scanner not return and ignore the token just read.
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
var = "data"
in front of any
[section]
line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

Regex Replacing : to ":" etc

I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom

The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider &#1?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END

* Repaired SNOBOL4 Solution
* &#38; -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test &#65;. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello,  here's a test colon:.
Here's a test semi-colon; ''
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello,  here's a test colon:.
Here's a test semi-colon; ''

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello,  here's a test colon:." | spirit_ncr2a
output:
"Hello,  here's a test colon:."

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?

Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello,  here's a test colon:.
> Here's a test semi-colon; ''
> &#59; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello,  here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
And the recursive one produces:
Hello,  here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Non-Greedy Regular Expression Matching in Flex - regex

A good question. Here is a pure regex solution, without using the non-greedy .? syntax: Dog([^C]|C+(aC+)([^Ca]|a[^Ct]))C+(aC+)at

Related

Alphabetic order regex using backreferences

Lex: match ignore space

A regular expression to exclude a Comments?

Flex RegEx to find string not starting with a pattern

Regex Replacing : to ":" etc

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Non-Greedy Regular Expression Matching in Flex - regex

A good question. Here is a pure regex solution, without using the non-greedy .*? syntax: Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

Related

Alphabetic order regex using backreferences

Lex: match ignore space

A regular expression to exclude a Comments?

Flex RegEx to find string not starting with a pattern

Regex Replacing : to ":" etc

Categories

Resources

A good question. Here is a pure regex solution, without using the non-greedy .? syntax: Dog([^C]|C+(aC+)([^Ca]|a[^Ct]))C+(aC+)at