Lex: match ignore space

Lex: match ignore space - regex

I have a work to recognize hex number,
my problem is how to ignore space, but not allow any character before.
like this:
0x7f6e ---->match,and print"0x7f6e"
0X2146 ---->match,and print"0X21467"
acns0x8972 ----> not match
my work now:
hex \s*0[X|x][0-9a-fA-f]{1,4}(^.)*(\n)
{hex} { ECHO;}
.|\n {}
and it print:
0x7f6e
0X2146
how can i print it without space?
like this:
0x7f6e
0X2146

I got a working version which should do what you expect:
%{
#include <ctype.h>
#include <stdio.h>
%}
%%
^[ \t]*0[Xx][0-9a-fA-f]{1,4}(.*)$ {
/* skip spaces at begin of line */
const char *bol = yytext;
while (isspace((unsigned char)*bol)) ++bol;
/* echo rest of line */
puts(bol);
}
.|\n { }
%%
int main(int argc, char **argv) { return yylex(); }
int yywrap() { return 1; }
Notes:
\s seems to be unsupported (at least in my version 2.6.3 of flex). I replaced it by [ \t]. Btw. \s usually matches also carriage return, newline, formfeed what's not intended in my case.
(^.)* replaced by (.*). (I didn't understand the intention of the original one. Mistake?)
I added a ^ at begin of 1st pattern so that pattern is attached to begin of line.
I replaced \n at the end of hex line with $. The puts() function adds a newline to output. (Newlines are always matched by 2nd rule and thus skipped.)
I replaced ECHO; with some C code to (1st) remove spaces at begin of line and (2nd) output the rest of line to standard output channel.
Compiled and tested in cygwin on Windows 10 (64 bit):
$ flex --version
flex 2.6.3
$ flex -o test-hex.c test-hex.l ; gcc -o test-hex test-hex.c
$ echo "
0x7f6e
0X2146
acns0x8972
" | ./test-hex
0x7f6e
0X2146
$
Note: I used echo to feed your sample data via pipe into standard input channel of test-hex.

Related

Non-Greedy Regular Expression Matching in Flex

I have just started with Flex and can't seem to figure out how to match the following Expression :
"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat
But I want a non-greedy matching, with the following output :
Output:
Dog Ca Cat
How can this be acheived on Flex ?
EDIT
Tried the following :
%%
Dog.*Cat/.*Cat printf("Matched : ||%s||", yytext);
dog.*cat printf("Matched : ||%s||", yytext);
dOg[^c]*cAt printf("Matched : ||%s||", yytext);
DOG.*?CAT printf("Matched : ||%s||", yytext);
%%
Input :
Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
Output :
Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||
Also receiving a warning :
lex4.l:2: warning, dangerous trailing context
Flex Version :
flex 2.5.35 Apple(flex-31)

This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */ or even 'comment' ... 'tnemmoc'. These have exactly the same characteristics as your example. Consider the following C code:
/* This is a comment */ "This is a String */"
A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.
For your cat ... dog problem they can be programmed in the following ways:
Start Conditions
In this solution we need several states. The keyword dog causes causes it to enter the DOG state which continues until a letter c is encountered. This then enters the LETTERC state which must be followed by a letter a, if not the DOG state continues; a letter a causes the CAT state to be entered which must be followed by a letter t which causes the entire phrase to be matched and returns to the INITIAL state. The yymore causes the entire dog ... cat text to be retained for use.
%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%
<INITIAL>{d}{o}{g} {
BEGIN(DOG);
printf("DOG\n");
yymore();
}
<DOG>[^cC]*{c} {
printf("C: %s\n",yytext);
yymore();
BEGIN(LETTERC);
}
<LETTERC>{a} {
printf("A: %s\n",yytext);
yymore();
BEGIN(CAT);
}
<LETTERC>[^aA] {
BEGIN(DOG);
yymore();
}
<CAT>{t} {
printf("CAT: %s\n",yytext);
BEGIN(INITIAL);
}
<CAT>[^tT] {
BEGIN(DOG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
Manual Input
The Manual input method just matches the start phrase dog and the enters C code which swallows up input characters until the desired cat sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.
d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g} {
register int c;
for ( ; ; )
{
/* Not dealt with upper case .. left as an exercise */
while ( (c = input()) != 'c' &&
c != EOF )
; /* eat up text of dog */
if ( c == 'c' )
{
if ( ( c = input()) == 'a' )
if ( (c = input()) == 't' )
break; /* found the end */
}
if ( c == EOF )
{
REJECT;
break;
}
}
/* because we have used input() yytext always contains "dog" */
printf("cat: %s\n", yytext);
}
{ws} /* skip */ ;
(Both these solutions have been tested)

A good question. Here is a pure regex solution, without using the non-greedy .*? syntax:
Dog([^C]|C+(aC+)*([^Ca]|a[^Ct]))*C+(aC+)*at

Here's a minimal C++ flex lexer for this problem. The key for nongreedy matching is start conditions as mentioned in the flex manual and elsewhere.
A start condition is just another state for the lexer. When nongreedy matching is needed there's some pattern that needs to terminate the matching on its first occurrence
In general regardless of state if you're looking for a target string or pattern you just need to make sure there are no other more general patterns that could match a longer stretch of input containing the target pattern
Start conditions help when the target pattern is conditional and needs to be enabled after some earlier match. You turn on the start condition to enable matching the target pattern and turn it off by resetting the state to 0 or INITIAL - or switching to another state for even more conditional matching
States are switched with BEGIN - there's also a state stack for use through yy_push_state and yy_pop_state
There are many examples of start conditions in the flex manual
Here are the flex rules that show nongreedy matching with flex start conditions - the lexer matches the first occurrence of dog on a line till the first occurrence of cat - matching is case insensitive
The complete file is posted at the end - for people unfamiliar with flex please note many lines begin with a space - this is not accidental and required by flex
%%
/* flex rules section */
string match;
dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
BEGIN(HAVE_DOG);
// save the found dog
match = yytext;
}
/* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;
<HAVE_DOG>cat {
// save the found cat
match += yytext;
// output the matched dog and cat
cout << match << "\n";
// ignore rest of line
BEGIN(SKIP_LINE);
}
/* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);
/* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
.*
\n BEGIN(0);
}
/* nothing to do yet */
.|\n
Here's some test input
$ cat dogcat.in.txt
Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
cat dog dog cat cat
dog kitten cat dog cat
dig cat dog can dog cut
dig dug dog cow cat cat
doc dogcat catdog
dog dog dog
cat cat cat
Build with
flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
Run with
$ ./dogcat < dogcat.in.txt
Dog Ca Cat
dog Ca cat
dOg Ca cAt
DOG CA CAT
dog dog cat
dog kitten cat
dog cow cat
dogcat
The complete flex file
/* dogcat.flex.l */
/*
Build with:
flex -o dogcat.flex.cpp dogcat.flex.l && g++ -o dogcat dogcat.flex.cpp
*/
/*
A minimal C++ flex lexer that shows nongreedy matching with flex
start conditions
matches the first occurrence of dog on a line till the first
occurrence of cat
matching is case insensitive
*/
/* C++ lexer using yyFlexLexer in FlexLexer.h */
%option c++
/* case-insensitive patterns */
%option case-insensitive
/* generate main function for executable */
%option main
/* all input must be matched, no echo by default */
%option nodefault
/* debug output with lexer.set_debug(1) */
%option debug
/* start condition means dog was matched */
%x HAVE_DOG
/* start condition means to ignore remaining line */
%x SKIP_LINE
%{
#include <string>
#include <iostream>
// C++ flex lexer class
// needed because header itself has no guard
#ifndef yyFlexLexerOnce
# include <FlexLexer.h>
#endif
using namespace std;
namespace {
// the C++ lexer class from flex
yyFlexLexer lexer;
// main generated by flex still calls free yylex function even for C++ lexer
int yylex() {
return lexer.yylex();
}
}
%}
%%
/* flex rules section */
string match;
dog {
// found a dog, change state to HAVE_DOG to start looking for a cat
BEGIN(HAVE_DOG);
// save the found dog
match = yytext;
}
/* save and keep going till cat is found */
<HAVE_DOG>. match += yytext;
<HAVE_DOG>cat {
// save the found cat
match += yytext;
// output the matched dog and cat
cout << match << "\n";
// ignore rest of line
BEGIN(SKIP_LINE);
}
/* no cat on this line, reset state */
<HAVE_DOG>\n BEGIN(0);
/* rules to ignore rest of the line then reset state */
<SKIP_LINE>{
.*
\n BEGIN(0);
}
/* nothing to do yet */
.|\n

Finding newlines between $$ $$ or $ $

I want to replace all "\r\n" with two backslahes+newline "\\ \r\n" except the "\r\n" inside "$$ $$" or "$ $" or "\[ \]". (This is the latex syntax)
The following text
1.$$ Test
2.
3.$$ $
4. $
5. Test $
6.
7. $
8.
9. Test
should be
1.$$ Test
2.
3.$$ $
4. $ \\
5. Test $
6.
7. $ \\
8. \\
9. Test
One of my trials:
First I have replaced new lines between $$ $$ or $ $ or \[ \] with --newline--
Then I have replaced all new lines with double new lines (in latex \ equals double new line).
Then I have replaced --newline-- with new line.
private static String replaceNewLines(String original) {
String text = original;
text = replaceBetween(text, "\\[", "\\]");
text = replaceBetween(text, "$$", "$$");
text = replaceBetween(text, "$", "$");
text = text.replace("\r\n", "\r\n\r\n").replace("--newline--", "\r\n");
return text;
}
private static String replaceBetween(String text, String start, String end) {
int i = text.indexOf(start);
while (i >= 0) {
int j = text.indexOf(end, i + 1);
String before = text.substring(0, i);
String after = text.substring(j);
text = before + text.substring(i, j).replace("\r\n", "--newline--")
+ after;
i = text.indexOf(start, j + 1);
}
return text;
}

I would suggest going through the file in one run with a flag marked if you are in math mode or not. Depending on flag you can replace newline or not.
In more general case when nesting is possible, I would suggest using stack implementation.
Deque<String> queue = new ArrayDeque<>(Collections.emptyList());
In this case you can go through the file in one run adding appropriate strings to the stack when entering into math mode and removing them when leaving it. Again depending on the mode (i.e. on the string which is on the top of stack) replace newline or not.
You can ask when in LaTeX could nesting appear. Look at this rough example of Dirichlet function definition:
\[
\mathbb{1}(x)
=
\begin{cases}
1&\text{when $x
$ is rational number}\\
0&\text{when $
x$ is not rational number}
\end{cases}
\]
Here $ $ is inside \[ \]  . Additionally you have to take into account \text{ and } as something what causes nesting. Things get complicated.
Finally, I think that you should also take into account the pair \( and \) which is equivalent $ $.
Apart from that in LaTeX there are also environments so if you have a real LaTeX source then you have to deal with \begin{equation} etc.
BTW \\ in LaTeX just breaks the line. Double \r\n starts a new paragraph. This is not the same.
You mentioned regex in tags. Achieving the same in regex is a longer story and it depends a bit on specific regex flavour. You can read about it on the page https://www.regular-expressions.info/balancing.html

How does flex match the beginning of line anchor?

I've always wondered how the beginning of input anchor (^) was converted to a FSA in flex. I know that the end of line anchor ($) is matched by the expression r/\n where r is the expression to match. How's the beginning of input anchor matched? The only solution I see is to use start conditions. How can it be implemented in a program?

End of line marker $ is different from \n in that it matches EOF as well, even if the end-of-line marker \n or \r\n is not found at the end of the file.
I did not look at flex's implementation, but I would implement both ^ and $ using boolean flags. The ^ flag would be initially set, then reset to false after the first character in a line, then set back to true after the next end-of-line marker, and so on.

If your scanner uses the ^anchor, then every start-condition needs two initial-state entries:
Beginning-of-line, and
otherwise.
Flex does this, and peeks behind the input pointer to determine which entry to consult.

The beginning of line anchor is matched by the pattern:
beginningOfLine ^.
(a caret followed by a point)
Example (numbering lines of a text):
%{
int ln = 1;
%}
beginningOfLine ^.
newline \n
%%
{beginningOfLine} { if (ln == 1) {
printf ("%d \t", ln);
printf (yytext);
ln++;
}else{
printf (yytext);
}
}
{newline} { printf ("\n");
printf ("%d \t", ln);
ln++; }
%%

How can I extract a substring after a match position?

I have a requirement to grep a string or pattern (say around 200 characters before and after the string or pattern) from an extremely long line ed file. The file contains streams of data (market trading data) coming from a remote server and getting appended onto this line of the file.
I know that I can match lines containing a specific pattern using grep (or other tools), but once I have such lines, how can I extract a portion of the line? I want to grab the part of the line with the pattern plus roughly 200 characters before and after the pattern. I would be especially interested in answers using...(supply tools or languages you're comfortable with here).

If what you need is the 200 characters before and after the expression plus the expression itself, then you are looking at:
/.{200}aaa.{200}/
If you need captures for each (allowing you to extract each part as a unit), then you use this regexp:
/(.{200})(aaa)(.{200})/

If your grep has -o then that will output only the matched part.
echo "abc def ghi jkl mno pqr" | egrep -o ".{4}ghi.{4}"
produces:
def ghi jkl

(.{0,200}(pattern).{0,200}), or something?

Is this what you want (in C)?
If it is, feel free to adapt to your specific needs.
#include <stdio.h>
#include <string.h>
void prt_grep(const char *haystack, const char *needle, int padding) {
char *ptr, *start, *finish;
ptr = strstr(haystack, needle);
if (!ptr) return;
start = (ptr - padding);
if (start < haystack) start = haystack;
finish = ptr + strlen(needle) + padding;
if (finish > haystack + strlen(haystack)) finish = haystack + strlen(haystack);
for (ptr = start; ptr < finish; ptr++) putchar(*ptr);
}
int main(void) {
const char *longline = "123456789 ASDF 123456789";
const char *pattern = "ASDF";
prt_grep(longline, pattern, 5); /* you want 200 */
return 0;
}

I think I might approach the problem by matching the part of the string I need, then using the match position as the starting point for the substring extraction. In Perl, once your regex suceeds, the pos built-in tells you where you left off:
if( $long_string = m/$regex/ ) {
$substring = substr( $long_string, pos( $long_string ), 200 );
}
I tend to write my programs in Perl instead of doing everything in the regular expression. There's nothing particularly special about Perl in this case.

I think this may be more basic that everybody is thinking, correct me if I'm wrong...
Do you want to print before and after the string excluding the string?
awk -F "ASDF" '{print "Before ASDF" $1 "\n" "After ASDF" $2}' $FILE
This will print something like:
Before ASDF blablabla
After ASDF blablablabla
Change it to match your needs, remove the "\n" and or the "Before..." and "After..." comments
Do you want to supress the string from the file?
This will replace the string with a blank space, again, change it to whatever you need.
sed -i 's/ASDF/\ /' longstring.txt
HTH

Regex Replacing : to ":" etc

I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom

The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider &#1?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END

* Repaired SNOBOL4 Solution
* &#38; -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test &#65;. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello,  here's a test colon:.
Here's a test semi-colon; ''
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello,  here's a test colon:.
Here's a test semi-colon; ''

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello,  here's a test colon:." | spirit_ncr2a
output:
"Hello,  here's a test colon:."

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?

Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello,  here's a test colon:.
> Here's a test semi-colon; ''
> &#59; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello,  here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive
And the recursive one produces:
Hello,  here's a test colon:.
Here's a test semi-colon; ''
; <-- may be recursive

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Lex: match ignore space - regex

Related

Non-Greedy Regular Expression Matching in Flex

Finding newlines between $$ $$ or $ $

How does flex match the beginning of line anchor?

How can I extract a substring after a match position?

Regex Replacing : to ":" etc

Categories

Resources