Remove comments from C/C++ code - c++

Is there an easy way to remove comments from a C/C++ source file without doing any preprocessing. (ie, I think you can use gcc -E but this will expand macros.) I just want the source code with comments stripped, nothing else should be changed.
EDIT:
Preference towards an existing tool. I don't want to have to write this myself with regexes, I foresee too many surprises in the code.

Run the following command on your source file:
gcc -fpreprocessed -dD -E test.c
Thanks to KennyTM for finding the right flags. Here’s the result for completeness:
test.c:
#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo
/* comments? comments. */
// c++ style comments
gcc -fpreprocessed -dD -E test.c:
#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo

It depends on how perverse your comments are. I have a program scc to strip C and C++ comments. I also have a test file for it, and I tried GCC (4.2.1 on MacOS X) with the options in the currently selected answer - and GCC doesn't seem to do a perfect job on some of the horribly butchered comments in the test case.
NB: This isn't a real-life problem - people don't write such ghastly code.
Consider the (subset - 36 of 135 lines total) of the test case:
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
On my Mac, the output from GCC (gcc -fpreprocessed -dD -E subset.c) is:
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
The output from 'scc' is:
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
The output from 'scc -C' (which recognizes double-slash comments) is:
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
Source for SCC now available on GitHub
The current version of SCC is 6.60 (dated 2016-06-12), though the Git versions were created on 2017-01-18 (in the US/Pacific time zone). The code is available from GitHub at https://github.com/jleffler/scc-snapshots. You can also find snapshots of the previous releases (4.03, 4.04, 5.05) and two pre-releases (6.16, 6.50) — these are all tagged release/x.yz.
The code is still primarily developed under RCS. I'm still working out how I want to use sub-modules or a similar mechanism to handle common library files like stderr.c and stderr.h (which can also be found in https://github.com/jleffler/soq).
SCC version 6.60 attempts to understand C++11, C++14 and C++17 constructs such as binary constants, numeric punctuation, raw strings, and hexadecimal floats. It defaults to C11 mode operation. (Note that the meaning of the -C flag — mentioned above — flipped between version 4.0x described in the main body of the answer and version 6.60 which is currently the latest release.)

gcc -fpreprocessed -dD -E did not work for me but this program does it:
#include <stdio.h>
static void process(FILE *f)
{
int c;
while ( (c=getc(f)) != EOF )
{
if (c=='\'' || c=='"') /* literal */
{
int q=c;
do
{
putchar(c);
if (c=='\\') putchar(getc(f));
c=getc(f);
} while (c!=q);
putchar(c);
}
else if (c=='/') /* opening comment ? */
{
c=getc(f);
if (c!='*') /* no, recover */
{
putchar('/');
ungetc(c,f);
}
else
{
int p;
putchar(' '); /* replace comment with space */
do
{
p=c;
c=getc(f);
} while (c!='/' || p!='*');
}
}
else
{
putchar(c);
}
}
}
int main(int argc, char *argv[])
{
process(stdin);
return 0;
}

There is a stripcmt program than can do this:
StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the command line.
(per hlovdal's answer to: question about Python code for this)

This is a perl script to remove //one-line and /* multi-line */ comments
#!/usr/bin/perl
undef $/;
$text = <>;
$text =~ s/\/\/[^\n\r]*(\n\r)?//g;
$text =~ s/\/\*+([^*]|\*(?!\/))*\*+\///g;
print $text;
It requires your source file as a command line argument.
Save the script to a file, let say remove_comments.pl
and call it using the following command: perl -w remove_comments.pl [your source file]
Hope it will be helpful

I had this problem as well. I found this tool (Cpp-Decomment) , which worked for me. However it ignores if the comment line extends to next line. Eg:
// this is my comment \
comment continues ...
In this case, I couldn't find a way in the program so just searched for ignored lines and fixed in manually. I believe there would be an option for that or maybe you could change the program's source file to do so.

Because you use C, you might want to use something that's "natural" to C. You can use the C preprocessor to just remove comments. The examples given below work with the C preprocessor from GCC. They should work the same or in similar ways with other C perprocessors as well.
For C, use
cpp -dD -fpreprocessed -o output.c input.c
It also works for removing comments from JSON, for example like this:
cpp -P -o - - <input.json >output.json
In case your C preprocessor is not accessible directly, you can try to replace cpp with cc -E, which calls the C compiler telling it to stop after the preprocessor stage.
In case your C compiler binary is not cc you can replace cc with the name of your C compiler binary, for example clang. Note that not all preprocessors support -fpreprocessed.

I write a C program using standard C library, around 200 lines, which removes comments of C source code file.
qeatzy/removeccomments
behavior
C style comment that span multi-line or occupy entire line gets zeroed out.
C style comment in the middle of a line remain unchanged. eg, void init(/* do initialization */) {...}
C++ style comment that occupy entire line gets zeroed out.
C string literal being respected, via checking " and \".
handles line-continuation. If previous line ending with \, current line is part of previous line.
line number remain the same. Zeroed out lines or part of line become empty.
testing & profiling
I tested with largest cpython source code that contains many comments.
In this case it do the job correctly and fast, 2-5 faster than gcc
time gcc -fpreprocessed -dD -E Modules/unicodeobject.c > res.c 2>/dev/null
time ./removeccomments < Modules/unicodeobject.c > result.c
usage
/path/to/removeccomments < input_file > output_file

I Believe If you use one statement you can easily remove Comments from C
perl -i -pe ‘s/\\\*(.*)/g’ file.c This command Use for removing * C style comments
perl -i -pe 's/\\\\(.*)/g' file.cpp This command Use for removing \ C++ Style Comments
Only Problem with this command it cant remove comments that contains more than one line.but by using this regEx you can easily implement logic for Multiline Removing comments

Recently I wrote some Ruby code to solve this problem. I have considered following exceptions:
comment in strings
multiple line comment on one line, fix greedy match.
multiple lines on multiple lines
Here is the code:
It uses following code to preprocess each line in case those comments appear in strings. If it appears in your code, uh, bad luck. You can replace it with a more complex strings.
MUL_REPLACE_LEFT = "MUL_REPLACE_LEFT"
MUL_REPLACE_RIGHT = "MUL_REPLACE_RIGHT"
SIG_REPLACE = "SIG_REPLACE"
USAGE: ruby -w inputfile outputfile

I know it's late, but I thought I'd share my code and my first attempt at writing a compiler.
Note: this does not account for "\*/" inside a multiline comment e.g /\*...."*/"...\*. Then again, gcc 4.8.1 doesn't either.
void function_removeComments(char *pchar_sourceFile, long long_sourceFileSize)
{
long long_sourceFileIndex = 0;
long long_logIndex = 0;
int int_EOF = 0;
for (long_sourceFileIndex=0; long_sourceFileIndex < long_sourceFileSize;long_sourceFileIndex++)
{
if (pchar_sourceFile[long_sourceFileIndex] == '/' && int_EOF == 0)
{
long_logIndex = long_sourceFileIndex; // log "possible" start of comment
if (long_sourceFileIndex+1 < long_sourceFileSize) // array bounds check given we want to peek at the next character
{
if (pchar_sourceFile[long_sourceFileIndex+1] == '*') // multiline comment
{
for (long_sourceFileIndex+=2;long_sourceFileIndex < long_sourceFileSize; long_sourceFileIndex++)
{
if (pchar_sourceFile[long_sourceFileIndex] == '*' && pchar_sourceFile[long_sourceFileIndex+1] == '/')
{
// since we've found the end of multiline comment
// we want to increment the pointer position two characters
// accounting for "*" and "/"
long_sourceFileIndex+=2;
break; // terminating sequence found
}
}
// didn't find terminating sequence so it must be eof.
// set file pointer position to initial comment start position
// so we can display file contents.
if (long_sourceFileIndex >= long_sourceFileSize)
{
long_sourceFileIndex = long_logIndex;
int_EOF = 1;
}
}
else if (pchar_sourceFile[long_sourceFileIndex+1] == '/') // single line comment
{
// since we know its a single line comment, increment file pointer
// until we encounter a new line or its the eof
for (long_sourceFileIndex++; pchar_sourceFile[long_sourceFileIndex] != '\n' && pchar_sourceFile[long_sourceFileIndex] != '\0'; long_sourceFileIndex++);
}
}
}
printf("%c",pchar_sourceFile[long_sourceFileIndex]);
}
}

#include<stdio.h>
{
char c;
char tmp = '\0';
int inside_comment = 0; // A flag to check whether we are inside comment
while((c = getchar()) != EOF) {
if(tmp) {
if(c == '/') {
while((c = getchar()) !='\n');
tmp = '\0';
putchar('\n');
continue;
}else if(c == '*') {
inside_comment = 1;
while(inside_comment) {
while((c = getchar()) != '*');
c = getchar();
if(c == '/'){
tmp = '\0';
inside_comment = 0;
}
}
continue;
}else {
putchar(c);
tmp = '\0';
continue;
}
}
if(c == '/') {
tmp = c;
} else {
putchar(c);
}
}
return 0;
}
This program runs for both the conditions i.e // and /...../

Related

SED to delete C Program comments

I need to delete the comment line in a C program with sed in linux, assuming that each comment line contains the start and end tokens without any other statements before and after.
For example, the code below:
/* a comment line in a C program */
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
[TAB][SPACE] /* another empty comment line here */
/* another weird line, but not a comment line */ y = 0;
becomes
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
I know that this regex
^\s?\/\*.*\*\/$
matches the lines that I need to delete. However, the following command:
sed -i -e 's/^\s?\/\*.*\*\/$//g' filename
does not do the trick.
I am not too sure what I am doing wrong...
Thanks for your help.
This does it:
$ sed -e '/^\s*\/\*.*\*\/$/d' file
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
Notes:
^\s? matches zero or one spaces. It looks like you want to match zero or one or more spaces. So, we use instead ^\s*.
Since you want to delete the lines rather than replace them with empty lines, the command to use is d for delete.
It is not necessary to delimit a regex with /. We can use |, for example:
sed -e '\|^\s*/\*.*\*/$|d' file
This eliminates the need to escape the /. Depending on how many times / appears in a regex, this may or may not be simpler and clearer.
This might be what you're looking for:
$ awk '{o=$0; gsub(/\*\//,"\n"); gsub(/\/\*[^\n]*\n/,"")} NF{print o}' file
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
/* first comment */ non comment /* second comment */
The above was run on this input file:
$ cat file
/* a comment line in a C program */
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another empty comment line here */
/* another weird line, but not a comment line */ y = 0;
/* first comment */ non comment /* second comment */
and uses awk because once you're past a simple s/old/new/ everythings easier (and more efficient, more portable, etc.) with awk. The above will delete any empty lines - if that's a problem then update your sample input/output to include that but it's a easy fix.
What you are doing is replacing your regex with a empty string
sed -i -e 's/^\s?\/\*.*\*\/$//g' filename
that means
sed -i -'s/pattern_to_find/replacement/g' : g means the whole file.
What you need to do is delete the line with the regex
sed -i -e '/^\s?\/\*.*\*\/$/d' filename

How do to: multiple multi-line replacements using text from the pattern match?

I'm implementing an annotation feature in Bash and am looking for either an awk or sed solution for some text manipulation.
I'd like to transform text in a file from:
^version 10.2 tag1 tag2
^audit arg1 arg2
f()
{
...
}
g()
{
...
}
^version 10.2
h() { ... }
^version 10.2
i() { ... } # Not annotated: doesn't immediately follow an annotation
to:
annotate f^1 version 10.2 tag1 tag2
annotate f^1 audit arg1 arg2
f^1()
{
...
}
g()
{
...
}
annotate h^2 10.2
h^2() { ... }
i() { ... } # Not annotated: doesn't immediately follow an annotation
Replacements are done as follows:
lines beginning with ^ are replaced by annotate, a space, the function name found after the annotation lines, a ^, an index, and the rest of the line
the function name is suffixed with a ^ and the index (after this, the index is incremented)
Function names begin in column 1 and are Bash function namess that do not require POSIX compliance (see Bash source code builtins/declare.def: shell function names don't have to be valid identifiers; and, in parse.y, a function is a WORD). An acceptably imperfect regex for the function part of the pattern is (but I'll upvote solutions that can figure out a better regex, even if they don't answer the bigger question--it was hard to figure out from reading the source code):
^[^'"()]\+\s*(\s*)
Note that an annotation applies only to the immediately following function following the match. If the function does not immediately follow the annotation lines, then the annotations should not be emitted at all.
The solution should be general and not include strings found in the example above (version, audit, f, g, h, etc.).
Solutions must not require utilities/packages that are not found in CentOS 7 Minimal. So, unfortunately, Perl cannot be considered. I would prefer an awk solution.
Your answer will be used to improve the code for an open-source Bash project: Eggsh.
Try something like this:
/^\^/ { if (ann == 0) count++; ann++; acc[ann] = substr($0, 2); next; }
/^[a-zA-Z0-9_]\s*(\s*)/ && ann {
ind = index($0, "(");
fname = substr($0, 1, ind-1)
for (i = 1; i <= ann; i++) {
print "annotate " fname "^" count " " acc[i];
}
print fname "^" count substr($0, ind);
ann = 0;
next;
}
{ ann = 0; print; }
Note that I have not bothered to do the research necessary to find a better function name regexp.

Lex: match ignore space

I have a work to recognize hex number,
my problem is how to ignore space, but not allow any character before.
like this:
0x7f6e ---->match,and print"0x7f6e"
0X2146 ---->match,and print"0X21467"
acns0x8972 ----> not match
my work now:
hex \s*0[X|x][0-9a-fA-f]{1,4}(^.)*(\n)
{hex} { ECHO;}
.|\n {}
and it print:
0x7f6e
0X2146
how can i print it without space?
like this:
0x7f6e
0X2146
I got a working version which should do what you expect:
%{
#include <ctype.h>
#include <stdio.h>
%}
%%
^[ \t]*0[Xx][0-9a-fA-f]{1,4}(.*)$ {
/* skip spaces at begin of line */
const char *bol = yytext;
while (isspace((unsigned char)*bol)) ++bol;
/* echo rest of line */
puts(bol);
}
.|\n { }
%%
int main(int argc, char **argv) { return yylex(); }
int yywrap() { return 1; }
Notes:
\s seems to be unsupported (at least in my version 2.6.3 of flex). I replaced it by [ \t]. Btw. \s usually matches also carriage return, newline, formfeed what's not intended in my case.
(^.)* replaced by (.*). (I didn't understand the intention of the original one. Mistake?)
I added a ^ at begin of 1st pattern so that pattern is attached to begin of line.
I replaced \n at the end of hex line with $. The puts() function adds a newline to output. (Newlines are always matched by 2nd rule and thus skipped.)
I replaced ECHO; with some C code to (1st) remove spaces at begin of line and (2nd) output the rest of line to standard output channel.
Compiled and tested in cygwin on Windows 10 (64 bit):
$ flex --version
flex 2.6.3
$ flex -o test-hex.c test-hex.l ; gcc -o test-hex test-hex.c
$ echo "
0x7f6e
0X2146
acns0x8972
" | ./test-hex
0x7f6e
0X2146
$
Note: I used echo to feed your sample data via pipe into standard input channel of test-hex.

(F)Lex : get text not matched by rules / get default output

I've read a lot about (F)Lex so far, but I couldn't find an answer.
Actually I have 2 questions, and getting the answer for one would be enough.
I have strings like:
TOTO 123 CD123 RGF 32/FDS HGGH
For each token I find, I put it in a vector. For example, for this string, I get a vector like this:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, FDS, whitespace, HGGH
The "/" does not match any rules, but still, i would like to put it in my vector when I reach it and get:
vector = TOTO, whitespace, CD, 123, whitespace, RGF, whitespace, 32, /, FDS, whitespace, HGGH
So my questions are:
1) Is there a possibility to modify the default action when an input does not match any rule? (instead of print on stdout ?)
2) If it is not possible, how to catch this ? because here, "/" is an example but it can be everything ( % , C, 3, Blabblabla, etc that does not match my rules), and I can't put
.* { else(); }
cause Flex uses the regex which matches the longest string. I would like that my rules to be "sorted", and ".*" would be the last, like changing the "preferences" of Flex.
Any idea ?
The usual way is to have a rule something like
. { do_something_with_extra_char(*yytext); }
at the END of your rules. This will match any single character (other than newline -- you need a rule that matches newline somewhere too) that doesn't match any other rule. If you have multiple unmatched characters, this rule will trigger multiple times, but generally that is fine.
EDIT: I think Chris Dodd's answer is better. Here are two alternative solutions.
One solution would be to use states. When you read a single unrecognized character, enter into a different state, and build up the unrecognized token.
%{
char str[1024];
int strUsed;
%}
%x UNRECOGNIZED
%%
{SOME_RULE} {/* do processing */ }
. {BEGIN(UNRECOGNIZED); str[0] = yytext[0]; strUsed = 1; }
<UNRECOGNIZED>{bad_input} { strcpy(str+strUsed, yytext); strUsed+=yyleng; }
<UNRECOGNIZED>{good_input} { str[strUsed] = 0; vector_add(str); BEGIN(INITIAL); }
This solution works well if it's easy to write a regular expression to match "bad" input. Another solution is to slowly build up bad characters until the next valid match:
%{
char str[1024];
int strUsed = 0;
void goodMatch() {
if(strUsed) {
str[strUsed] = 0;
vector_add(str);
strUsed = 0;
}
}
%}
%%
{SOME_RULE} { goodMatch(); /* do processing */ }
. {str[strUsed++] = yytext[0]; }
Note that this requires you to modify all existing rules to add in a call to function goodMatch.
Note for both solutions: if you use a statically sized buffer, you'll have to ensure you don't overflow it on the strcpy. If you end up using a dynamically sized string, you'll have to be sure to correctly clean up memory.

Regex Replacing : to ":" etc

I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom
The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.
I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider &#1?\d\d;.
As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
For each match // Using regex_iterator<>
Print the prefix of the match
Remove the first 2 and last character of the match (&#;)
lexical_cast the result to int, then truncate to char and append.
Print the suffix of the last match.
This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
NUMS = '1234567890'
MAIN LINE = INPUT :F(END)
SWAP LINE ? '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
OUTPUT = LINE :(MAIN)
END
* Repaired SNOBOL4 Solution
* &#38; -> &
digit = '0123456789'
main line = input :f(end)
result =
swap line arb . l
+ '&#' span(digit) . n ';' rem . line :f(out)
result = result l char(n) :(swap)
out output = result line :(main)
end
I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)
Producing:
"Hello, here's a test colon:. Here's a test semi-colon;"
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.
The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
dd = "0123456789"
ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
echo "Hello, here's a test colon:. Here's a test semi-colon; Further test &#65;. abc.~.def." | perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) ) { return sprintf("%c",$x); } else { return "&#".$x.";"; } } while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'
Pretty-printing that:
#!/usr/bin/perl -w
sub translate
{
my $x=$_[0];
if ( ($x >= 32) && ($x <= 126) )
{
return sprintf( "%c", $x );
}
else
{
return "&#" . $x . ";" ;
}
}
while (<>)
{
s/&#(1?\d\d);/&translate($1)/ge;
print;
}
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.
Here's another Perl's one-liner (see #mrree's answer):
a test file:
$ cat ent.txt
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt
or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
both one-liners produce the same output:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>
int main() {
using namespace BOOST_SPIRIT_CLASSIC_NS;
std::string line;
while (std::getline(std::cin, line)) {
assert(parse(line.begin(), line.end(),
// match "&#(\d+);" where 32 <= $1 <= 126 or any char
*(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
| anychar_p[&putchar])).full);
putchar('\n');
}
}
compile:
$ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
run:
$ echo "Hello,  here's a test colon:." | spirit_ncr2a
output:
"Hello,  here's a test colon:."
I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])
Does that make any sense?
Here's a NCR scanner created using Flex:
/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
To make an executable:
$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl
Example:
$ echo "Hello,  here's a test colon:.
> Here's a test semi-colon; 'ƒ'
> &#59; <-- may be recursive" \
> | ncr2a
It prints for non-recursive version:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
And the recursive one produces:
Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
dd = "0123456789"
ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = line :(rdl)
done
end
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
dd = "0123456789"
ccp = "#" (span(dd) $ n ";") $ enc
+ *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
+ fence (*ccp | null)
rdl line = input :f(done)
repl line "&" *?(s = ) ccp = s :s(repl)
output = replace(line,char(10),"#") :(rdl)
done
end
Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.
#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>
int main()
{
boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
const int subs[] = {-1, 1}; // non-match & subexpr
boost::sregex_token_iterator end;
std::string line;
while (std::getline(std::cin, line)) {
boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);
for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
if (isncr) { // convert NCR e.g., ':' -> ':'
const int d = boost::lexical_cast<int>(*tok);
assert(32 <= d && d < 127);
std::cout << static_cast<char>(d);
}
else
std::cout << *tok; // output as is
}
std::cout << '\n';
}
}