SED to delete C Program comments - regex

I need to delete the comment line in a C program with sed in linux, assuming that each comment line contains the start and end tokens without any other statements before and after.
For example, the code below:
/* a comment line in a C program */
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
[TAB][SPACE] /* another empty comment line here */
/* another weird line, but not a comment line */ y = 0;
becomes
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
I know that this regex
^\s?\/\*.*\*\/$
matches the lines that I need to delete. However, the following command:
sed -i -e 's/^\s?\/\*.*\*\/$//g' filename
does not do the trick.
I am not too sure what I am doing wrong...
Thanks for your help.

This does it:
$ sed -e '/^\s*\/\*.*\*\/$/d' file
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
Notes:
^\s? matches zero or one spaces. It looks like you want to match zero or one or more spaces. So, we use instead ^\s*.
Since you want to delete the lines rather than replace them with empty lines, the command to use is d for delete.
It is not necessary to delimit a regex with /. We can use |, for example:
sed -e '\|^\s*/\*.*\*/$|d' file
This eliminates the need to escape the /. Depending on how many times / appears in a regex, this may or may not be simpler and clearer.

This might be what you're looking for:
$ awk '{o=$0; gsub(/\*\//,"\n"); gsub(/\/\*[^\n]*\n/,"")} NF{print o}' file
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another weird line, but not a comment line */ y = 0;
/* first comment */ non comment /* second comment */
The above was run on this input file:
$ cat file
/* a comment line in a C program */
printf("It is /* NOT a comment line */\n");
x = 5; /* This is an assignment, not a comment line */
/* another empty comment line here */
/* another weird line, but not a comment line */ y = 0;
/* first comment */ non comment /* second comment */
and uses awk because once you're past a simple s/old/new/ everythings easier (and more efficient, more portable, etc.) with awk. The above will delete any empty lines - if that's a problem then update your sample input/output to include that but it's a easy fix.

What you are doing is replacing your regex with a empty string
sed -i -e 's/^\s?\/\*.*\*\/$//g' filename
that means
sed -i -'s/pattern_to_find/replacement/g' : g means the whole file.
What you need to do is delete the line with the regex
sed -i -e '/^\s?\/\*.*\*\/$/d' filename

Related

I need to replace C comments with a blank space or new line using sed

The input file is like this
#include <stdio.h>
int main()
{
// this is a function
float alpha = 0;
// test
/* */
int y = 11; // comment
y = y + 15;
//
char z = 'n';
/* end of file
c */
}
And the desired output should look like this
#include <stdio.h>
int main()
{
*
float alpha = 0;*
*
*
int y = 11; *
y = y + 15;*
*
char z = 'n';*
*
*
}
Here the * represents EOL.
I have tried this but it simply deletes the spaces and new lines too.
sed '/^[ \t]*\/\//d;/\/*\*\//d;/^[ \t]*\/\*/d' $[input file]
This might work for you (GNU sed):
sed -z 's#//[^\n]*##g
s#/\*#\x00#g
s#\*/#\x01#g
s/\x00[^\n\x00\x01]*\x01//g;tb
:b;s/\x00[^\x01\n]*\n/\n\x00/;tb;s/\x00[^\n]*\x01//;tb' file
The solution comes in 3 parts:
Single line comments are removed
Multi line comments on a single line are removed.
Multi line comments on multiple lines are removed but the newlines of such lines are retained.
The solution uses the -z option which may cause problems if the file contains null characters.
N.B. This solution is only partial as many corner cases may break it e.g. literal comments as part of a variable value.
sed 's/\/\/.*//;s/\/\*.*//;s/.*\*\///' file
The command works as following:- the first part searches for the string "//" and then eliminates the {string} followed by it on the same line. The second part searches for the string "/" and then eliminates the {string} followed by it on the same line. the third part searches for the string "/" and then eliminates the {string} followed by it on the same line. The failure for the given solution would be when the multi-line comment exceeds more than 2 lines

A regular expression to exclude a Comments?

I have a regular expression as follows:
(\/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*+\/)|(\/\/.*)
And my test string as follows:
<?
/* This is a comment */
cout << "Hello World"; // prints Hello World
/*
* C++ comments can also
*/
cout << "Hello World";
/* Comment out printing of Hello World:
cout << "Hello World"; // prints Hello World
*/
echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */
https://regex101.com/r/lx2f5F/1
It can matches the line 2, 4, 7~9, 13~17 correctly.
But it also matches single quotes('), double quotes(") and array in the last line,
how to Non-greedy Matching?
Any help would be gratefully appreciated.
I believe I have a new best pattern for you./\/\*[\s\S]*?\*\/|(['"])[\s\S]+?\1(*SKIP)(*FAIL)|\/{2}.*/This will accurately process the following block of text in just 683 steps:
<?
/* This is a comment */
cout << "Hello World"; // prints Hello World
/*
* C++ comments can also
*/
cout << "Hello World";
/* Comment out printing of Hello World:
cout << "Hello World"; // prints Hello World
*/
echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */
Pattern Explanation: (Demo *you can use the Substitution box at the bottom to replace the comment substrings with an empty string -- effectively removing all comments.)
/\/\*[\s\S]*?\*\/ Match \* then 0 or more characters then */
| OR
(['"])[\s\S]*?\1(*SKIP)(*FAIL) Don't match ' or " then 1 or more characters then the leading (captured) character
| OR
\/{2}.*/ Match // then zero or more non-newline characters
Using [\s\S] is like . except it allows newline characters, this is deliberately used in the first two alternatives. The third alternative intentionally uses . to stop when a newline character is found.
I have checked every sequence of alternatives, to ensure that the fastest alternatives come first and the pattern is optimized. My pattern correctly matches the OP's sample input. If anyone finds an issue with my pattern, please leave me a comment so that I can try to fix it.
Jan's pattern correctly matches all of the OP's desired substrings in 1006 steps using: ~([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)|(?|(?P<comment>(?s)\/\*.*?\*\/(?-s))|(?P<comment>\/\/.+))~gx
Sahil's pattern fails to completely match the final comment in your UPDATED sample input. This means either the question is wrong and should be closed as "unclear what you are asking", or Sahil's answer is wrong and it should not be awarded the green tick. When you updated your question, you should have requested that Sahil update his answer. When incorrect answers fail to satisfy the question, future SO readers are likely to become confused and SO becomes a less reliable resource.
With PCRE you can use the (*SKIP)(*FAIL) mechanism:
([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)
|
(?|
(?P<comment>(?s)/\*.*?\*/(?-s))
|
(?P<comment>//.+)
)
See a working demo on regex101.com.
Note: The branch reset (?|...) is not really needed here but was merely used to make clear the group called comment.

Why in my flex lexer line count is incremented in one case and is not incremented in the other?

My assignment (it is not graded and I get nothing from solving it) is to write a lexer/scanner/tokenizer (however you want to call it). flex is used for this class. The lexer is written for Class Object Oriented Language or COOL.
In this language multi-line comments start and end like this:
(* line 1
line 2
line 3 *)
These comments can be nested. In other words the following is valid:
(* comment1 start (* comment 2 start (* comment 3 *) comemnt 2 end *) comment 1 end *)
Strings in this language are regular quoted strings, just like in C. Here is an example:
"This is a string"
"This is another string"
There is also an extra rule saying that there cannot be an EOF in the comment or in the string. For example the following is invalid:
(* comment <EOF>
"My string <EOF>
I wrote a lexer to handle it. It keeps track for a line count by looking for a \n.
Here is the problem that I'm having:
When lexer encounters an EOF in the comment it increments the line count by 1, however when it encounters an EOF in the string it doesn't do it.
For example when lexer encounters the following code
Line 1: (* this is a comment <EOF>
the following error is displayed:
`#2 ERROR "EOF in comment"
However when it encounters this code:
Line 1: "This is a string <EOF>
the following error is displayed:
`#1 ERROR "String contains EOF character"
I can't understand why this (line number is incremented in one case and is not incremented in the other) is happening. Below are some of the rules that I used to match comments and string. If you need more then just ask, I will post them.
<BLOCK_COMMENT>{
[^\n*)(]+ ; /* Eat the comment in chunks */
")" ; /* Eat a lonely right paren */
"(" ; /* Eat a lonely left paren */
"*" ; /* Eat a lonely star */
\n curr_lineno++; /* increment the line count */
}
/*
Can't have EOF in the middle of a block comment
*/
<BLOCK_COMMENT><<EOF>> {
cool_yylval.error_msg = "EOF in comment";
/*
Need to return to INITIAL, otherwise the program will be stuck
in the infinite loop. This was determined experimentally.
*/
BEGIN(INITIAL);
return ERROR;
}
/* Match <back slash>\n or \n */
<STRING>\\\n|\n {
curr_lineno++;
}
<STRING><<EOF>> {
/* String may not have an EOF character */
cool_yylval.error_msg = "String contains EOF character";
/*
Need to return to INITIAL, otherwise the program will be stuck
in the infinite loop. This was determined experimentally.
*/
BEGIN(INITIAL);
return ERROR;
}
So the question is
Why in the case of a comment the line number is incremented and in the case of a string it stays the same?
Any help is appreciated.
Because the pattern for a comment doesn't require a newline to exist in order to increment the line number, and the pattern for a string does.

Regex for old type of comment

I have this kind of comments (a few examples):
//========================================================================
// some text some text some text some text some text some text some text
//========================================================================
// some text some text some text some text some text some text some text some text
// some text some text
// (..)
I want to replace it with comment of this style:
/*****************************************************************************\
Description:
some text some text
some text some text some text
\*****************************************************************************/
So I need regular expression for this. I managed to make this regex:
//=+\r//(.+)+
It matches the comment in group, but only one line(example 1). How to make it work with many lines comments(like example 2)?
Thanks for help
Using sed:
sed -n '
\_^//==*_!p;
\_^//==*_{
s_//_/*_; s_=_\*_g; s_\*$_\*\\_;
h; p; i\
Desctiption:
: l; n; \_//[^=]_{s_//_\t_;p;};t l;
x;s_^/_\\_;s_\\$_/_;p;x;p;
}
' input_file
Commented version:
sed -n '
# just print non comment lines
\_^//==*_!p;
# for old-style block comments:
\_^//==*_{
# generate header line
s_//_/*_; s_=_\*_g; s_\*$_\*\\_;
# remember header, add description
h; p; i\
Desctiption:
# while comment continues, replace // with tab
: l; n; \_//[^=]_{s_//_\t_;p;};t l;
# modify the header as footer and print
x;s_^/_\\_;s_\\$_/_;p
# also print the non-comment line
x;p;
}
' input_file
This regex matches the whole comment
(\/\/=+)(\s*\/\/ .+?$)+
A short perl script that should do what you need, explained in comments:
#!/usr/bin/perl -p
$ast = '*' x 75; # Number of asterisks.
if (m{//=+}) { # Beginning of a comment.
$inside = 1;
s{.*}{/$ast\\\nDescription:};
next;
}
if ($inside) {
unless (m{^//}) { # End of a comment.
undef $inside;
print '\\', $ast, "/\n" ;
}
s{^//}{}; # Remove the comment sign for internal lines.
}
If a regex is still wanted, dont know if there is a better solution or not this is what I came up with:
(?<=\/{2}\s)[\w()\.\s]+
Should get all the text that is of interest.

Remove comments from C/C++ code

Is there an easy way to remove comments from a C/C++ source file without doing any preprocessing. (ie, I think you can use gcc -E but this will expand macros.) I just want the source code with comments stripped, nothing else should be changed.
EDIT:
Preference towards an existing tool. I don't want to have to write this myself with regexes, I foresee too many surprises in the code.
Run the following command on your source file:
gcc -fpreprocessed -dD -E test.c
Thanks to KennyTM for finding the right flags. Here’s the result for completeness:
test.c:
#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo
/* comments? comments. */
// c++ style comments
gcc -fpreprocessed -dD -E test.c:
#define foo bar
foo foo foo
#ifdef foo
#undef foo
#define foo baz
#endif
foo foo
It depends on how perverse your comments are. I have a program scc to strip C and C++ comments. I also have a test file for it, and I tried GCC (4.2.1 on MacOS X) with the options in the currently selected answer - and GCC doesn't seem to do a perfect job on some of the horribly butchered comments in the test case.
NB: This isn't a real-life problem - people don't write such ghastly code.
Consider the (subset - 36 of 135 lines total) of the test case:
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
On my Mac, the output from GCC (gcc -fpreprocessed -dD -E subset.c) is:
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
The output from 'scc' is:
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
The output from 'scc -C' (which recognizes double-slash comments) is:
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
Source for SCC now available on GitHub
The current version of SCC is 6.60 (dated 2016-06-12), though the Git versions were created on 2017-01-18 (in the US/Pacific time zone). The code is available from GitHub at https://github.com/jleffler/scc-snapshots. You can also find snapshots of the previous releases (4.03, 4.04, 5.05) and two pre-releases (6.16, 6.50) — these are all tagged release/x.yz.
The code is still primarily developed under RCS. I'm still working out how I want to use sub-modules or a similar mechanism to handle common library files like stderr.c and stderr.h (which can also be found in https://github.com/jleffler/soq).
SCC version 6.60 attempts to understand C++11, C++14 and C++17 constructs such as binary constants, numeric punctuation, raw strings, and hexadecimal floats. It defaults to C11 mode operation. (Note that the meaning of the -C flag — mentioned above — flipped between version 4.0x described in the main body of the answer and version 6.60 which is currently the latest release.)
gcc -fpreprocessed -dD -E did not work for me but this program does it:
#include <stdio.h>
static void process(FILE *f)
{
int c;
while ( (c=getc(f)) != EOF )
{
if (c=='\'' || c=='"') /* literal */
{
int q=c;
do
{
putchar(c);
if (c=='\\') putchar(getc(f));
c=getc(f);
} while (c!=q);
putchar(c);
}
else if (c=='/') /* opening comment ? */
{
c=getc(f);
if (c!='*') /* no, recover */
{
putchar('/');
ungetc(c,f);
}
else
{
int p;
putchar(' '); /* replace comment with space */
do
{
p=c;
c=getc(f);
} while (c!='/' || p!='*');
}
}
else
{
putchar(c);
}
}
}
int main(int argc, char *argv[])
{
process(stdin);
return 0;
}
There is a stripcmt program than can do this:
StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the command line.
(per hlovdal's answer to: question about Python code for this)
This is a perl script to remove //one-line and /* multi-line */ comments
#!/usr/bin/perl
undef $/;
$text = <>;
$text =~ s/\/\/[^\n\r]*(\n\r)?//g;
$text =~ s/\/\*+([^*]|\*(?!\/))*\*+\///g;
print $text;
It requires your source file as a command line argument.
Save the script to a file, let say remove_comments.pl
and call it using the following command: perl -w remove_comments.pl [your source file]
Hope it will be helpful
I had this problem as well. I found this tool (Cpp-Decomment) , which worked for me. However it ignores if the comment line extends to next line. Eg:
// this is my comment \
comment continues ...
In this case, I couldn't find a way in the program so just searched for ignored lines and fixed in manually. I believe there would be an option for that or maybe you could change the program's source file to do so.
Because you use C, you might want to use something that's "natural" to C. You can use the C preprocessor to just remove comments. The examples given below work with the C preprocessor from GCC. They should work the same or in similar ways with other C perprocessors as well.
For C, use
cpp -dD -fpreprocessed -o output.c input.c
It also works for removing comments from JSON, for example like this:
cpp -P -o - - <input.json >output.json
In case your C preprocessor is not accessible directly, you can try to replace cpp with cc -E, which calls the C compiler telling it to stop after the preprocessor stage.
In case your C compiler binary is not cc you can replace cc with the name of your C compiler binary, for example clang. Note that not all preprocessors support -fpreprocessed.
I write a C program using standard C library, around 200 lines, which removes comments of C source code file.
qeatzy/removeccomments
behavior
C style comment that span multi-line or occupy entire line gets zeroed out.
C style comment in the middle of a line remain unchanged. eg, void init(/* do initialization */) {...}
C++ style comment that occupy entire line gets zeroed out.
C string literal being respected, via checking " and \".
handles line-continuation. If previous line ending with \, current line is part of previous line.
line number remain the same. Zeroed out lines or part of line become empty.
testing & profiling
I tested with largest cpython source code that contains many comments.
In this case it do the job correctly and fast, 2-5 faster than gcc
time gcc -fpreprocessed -dD -E Modules/unicodeobject.c > res.c 2>/dev/null
time ./removeccomments < Modules/unicodeobject.c > result.c
usage
/path/to/removeccomments < input_file > output_file
I Believe If you use one statement you can easily remove Comments from C
perl -i -pe ‘s/\\\*(.*)/g’ file.c This command Use for removing * C style comments
perl -i -pe 's/\\\\(.*)/g' file.cpp This command Use for removing \ C++ Style Comments
Only Problem with this command it cant remove comments that contains more than one line.but by using this regEx you can easily implement logic for Multiline Removing comments
Recently I wrote some Ruby code to solve this problem. I have considered following exceptions:
comment in strings
multiple line comment on one line, fix greedy match.
multiple lines on multiple lines
Here is the code:
It uses following code to preprocess each line in case those comments appear in strings. If it appears in your code, uh, bad luck. You can replace it with a more complex strings.
MUL_REPLACE_LEFT = "MUL_REPLACE_LEFT"
MUL_REPLACE_RIGHT = "MUL_REPLACE_RIGHT"
SIG_REPLACE = "SIG_REPLACE"
USAGE: ruby -w inputfile outputfile
I know it's late, but I thought I'd share my code and my first attempt at writing a compiler.
Note: this does not account for "\*/" inside a multiline comment e.g /\*...."*/"...\*. Then again, gcc 4.8.1 doesn't either.
void function_removeComments(char *pchar_sourceFile, long long_sourceFileSize)
{
long long_sourceFileIndex = 0;
long long_logIndex = 0;
int int_EOF = 0;
for (long_sourceFileIndex=0; long_sourceFileIndex < long_sourceFileSize;long_sourceFileIndex++)
{
if (pchar_sourceFile[long_sourceFileIndex] == '/' && int_EOF == 0)
{
long_logIndex = long_sourceFileIndex; // log "possible" start of comment
if (long_sourceFileIndex+1 < long_sourceFileSize) // array bounds check given we want to peek at the next character
{
if (pchar_sourceFile[long_sourceFileIndex+1] == '*') // multiline comment
{
for (long_sourceFileIndex+=2;long_sourceFileIndex < long_sourceFileSize; long_sourceFileIndex++)
{
if (pchar_sourceFile[long_sourceFileIndex] == '*' && pchar_sourceFile[long_sourceFileIndex+1] == '/')
{
// since we've found the end of multiline comment
// we want to increment the pointer position two characters
// accounting for "*" and "/"
long_sourceFileIndex+=2;
break; // terminating sequence found
}
}
// didn't find terminating sequence so it must be eof.
// set file pointer position to initial comment start position
// so we can display file contents.
if (long_sourceFileIndex >= long_sourceFileSize)
{
long_sourceFileIndex = long_logIndex;
int_EOF = 1;
}
}
else if (pchar_sourceFile[long_sourceFileIndex+1] == '/') // single line comment
{
// since we know its a single line comment, increment file pointer
// until we encounter a new line or its the eof
for (long_sourceFileIndex++; pchar_sourceFile[long_sourceFileIndex] != '\n' && pchar_sourceFile[long_sourceFileIndex] != '\0'; long_sourceFileIndex++);
}
}
}
printf("%c",pchar_sourceFile[long_sourceFileIndex]);
}
}
#include<stdio.h>
{
char c;
char tmp = '\0';
int inside_comment = 0; // A flag to check whether we are inside comment
while((c = getchar()) != EOF) {
if(tmp) {
if(c == '/') {
while((c = getchar()) !='\n');
tmp = '\0';
putchar('\n');
continue;
}else if(c == '*') {
inside_comment = 1;
while(inside_comment) {
while((c = getchar()) != '*');
c = getchar();
if(c == '/'){
tmp = '\0';
inside_comment = 0;
}
}
continue;
}else {
putchar(c);
tmp = '\0';
continue;
}
}
if(c == '/') {
tmp = c;
} else {
putchar(c);
}
}
return 0;
}
This program runs for both the conditions i.e // and /...../