VIM, Automatic formatting, Code-Guidelines, C++ - c++

I want to be able to automatically format code for the following rules using vim:
Rule 1): If expressions which are must be indeneted with 3 spaces. Example:
if(a &&
b)
(Note: b has three space-indent relative to the parent if, note that current vim behavior is 4)
Rule 2): parameters separated by space. Example:
function_call(a, b, c);
Rule 3): No space between assignment operators. Example:
int a=x;
Rule 4): Reference/dereference operator is attached to variable name not type. Example:
int &x = b;
Where possible, I want vim to do this stuff automatically as I am typing, however if this not possible, identifying formatting that is counter to the above rules (by marking them as errors) will also be helpful.

You can set auto-indentation rules in a custom indent file. Check out examples in the "indent" directory, somewhere like /usr/share/vim/vim74/indent, or in the Vim source code distribution.
You can set error highlighting rules in a custom syntax file. Find examples in the "syntax" directory, somewhere like /usr/share/vim/vim74/syntax, or again in the Vim source code distribution. Here's an example for JSON files:
" Syntax: Decimals smaller than one should begin with 0 (so .1 should be 0.1).
syn match jsonNumError "\:\#<=[[:blank:]\r\n]*\zs\.\d\+"
If you want to actually re-format code automatically as you go you might need a special plugin like vim-autoformat and/or an external tool like ClangFormat.

Regarding indenting, and so on, check the options :h 'sw', :h 'cindent', :h 'cinoptions'...
Regarding where spaces and newlines shall be inserted,
For code already typed, clang-format is indeed the best way to go to reformat code. There is a plugin for vim.
For snippets, brackets and so on, lately I've worked on a plugin aimed at formatting text inserted by other plugins. Excesivelly inspired, I'm named the core plugin lh-style. It's used by mu-template (my snippet/templating plugin), and lh-brackets.
For other stuff you'll want to reformat on the fly, it'll be a little bit more complex. May be lh-style could help, I don't know, I haven't given much though on the subject yet.
For instance, outside comments and strings, = shall be expanded into :
itself after a [ (lamdbas),
<BS>=<space>, after =, >, <, ! followed by a space
<space>=<space> otherwise
EDIT: I got it all wrong, it does exactly the contrary of what you're looking for.
It'd be something like:
" ftplugin/c/mymappings.vim
function! s:InsertExpr(char) abort
let col = col('.')
let line = getline('.')
let syn = synIDattr(synID(line('.'),col-1,1),'name')
if syn =~? 'comment\|string\|character\|doxygen'
return a:key
endif
let lcut = getline('.')[: col-2]
let before =
\ lcut =~ '[=<>!] $' ? "\<bs>"
\ : lcut =~ "[=<>![ \t\n]$" ? ''
\ : ' '
let after = line[col-1] =~ "[ \t\n\\]]" ? '' : ' '
return before.a:char.after
endfunction
inoremap <buffer> <expr> = <sid>InsertExpr('=')
inoremap <buffer> <expr> < <sid>InsertExpr('<')
inoremap <buffer> <expr> > <sid>InsertExpr('>')

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

A Perl 6 Regex to match a Perl 6 delimited comment

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.
As an example of what I am looking for, I want something that can parse the comments in here:
#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}
My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.
Matching your examples
my %openers-closers = < { } « » ( ) >; # (many more in reality)
my #openers = %openers-closers.keys; # { « ( ...
my ($open, $close); # possibly multiple chars
my token comment { '#`' <&open> <&middle> <&close> }
my token open {
# Store first delimiter char: Slurp as many as are repeated:
( ( #openers ) $0* )
# Store the full (possibly multiple character) delimiters:
{ $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}
my token middle {
:my $nest-level; # for tracking nesting
[
# Continue if nested: or if not at unnested end delimiter:
[ <?{$nest-level}> || <!&close> ]
# Match either a nested delimiter: or a single character:
( $open || $close || . )
# Keep track of nesting:
{ $_ = ~$0.tail; # set topic to latest match in list
$nest-level++ when $open; $nest-level-- when $close }
]*
}
my token close { $close }
.say for $your-examples ~~ m:g / <.&comment> /
displays:
「{ foo {} bar }」
「« woo woo »」
「(
This is a (
long )
multiliner())」
「{{ { And don't forget the tricky repeating delimiters }}」
Hopefully the code is self-explanatory if you know Raku regexes. Please use the comments if you want clarification of any of it.
Looking at related Rakudo source code
I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)
But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.
As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):
#`{{ {{ And don't forget the tricky repeating delimiters } }}
yields the rather LTA (Less Than Awesome) compiler error:
Starter {{ is immediately followed by a combining codepoint...
This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.
So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the Raku grammar. (Perhaps not of interest to you but it is to me.)
Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:
The comment:sym<#`(...)> token that parses these comments. This leads to:
The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.
The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:
The babble token. This establishes a "start" and "stop" with this code:
$<B>=[<?before .>]
{
# Work out the delimiters.
my $c := $/;
my #delims := $c.peek_delimiters($c.target, $c.pos);
my $start := #delims[0];
my $stop := #delims[1];
The rule peek_delimiters is not in the Raku grammar file.
A search in the Rakudo repo shows it's not anywhere in Rakudo or Raku.
A search in NQP yields a routine in nqp's grammar (from which the Raku grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/Raku).
I'll stop at this point to draw a conclusion.
Conclusion
You've got a regex. It might work out as you intend. I don't know.
If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

Rewriting C macro code with VIM search & replace

I've got a file that uses an outdated macro to read 32 bit integers,
READ32(dest, src)
I need to replace all calls with
dest = readUint32(&src);
I'm trying to write a SED style Vim search & replace command, but not having luck.
I can match the 1st part using READ32([a-z]\+, cmd) using the / search prompt, but it does not seem to match in the :s syntax.
Here's what I finally figured out to work:
:%s/READ32(\(\a\+\),\(\a\+\)/\1 = readUint32(\&\2);
The trick is wrapping the values you want to store in \1 & \2 in \( and \) The other trick was you have to escape the & operator as & in vim replacement is "the whole match".
EDIT: improved further as I refined it:
:%s/READ32(\(\w\+\),\s*\(\w\+\)/\1 = readUint32(\&\2);
Changed \a to \w as I had variables with _ in them.
Added \s* to take care of white space issues between the , and second variable.
Now just trying to deal with c++ style variables of style class.variable.subvariable
EDIT 2:
replaced \w with [a-zA-Z0-9_.] to catch all of the ways my variables were named.
This should do what you want or at least get you started:
%s-READ32(\s*\(\i\+\)\s*,\s*\(\i\+\)\s*)-\1 = readUint32(\&\2);-g
I'd do the macro style again: hit * to 'highlight' search for READ32.
Now, we are going to record a macro (q..qq):
n (move to next match)
cwreadUint32Esc (change the function name)
wwdt, (delete the first argument)
"_dw (remove the redundant ,)
bbPa=Esc (insert the result variable appending = before readUint32)
A; (append ; to the end of the line)
Now you can just repeat the macro (1000#q).

Vim: Get content of syntax element under cursor

I'm on a highlighted complex syntax element and would like to get it's content. Can you think of any way to do this?
Maybe there's some way to search for a regular expression so that it contains the cursor?
EDIT
Okay, example. The cursor is inside a string, and I want to get the text, the content of this syntactic element. Consider the following line:
String myString = "Foobar, [CURSOR]cool \"string\""; // And some other "rubbish"
I want to write a function that returns
"Foobar, cool \"string\""
if I understood the question. I found this gem some time ago and don't remember where but i used to understand how syntax hilighting works in vim:
" Show syntax highlighting groups for word under cursor
nmap <leader>z :call <SID>SynStack()<CR>
function! <SID>SynStack()
if !exists("*synstack")
return
endif
echo map(synstack(line('.'), col('.')), 'synIDattr(v:val, "name")')
endfunc
The textobj-syntax plugin might help. It creates a custom text object, so that you can run viy to visually select the current syntax highlighted element. The plugin depends on the textobj-user plugin, which is a framework for creating custom text objects.
This is a good use for text objects (:help text-objects). To get the content you're looking for (Foobar, cool \"string\"), you can just do:
yi"
y = yank
i" = the text object "inner quoted string"
The yank command by default uses the unnamed register ("", see :help registers), so you can access the yanked contents programmatically using the getreg() function or the shorthand #{register-name}:
:echo 'String last yanked was:' getreg('"')
:echo 'String last yanked was:' #"
Or you can yank the contents into a different register:
"qyi"
yanks the inner quoted string into the "q register, so it doesn't conflict with standard register usage (and can be accessed as the #q variable).
EDIT: Seeing that the plugin mentioned by nelstrom works similar to my original approach, I settled on this slightly more elegant solution:
fu s:OnLink()
let stack = synstack(line("."), col("."))
return !empty(stack)
endf
normal mc
normal $
let lineLength = col(".")
normal `c
while col(".") > 1
normal h
if !s:OnLink()
normal l
break
endif
endwhile
normal ma`c
while col(".") < lineLength
normal l
if !s:OnLink()
normal h
break
endif
endwhile
normal mb`av`by

Remove C and C++ comments using Python?

I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)
I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.
Ideally, I would prefer a non-naive implementation that properly handles awkward cases.
This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.
C (and C++) comments cannot be nested. Regular expressions work well:
//.*?\n|/\*.*?\*/
This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.
def stripcomments(text):
return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)
This code should work.
/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:
//.*?(\r\n?|\n)|/\*.*?\*/
This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).
/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.
Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...
" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"
"And escaped double quotes at the end of a string\""
aa '\\
n' OK
aa "\""
aa "\
\n"
This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.
This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.
This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).
This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)
To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)
This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).
import re
def removeCCppComment( text ) :
def blotOutNonNewlines( strIn ) : # Return a string containing only the newline chars contained in strIn
return "" + ("\n" * strIn.count('\n'))
def replacer( match ) :
s = match.group(0)
if s.startswith('/'): # Matched string is //...EOL or /*...*/ ==> Blot out all non-newline chars
return blotOutNonNewlines(s)
else: # Matched string is '...' or "..." ==> Keep unchanged
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:
import subprocess
from cStringIO import StringIO
input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()
process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
input=input, output=output)
return_code = process.wait()
stripped_code = output.getvalue()
In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.
This will probably be better than a pure Python solution; no need to reinvent the wheel.
The regular expression cases will fall down in some situations, like where a string literal contains a subsequence which matches the comment syntax. You really need a parse tree to deal with this.
you may be able to leverage py++ to parse the C++ source with GCC.
Py++ does not reinvent the wheel. It
uses GCC C++ compiler to parse C++
source files. To be more precise, the
tool chain looks like this:
source code is passed to GCC-XML
GCC-XML passes it to GCC C++ compiler
GCC-XML generates an XML description
of a C++ program from GCC's internal
representation. Py++ uses pygccxml
package to read GCC-XML generated
file. The bottom line - you can be
sure, that all your declarations are
read correctly.
or, maybe not. regardless, this is not a trivial parse.
# RE based solutions - you are unlikely to find a RE that handles all possible 'awkward' cases correctly, unless you constrain input (e.g. no macros). for a bulletproof solution, you really have no choice than leveraging the real grammar.
I'm sorry this not a Python solution, but you could also use a tool that understands how to remove comments, like your C/C++ preprocessor. Here's how GNU CPP does it.
cpp -fpreprocessed foo.c
There is also a non-python answer: use the program stripcmt:
StripCmt is a simple utility written
in C to remove comments from C, C++,
and Java source files. In the grand
tradition of Unix text processing
programs, it can function either as a
FIFO (First In - First Out) filter or
accept arguments on the commandline.
The following worked for me:
from subprocess import check_output
class Util:
def strip_comments(self,source_code):
process = check_output(['cpp', '-fpreprocessed', source_code],shell=False)
return process
if __name__ == "__main__":
util = Util()
print util.strip_comments("somefile.ext")
This is a combination of the subprocess and the cpp preprocessor. For my project I have a utility class called "Util" that I keep various tools I use/need.
I have using the pygments to parse the string and then ignore all tokens that are comments from it. Works like a charm with any lexer on pygments list including Javascript, SQL, and C Like.
from pygments import lex
from pygments.token import Token as ParseToken
def strip_comments(replace_query, lexer):
generator = lex(replace_query, lexer)
line = []
lines = []
for token in generator:
token_type = token[0]
token_text = token[1]
if token_type in ParseToken.Comment:
continue
line.append(token_text)
if token_text == '\n':
lines.append(''.join(line))
line = []
if line:
line.append('\n')
lines.append(''.join(line))
strip_query = "\n".join(lines)
return strip_query
Working with C like languages:
from pygments.lexers.c_like import CLexer
strip_comments("class Bla /*; complicated // stuff */ example; // out",CLexer())
# 'class Bla example; \n'
Working with SQL languages:
from pygments.lexers.sql import SqlLexer
strip_comments("select * /* this is cool */ from table -- more comments",SqlLexer())
# 'select * from table \n'
Working with Javascript Like Languages:
from pygments.lexers.javascript import JavascriptLexer
strip_comments("function cool /* not cool*/(x){ return x++ } /** something **/ // end",JavascriptLexer())
# 'function cool (x){ return x++ } \n'
Since this code only removes the comments, any strange value will remain. So, this is a very robust solution that is able to deal even with invalid inputs.
You don't really need a parse tree to do this perfectly, but you do in effect need the token stream equivalent to what is produced by the compiler's front end. Such a token stream must necessarilyy take care of all the weirdness such as line-continued comment start, comment start in string, trigraph normalization, etc. If you have the token stream, deleting the comments is easy. (I have a tool that produces exactly such token streams, as, guess what, the front end of a real parser that produces a real parse tree :).
The fact that the tokens are individually recognized by regular expressions suggests that you can, in principle, write a regular expression that will pick out the comment lexemes. The real complexity of the set regular expressions for the tokenizer (at least the one we wrote) suggests you can't do this in practice; writing them individually was hard enough. If you don't want to do it perfectly, well, then, most of the RE solutions above are just fine.
Now, why you would want strip comments is beyond me, unless you are building a code obfuscator. In this case, you have to have it perfectly right.
I ran across this problem recently when I took a class where the professor required us to strip javadoc from our source code before submitting it to him for a code review. We had to do this several times, but we couldn't just remove the javadoc permanently because we were required to generate javadoc html files as well. Here is a little python script I made to do the trick. Since javadoc starts with /** and ends with */, the script looks for these tokens, but the script can be modified to suite your needs. It also handles single line block comments and cases where a block comment ends but there is still non-commented code on the same line as the block comment ending. I hope this helps!
WARNING: This scripts modifies the contents of files passed in and saves them to the original files. It would be wise to have a backup somewhere else
#!/usr/bin/python
"""
A simple script to remove block comments of the form /** */ from files
Use example: ./strip_comments.py *.java
Author: holdtotherod
Created: 3/6/11
"""
import sys
import fileinput
for file in sys.argv[1:]:
inBlockComment = False
for line in fileinput.input(file, inplace = 1):
if "/**" in line:
inBlockComment = True
if inBlockComment and "*/" in line:
inBlockComment = False
# If the */ isn't last, remove through the */
if line.find("*/") != len(line) - 3:
line = line[line.find("*/")+2:]
else:
continue
if inBlockComment:
continue
sys.stdout.write(line)