ANTLR4 Lexer Matching Start of Line End Of Line

ANTLR4 Lexer Matching Start of Line End Of Line - regex

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.
I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.
Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.
/* helo
world*/ # /* hel
l
o
*/ /*world */ifdef .....
is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)
This's what I am doing currently:
PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR);
But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.
However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.
If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.

You can use semantic predicates for the conditions.
PPLINE
: {getCharPositionInLine() == 0}?
(ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
{_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
-> channel(PPDIR)
;

You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.
Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

Related

Add text to the end if not already added

I have the following lines:
source = "git::ssh://git#github.abc.com/test//bar"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I want to update any lines that contain source and git words by adding ?ref:tf12 to the end of the line but inside ". If the line already contains ?ref=tf12, it should skip
source = "git::ssh://git#github.abc.com/test//bar?ref=tf12"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I have the following expression using sed, but it outputs wrongly
sed 's#source.*git.*//.*#&?ref=tf12#' file.tf
source = "git::ssh://git#github.abc.com/test//bar"?ref=tf12
source = "git::ssh://git#github.abc.com/test//foo"?ref=tf12?ref=tf12
resource = "bar"

Using simple regular expressions for this is rather brittle; if at all possible, using a more robust configuration file parser would probably be a better idea. If that's not possible, you might want to tighten up the regular expressions to make sure you don't modify unrelated lines. But here is a really simple solution, at least as a starting point.
sed -e '/^ *source *= *"git/!b' -e '/?ref=tf12" *$/b' -e 's/" *$/?ref=tf12"/' file.tf
This consists of three commands. Remember that sed examines one line at a time.
/^ * source *= *"git/!b - if this line does not begin with source="git (with optional spaces between the tokens) leave it alone. (! means "does not match" and b means "branch (to the end of this script)" i.e. skip this line and fetch the next one.)
/?ref=tf12" *$/b similarly says to leave alone lines which match this regex. In other words, if the line ends with ?ref=tf12" (with optional spaces after) don't modify it.
s/"* $/?ref=tf12"/ says to change the last double quote to include ?ref=tf12 before it. This will only happen on lines which were not skipped by the two previous commands.

sed '/?ref=tf12"/!s#\(source.*git.*//.*\)"#\1?ref=tf12"#' file.tf
/?ref=tf12"/! Only run substitude command if this pattern (?ref=tf12") doesn't match
\(...\)", \1 Instead of appending to the entire line using &, only match the line until the last ". Use parentheses to match everything before that " into a group which I can then refer with \1 in the replacement. (Where we re-add the ", so that it doesn't get lost)

Perl search and replace until positive lookahead over several lines - not working as expected?

The overall goal here is to remove a block of text starting with a particular string and ending with a positive lookahead. From the testing I've done, it seems that newlines are causing the problem, but I'm not sure what exactly is going on or the best way to fix it.
More context: I want to remove taxa from a .fasta file, including the taxon name and header information and the associated sequence. (fasta format begins with a header >locusname-locusnumber-species_name |locusname-locusnumber \n). Missing data in the sequence is coded as "-". Eventually I would like to do this for several species_names and do so for each of several thousand files in a directory.
I presumed this would be a simple task to do as a perl one-liner in bash (Ubuntu 18.04.2).
As an example, from the excerpt below I would like to remove the entire sequence of Pseudomymrex seminole D1367, i.e. the string that starts with >uce-483_Pseudomyrmex_seminole_D1367 |uce-483 and ends with the newline before >uce-483_Pseudomyrmex_seminole_D1435. . ..
For this, I have: perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367)[\s\S]+(?=>)//' infile.fasta > outfile.fasta
or equivalently perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367(.)+(?=>)//s' infile.fasta > outfile.fasta
Both of these seem to have no effect at all (i.e. diff infile.fasta outfile.fasta is empty.) If I remove the positive lookahead, it works correctly but only up to the first newline.
Here's an excerpt from the .fasta for context and testing:
>uce-483_Pseudomyrmex_seminole_D1366 |uce-483
------------------------------------------------------------
---------------------------------------------------tgtaaacgt
tataatacatgcgtatgaaaaaaaaaagtgaacacccggtacgtacccgtgctgaaacgt
tcagatttacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata----tgtgtgtgtgtgcgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaatgttgaattattttcactcgctc
cacgcgcgagtgttcgctccttttacgcacaacgagtccttctgctgcagc--gagatag
aaaatatttttgcgcggtaatcgtaaacgtatgagtgcctttcgacgtgaattctcttat
ggcagttctcacggtgtaaattataatcgaattaacattgcgagtgtgatctcaatataa
ttatagcgtctaagaacaaacacgtaacatgcacacacacacacacacac----------
---
>uce-483_Pseudomyrmex_seminole_D1367 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ttcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatg---atatatatgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactctgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtacgcgcgcacacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatg-----------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
>uce-483_Pseudomyrmex_seminole_D1435 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------tacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaa-----------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---

With -p (or -n) the one-liner is reading a line at a time; so it just can't match multiline patterns. One solution is to "slurp" the whole file in, if it isn't too large (see end for line-by-line solution)
perl -0777 -pe'...' in > out
See Command Switches in perlrun.
Then, the code shown in the question has an unbalanced parenthesis and it doesn't compile. Further, there is no reason to capture those .s so drop the parentheses around. Next, the pattern
s/>.+Pseudomyrmex_seminole_D1367...//;
matches everything from the very first > to the name of interest, so all preceding sequences are matched and removed as well. Instead, match >[^>]+...D1367 for example, so everything that isn't > after a >, to that phrase.
Finally, the last .+(?=>) will match everything to the very last > and thus the regex will remove all following sequences, not what you want according to the description. Instead, limit it to match to the first following >, either by making it "non-greedy" with .+?(?=>) or, more simply, with [^>]+.
All corrected
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367[^>]+//' in > out
Note that there is no need for /s modifier now, since its purpose is to make . match a newline and here we don't need that since the [^>] does match newlines as well (anything other than >). The quantifier is +? to (hopefully) prevent backtracking each whole sequence that doesn't match.
Or, with your original use of lookahead
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367.+?(?=>)//s' in > out
These work as expected with your sample, as well as with an extended example I made up with further sequences (>...) added.
For reference, and since a fasta file can be too big to slurp into a string, here it is line by line.
Once you see the >... line of interest set a flag; print a line if that flag isn't set (and if we aren't on that very line). Once you reach the next > clear the flag (print that line, too).
perl -ne'
if (/^>.+?Pseudomyrmex_seminole_D1367/) { $f = 1 }
elsif (not $f) { print }
elsif (/^>/) { $f = 0; print }
' in > out
I suspect that this may also perform considerably better on very large files.
The regex in the first solution has to scan each sequence whole in order to find that it is not the one of interest; it is only once it hits the next > that it can decide that the sequence doesn't match (and with no backtracking, hopefully, since +? would've stopped it had the right phrase been encountered).
Here the code mostly checks the first character and a flag.
So it's an incomparably lesser workload here -- but here the regex engine is started up on every line, and that is expensive. I can't tell with confidence how they stack against each other without trying.

You can also use > as input record separator. This way you avoid to slurp the whole file and since the main loop loads your file block by block, you only have to test which one is the target to not print it (without to describe the whole block in a pattern):
perl -ln076e's/\n$//;print ">$_" if $_ && !/Pseudomyrmex_seminole_D1367/' file
The l switch sets the output record separator to the input record separator (a newline by default).
The 0 switch sets the input record separator to > (76 in octal).

ANTLR4 skips empty line only

I am using antlr4 parsing a text file and I am new to it. Here is the part of the file:
abcdef
//emptyline
abcdef
In file stream string it will be looked like this:
abcdef\r\n\r\nabcdef\r\n
In terms of ANTLR4, it offers the "skip" method to skip something like white-space, TAB, and new line symbol by regular expression while parsing. i.e.
WS : [\t\s\r\n]+ -> skip ; // skip spaces, tabs, newlines
My problem is that I want to skip the empty line only. I don't want to skip every single "\r\n". Therefore it means when there are two or more "\r\n" appear together, I only want to skip the second one or following ones. How should I write the regular expression? Thank you.
grammar INIGrammar_1;
init: (section|NEWLINE)+ ;
section: '[' phase_name ':' v ']' (contents)+
| '[' phase_name ']' (contents)+ ;
//
//
phase_name : STRING
|MTT
|MPI_GET
|MPI_INSTALL
|MPI_DETAILS
|TEST_GET
|TEST_BUILD
|TEST_RUN
|REPORTER
;
v : STRING ;
contents: kvpairs
| include_section_pairs
| if_statement
| NEWLINE
| EOT
;
keylhs : STRING
;
valuerhs : STRING
|multiline_valuerhs
|kvpairs
|url
;
kvpairs: keylhs '=' valuerhs NEWLINE
;
include_section_pairs: INCLUDE_SECTION '=' STRING
;
if_statement: IF if_statement_condition THEN NEWLINE (ELSEIF if_statement_condition THEN NEWLINE)*? STRING NEWLINE IFEND NEWLINE
;
if_statement_condition:STRING '=' STRING ';'//here, semicolon has problem, either I use ';' or SEMICOLON
;
multiline_valuerhs:STRING (',' (' ')*? ( '\\' (' ')*? NEWLINE)? STRING)+
;
url:(' ')*?'http'':''//''www.';//ignore this, not finished.
IF: 'if';
ELSEIF:'elif';
IFEND:'fi';
THEN: 'then';
SEMICOLON: ';';
STRING : [a-z|A-Z|0-9|''| |.|\-|_|(|)|#|&|""|/|#|<|>|$]+ ;
//Keywords
MTT: 'MTT';
MPI_GET: 'MPI get';
MPI_INSTALL:'MPI install';
MPI_DETAILS:'MPI Details';
TEST_GET:'Test get';
TEST_BUILD: 'Test build';
TEST_RUN: 'Test run';
REPORTER: 'Reporter';
INCLUDE_SECTION: 'include_section';
//INCLUDE_SECTION_VALUE:STRING;
EOT:'EOT';
NEWLINE: ('\r' ? '\n')+ ;
WS : [\t]+ -> skip ; // skip spaces, tabs, newlines
COMMENT: '#' .*? '\r'?'\n' -> skip;
EMPTYLINE: '\r\n' -> skip;
Part of the INI file
#======================================================================
# MPI run details
#======================================================================
[MPI Details: Open MPI]
# MPI tests
#exec = mpirun #hosts# -np &test_np() #mca# --prefix &test_prefix() &test_executable() &test_argv()
exec = mpirun #hosts# -np &test_np() --prefix &test_prefix() &test_executable() &test_argv()
hosts = &if(&have_hostfile(), "--hostfile " . &hostfile(), \
&if(&have_hostlist(), "--host " . &hostlist(), ""))
One more small thing is, it seems like ";" cannot be indicated as itself in result. The ANTLR4 just keep saying it expects something else and treat the semicolon as unknown symbol.

The short answer to your question is that whitespace is not significant to your parser, so skip it all in the lexer.
The longer answer is to recognize that skipping whitespace (or any other character sequence) does not mean that it is not significant in the lexer. All it means is that no corresponding token is produced for consumption by the parser. Skipped whitespace will therefore still operate as a delimiter for generated tokens.
Couple of additional observations:
Antlr does not do regex's - thinking along those lines will lead to further conceptual difficulties.
Don't ignore warnings and errors messages produced in the generation of the Lexer/Parser - they almost always require correction before the generated code will function correctly.
Really helps to verify that the lexer is producing your intended token stream before trying to debug parser rules. See this answer that shows how to dump the token stream.

I ran into the same issue trying to have a language that does not require a ; command delimiter.
What resolved it for me was adding the new line as a valid parse rule that does nothing.
I am no expert on this matter but it worked:
nl : NEWLINE{};
The new line looks like this (no skipping)
NEWLINE:[\r?\n];

What would be the best approach to this substitution in Vim?

A several line document has a header/title section and then about 10 listings under each. I need to put the header/title info in with each of the listings so that they can be properly uploaded into a website (using comma and pipe delimiters). It looks like this:
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
Repeating several hundred times. What I need is to produce something like:
SectionName1,TitleName1,1111,SubSectionNameA
SectionName1,TitleName1,222,SubSectionNameB
SectionName1,TitleName1,3333,SubSectionNameC
SectionName2,TitleName2,444,SubSectionNameD
SectionName2,TitleName2,55555,SubSectionNameE
SectionName2,TitleName2,66,SubSectionNameF
I realize there can multiple approaches to this solution, but I'm having a difficult time pulling the trigger on any one method. I understand submatches, joins and getline but I am not good at practical use of them in this scenario.
Any help to get me mentally started would be greatly appreciated.

Let me propose the following quite general Ex command solving the
issue.1
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
\ 'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
At the top level, this is the :global command that enumerates the lines
starting with zero or more whitespace characters followed by a Latin letter or
an underscore (see :help /\h). The lines matching this pattern are supposed
to be the header lines containing section and title names. The rest of the
command, after the pattern describing the header lines, are instructions to be
executed for each of those lines.
The actions to be performed on the headers can be divided into three steps.
Delete the current header line, at the same time extracting section
and title names from it.
:d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')
First, remove the current line, saving it into the unnamed register,
using the :delete command. Then, update the contents of that
register (referred to as #"; see :help #r and :help "") to be
result of the substitution changing the word and surrounded by
whitespace characters, to a single comma. The actual replacement is
carried out by the substitute() function.
However, the input is not the exact string containing the whole header
line, but its prefix leaving out the last character, which is
a newline symbol. The [:-2] notation is a short form of the
[0:-2] subscript expression that designates the substring from the
very first byte to the second one counting from the end (see :help
expr-[:]). This way, the unnamed register holds the section and the
title names separated by comma.
Determine the range of dependent subsection lines.
:ki|/\n\s*\h\|\%$/kj
After the first step, the subsection records belonging to the just
parsed header line are located starting from the current line (the one
followed the header) until the next header line or, if there is no
such line below, the end of buffer. The numbers of these lines are
stored in the marks i and j, respectively. (See :helpg ^A mark
is for description of marks.)
The marks are placed using the :k command that sets a specified mark
at the last line of a given range which is the current line, by
default. So, unlike the first line of the considered block, the last
one requires a specific line range to point out its location.
A particular form of range, denoting the next line where a given
pattern matches, is used in this case (see :help :range). The
pattern defining the location of the line to be found, is composed in
such a way that it matches a line immediately preceding a header (a
line starting with possible whitespace followed by an alphabetical
character), or the very last line. (See :help pattern for details
about syntax of Vim regular expressions.)
Transform the delineated subsection lines according to desired format,
prepending section and title names found in the corresponding header
line.
:'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
This step comprised of the two :substitute commands that are run
over the range of lines delimited by the locations labelled by the
marks i and j (see :help [range]).
The first substitution command matches the beginning of a subsection
line—an identifier followed by a hyphen and the word The, all
floating in a whitespace—and replaces it with the contents of the
unnamed register, holding the section and title names concatenated
with a comma, the matched identifier, and another comma. The second
substitution finalizes the transformation by squeezing all whitespace
characters on the line to gum the subsection name and the following
letter together.
To construct the replacement string in the first :substitute
command, the substitute-with-an-expression feature is used (see :help
sub-replace-\=). The substitution part of the command should start
with \= for Vim to interpret the remaining text not in a regular
way, but as an expression (see :help expression). The result of
that expression's evaluation becomes the substitution string. Note
the use of the submatch() function in the substitute expression to
retrieve the text of a submatch by its number.
1 The command is wrapped for better readability, its one-line
version is listed below for ease of copy-pasting into Vim command line. Note
that the wrapped command can be used in a Vim script without any change.
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g

Simplest/fastest way I can think of is a simple macro. Do once, rinse, repeat.
Assuming your cursor is initially on the first character of the first line (S of SectionName), this macro should work as long as the document is exactly in the same format as posted above.
f ctT,<Esc>yyjpjjpjddkkkddkkkJr,f ctS,<Esc>f xjJr,f ctS,f xjJr,f ctS,<Esc>f xjdd

well I think the question is not that clear. why in your demo input, after "-", the text was like:
55555 - The SubSectionName E
but in your expected output, it turned into:
55555,SubSectionNameE
all spaces were removed, this is ok, but why "The" was removed as well? is there any pattern for "the" ?
I wrote an awk oneliner, it removes all spaces in output, but leave those "The" there, you can change it to get the right output you need.
awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' input
test on your example input:
kent$ cat v
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
kent$ awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' v
SectionName1,TitleName1,1111,TheSubSectionNameA
SectionName1,TitleName1,222,TheSubSectionNameB
SectionName1,TitleName1,3333,TheSubSectionNameC
SectionName2,TitleName2,444,TheSubSectionNameD
SectionName2,TitleName2,55555,TheSubSectionNameE
SectionName2,TitleName2,66,TheSubSectionNameF

Remove C and C++ comments using Python?

I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)
I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.
Ideally, I would prefer a non-naive implementation that properly handles awkward cases.

This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.

C (and C++) comments cannot be nested. Regular expressions work well:
//.*?\n|/\*.*?\*/
This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.
def stripcomments(text):
return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)
This code should work.
/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:
//.*?(\r\n?|\n)|/\*.*?\*/
This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).
/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.

Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...
" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"
"And escaped double quotes at the end of a string\""
aa '\\
n' OK
aa "\""
aa "\
\n"
This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.
This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.
This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).

This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)
To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)
This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).
import re
def removeCCppComment( text ) :
def blotOutNonNewlines( strIn ) : # Return a string containing only the newline chars contained in strIn
return "" + ("\n" * strIn.count('\n'))
def replacer( match ) :
s = match.group(0)
if s.startswith('/'): # Matched string is //...EOL or /*...*/ ==> Blot out all non-newline chars
return blotOutNonNewlines(s)
else: # Matched string is '...' or "..." ==> Keep unchanged
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:
import subprocess
from cStringIO import StringIO
input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()
process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
input=input, output=output)
return_code = process.wait()
stripped_code = output.getvalue()
In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.
This will probably be better than a pure Python solution; no need to reinvent the wheel.

The regular expression cases will fall down in some situations, like where a string literal contains a subsequence which matches the comment syntax. You really need a parse tree to deal with this.

you may be able to leverage py++ to parse the C++ source with GCC.
Py++ does not reinvent the wheel. It
uses GCC C++ compiler to parse C++
source files. To be more precise, the
tool chain looks like this:
source code is passed to GCC-XML
GCC-XML passes it to GCC C++ compiler
GCC-XML generates an XML description
of a C++ program from GCC's internal
representation. Py++ uses pygccxml
package to read GCC-XML generated
file. The bottom line - you can be
sure, that all your declarations are
read correctly.
or, maybe not. regardless, this is not a trivial parse.
# RE based solutions - you are unlikely to find a RE that handles all possible 'awkward' cases correctly, unless you constrain input (e.g. no macros). for a bulletproof solution, you really have no choice than leveraging the real grammar.

I'm sorry this not a Python solution, but you could also use a tool that understands how to remove comments, like your C/C++ preprocessor. Here's how GNU CPP does it.
cpp -fpreprocessed foo.c

There is also a non-python answer: use the program stripcmt:
StripCmt is a simple utility written
in C to remove comments from C, C++,
and Java source files. In the grand
tradition of Unix text processing
programs, it can function either as a
FIFO (First In - First Out) filter or
accept arguments on the commandline.

The following worked for me:
from subprocess import check_output
class Util:
def strip_comments(self,source_code):
process = check_output(['cpp', '-fpreprocessed', source_code],shell=False)
return process
if __name__ == "__main__":
util = Util()
print util.strip_comments("somefile.ext")
This is a combination of the subprocess and the cpp preprocessor. For my project I have a utility class called "Util" that I keep various tools I use/need.

I have using the pygments to parse the string and then ignore all tokens that are comments from it. Works like a charm with any lexer on pygments list including Javascript, SQL, and C Like.
from pygments import lex
from pygments.token import Token as ParseToken
def strip_comments(replace_query, lexer):
generator = lex(replace_query, lexer)
line = []
lines = []
for token in generator:
token_type = token[0]
token_text = token[1]
if token_type in ParseToken.Comment:
continue
line.append(token_text)
if token_text == '\n':
lines.append(''.join(line))
line = []
if line:
line.append('\n')
lines.append(''.join(line))
strip_query = "\n".join(lines)
return strip_query
Working with C like languages:
from pygments.lexers.c_like import CLexer
strip_comments("class Bla /*; complicated // stuff */ example; // out",CLexer())
# 'class Bla example; \n'
Working with SQL languages:
from pygments.lexers.sql import SqlLexer
strip_comments("select * /* this is cool */ from table -- more comments",SqlLexer())
# 'select * from table \n'
Working with Javascript Like Languages:
from pygments.lexers.javascript import JavascriptLexer
strip_comments("function cool /* not cool*/(x){ return x++ } /** something **/ // end",JavascriptLexer())
# 'function cool (x){ return x++ } \n'
Since this code only removes the comments, any strange value will remain. So, this is a very robust solution that is able to deal even with invalid inputs.

You don't really need a parse tree to do this perfectly, but you do in effect need the token stream equivalent to what is produced by the compiler's front end. Such a token stream must necessarilyy take care of all the weirdness such as line-continued comment start, comment start in string, trigraph normalization, etc. If you have the token stream, deleting the comments is easy. (I have a tool that produces exactly such token streams, as, guess what, the front end of a real parser that produces a real parse tree :).
The fact that the tokens are individually recognized by regular expressions suggests that you can, in principle, write a regular expression that will pick out the comment lexemes. The real complexity of the set regular expressions for the tokenizer (at least the one we wrote) suggests you can't do this in practice; writing them individually was hard enough. If you don't want to do it perfectly, well, then, most of the RE solutions above are just fine.
Now, why you would want strip comments is beyond me, unless you are building a code obfuscator. In this case, you have to have it perfectly right.

I ran across this problem recently when I took a class where the professor required us to strip javadoc from our source code before submitting it to him for a code review. We had to do this several times, but we couldn't just remove the javadoc permanently because we were required to generate javadoc html files as well. Here is a little python script I made to do the trick. Since javadoc starts with /** and ends with */, the script looks for these tokens, but the script can be modified to suite your needs. It also handles single line block comments and cases where a block comment ends but there is still non-commented code on the same line as the block comment ending. I hope this helps!
WARNING: This scripts modifies the contents of files passed in and saves them to the original files. It would be wise to have a backup somewhere else
#!/usr/bin/python
"""
A simple script to remove block comments of the form /** */ from files
Use example: ./strip_comments.py *.java
Author: holdtotherod
Created: 3/6/11
"""
import sys
import fileinput
for file in sys.argv[1:]:
inBlockComment = False
for line in fileinput.input(file, inplace = 1):
if "/**" in line:
inBlockComment = True
if inBlockComment and "*/" in line:
inBlockComment = False
# If the */ isn't last, remove through the */
if line.find("*/") != len(line) - 3:
line = line[line.find("*/")+2:]
else:
continue
if inBlockComment:
continue
sys.stdout.write(line)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js