Groovy Regex illegal Characters

Groovy Regex illegal Characters - regex

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!

line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.

Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/

OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax

try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

Related

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'

The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.

The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever

i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)

Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).

A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

JSONCPP is adding extra double quotes to string

I have a root in JSONcpp having string value like this.
Json::Value root;
std::string val = "{\"stringval\": \"mystring\"}";
Json::Reader reader;
bool parsingpassed = reader.parse(val, root, false);
Now when I am trying to retrieve this value using this piece of code.
Json::StreamWriterBuilder builder;
builder.settings_["indentation"] = "";
std::string out = Json::writeString(builder, root["stringval"]);
here out string ideally should be giving containing:
"mystring"
whereas it is giving output like this:
"\"mystring\"" \\you see this in debug mode if you check your string content
by the way if you print this value using stdout it will be printed something like this::
"mystring" \\ because \" is an escape sequence and prints " in stdout
it should be printing like this in stdout:
mystring \\Expected output
Any idea how to avoid this kind of output when converting json output to std::string ?
Please avoid suggesting fastwriter as it also adds newline character and it deprecated API as well.
Constraint: I do not want to modify the string by removing extra \" with string manipulation rather I am willing to know how I can I do that with JSONcpp directly.
This is StreamWriterBuilder Reference code which I have used
Also found this solution, which gives optimal solution to remove extra quotes from your current string , but I don't want it to be there in first place

I had this problem also until I realized you have to use the Json::Value class accessor functions, e.g. root["stringval"] will be "mystring", but root["stringval"].asString() will be mystring.

Okay so This question did not get answer after thorough explanation as well and I had to go through JSONCPP apis and documentation for a while.
I did not find any api as of now which takes care of this scenario of extra double quote addition.
Now from their wikibook I could figure out that some escape sequences might come in String. It is as designed and they haven't mentioned exact scenario.
\" - quote
\\ - backslash
\/ - slash
\n - newline
\t - tabulation
\r - carriage return
\b - backspace
\f - form feed
\uxxxx , where x is a hexadecimal digit - any 2-byte symbol
Link Explaining what all extra Escape Sequence might come in String
Anyone coming around this if finds out better explanation for the same issue , please feel free to post your answer.Till then I guess only string manipulation is the option to remove those extra escape sequence..

lua gsub special replacement producing invalid capture index

I have a piece of lua code (executing in Corona):
local loginstr = "emailAddress={email} password={password}"
print(loginstr:gsub( "{email}", "tester#test.com" ))
This code generates the error:
invalid capture index
While I now know it is because of the curly braces not being specified appropriately in the gsub pattern, I don't know how to fix it.
How should I form the gsub pattern so that I can replace the placeholder string with the email address value?
I've looked around on all the lua-oriented sites I can find but most of the documentation seems to revolve around unassociated situations.

As I've suggested in the comments above, when the e-mail is encoded as a URL parameter, the %40 used to encode the '#' character will be used as a capture index. Since the search pattern doesn't have any captures (let alone 40 of them), this will cause a problem.
There are two possible solutions: you can either decode the encoded string, or encode your replacement string to escape the '%' character in it. Depending on what you are going to do with the end result, you may need to do both.
the following routine (I picked up from here - not tested) can decode an encoded string:
function url_decode(str)
str = string.gsub (str, "+", " ")
str = string.gsub (str, "%%(%x%x)",
function(h) return string.char(tonumber(h,16)) end)
str = string.gsub (str, "\r\n", "\n")
return str
end
For escaping the % character in string str, you can use:
str:gsub("%%", "%%%%")
The '%' character is escaped as '%%', and it needs to be ascaped on both the search pattern and the replace pattern (hence the amount of % characters in the replace).

Are you sure your problem isn't that you're trying to gsub on loginurl rather than loginstr?
Your code gives me this error (see http://ideone.com/wwiZk):
lua: prog.lua:2: attempt to index global 'loginurl' (a nil value)
and that sounds similar to what you're seeing. Just fixing it to use the right variable:
print(loginstr:gsub( "{email}", "tester#test.com" ))
says (see http://ideone.com/mMj0N):
emailAddress=tester#test.com password={password}
as desired.

I had this in value part so You need to escape value with: value:gsub("%%", "%%%%").
Example of replacing "some value" in json:
local resultJSON = json:gsub(, "\"SOME_VALUE\"", value:gsub("%%", "%%%%"))

Flex 3 Regular Expression Problem

I've written a url validator for a project I am working on. For my requirements it works great, except when the last part for the url goes longer than 22 characters it breaks. My expression:
/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i
It expects input that looks like "http(s)://hostname:port/location".
When I give it the input:
https://demo10:443/111112222233333444445
it works, but if I pass the input
https://demo10:443/1111122222333334444455
it breaks. You can test it out easily at http://ryanswanson.com/regexp/#start. Oddly, I can't reproduce the problem with just the relevant (I would think) part /(:\d+\/\S+)/i. I can have as many characters after the required / and it works great. Any ideas or known bugs?
Edit:
Here is some code for a sample application that demonstrates the problem:
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute">
<mx:Script>
<![CDATA[
private function click():void {
var value:String = input.text;
var matches:Array = value.match(/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i);
if(matches == null || matches.length < 1 || matches[0] != value) {
area.text = "No Match";
}
else {
area.text = "Match!!!";
}
}
]]>
</mx:Script>
<mx:TextInput x="10" y="10" id="input"/>
<mx:Button x="178" y="10" label="Button" click="click()"/>
<mx:TextArea x="10" y="40" width="233" height="101" id="area"/>
</mx:Application>

I debugged your regular expression on RegexBuddy and apparently it takes millions of steps to find a match. This usually means that something is terribly wrong with the regular expression.
Look at ([^\s.]+.)+([^\s.]+)(:\d+\/\S+).
1- It seems like you're trying to match subdomains too, but it doesn't work as intended since you didn't escape the dot. If you escape it, demo10:443/123 won't match because it'll need at least one dot. Change ([^\s.]+\.)+ to ([^\s.]+\.)* and it'll work.
2- [^\s.]+ is a bad character class, it will match the whole string and start backtracking from there. You can avoid this by using [^\s:.] which will stop at the colon.
This one should work as you want:
https?:\/\/([^\s:.]+\.)*([^\s:.]+):\d+\/\S+

This is a bug, either in Ryan's implementation or within Flex/Flash.
The regular expression syntax used above (less surrounding slashes and flags) matches Python which provides the following output:
# ignore case insensitive flag as it doesn't matter in this case
>>> import re
>>> rx = re.compile('((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)')
>>> print rx.match('https://demo10:443/1111122222333334444455').groups()
('https://', 'https', 'demo1', '0', ':443/1111122222333334444455')

Remove C and C++ comments using Python?

I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)
I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.
Ideally, I would prefer a non-naive implementation that properly handles awkward cases.

This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.

C (and C++) comments cannot be nested. Regular expressions work well:
//.*?\n|/\*.*?\*/
This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.
def stripcomments(text):
return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)
This code should work.
/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:
//.*?(\r\n?|\n)|/\*.*?\*/
This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).
/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.

Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...
" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"
"And escaped double quotes at the end of a string\""
aa '\\
n' OK
aa "\""
aa "\
\n"
This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.
This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.
This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.
/\
\/ This is not a C++/C99 comment!
This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.
/\
\* This is not a C or C++ comment!
This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.
This is followed by regular C comment number 3.
/\
\
\
\
* C comment */
This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).

This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)
To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)
This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).
import re
def removeCCppComment( text ) :
def blotOutNonNewlines( strIn ) : # Return a string containing only the newline chars contained in strIn
return "" + ("\n" * strIn.count('\n'))
def replacer( match ) :
s = match.group(0)
if s.startswith('/'): # Matched string is //...EOL or /*...*/ ==> Blot out all non-newline chars
return blotOutNonNewlines(s)
else: # Matched string is '...' or "..." ==> Keep unchanged
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:
import subprocess
from cStringIO import StringIO
input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()
process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
input=input, output=output)
return_code = process.wait()
stripped_code = output.getvalue()
In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.
This will probably be better than a pure Python solution; no need to reinvent the wheel.

The regular expression cases will fall down in some situations, like where a string literal contains a subsequence which matches the comment syntax. You really need a parse tree to deal with this.

you may be able to leverage py++ to parse the C++ source with GCC.
Py++ does not reinvent the wheel. It
uses GCC C++ compiler to parse C++
source files. To be more precise, the
tool chain looks like this:
source code is passed to GCC-XML
GCC-XML passes it to GCC C++ compiler
GCC-XML generates an XML description
of a C++ program from GCC's internal
representation. Py++ uses pygccxml
package to read GCC-XML generated
file. The bottom line - you can be
sure, that all your declarations are
read correctly.
or, maybe not. regardless, this is not a trivial parse.
# RE based solutions - you are unlikely to find a RE that handles all possible 'awkward' cases correctly, unless you constrain input (e.g. no macros). for a bulletproof solution, you really have no choice than leveraging the real grammar.

I'm sorry this not a Python solution, but you could also use a tool that understands how to remove comments, like your C/C++ preprocessor. Here's how GNU CPP does it.
cpp -fpreprocessed foo.c

There is also a non-python answer: use the program stripcmt:
StripCmt is a simple utility written
in C to remove comments from C, C++,
and Java source files. In the grand
tradition of Unix text processing
programs, it can function either as a
FIFO (First In - First Out) filter or
accept arguments on the commandline.

The following worked for me:
from subprocess import check_output
class Util:
def strip_comments(self,source_code):
process = check_output(['cpp', '-fpreprocessed', source_code],shell=False)
return process
if __name__ == "__main__":
util = Util()
print util.strip_comments("somefile.ext")
This is a combination of the subprocess and the cpp preprocessor. For my project I have a utility class called "Util" that I keep various tools I use/need.

I have using the pygments to parse the string and then ignore all tokens that are comments from it. Works like a charm with any lexer on pygments list including Javascript, SQL, and C Like.
from pygments import lex
from pygments.token import Token as ParseToken
def strip_comments(replace_query, lexer):
generator = lex(replace_query, lexer)
line = []
lines = []
for token in generator:
token_type = token[0]
token_text = token[1]
if token_type in ParseToken.Comment:
continue
line.append(token_text)
if token_text == '\n':
lines.append(''.join(line))
line = []
if line:
line.append('\n')
lines.append(''.join(line))
strip_query = "\n".join(lines)
return strip_query
Working with C like languages:
from pygments.lexers.c_like import CLexer
strip_comments("class Bla /*; complicated // stuff */ example; // out",CLexer())
# 'class Bla example; \n'
Working with SQL languages:
from pygments.lexers.sql import SqlLexer
strip_comments("select * /* this is cool */ from table -- more comments",SqlLexer())
# 'select * from table \n'
Working with Javascript Like Languages:
from pygments.lexers.javascript import JavascriptLexer
strip_comments("function cool /* not cool*/(x){ return x++ } /** something **/ // end",JavascriptLexer())
# 'function cool (x){ return x++ } \n'
Since this code only removes the comments, any strange value will remain. So, this is a very robust solution that is able to deal even with invalid inputs.

You don't really need a parse tree to do this perfectly, but you do in effect need the token stream equivalent to what is produced by the compiler's front end. Such a token stream must necessarilyy take care of all the weirdness such as line-continued comment start, comment start in string, trigraph normalization, etc. If you have the token stream, deleting the comments is easy. (I have a tool that produces exactly such token streams, as, guess what, the front end of a real parser that produces a real parse tree :).
The fact that the tokens are individually recognized by regular expressions suggests that you can, in principle, write a regular expression that will pick out the comment lexemes. The real complexity of the set regular expressions for the tokenizer (at least the one we wrote) suggests you can't do this in practice; writing them individually was hard enough. If you don't want to do it perfectly, well, then, most of the RE solutions above are just fine.
Now, why you would want strip comments is beyond me, unless you are building a code obfuscator. In this case, you have to have it perfectly right.

I ran across this problem recently when I took a class where the professor required us to strip javadoc from our source code before submitting it to him for a code review. We had to do this several times, but we couldn't just remove the javadoc permanently because we were required to generate javadoc html files as well. Here is a little python script I made to do the trick. Since javadoc starts with /** and ends with */, the script looks for these tokens, but the script can be modified to suite your needs. It also handles single line block comments and cases where a block comment ends but there is still non-commented code on the same line as the block comment ending. I hope this helps!
WARNING: This scripts modifies the contents of files passed in and saves them to the original files. It would be wise to have a backup somewhere else
#!/usr/bin/python
"""
A simple script to remove block comments of the form /** */ from files
Use example: ./strip_comments.py *.java
Author: holdtotherod
Created: 3/6/11
"""
import sys
import fileinput
for file in sys.argv[1:]:
inBlockComment = False
for line in fileinput.input(file, inplace = 1):
if "/**" in line:
inBlockComment = True
if inBlockComment and "*/" in line:
inBlockComment = False
# If the */ isn't last, remove through the */
if line.find("*/") != len(line) - 3:
line = line[line.find("*/")+2:]
else:
continue
if inBlockComment:
continue
sys.stdout.write(line)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Groovy Regex illegal Characters - regex

Try this Regular Expression to remove unicode char from the string : /*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/

try def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

Related

error: multiple repeat for regex in robot [duplicate]

JSONCPP is adding extra double quotes to string

lua gsub special replacement producing invalid capture index

Flex 3 Regular Expression Problem

Remove C and C++ comments using Python?

Categories

Resources