Regex for comments in strings, strings in comments, etc - regex

This a question I've solved and wanted to post in Q&A style because I think more people could use the solution. Or maybe improve the solution, show where it breaks.
The problem
You wanna do something with quoted strings and/or comments in a body of text. You wanna extract them, highlight them, what have you. But some quoted strings are inside comments, and sometimes comment-characters are inside strings. And strings delimiters can be escaped, and comments can be line-comments or block comments. And when you thought you had a solution somebody complains that it doesn't work when there's a regex-literal in his JavaScript. What do?
Concrete example
var ret = row.match(/'([^']+)'/i); // Get 1st single quoted string's content
if (!ret) return ''; /* return if there's no matches
Otherwise turn into xml: */
var message = '\t<' + ret[1].replace(/\[1]/g, '').replace(/\/#(\w+)/i, ' $1=""') + '></' + ret[1].match(/[A-Z_]\w*/i)[0] + '>';
alert('xml: \'' + message + '\''); /*
alert("xml: '" + message + "'"); // */
var line = prompt('How do line-comments start? (e.g. //)', '//');
// do something with line
This code is nonsense, but how do I do the right thing in each of the cases of the above JavaScript?
The only thing I found that comes close is this: Comments in string and strings in comments where Jan Goyvaerts himself answered with a similar approach. But that one doesn't handle apostrophe-escaping yet.

I've broken the regex into 4 lines corresponding with the 4 paths in the graph, don't keep those line-breaks in there if you ever use this.
(['"])(?:(?!\1|\\).|\\.)*\1|
\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|
\/\/[^\n]*(?:\n|$)|
\/\*(?:[^*]|\*(?!\/))*\*\/
Debuggex Demo
This code grabs 4 types of "blocks" that can contain the other 3. You can iterate through this and do with each one whatever you want or discard it because it's not the one you wanna do anything to.
This one is specific for JavaScript as it's a language I'm familiar with. But you could easily adapt this to the language of your preference.
Anyone see a way in which this code breaks?
Edit I have since been notified that the general pattern is described very well here: https://stackoverflow.com/a/23589204/2684660, neato!

Related

D: split string by comma, but not quoted string

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

Controlling Word Wrap in a container

I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]

Replace C style comments by C++ style comments

How can I automatically replace all C style comments (/* comment */) by C++ style comments (// comment)?
This has to be done automatically in several files. Any solution is okay, as long as it works.
This tool does the job:
https://github.com/cenit/jburkardt/tree/master/recomment
RECOMMENT is a C++ program which
converts C style comments to C++ style
comments.
It also handles all the non-trivial cases mentioned by other people:
This code incorporates suggestions and
coding provided on 28 April 2005 by
Steven Martin of JDS Uniphase,
Melbourne Florida. These suggestions
allow the program to ignore the
internal contents of strings, (which
might otherwise seem to begin or end
comments), to handle lines of code
with trailing comments, and to handle
comments with trailing bits of code.
This is not a trivial problem.
int * /* foo
/* this is not the beginning of a comment.
int * */ var = NULL;
What do you want to replace that with? Any real substitution requires sometimes splitting lines.
int * // foo
// this is not the beginning of a comment.
// int *
var = NULL;
How do you intend to handle situations like this:
void CreateExportableDataTable(/*[out, retval]*/ IDispatch **ppVal)
{
//blah
}
Note the comment inside the parens... this is a common way of documenting things in generated code, or mentioning default parameter values in the implementation of a class, etc. I'm usually not a fan of such uses of comments, but they are common and need to be considered. I don't think you can convert them to C++ style comments without doing some heavy thinking.
I'm with the people who commented in your question. Why do it? Just leave it.
it wastes time, adds useless commits to version control, risk of screwing up
EDIT:
Adding details from the comments from the OP
The fundamental reason of preferring C++-style comment is that you can comment out a block of code which may have comments in it. If that comment is in C-style, this block-comment-out of code is not straight forward. – unknown (yahoo)
that might be a fair/ok thing to want to do, but I have two comments about that:
I know of no one who would advocate changing all existing code - that is a preference for new code. (IMO)
If you feel the need to "comment out code" (another iffy practice) then you can do it as needed - not before
It also appears that you want to use the c-style comments to block out a section of code? Or are you going to use the // to block out many lines?
One alternative is a preprocessor #ifdef for that situation. I cringe at that but it is just as bad as commenting out lines/blocks. Neither should be left in the production code.
I recently converted all C-style comments to C++-style for all files in our repository. Since I could not find a tool that would do it automatically, I wrote my own: c-comments-to-cpp
It is not fool-proof, but way better than anything else I've tried (including RECOMMENT). Among other things, it supports converting Doxygen style comments, for instance:
/**
* #brief My foo struct.
*/
struct foo {
int bar; /*!< This is a member.
It also has a meaning. */
};
Gets converted to:
/// #brief My foo struct.
struct foo {
int bar; ///< This is a member.
///< It also has a meaning.
};
Here's a Python script that will (mostly) do the job. It handles most edge cases, but it does not handle comment characters inside of strings, although that should be easy to fix.
#!/usr/bin/python
import sys
out = ''
in_comment = False
file = open(sys.argv[1], 'r+')
for line in file:
if in_comment:
end = line.find('*/')
if end != -1:
out += '//' + line[:end] + '\n'
out += ' ' * (end + 2) + line[end+2:]
in_comment = False
else:
out += '//' + line
else:
start = line.find('/*')
cpp_start = line.find('//')
if start != -1 and (cpp_start == -1 or cpp_start > start):
out += line[:start] + '//' + line[start+2:]
in_comment = True
else:
out += line
file.seek(0)
file.write(out)
Why don't you write a C app to parse it's own source files? You could find the /* comments */ sections with a relatively easy Regex query. You could then replace the new line characters with new line character + "//".
Anyway, just a thought. Good luck with that.
If you write an application/script to process the C source files, here are some things to be careful of:
comment characters within strings
comment characters in the middle of a line (you might not want to split the code line)
You might be better off trying to find an application that understands how to actually parse the code as code.
There are a few suggestions that you might like to try out:
a)Write your own code (C/ Python/ any language you like) to replace the comments. Something along the lines of what regex said or this naive solution 'might' work:
[Barring cases like the one rmeador, Darron posted]
for line in file:
if line[0] == "\*":
buf = '//' + all charachters in the line except '\*'
flag = True
if flag = True:
if line ends with '*/':
strip off '*/'
flag = False
add '//' + line to buf
b)Find a tool to do it. (I'll look up some and post, if I find them.)
c)Almost all modern IDE's (if you are using one) or text editors have an auto comment feature. You can then manually open up each file, select comment lines, decide how to handle the situation and comment C++ style using an accelerator (say Ctrl + M). Then, you can simply 'Find and Replace' all "/*" and "*/", again using your judgment. I have Gedit configured to do this using the "Code Comment' plugin. I don't remember the way I did it in Vim off hand. I am sure this one can be found easily.
If there are just "several files" is it really necessary to write a program? Opening it up in a text editor might do the trick quicker in practice, unless there's a whole load of comments. emacs has a comment-region command that (unsurprisingly) comments a region, so it'd just be a case of ditching the offending '/*' and '*/'.
Very old question, I know, but I just achieved this using "pure emacs". In short, the solution looks as follows:
Run M-x query-replace-regexp. When prompted, enter
/\*\(\(.\|^J\)*?\)*\*/
as the regex to search for. The ^J is a newline, which you can enter by pressing ^Q (Ctrl+Q in most keyboards), and then pressing the enter key. Then enter
//\,(replace-regexp-in-string "[\n]\\([ ]*?\\) \\([^ ]\\)" "\n\\1// \\2" \1))
as the replacement expression.
Essentially, the idea is that you use two nested regex searches. The main one simply finds C-style comments (the *? eager repetition comes very handy for this). Then, an elisp expression is used to perform a second replacement inside the comment text only. In this case, I'm looking for newlines followed by space, and replacing the last three space characters by //, which is nice for preserving the comment formatting (works only as long as all comments are indented, though).
Changes to the secondary regex will make this approach work in other cases, for example
//\,(replace-regexp-in-string "[\n]" " " \1))
will just put the whole contents of the original comment into a single C++-style comment.
from PHP team convention... some reasonning has to exist if the question was asked. Just answer if you know.
Never use C++ style comments (i.e. // comment). Always use C-style
comments instead. PHP is written in C, and is aimed at compiling
under any ANSI-C compliant compiler. Even though many compilers
accept C++-style comments in C code, you have to ensure that your
code would compile with other compilers as well.
The only exception to this rule is code that is Win32-specific,
because the Win32 port is MS-Visual C++ specific, and this compiler
is known to accept C++-style comments in C code.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.