Stack overflow during std::regex_replace - c++

I'm trying to execute the following C++ STL-based code to replace text in a relatively large SQL script (~8MB):
std::basic_regex<TCHAR> reProc("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$\n((^(?![ \t]*go[ \t]*).*$\n)+)^[ \t]*go[ \t]*$");
std::basic_string<TCHAR> replace = _T("ALTER $1 $2\n$3\ngo");
return std::regex_replace(strInput, reProc, replace);
The result is a stack overflow, and it's hard to find information about that particular error on this particular site since that's also the name of the site.
Edit: I am using Visual Studio 2013 Update 5
Edit 2: The original file is over 23,000 lines. I cut the file down to 3,500 lines and still get the error. When I cut it by another ~50 lines down to 3,456 lines, the error goes away. If I put just those cut lines into the file, the error is still gone. This suggests that the error is not related to specific text, but just too much of it.
Edit 3: A full working example is demonstrated operating properly here:
https://regex101.com/r/iD1zY6/1
It doesn't work in that STL code, though.

The following trimmed-down version of your regex saves about 20% of processing steps according to regex101 (see here).
\\bcreate[ \t]+(view|procedure|proc)[ \t]+(.+)\n(((?![ \t]*go[ \t]*).*\n)+)[ \t]*go[ \t]*
Modifications:
inline anchors removed: you are expressly testing for newline characters
repetition operator for the db object keywords removed - a repetition at this point would make the original script syntactically invalid.
initial whitespace pattern replaced by word boundary (note the double backslash - the escape sequence is for the regex engine, not for the compiler)
If you can be sure that ...
the create ... statements do not occur in string literals, and
you do not need to distinguish between create ... statements followed by a go or not (eg. because all statements are trailed by a go)
...it might even be easier to just replace these strings:
std::basic_regex<TCHAR> reProc("\bcreate[ \t]+(view|procedure|proc)");
std::basic_string<TCHAR> replace = _T("ALTER $1");
return std::regex_replace(strInput, reProc, replace);
(Here is a demo for the latter approach - reduces the steps to a little more than 1/4 th).

It turns out that STL regular expressions are tragic under-performers versus Perl (about 100 times slower if you can believe https://stackoverflow.com/a/37016671/78162), so it's apparently necessary to absolutely minimize the use of regular expressions in STL/C++ when performance is a serious concern. (The degree to which C++/STL under-performs here blew my mind considering I presume C++ to generally be one of the more performant languages). I ended up passing the file stream to read one line at a time and only run the expression on lines that needed processing like this:
std::basic_string<TCHAR> result;
std::basic_string<TCHAR> line;
std::basic_regex<TCHAR> reProc(_T("^[ \t]*create[ \t]+(view|procedure|proc)+[ \t]+(.+)$"), std::regex::optimize);
std::basic_string<TCHAR> replace = _T("ALTER $1 $2");
do {
std::getline(input, line);
int pos = line.find_first_not_of(_T(" \t"));
if ((pos != std::basic_string<TCHAR>::npos)
&& (_tcsnicmp(line.substr(pos, 6).data(), _T("create"), 6)==0))
result.append(std::regex_replace(line, reProc, replace));
else
result.append(line);
result.append(_T("\n"));
} while (!input.eof());
return result;

Related

D: split string by comma, but not quoted string

I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.

Minify HTML with Boost regex in C++

Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

Creating a simple parser in (V)C++ (2010) similar to PEG

For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file
Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.
Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression

Replace C style comments by C++ style comments

How can I automatically replace all C style comments (/* comment */) by C++ style comments (// comment)?
This has to be done automatically in several files. Any solution is okay, as long as it works.
This tool does the job:
https://github.com/cenit/jburkardt/tree/master/recomment
RECOMMENT is a C++ program which
converts C style comments to C++ style
comments.
It also handles all the non-trivial cases mentioned by other people:
This code incorporates suggestions and
coding provided on 28 April 2005 by
Steven Martin of JDS Uniphase,
Melbourne Florida. These suggestions
allow the program to ignore the
internal contents of strings, (which
might otherwise seem to begin or end
comments), to handle lines of code
with trailing comments, and to handle
comments with trailing bits of code.
This is not a trivial problem.
int * /* foo
/* this is not the beginning of a comment.
int * */ var = NULL;
What do you want to replace that with? Any real substitution requires sometimes splitting lines.
int * // foo
// this is not the beginning of a comment.
// int *
var = NULL;
How do you intend to handle situations like this:
void CreateExportableDataTable(/*[out, retval]*/ IDispatch **ppVal)
{
//blah
}
Note the comment inside the parens... this is a common way of documenting things in generated code, or mentioning default parameter values in the implementation of a class, etc. I'm usually not a fan of such uses of comments, but they are common and need to be considered. I don't think you can convert them to C++ style comments without doing some heavy thinking.
I'm with the people who commented in your question. Why do it? Just leave it.
it wastes time, adds useless commits to version control, risk of screwing up
EDIT:
Adding details from the comments from the OP
The fundamental reason of preferring C++-style comment is that you can comment out a block of code which may have comments in it. If that comment is in C-style, this block-comment-out of code is not straight forward. – unknown (yahoo)
that might be a fair/ok thing to want to do, but I have two comments about that:
I know of no one who would advocate changing all existing code - that is a preference for new code. (IMO)
If you feel the need to "comment out code" (another iffy practice) then you can do it as needed - not before
It also appears that you want to use the c-style comments to block out a section of code? Or are you going to use the // to block out many lines?
One alternative is a preprocessor #ifdef for that situation. I cringe at that but it is just as bad as commenting out lines/blocks. Neither should be left in the production code.
I recently converted all C-style comments to C++-style for all files in our repository. Since I could not find a tool that would do it automatically, I wrote my own: c-comments-to-cpp
It is not fool-proof, but way better than anything else I've tried (including RECOMMENT). Among other things, it supports converting Doxygen style comments, for instance:
/**
* #brief My foo struct.
*/
struct foo {
int bar; /*!< This is a member.
It also has a meaning. */
};
Gets converted to:
/// #brief My foo struct.
struct foo {
int bar; ///< This is a member.
///< It also has a meaning.
};
Here's a Python script that will (mostly) do the job. It handles most edge cases, but it does not handle comment characters inside of strings, although that should be easy to fix.
#!/usr/bin/python
import sys
out = ''
in_comment = False
file = open(sys.argv[1], 'r+')
for line in file:
if in_comment:
end = line.find('*/')
if end != -1:
out += '//' + line[:end] + '\n'
out += ' ' * (end + 2) + line[end+2:]
in_comment = False
else:
out += '//' + line
else:
start = line.find('/*')
cpp_start = line.find('//')
if start != -1 and (cpp_start == -1 or cpp_start > start):
out += line[:start] + '//' + line[start+2:]
in_comment = True
else:
out += line
file.seek(0)
file.write(out)
Why don't you write a C app to parse it's own source files? You could find the /* comments */ sections with a relatively easy Regex query. You could then replace the new line characters with new line character + "//".
Anyway, just a thought. Good luck with that.
If you write an application/script to process the C source files, here are some things to be careful of:
comment characters within strings
comment characters in the middle of a line (you might not want to split the code line)
You might be better off trying to find an application that understands how to actually parse the code as code.
There are a few suggestions that you might like to try out:
a)Write your own code (C/ Python/ any language you like) to replace the comments. Something along the lines of what regex said or this naive solution 'might' work:
[Barring cases like the one rmeador, Darron posted]
for line in file:
if line[0] == "\*":
buf = '//' + all charachters in the line except '\*'
flag = True
if flag = True:
if line ends with '*/':
strip off '*/'
flag = False
add '//' + line to buf
b)Find a tool to do it. (I'll look up some and post, if I find them.)
c)Almost all modern IDE's (if you are using one) or text editors have an auto comment feature. You can then manually open up each file, select comment lines, decide how to handle the situation and comment C++ style using an accelerator (say Ctrl + M). Then, you can simply 'Find and Replace' all "/*" and "*/", again using your judgment. I have Gedit configured to do this using the "Code Comment' plugin. I don't remember the way I did it in Vim off hand. I am sure this one can be found easily.
If there are just "several files" is it really necessary to write a program? Opening it up in a text editor might do the trick quicker in practice, unless there's a whole load of comments. emacs has a comment-region command that (unsurprisingly) comments a region, so it'd just be a case of ditching the offending '/*' and '*/'.
Very old question, I know, but I just achieved this using "pure emacs". In short, the solution looks as follows:
Run M-x query-replace-regexp. When prompted, enter
/\*\(\(.\|^J\)*?\)*\*/
as the regex to search for. The ^J is a newline, which you can enter by pressing ^Q (Ctrl+Q in most keyboards), and then pressing the enter key. Then enter
//\,(replace-regexp-in-string "[\n]\\([ ]*?\\) \\([^ ]\\)" "\n\\1// \\2" \1))
as the replacement expression.
Essentially, the idea is that you use two nested regex searches. The main one simply finds C-style comments (the *? eager repetition comes very handy for this). Then, an elisp expression is used to perform a second replacement inside the comment text only. In this case, I'm looking for newlines followed by space, and replacing the last three space characters by //, which is nice for preserving the comment formatting (works only as long as all comments are indented, though).
Changes to the secondary regex will make this approach work in other cases, for example
//\,(replace-regexp-in-string "[\n]" " " \1))
will just put the whole contents of the original comment into a single C++-style comment.
from PHP team convention... some reasonning has to exist if the question was asked. Just answer if you know.
Never use C++ style comments (i.e. // comment). Always use C-style
comments instead. PHP is written in C, and is aimed at compiling
under any ANSI-C compliant compiler. Even though many compilers
accept C++-style comments in C code, you have to ensure that your
code would compile with other compilers as well.
The only exception to this rule is code that is Win32-specific,
because the Win32 port is MS-Visual C++ specific, and this compiler
is known to accept C++-style comments in C code.

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)