I want to perform a batch replace operation on a project by following some rules. For e.g. I am taking notes in the code like this:
On every code piece, which is commented like this, I want to perform a replace operation, which will replace the input code piece with the output code piece in the following examples:
Input 1:
//+
a++;
//+(+SomeException$SomeMessage)
Output 1:
try
{
a++;
}
catch (AnException)
{
throw;
}
catch (Exception ex)
{
throw new SomeException("SomeMessage", "15", ex);
}
Input 2:
//+
a++;
//-(+InvalidOperationException$SomeMessage)
Output 2:
try
{
a++;
}
catch (InvalidOperationException ex)
{
throw new AnException("SomeMessage", "16", ex);
}
Input 3:
//+
a++;
//-(SomeMessage)
Output 3:
try
{
a++;
}
catch (Exception ex)
{
throw new AnException("SomeMessage", "17", ex);
}
The magic numbers (15, 16, 17) will increase for each code piece commented like this. I know this is not the best practice but I am not making the decisions and I am expected to handle exceptions like this, so I thought I can ease the pain by taking notes and batch replacing in the end. What is the best way to do this? Should I write my own code to perform replaces or is there some regex replace tool or something like that exist that can automatically make this for me?
Update: This is a one time job and my magic number has to be globally unique. So if it was 25 for the last match in a file, it must be 26 for the first match in the next file.
What is the best way to do this? Should I write my own code to perform replaces or is there some regex replace tool or something like that exist that can automatically make this for me?
I'd write a little program in C++ or C# to do this. There are presumably other tools and script languages that can do it; but given that it's a trivial job in C++ or C# and given that I aready know how to do it in these languages, why not?
I don't know what you mean by the "best" way, but for me at least this would be one of the easiest ways.
This looks like a simple language that you're going to compile into another language that looks like Java. A compiler is the right tool for a job like this, especially because you need to keep around the state of the current magic number. It also seems likely that whoever is making the decisions would want to add new features to the language, in which case a solution glued together with regular expressions might not work properly.
If I'm right about what you really want, your question is reduced to the problem of "How do I write a Domain Specific Language?" I'm not sure what the best method would be for this, but if you know Perl you could probably put together a solution with Parse::RecDescent.
I think it's possibly to do this with scripting and regular expressions, but this is the type of problem for which compilers were invented. If you end up making something hacky, God help the person that has to maintain it after you! :)
You could write a CodeSmith template that reads that input and outputs that output. But, I'm not sure you could do it in-line. That is, you would need a file of just inputs and then your template could give you the file of outputs. I'm not sure if that acceptable tho.
There's a lot of ways you could do this, even though you probably shouldn't (as you seem to realize, this will just result in meaningless exceptions). Nevertheless, here's a sed/sh combo to do the first one. It doesn't handle the autonumbering or your other variants. I'll leave that as an exercise for the OP.
P1='\/\/+'; P2='\(.*\)'; P3='\/\/+(+\([^$]*\)$\(.*\))';
echo 'foo()\n//+\na++\n//+(+SomeException$Message)'|sed ' /'$P1'/ { N; /'$P2'/ { N; /'$P3'/ { s/'$P1'\n'$P2'\n'$P3'/try\n{\n\t\1\n}\ncatch (AnException)\n{\n\tthrow;\n}\ncatch (Exception ex)\n{\n\tthrow new \2("\3", "0", ex);\n}/ } } } '
The echo is just a test string.
As an Emacs user, for a one time job I'd do this by defining keyboard macros,
then use set/increment/insert-register for the autonumbering magic. There
shouldn't really be any need for writing your own elisp functions.
Though if you need to perform this on more than just a couple of files, you'll
probably be better off writing a script to do the job.
If you do not happen to use an IDE like Emacs (as answered by many) with strong regex support I would write a little script. Note that text manipulation is in general more a scripting operation, e.g. Perl, Ruby, due to regex support in the language itself. On the other hand if you are very familiar with say Java Pattern, then writing it in Java is propably the fastest solution, even if you need more overhead esp. for a one time operation.
So a litte Ruby script might look like that (beware, I did not test it):
$cnt = 1
IO.readlines(filename).collect { |line|
if line =~ /^\s*\/\/\+\s*$/
$cnt += 1
["try\n", "{\n" ]
elsif line =~ /^\s*\/\/\+\(\+(.+)\$(.+)\)\s*/
["}\n", "catch (#{$1} ex)\n", "{\n",
"throw new AnException(\"#{$2}\", \"#{$cnt}\", ex);\n", "}\n"]
# propably more else for all cases
else
line
end
}.flatten
# save the file again
Related
I need to split string by comma, that not quoted like:
foo, bar, "hello, user", baz
to get:
foo
bar
hello, user
baz
Using std.csv:
import std.csv;
import std.stdio;
void main()
{
auto str = `foo,bar,"hello, user",baz`;
foreach (row; csvReader(str))
{
writeln(row);
}
}
Application output:
["foo", "bar", "hello, user", "baz"]
Note that I modified your CSV example data. As std.csv wouldn't correctly parse it, because of space () before first quote (").
You can use next snippet to complete this task:
File fileContent;
string fileFullName = `D:\code\test\example.csv`;
fileContent = File (fileFullName, "r");
auto r = regex(`(?!\B"[^"]*),(?![^"]*"\B)`);
foreach(line;fileContent.byLine)
{
auto result = split(line, r);
writeln(result);
}
If you are parsing a specific file format, splitting by line and using regex often isn't correct, though it will work in many cases. I prefer to read it in character by character and keep a few flags for state (or use someone else's function where appropriate that does it for you for this format). D has std.csv: http://dlang.org/phobos/std_csv.html or my old old csv.d which is minimal but basically works too: https://github.com/adamdruppe/arsd/blob/master/csv.d (haha 5 years ago was my last change to it, but hey, it still works)
Similarly, you can kinda sorta "parse" html with regex... sometimes, but it breaks pretty quickly outside of simple cases and you are better off using an actual html parser (which probably is written to read char by char!)
Back to quoted commas, reading csv, for example, has a few rules with quoted content: first, of course, commas can appear inside quotes without going to the next field. Second, newlines can also appear inside quotes without going to the next row! Third, two quote characters in a row is an escaped quote that is in the content, not a closing quote.
foo,bar
"this item has
two lines, a comma, and a "" mark!",this is just bar
I'm not sure how to read that with regex (eyeballing, I'm pretty sure yours gets the escaped quote wrong at least), but it isn't too hard to do when reading one character at a time (my little csv reader is about fifty lines, doing it by hand). Splitting the lines ahead of time also complicates compared to just reading the characters because you might then have to recombine lines later when you find one ends with a closing quote! And then your beautiful byLine loop suddenly isn't so beautiful.
Besides, when looking back later, I find simple character readers and named functions to be more understandable than a regex anyway.
So, your answer is correct for the limited scope you asked about, but might be missing the big picture of other cases in the file format you are actually trying to read.
edit: one last thing I want to pontificate on, these corner cases in CSV are an example of why people often say "don't reinvent the wheel". It isn't that they are really hard to handle - look at my csv.d code, it is short, pretty simple, and works at everything I've thrown at it - but that's the rub, isn't it? "Everything I've thrown at it". To handle a file format, you need to be aware of what the corner cases are so you can handle them, at least if you want it to be generic and take arbitrary user input. Knowing these edge cases tends to come more from real world experience than just taking a quick glance. Once you know them though, writing the code again isn't terribly hard, you know what to test for! But if you don't know it, you can write beautiful code with hundreds of unittests... but miss the real world case your user just happens to try that one time it matters.
My problem: long chemical terms, without any guidance to a browser about where to break the term. Some terms are over 70 characters.
My goal: introduce <wbr> at logical insertion points.
Example of problem:
isoquinolinetetramethylenesulfoxidetetrachlororuthenate (55 chars)
Example of opportunities to break a chemical term (e.g. the way a person would pronounce the term as opposed to typing the term):
iso<wbr>quinoline
tetra<wbr>methylene
methylene<wbr>sulfoxide
tetra<wbr>chloro
Usually (but not always) iso, tetra, and methyl are word_break_opportunities.
In general how should I set up an environment with:
control file with "rules" that introduce word_break opportunities
file on which to apply the rules from the control file
The control file will be updated with new rules as new chemical term are encountered.
Would like to use: sed, awk, regex.
Perhaps the environment would look like:
awk rules.awk inputfile.txt > outputfile.txt
Am prepared for trial and error so would appreciate basic explanation so I can refine the control file.
My platform: Windows 7; 64 bit; 8 GB memory; GNUwin32; sed 4.1.5.4013; awk 3.1.6.2962
Thank you in advance.
Your first job is to come up with a list of what is and isn't breakable. Once you have this you can define a format to interpret, and build some code around it.
For example, I would probably go something like:
Opening chars:
iso
tetra
then some code like:
for Each openingString {
if (string.startsWith(openingString)){
insert wbr after opening string
}
}
2.
Opening chars, unless followed by
iso|"tope|bob"
tetra|"pak"
for Each openingString {
if (string.startsWith(openingString)){
get the next element from the row (after the |, surrounded by ")
split around the |
for each part
if (!string.startsWith(part, openingString.length)) {
insert wbr after openingString
}
}
}
then build up from there. It's a pretty monumental task though, it's going to take a lot of building on to get to something useful, but if you're committed to it! The first task is to decide how you're going to hold these mappings though.
For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file
Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.
Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression
How can I move away from (in c++) the annoying menus like:
(a) Do something
(b) Do something else
(c) Do that 3rd thing
(x) exit
Basically I want to be able to run the program then do something like "calc 32 / 5" or "open data.csv", where obviously I would have written the code for "calc" and "open". Just a shove in the right direction would be great, I am sure I can figure it all out, I just need something to google-fu.
I think you want to do something like this:
string cmd;
cout << "Enter your command:" << endl;
cin >> cmd;
if(cmd == "open") {
// read file name and open file
} else if (cmd == "calc") {
// read and evaluate expression
} ...
Though depending on how complex you want your command language to be, a more elaborate design (maybe even using a parser generator) might be appropriate.
You should pick up The C++ Programming Language, which is the book on C++ (there are others, but this one is great). It has an example program, spread over a few chapters, on tokenizing, parsing arguments, and making a calculator.
What you want is a command line parser. I can't remember the name, but there is actually a design pattern to this. However, this site gives you some sample code you can use to write one. Hope that's not giving you too much of the answer :)
Instead of looking for input like a, b, etc, just ask for generic input. Split the input at spaces, do a "switch" on the first one to match it up to your function calls, treat the rest as arguments.
Is your menu based on a call to getchar()? If you want to allow entering an entire line before processing it, you can use fgets() or, in C++ land, std::getline.
Some folks will package their C++ class definitions as Python classes by adding a Python interface to the C++.
Then they write the top-level interpreter in Python using the built-in cmd library.
Take a look to:
ANTLR which is a very good and easy parser which also generates code on C++.
You can take a look to Natural CLI (java) to get inspired. (Disclaimer: i'm the developer of that project).
I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)