I'm wondering if Perl is a good (easy to use and to learn) tool for this:
I'd like to do some custom preprocessing on my C/C++ source code. Basically, this's to allow me to insert my own custom annotations to the source code and generate new codes base on it. The required processing is mainly line oriented search/replace and insertion of new source code lines.
I can now think of 2 tools to achieve this: (1) Ultraedit's scripting feature (or any other capable editors). (2) Perl scripting.
Ultraedit's scripting looks good and I'm familiar with it. Best of all, its natural line oriented processing is a good abstraction for processing source code lines.
I'm wondering if Perl is also a good tool. I've ZERO experience with Perl except that I'm familiar with Perl style Regexpr used in other contexts. Is Perl a good tool for line oriented text processing? I'll have to search forward and backward and replace source code lines with some other texts.
Yes, Perl is a good tool for what you want. I'd go for Python, it's quick, easy, beautiful and has a good regex interface in the STL; but it's purely a matter of taste.
Perl is an excellent tool for this if you are familiar with it. It's essentially geared towards that kind of text analysis and translation, so you'll find that it has all of the extensions you could ask for.
Another option is to use UltraEdit's JavaScript functionality. The execution speed on it is a little slower than what you can get in Perl, but it provides a decent user interface where you can use UltraEdit to indicate where you want the changes to be made. Also, UltraEdit JavaScript has a great deal more flexibility than UltraEdit scripting.
I can't personally recommend Python for it, but I'm currently part of a company initiative to use it for exactly that kind of function, so hopefully the previous answerer is right.
Related
I know this post, I've already read it but still I'd like to learn what language does an html parser (may) use? I mean, does it parse the whole source with a regex or it uses a normal programming language such as c# or python?
Apart from the question above can you also brief me on from where I should start to create my own parser? (I'd like to create an html parser for my personal needs :)
Python, Java, and Perl are all fine languages for learning to write an HTML parser. Perl is very pleasant for regular expressions, but that's not what you need for a parser. It is a bit more pleasant to write OO programs in Python or Java. C/C++/C#, etc., are also common, for very fast parsers. However, as a learning exercise, I recommend Python or Java, so that you can compare your work with standard parsers.
The standard way is to use some Yacc/Lex duet; second makes a code that splits the code into tokens, first builds a code that converts a token stream into some desired structure.
There is also some more tempting option, Ragel. Here you just write a big regexp-like structure capable of matching entire file and define a hooks that will fire when a certain sub-pattern was matched.
I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.
Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc
If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy
Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.
Perl has been one of my go-to programming language tools for years and years. Perl 6 grammars looks like a great language feature. I'd like to know if someone has started something like this for Ruby.
If you want to use actual Perl 6 grammars in Ruby, your best bet is going to be Cardinal, a ruby compiler on Parrot. It's currently unfinished and VERY SLOW, but I'm quite hopeful about it being a viable ruby implementation eventually. It's currently mostly-inactive, pending some infrastructure changes in Parrot to support improved parsing speed and additional features.
No. And, since Perl6 grammars are a language feature, and Ruby doesn't allow the language to be extended, it is actually impossible to implement this in an "addon".
However, there are numerous libraries for Ruby which implement different kinds of Parsing or Grammar Systems. The standard library already contains racc, which is an LALR(1) parser generator (comparable to and somewhat compatible with the venerable yacc). Then there is the ANTLR parser generator, which has a Ruby backend (although I am not sure whether that actually works).
The closest thing to Perl6 grammars in Ruby would be the Ruby-OMeta project (make sure to also take a look at Ryan Davis's fork), which unfortunately ist still under development. (Or rather, no longer under active development.)
So, keeping to stuff that actually exists, I recommend you take a look at the Grammar project and Treetop.
Don't know of anything similar for Ruby.
However there is something similar for Perl5, see Regexp::Grammars
Are Regular Expressions useful for a Web Designer (XHTML/CSS)? Can a web designer get any help if they learn regular expressions?
Anyone who works with text files on a regular basis, which includes all programmers, can benefit from learning regular expressions. They make find-and-replace tasks much easier, and save you a lot of manual editing. Almost all text-editing programs support regular expression searches.
You won't be able to use them in your code, if all you're doing is HTML and CSS. But if you start to use JavaScript, you'll find them useful for things like testing the value of input fields.
Yes.
Regular expressions are not part of (X)HTML or CSS, but they are part of the tools you will probably use with them: Javascript, XSLT, and any server scripts.
The key is to remember that, because of the difficulty in parsing a language with quoted strings and SGML-style tags, that regex shouldn't be used to parse (X)HTML except as a last resort. Tokenizers exist so you don't have to do that particular hard work. You would find most regular expressions in use for checking and sanitizing input.
In short: yes, but always use the best tool for the job.
Among other things, it's great for validating input and cleansing html
Even if you don't use them directly, RegExes are a way of thinking about manipulating text that have proven very valuable to me over the years from the very first project I was paid money to write in PERL to things I do today in .NET
Can a web designer get any help if
they learn regular expressions?
If you're asking for some resources to help you with regular expressions, I find the MDC page on Regular Expressions and the Regular Expression Cheat Sheet quite useful references, and the Regular Expression Tester handy for testing regular expressions.
Steve
CSS has partial support for regular expressions, so they might be of some use in that area.
Other than that, only if you code Javascript.
There are languages and protocols you use when creating web artifacts (pages, scripts, styles, etc.) and there are tools you use to manipulate those artifacts (editors, utilities, interpreters, etc.).
Regular expressions would belong to latter set and they belong to more advanced specter of it (based on the fact that average web designer is likely to have no clue). But any computer literate person would benefit from learning and using regular expressions - not just web designers.
Anything that makes you stand out (in positive way of course) in comparison to the crowd will benefit you - so go ahead and learn regex - you won't regret.
One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:
Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. - Jamie Zawinski
And:
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as
cleverly as possible, you are, by
definition, not smart enough to debug
it. - Brian Kernighan
My concerns are (respectively:)
That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).
That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.
Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?
Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.
Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.
Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.
As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:
Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.
You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.
I'm not sure why there is so much diffidence against regex.
Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.
Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.
I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.
On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's
Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.
Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.
Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.
If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.
Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.
Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.
Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.
What you should be doing is getting your other devs hooked up with RB.
Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.
I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.
It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.
Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.
So, documentation can help.
This is a real trivial example is a table like the following (just a suggestion)
Expression Match Reason
^ Pos 0 Start of input
\s+ " " At least one space
(abs|floor|ceil) ceil One of "abs", "floor", or "ceil"
...
I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.
However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.
It's relative.
A couple of regex tools (for Node/JS, PHP and Python) i made (for some other projects) are available online to play and experiment.
regex-analyzer and regex-composer
github repo