TCL string match vs regexps - regex

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?

You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization

Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.

The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.

I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.

Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)

regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

Related

Should I create one complex RegEx or multiple and less complex ones?

Should I create one complex RegEx to tackle all cases on hand or should I break one complex RegEx in multiple Regex which ?
I'm concerned regarding performance using complex Regex.
Will breaking the complex Regex into smaller simple regex perform better?
If you want a meaningful answer to the performance question, you need to benchmark both cases.
Regarding readability/maintainability, you can write unreadable code in any language and so you can do with regular expressions. If you write a big one, be sure to use the x modifier (IgnorePatternWhitespace in c#) and use comments to build your regex.
A randomly chosen example from one of my past answers in c#:
MatchCollection result = Regex.Matches
(testingString,
#"
(?<=\$) # Ensure there is a $ before the string
[^|]* # Match any character that is not a |
(?=\|) #Till a | is ahead
"
, RegexOptions.IgnorePatternWhitespace);
I don't think there would be much of a difference now because of compiler optimization, however, using a simple one would make understanding your code easier which in turn makes maintenance easier.
Complex regular expressions can be VERY slow, but it depends on your regular expression and your environment. Take the case of string.trim(). It can be trivially implemented with regular expressions. You might use one regex or two (remove front and back whitespace separately). Here is somebody that took 11 different javascript trim implementations and benchmarked them in different browsers: http://blog.stevenlevithan.com/archives/faster-trim-javascript. In that case, one regex loses big time in most situations.

When should I prefer regex over built-in string functions?

Some say I should use regex whenever possible, others say I should use it at least as possible. Is there something like a "Perl Etiquette" about that matter or just TIMTOWTDI?
The level of complexity generally dictates whether I use a regex or not. Some of the questions I ask when deciding whether or not to use a regex are:
Is there no built string function that handles this relatively easily?
Do I need to capture substring groups?
Do I need complex features like look behind or negative sets?
Am I going to make use of character sets?
Will using a regex make my code more readable?
If I answer yes to any of these, I generally use a regex.
I think a lot of the answers you got already are good. I want to address the etiquette part because I think there is some.
Summed up: if there is a robust parser available, use it instead of regular expressions; 100% of the time. Never recommend anything else to a novice. So–
Don'ts
Don't split or match against commas for CSV, use Text::CSV/Text::CSV_XS.
Don't write regexes against HTML or XML, use XML::LibXML, XML::Twig, HTML::TreeBuilder, HTML::TokeParser::Simple, et cetera.
Don't write regexes for things that are trivial to split or unpack.
Dos
Do use substr, index, and rindex where appropriate but recognize they can come off "unperly" so they are best used when benchmarking shows them superior to regular expressions; regexes can be surprisingly fast in many cases.
Do use regular expressions when there is no good parser available and writing a Parse::RecDescent grammar is overkill, too much work, or will be too slow.
Do use regular expressions for throw-away code like one-liners on well-known/predictable data including the HTML/CSV previously banned from regular expression use.
Do be aware of alternatives for bigger problems like P::RecD, Parse::Yapp, and Marpa.
Do keep your own council. Perl is supposed to be fun. Do whatever you like; just be prepared to get bashed if you complain when not following advice and it goes sideways. :P
I don't know of any "etiquette" about this.
Perl regex are highly optimized (that's one of the things the language is known for, although there are engines that are faster), and in the end, if your regex is so simple that it could be replaced by a string function, I don't believe that the regex will be any significantly less performant. If the problem you are trying to resolve is so time sensitive that you might look into other possibilities of optimization.
Another important aspect is readability. And I think that handling all string transformations through regex also add to this, insteas of mixing and matching different approaches.
Just my two cents.
Though I would classify this as too opinionated for SO, I'll give my point of view.
Use regex when the string is:
"Too Dynamic" (The string could have a lot of variation to it, that making use of the string library(ies) would be cumbersome.
"Contains patterns" if there is a genuine pattern to the string (and may be as simple as 1 character or a group of characters) this is where (i feel) regex excels.
"Too Complex" If you find yourself declaring a whole function block just to do what a single pattern can do, I can see it being worthwhile just to use regex. (However, see "Too Complex" below, too).
Do not use regex to be:
"Fast" Consider the overhead involved in spinning up a regex library over grabbing information directly from a string.
"Too Complex" Good code isn't always short. If you begin making a huge pattern to circumvent several lines of code, that's fine, but keep in mind it's at the risk of readability. Coming back to that piece and trying to wrap your head around it again may not be worth just doing the plain-jane method.
I'd say, if you need more than one or two string function calls to do it, use a regex. ;)
For things that are not too complex that the regex becomes bloated, affects the readability of code and cause performance issues. You can do it via a serious of steps, using builtin functions and other means. You may not have a cool single line regex, but your code will be readable and maintanable.
And also not too simple problems because, again, regexes are heavy weight and there are usually built-in functions that handled the simple scenarios.
It is going to depend on what you are going to do. Ofcourse, please don't use regex for parsing ( especially HTML etc. )
Perl is a great language for regex. It honestly has one of the greatest parsers of any language, so that is why you see so many "use regex" answers. I am not sure what the aversion to regex is, however.
My answer would be: can you sum up the work in a single pattern easier than using the string function, or do you need to use multiple string functions versus a single regex? In either case, I would aim for regex. Otherwise, do what feels comfortable for you.

does regex comparisons consume lots of resources?

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about
It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.
Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.
Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html
You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!
It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.
You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.
I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.
Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.
Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.

How to generate random strings that match a given regexp?

Duplicate:
Random string that matches a regexp
No, it isn't. I'm looking for an easy and universal method, one that I could actually implement. That's far more difficult than randomly generating passwords.
I want to create an application that takes a regular expression, and shows 10 randomly generated strings that match that expression. It's supposed to help people better understand their regexps, and to decide i.e. if they're secure enough for validation purposes. Does anyone know of an easy way to do that?
One obvious solution would be to write (or steal) a regexp parser, but that seems really over my head.
I repeat, I'm looking for an easy and universal way to do that.
Edit: Brute force approach is out of the question. Assuming the random strings would just be [a-z0-9]{10} and 1 million iterations per second, it would take 65 years to iterate trough the space of all 10-char strings.
Parse your regular expression into a DFA, then traverse your DFA randomly until you end up in an accepting state, outputting a character for each transition. Each walk will yield a new string that matches the expression.
This doesn't work for "regular" expressions that aren't really regular, though, such as expressions with backreferences. It depends on what kind of expression you're after.
Take a look at Perl's String::Random.
One rather ugly solution that may or may not be practical is to leverage an existing regex diagnostics option. Some regex libraries have the ability to figure out where the regex failed to match. In this case, you could use what is in effect a form of brute force, but using one character at a time and trying to get longer (and further-matching) strings until you got a full match. This is a very ugly solution. However, unlike a standard brute force solution, it failure on a string like ab will also tell you whether there exists a string ab.* which will match (if not, stop and try ac. If so, try a longer string). This is probably not feasible with all regex libraries.
On the bright side, this kind of solution is probably pretty cool from a teaching perspective. In practice it's probably similar in effect to a dfa solution, but without the requirement to think about dfas.
Note that you won't want to use random strings with this technique. However, you can use random characters to start with if you keep track of what you've tested in a tree, so the effect is the same.
if your only criteria are that your method is easy and universal, then there ain't nothing easier or more universal than brute force. :)
for (i = 0; i < 10; ++i) {
do {
var str = generateRandomString();
} while (!myRegex.match(str));
myListOfGoodStrings.push(str);
}
Of course, this is a very silly way to do things and mostly was meant as a joke.
I think your best bet would be to try writing your own very basic parser, teaching it just the things which you're expecting to encounter (eg: letter and number ranges, repeating/optional characters... don't worry about look-behinds etc)
The universality criterion is impossible. Given the regular expression "^To be, or not to be -- that is the question:$", there will not be ten unique random strings that match.
For non-degenerate cases:
moonshadow's link to Perl's String::Random is the answer. A Perl program that reads a RegEx from stdin and writes the output from ten invocations of String::Random to stdout is trivial. Compile it to either a Windows or Unix exe with Perl2exe and invoke it from PHP, Python, or whatever.
Also see Random Text generator based on regex

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx