Regex Performance Optimization Tips and Tricks [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
After reading a pretty good article on regex optimization in java I was wondering what are the other good tips for creating fast and efficient regular expressions?

Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.
I created a video demonstrating these techniques. I started with the very poorly written regular expression in the catastrophic backtracking article (x+x+)+y. And then I made it 3 million times faster after a series of optimizations, benchmarking after every change. The video is specific to .NET but many of these things apply to most other regex flavors as well:
.NET Regex Lesson: #5: Optimization

Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
i'm not sure whether PCRE is NFA and i'm only familiar with PCRE but + and * are usually greedy by default, they will match as much as possible to turn this around use +? and *? to match the least possible, bear these two clauses in mind while writing your regexp.

Know when not to use a regular expression -- sometimes a hand coded solution is more efficient and more understandable.
Example: suppose you want to match an integer that's evenly divisible by 3. It's trivial to design a finite state machine to accomplish this, and therefore a corresponding regex must exist, but writing it out is not so trivial -- and I'd sure hate to have to debug it!

Related

why sed doesn't have non greedy regex [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am not asking if there is greedy regex in sed, I already know that there is not. What I am asking is : It's known that sed is the best or one of the best stream editors that exists. So why the developers of this tool didn't implement the non greedy regex. It looks simple comparing to all the things this tool can do.
History
Non-greedy matching is a feature of Perl-Compatible Regular Expressions. PCREs were only available as part of the Perl language until the 1997 implementation of libpcre, whereas the POSIX implementation of sed was first introduced in 1992 -- and the implementation of standard-C-library regular expression routines which it references predates even that, having been published in 1988.
Standards-Body Definitions
The POSIX specification for sed supports BRE; only BRE ("Basic Regular Expressions") and ERE ("Extended Regular Expressions") are specified in POSIX at all, and neither form contains non-greedy matching.
Thus, for PCRE support (or, otherwise, non-greedy matching support) to be standardized for inclusion in all sed implementations, it would first need to be standardized in the POSIX regular expression definition.
However, it's highly unlikely that this would occur in practice (except as an extension to be present, or not, at the implementor's option), given the practical reasons for which PCRE support can be undesirable; see the following section:
Implementation Considerations
sed is typically considered a "core tool", thus implemented with only minimal dependencies. Requiring libpcre in order to install sed thus makes libpcre a part of your operating system that needs to be included even in images where size is at a premium (initrd/initramfs images, etc).
Multiple strategies for implementing regular expressions are available. The historical (very high-performance) implementation compiles the expression into a nondeterministic finite automata which can be executed in O(n) time against a string of size n given a fixed regular expression. The libpcre implementation uses backtracking -- which permits for easier implementation of features such as non-greedy matching, lookahead, and lookbehind -- but can often have far-worse-than-linear performance).
See https://swtch.com/~rsc/regexp/regexp1.html for a discussion of the performance advantages of Thompson NFAs over backtracking implementations.

Is it OK to mix string parsing while learning reg ex? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm performing some regular expression exercises in Python 2.7.3, on Windows 7. Some of these exercises have me looking for similar patterns on the same line. For example, I want to use regex to capture name1 and name2...
<XML tag><more tags>[name1]</XML tag><XML2 tag>[name2]<XML2 tag></more tags>
Would it be "cheating" or "missing the point" if I used any string parsing to capture name2? I feel like using regex the correct way alone should be able to capture both of those names, but string parsing is what I've always been familiar with.
An analogy would be like someone studying recursion in C++, but using a While loop. Recursion should NOT have any While loops (although of course it may be part of some other grand design).
Good question! Many beginners come into it believing they should be able do everything with one regex match. After all, people are always saying how powerful regexes are, and what you're trying to do is so simple...
But no, the regex is responsible for finding the next match, that's all. Retrieving the substring that it matched, or finding multiple matches, or performing substitutions, that's all external to the act of matching the regex. That's why languages provide methods like Python's findall() and sub(); to do the kind of "string parsing" operations you're talking about, so you don't have to.
It occurred to me a while back that the process of mastering regexes is one of learning everything you can't do with them, and why not. Understanding which parts of the regex matching operation are performed by the regex engine, and which parts are the responsibility of the enclosing language or tool, is a good start.

Strings that can never be parsed with regular expression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am teaching regular expression to some good programmers. They are good at programming but hardly use regex. My task is to train them up so they know when to use regex and when not.
After showing most regular expression features, I found they are parsing everything with regex. This is not what I want. I want they know that there are some texts that can never be parsed with regex.
But I am out of luck. I know regular expression can parse regular language. If its a non regular language it can not parse it. So I am looking for non regular language example.
My target is when they fail to parse it, they will come up with some custom parser.
So, could you provide some good example of such non-regular language?
The best example is the parsing html
Show your students this:
<div>
<div>some shit</div>
<div>
This is some shit again
<div>
Really? Is this parsable?
</div>
</div>
</div>
and ask them to match the content of the inner most div, provided the html is dynamic.
In general, ask your students not to parse any other language using regex.
The best way to teach them that is by making them read this answer
In other words:
Use regex only when there is a uniform pattern of something
Also,
You cannot parse palindromes
You cannot parse another regex
You cannot match people's names and emails as they vary.(Email can be matched, but is a overkill)
A simple and understandable example of a non-regular language would be the language of palindromes, or in other words, strings that are equal to their reverses. It's pretty easy to demonstrate its non-regularity with the pumping lemma (see Wikipedia: http://en.wikipedia.org/wiki/Pumping_lemma)
Mind though, that in practical computing, the distinction isn't quite as clear, as many regular expression engines support features such as back references that allow recognition of certain non-regular languages. A regex engine with back references can match, for example, the language of squares or repetitions ("PonyPony", "123123", "gg" etc): (.*)\1 which isn't possible without back references.

How to tell if some language's/command's regular expression supports things such as \d \w? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
In some languages or commands (eg: javascript) I can use \d, \w
In others I have to use [0-9], [a-zA-Z]
How to tell when I can use \d, \w?
[A side question: can Notepad++ and grep use them?]
Simple answer - RTFM for each language/function what kind of RegExp they support, as there are different dialects, at least (I really don't remember all differences, have to read manual each time for new function):
Perl Compatible
Posix
GNU
And of course not all languages fully supports any kind of standard, so after reading common manual you have to struggle with peculiarities of particular implementation
Look up a reference for the language you're using the regexes in or try them out; we can't give an exhaustive reference here.
Notepad++ does have regex support, but I've found it to be flaky. It's supposed to support the basic character classes, though. Grep, I'm less familiar with. I'd expect it to, but...try and see.

Inverted index for regex? Regex search engine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.
My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.
Maybe something similar? I don't really know.
Here's a description of my idea:
The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.
an example of a search: [^ ]*ell[^ ]* .*\.
something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.
it'll be a simple sed-like regex, maybe a bit javascripty.
they are all similar anyway (with the basics)
Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.
Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?
If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..