Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am not asking if there is greedy regex in sed, I already know that there is not. What I am asking is : It's known that sed is the best or one of the best stream editors that exists. So why the developers of this tool didn't implement the non greedy regex. It looks simple comparing to all the things this tool can do.
History
Non-greedy matching is a feature of Perl-Compatible Regular Expressions. PCREs were only available as part of the Perl language until the 1997 implementation of libpcre, whereas the POSIX implementation of sed was first introduced in 1992 -- and the implementation of standard-C-library regular expression routines which it references predates even that, having been published in 1988.
Standards-Body Definitions
The POSIX specification for sed supports BRE; only BRE ("Basic Regular Expressions") and ERE ("Extended Regular Expressions") are specified in POSIX at all, and neither form contains non-greedy matching.
Thus, for PCRE support (or, otherwise, non-greedy matching support) to be standardized for inclusion in all sed implementations, it would first need to be standardized in the POSIX regular expression definition.
However, it's highly unlikely that this would occur in practice (except as an extension to be present, or not, at the implementor's option), given the practical reasons for which PCRE support can be undesirable; see the following section:
Implementation Considerations
sed is typically considered a "core tool", thus implemented with only minimal dependencies. Requiring libpcre in order to install sed thus makes libpcre a part of your operating system that needs to be included even in images where size is at a premium (initrd/initramfs images, etc).
Multiple strategies for implementing regular expressions are available. The historical (very high-performance) implementation compiles the expression into a nondeterministic finite automata which can be executed in O(n) time against a string of size n given a fixed regular expression. The libpcre implementation uses backtracking -- which permits for easier implementation of features such as non-greedy matching, lookahead, and lookbehind -- but can often have far-worse-than-linear performance).
See https://swtch.com/~rsc/regexp/regexp1.html for a discussion of the performance advantages of Thompson NFAs over backtracking implementations.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
If I use a lot of backslashes, can I do any regular expression with grep? So far, it seems so.
I mean, without grep -E also. So just "grep regex file", with many backslashes where needed.
man egrep
"In addition, two variant programs egrep and fgrep are available. egrep is the same as grep -E. fgrep is the same as grep -F. Direct invocation as either egrep or
fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified."
-E means extended regular expressions, still man egrep:
"grep understands three different versions of regular expression syntax: “basic,” “extended” and “perl.” In GNU grep, there is no difference in available functionality
between basic and extended syntaxes. In other implementations, basic regular expressions are less powerful. The following description applies to extended regular
expressions; differences for basic regular expressions are summarized afterwards. Perl regular expressions give additional functionality, and are documented in
pcresyntax(3) and pcrepattern(3), but may not be available on every system."
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
In some languages or commands (eg: javascript) I can use \d, \w
In others I have to use [0-9], [a-zA-Z]
How to tell when I can use \d, \w?
[A side question: can Notepad++ and grep use them?]
Simple answer - RTFM for each language/function what kind of RegExp they support, as there are different dialects, at least (I really don't remember all differences, have to read manual each time for new function):
Perl Compatible
Posix
GNU
And of course not all languages fully supports any kind of standard, so after reading common manual you have to struggle with peculiarities of particular implementation
Look up a reference for the language you're using the regexes in or try them out; we can't give an exhaustive reference here.
Notepad++ does have regex support, but I've found it to be flaky. It's supposed to support the basic character classes, though. Grep, I'm less familiar with. I'd expect it to, but...try and see.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Man that's an awkwardly formed question.
My project right now is reading from a serial port ASCII lines. I'm using a private library that reads output line by line. Each line is identified as such with the \r character. I'm limited by this library because I have to specify what escape character ends that line.
Anyway, I found this documentation online and in particular, I am interested in the escape characters \> and \s because \> checks for the end of a string and \s checks for any whitespace escape characters.
However, I don't think this is available by default in C++. I'm not even sure what language that documentation is for!
So I ask the gurus of stackoverflow; Is there a way to check multiple escape characters in C++ with only one escape character?
Thanks for reading..
You could try using std::regex from the C++11 standard.
If you compiler does not support that, you can also use QRegularExpression if you do not mind using Qt 5.
You could probably also use the regex from the boost library for this.
If you wanna go down C in favor of supporting older compilers, you could even use regex(3).
Yes, it definitely is an "awkwardly formed question".
Regarding the second point, you are confusing escape characters with regular expression constructs/escapes. The second are valid only within regular expressions. Unless your library supports them, you will not be able to use them successfully. C++ does not provide any direct support for regular expressions. Only through library functions.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need a library which will take in two regular expressions and determine whether they are isomorphic (i.e. match exactly the same set of strings or not)
For example a|b is isomorphic to [ab]
As I understand it, a regular expression can be converted to an NFA which in some cases can be efficiently converted to a DFA. The DFA can then be converted to a minimal DFA, which, if I understand it correctly, is unique and so these minimal DFA's can then be compared for equality. I realize that not all regular expression NFA's can be efficently transformed into DFA's (especially when they were generate from Perl Regexps which are not truly "regular") in which case ideally the library would just return an error or some other indication that the conversion is not possible.
I see tons of articles and academic papers on-line about doing this (and even some programming assignments for classes asking students to do this) but I can't seem to find a library which implements this functionality. I would prefer a Python and/or C/C++ library, but a library in any language will do. Does anyone know if such a library? If not, does someone know of a library that gets close that I can use as a starting point?
Haven't tried it, but Regexp:Compare for Perl looks promising: two regex's are equivalent if the language of the first is a subset of the second, and vice verse.
The brics automaton library for Java supports this.
It can be used to convert regular expressions to minimal Deterministic Finite State Automata, and check if these are equivalent:
public static void isIsomorphic(String regexA, String regexB) {
Automaton a = new RegExp(regexA).toAutomaton();
Automaton b = new RegExp(regexB).toAutomaton();
return a.equals(b);
}
Note that this library only works for regular expressions that describe a regular language: it does not support some more advanced features, such as backreferences.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
After reading a pretty good article on regex optimization in java I was wondering what are the other good tips for creating fast and efficient regular expressions?
Use the non-capturing group (?:pattern) when you need to repeat a grouping but don't need to use the captured value that comes from a traditional (capturing) group.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Avoid catastrophic backtracking like the plague by designing your regular expressions to terminate early for non-matches.
I created a video demonstrating these techniques. I started with the very poorly written regular expression in the catastrophic backtracking article (x+x+)+y. And then I made it 3 million times faster after a series of optimizations, benchmarking after every change. The video is specific to .NET but many of these things apply to most other regex flavors as well:
.NET Regex Lesson: #5: Optimization
Use the any (dot) operator sparingly, if you can do it any other way, do it, dot will always bite you...
i'm not sure whether PCRE is NFA and i'm only familiar with PCRE but + and * are usually greedy by default, they will match as much as possible to turn this around use +? and *? to match the least possible, bear these two clauses in mind while writing your regexp.
Know when not to use a regular expression -- sometimes a hand coded solution is more efficient and more understandable.
Example: suppose you want to match an integer that's evenly divisible by 3. It's trivial to design a finite state machine to accomplish this, and therefore a corresponding regex must exist, but writing it out is not so trivial -- and I'd sure hate to have to debug it!