Are long regular expressions worse than short ones? - regex

I was trying to learn about regular expressions for a project where I want to create a textmate grammar, regexes seem relatively simple but really hard to read for me, so I tried to create a utility module hat could generate them, it kinda works as intended and generate regular expressions that actually work, all aliased by easy to understand names.
for example:
struc_enum = OrGroup("struct", "enum")
whitespace = TAB_SPACE.at_least(1)
results in:
(?:struct|enum)
[ \t]+
in this case, there's not much benefit in using python aliases but then I can do:
valid_name = r"\b" + Group(ALPHA, ALPHANUMERIC.repeated())
struc_enum = OrGroup("struct", "enum")
typed_name = (struc_enum + whitespace).optional() + valid_name + whitespace + valid_name.captured()
and ths is what print(typed_name) displays:
(?:(?:(?:struct|enum)[ \t]+)?\b[a-zA-Z][a-zA-Z\d]*[ \t]+(\b[a-zA-Z][a-zA-Z\d]*))
This method can be used to create small snippets and concatenate them to construct more complex patterns, but for each level of concatenation the expression grows exponentially large, such that I could easily get at this point:
(?:(func)[\s]+([a-zA-Z_]+[a-zA-Z\d_]*)[\s]*\([\s]*(?:[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?(?:[\s]*,[\s]*[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?)*[\s]*)?\))
In an atom grammar this big regex above can match lines like this, but it doesn't seem to work elsewhere:
func myfunc(asd asd*, asd*, asdasd)
func do_foo01(type arg1, int arg2)
With enough patience, a human might construct an equivalent expression but probably much shorter, which raises the question. Are big regular expressions worse or better than the equivalent shorter ones int terms of computational overhead? At which point can we consider regexes too big?

Since the original problem you set out to solve is that long regular expressions are difficult to read, you may wish to consider extended (verbose) regular expressions. Extended regular expressions allow whitespace and comments, which can make a regular expression much easier to read.
Contrast this regular expression:
charref = re.compile("&#(0[0-7]+"
"|[0-9]+"
"|x[0-9a-fA-F]+);")
with the same regular expression, with comments:
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)
Example taken from Regular Expression HOWTO

I think this is a fine idea, but you need to be clear with yourself about the scale of the project you're undertaking.
We almost never need to use regular expressions; we could take apart every string and write our own parsing operations using starts_with and ifs etc. But regex syntax is a mature, powerful system that let's us succinctly express certain kinds of logic.
Often regexes are hard to read. There are some tools that can help, but the idea of a less succinct system for doing the stuff we currently do with regexs is sound. The hard part will be replicating the breadth, power, and reliability of existing regex systems.
My guess is you'll be best served by learning to tolerate the density of regular expressions. Possibly we might be better served by you building a easier-to-read system for sting-parsing, but you'll have about 20 years of catching up to do.
Regarding performance: Regexes are (can be) compiled. Depending on the context, this can have a big performance benefit.
Anyway, like any sufficiently powerful language, the length of the instruction is a poor indicator of it's run-time complexity.

Related

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)
It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.
It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

Using regular expression for validating data is correct or not?

I have been finding some articles and post which suggest not to use the regular expression to validate user data. I am not sure of all the things but i usually find it in case of email address verification.
So i want to be clear whether using regular expression for validating user input is good or not? if it is good then what is bad with it for validating email address?
Edit:
So can we say that for basic primary validation of data types we can use regex and it is good and for full validation we need to combine it with another parser.
And for second part for email validation in general usage we can use it but as per standard it is not appropriate. Is it?
Now confusion in selecting correct one answer
It’s good because you can use regular expressions to express and test complex patterns in an easy way.
It’s bad because regular expressions can be complicated and there is much you can do wrong.
Edit    Well, ok. Here’s some real advice: First make sure that the expected valid values can be expressed using regular expression at all. That is when the language of valid values is a regular language. Otherwise you simply cannot use regular expressions (or at least not regular expressions only)!
Now that we know what can be validated using regular expressions, we should discuss what is viable to be validated using regular expressions. If we take an e-mail address as an example (like many others did), we should know what a valid e-mail address may look like (see RFC 5322):
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext = %d33-90 / ; Printable US-ASCII
%d94-126 / ; characters not including
obs-dtext ; "[", "]", or "\"
Here we see that the local-part may consists of a quoted-string that may contain any printable US-ASCII character (excluding \ and "", but including #). So it is not sufficient to test if the e-mail address contains just one # if we want to allow addresses according to RFC 5322.
On the other hand, if we want to allow any valid e-mail address according to RFC 5322, we would also allow addresses that do probably not exists or are just senseless in most cases (e.g. ""#localhost).
Your question seems to have two parts: (1) is using regular expressions for data validation bad, and (2) is using them for validating email addresses bad?
Re (1), this really depends upon the situation. In many situations a regular expression will be more than adequate to validate user input; for example, validating that a username has only alphanumeric characters. Where a set of regular expressions will probably be inadequate is when the input might be passed to something like a database query or an eval() statement. In these instances there may be language constructs like recursion that cannot be handled with regular expressions, and, more generally, you will want something that knows a lot about the target language to do the validation (and sanitization).
In most cases you'll want to escape the input so that it will will be an innocuous string in the target language.
If you are validating the correctness of code, you will want a full-blown parser for this. A parser may make use of regular expressions, but typically parsers use other things to do the heavy lifting.
Regular expressions can be bad for three reasons:
They can get really complicated, and eventually unmaintainable. It's very easy to make mistakes.
There are certain types of text that cannot be parsed with regular expressions at all (e.g. HTML). Basically, anything with nested patterns cannot be parsed with regular expressions. You wouldn't be able to parse a programming language with regex, for example.
Depending on what kind of text you are working with, it may be easier and clearer if you just write your own code to parse it.
But if neither of these is an issue for whatever you are working with, then there is nothing wrong with using regular expressions. I would say validating email addresses is a good use of regex.
Regular expressions are a tool like any other, albeit a very powerful one.
They are so powerful that people using them tend to suffer from the problem of everything looking like a nail (when you have a hammer). This leads to them being used in situations where another method would be more verbose but more efficient and more maintainable.
In the specific case of email addresses, the main problem here is that there are a very large number of regular expressions out there which claim to validate email address syntax, but are loaded up with problems that cause false negatives.
The main problems with them include:
Disallowing plus characters in the first half of the address (despite them being relatively common)
Limiting the TLD to three characters (this blocking out the .museum TLD)
Limiting the TLD to two character country code TLDs or a list of specific TLDs (thus forcing it to be updated whenever a new TLD comes into play — guess what never happens?)
Email addresses are so complex that a regular expression shouldn't really try to do anything more then:
Something that doesn't include an #
An #
Something that doesn't include an #
A .
Something that doesn't include an #
For e-mail addresses is good to use regular expressions. It will work in most of the cases.
In general: you should validate with regular expressions whatever can be expressed as a regular language
If the pattern of the data you are validating can be expressed completely and correctly using regular expressions, you can use them safely with no worries. However not all textual patterns can be expressed using regular expressions (e.g. context free grammars). In such cases you might need to write a parser or a custom method for validating the data.
The concerns are probably about the fact that often the regular expressions in use do not cover all the possible (valid) inputs and/or restrict the user to much in what he can input.
I see no other way to validate if some user input matches a certain schema (I mean, that is what regular expressions are for), so they are essential (imo) for user input validation. But you definitely have to put some time into designing an expression, to make sure it really works, also in extreme cases.
Take credit card numbers. You have to consider the ways a user might enter them:
1234-5678
// or
1234 5678
// or
1234 - 5678
And now you have two possibilities:
You restrict the input to the first case which will result in an easier expression but will restrict (and maybe annoy) the user the most.
You create an expression that accepts any of these possibilities, making the expression more complicated (hence harder to maintain) but is more use friendly.
It is a trade-off.
Regexes aren't bad for validating most data, if it's a Regular Language.
But, as has been noted, sometimes they can become difficult to maintain, and the programmers introduce errors.
The simplest way to mitigate the situation is with Tests/TDD. These tests should be calling a method that uses the regular expression to validate email addresses (I currently use this regex /^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$/i which is working well enough. This way, when you get a false positive or false negative, you can add another test for that case, adjust your regular expression, and ensure you didn't break some other condition.
If TDD seems a bit much, a tool like Expresso lets you save regexes with test data, and that can aid in keeping track of values that should pass/fail and aid in creating and understanding your regex.
WARNING:
Take some care in constructing regular expressions. There is potential for introducing ReDos vulnerabilities
See: http://msdn.microsoft.com/en-us/magazine/ff646973.aspx
In short, a poorly constructed regex, given the right input can take hours to execute effectively killing your servers performance.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.

Are there particular cases where native text manipulation is more desirable than regex?

Are there particular cases where native text manipulation is more desirable than regex?
In particular .net?
Note:
Regex appears to be a highly emotive subject, so I am wary of asking such a question. This question is not inviting personal/profession opinions on regex, only specific situations where a solution including its use is not as good as language native commands (including those which have underlying code using regex) and why.
Also, note that Desirable can mean performance, can mean code-readability; it does not mean panacea, as each solution for a problem has its benefits and limitations.
Apologies if this is a duplicate, I have searched SO for a similar question.
I prefer text manipulation over regular expressions to parse delimited string input. It's far simpler (for me at least) to issue a string split than to manage a regular expression.
Given some text:
value1, value2, value3
You can parse the line easily:
var values = myString.Split(',');
I'm sure there's a better way but with regular expressions you'd have to do something like:
var match = Regex.Match(myString, "^([^,]*),([^,]*),([^,]*)$");
var value1 = match.Group[1];
...
When you can do it simply with native text manipulation, it is usually preferable (simpler to read & better performance) not to use regex.
Personal rule of thumb: if it's tricky or relatively longer to do it "manually" and that performance gain is negligible, don't. Else do.
Don't examples:
split
simple find & replace
long text
loop
existing native functions (like, in PHP, strrchr, ucwords...)
Using a regex basically means embedding a tiny program, written in a different programming language, in the middle of your program. I'll ignore the inefficiency of using a regex over native string manipulation, because it probably isn't relevant in most cases.
I prefer native text manipulation over regex any time native text manipulation will be easier to follow for other people. Which is true quite frequently, since plenty of the people around me are not strongly familiar with regex. Unless working with something that is very much about parsing (via regex) they should not need to be!
Regular expressions are usually slower, less readable, and harder to debug than native string manipulation.
The main case where I'll prefer regex over string manipulation is when I want to be able to have different ways to parse strings dependning on the source, and the types of sources will increase over time. Native string manipulation is not really practical in this case. I've had cases where I've stuck a regex column in a database...
RegEx's are very flexible and powerful, because they are in many ways similar to an eval() statement. That being said, depending on the implementation, they can be a bit slow. Normally, this is not an issue, however, if they can be avoided in a particularly costly loop, that can boost performance.
That being said, I tend to use them, and only worry about performance when the app is "done" and I have real benchmarks to prove I need to tweak performance. i.e, avoid premature optimization.
Whenever the same result can be achieved with a reasonable amount of code.
Regular expressions are very powerful, but they tend to get hard to read. If you can do the same with simple string operations that usually means that the code gets easier to manage and maintain.
There is some overhead in setting up the object and parsing the expression. For simpler string manipulation you can get better performance with simple string methods.
Example:
Getting the file name from a file path (yes, I know that the Path class should be used for that, it's just an example...)
string name = Regex.Match(path, #"([^\\]+)$").Groups[0].Value;
vs.
string name = path.Substring(path.LastIndexOf('\\') + 1);
The second solution is straight forward and does the minimal work needed to get the result. The regular expression solution produces the same result, but it does more work to parse the string, and it produces a bunch of objects that is not needed for the result.
Regex parsing and execution refers the host language to defer processing to its regex "engine". This adds overhead, so for any instance where native string manipulation could be used it is preferable for speed (and readability!).
I'll usually just use text manipulation for simple string replacements (e.g. replacing tokens in a template with actual values). You could certainly do this with Regex, but replacements are much easier.
Yes. Example:
char* basename (const char* path)
{
char* p = strrchr(path, '/');
return (p != NULL) ? (p+1) : path;
}

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
Try combining them into one big regex?
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....