Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there's anything out there (utility/library/algorithm) that might help automate this.
Say you have a list of literal values (in our cases, they are URLs). What we want to do is, based on this list, come up with a single regex that matches all of those literal items.
So, if my list is:
http://www.example.com
http://www.example.com/subdir
http://foo.example.com
The simplest answer is
^(http://www.example.com|http://www.example.com/subdir|http://foo.example.com)$
but this gets large for lots of data, and we have a length limit we're trying to stay under.
Currently we manually write the regexes but this doesn't scale very well nor is it a great use of anyone's time. Is there a more automated way of decomposing the source data to come up with a length-optimal regex that matches all of the source values?
The Aho-Corasick matching algorithm constructs a finite automaton to match multiple strings. You could convert the automaton to its equivalent regex but it is simpler to use the automaton directly (this is what the algorithm does.)
Today I was searching that. I didn't found it, so I create a tool: kemio.com.ar/tools/lst-trie-re.php
You put a list on the right side, submit it, and get the regexp on the left one.
I tried with a 6Kb list of words, and produced a regexp of 4Kb (that I put on a JS file) like: var re=new RegExp(/..../,"mib");
Don't abuse of it, please.
The Emacs utility function regexp-opt (source code) does not do exactly what you want (it only works on fixed strings), but it might be a useful starting point.
If you want to compare against all the strings in a set and only against those, use a trie, or compressed trie, or even better a directed acyclic word graph. The latter should be particularly efficient for URLs IMO.
You would have to abandon regexps though.
I think it would make sense to take a step back and think about what you're doing, and why.
To match all those URLs, only those URLs and no other, you don't need a regexp; you can probably get acceptable performance from doing exact string comparisons over each item in your list of URLs.
If you do need regexps, then what are the variable differences you're trying to accomodate? I.e. which part of the input must match verbatim, and where is there wiggle room?
If you really do want to use a regexp to match a fixed list of strings, perhaps for performance reasons, then it should be simple enough to write a method that glues all your input strings together as alternatives, as in your example. The state machine doing regexp matching behind the scenes is quite clever and will not run more slowly if your match alternatives have common (and thus possibly redundant) substrings.
Taking the cue from the other two answers, is all you need to match is only the strings supplied, you probably better off doing a straight string match (slow) or constructing a simple FSM that matches those strings(fast).
A regex actually creates a FSM and then matches your input against it, so if the inputs are from a set of previously known set, it is possible and often easier to make the FSM yourself instead of trying to auto-generate a regex.
Aho-Corasick has already been suggested. It is fast, but can be tricky to implement. How about putting all the strings in a Trie and then querying on that instead (since you are matching entire strings, not searching for substrings)?
An easy way to do this is to use Python's hachoir_regex module:
urls = ['http://www.example.com','http://www.example.com/subdir','http://foo.example.com']
as_regex = [hachoir_regex.parse(url) for url in urls]
reduce(lambda x, y: x | y, as_regex)
creates the simplified regular expression
http://(www.example.com(|/subdir)|foo.example.com)
The code first creates a simple regex type for each URL, then concatenates these with | in the reduce step.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The idea is to look for urls in emails of an entire mailbox. There are at least 20 urls to look for and the amount of email bodies could be 10mb or 100 GB collectively.
I planned to use the regular expressions to look for match's as many times as there are urls on as many email bodies that exist, like 20 regex_match calls on 1 string.
The application is written in C++ and I entinded to use the stl::regex_match and favor speed over memory usage.
I think my approach can be improved (linear search/for loop on email body string calling regex_match), but I don't have much experience on string parsing unknown text at this scale. Do you have suggestion on how to implement this idea?
Unrelated but important (I will come to the answer in a while): If you have access to multiple machines, scale out: use Mapreduce or spark.
You can check for the URLs in the mails in parallel, and thus your problem naturally fits in the distributed framework.
For example, if you end up using map reduce, just feed one mail per mapper, and you are done.
Back to your question, the fastest approach to solve this depends on the kind of dataset that we are dealing with here (Are there lots of URLs? Only a few? Lots of URLs but only a few match the 20?)
Thus, one could think of several strategies.
Yes, you can use regex. There are classical links that you would find for this, for example this.
If the set of URLs that you have is fixed, you can create a trivial single regex out of them, and call the library.
A single call to a regex must be faster than 20 calls to string matching functions.
As a final tweak if you are in fact matching 20 strings, match the string that you expect to be present most of the times first.
Alternatively, you can use a series of checks.
Add a super fast first level check to see if it is a url (presence of a colon at 5th location etc.)
If this check is satisfied, call a regex that checks if the string is a URL.
If this check is also passed, send the string to the Hashset of the 20 URLs.
You can add more stages to it. Such a machinery will obviously yield gains if there are not many URLs in your data.
I'm looking for an efficient algorithm able to find all patterns that match a specific string. The pattern set can be very large (more than 100,000) and dynamic (patterns added or removed at anytime). Patterns are not necessarily standard regexp, they can be a subset of regexp or something similar to shell pattern (ie: file-*.txt). A solution for a subset of regex is preferred (as explained below).
FYI: I'm not interested by brute force approaches based on a list of RegExp.
By simple regexp, I mean a regular expression that supports ?, *, + , character classes [a-z] and possibly the logical operator |.
To clarify my need: I wish find all patterns that match the URL:
http://site1.com/12345/topic/news/index.html
The response should be these patterns based on the pattern set below.
http://*.site1.com/*/topic/*
http://*.site1.com/*
http://*
Pattern set:
http://*.site1.com/*/topic/*
http://*.site1.com/*/article/*
http://*.site1.com/*
http://*.site2.com/topic/*
http://*.site2.com/article/*
http://*.site2.com/*
http://*
Here is an approach we've used pretty successfully (implementation here):
Adding Patterns:
For any pattern there exists a set of sub-strings a string must contain in order to have a chance of matching against it. Call these meta words. For example:
dog*fish -> [dog, fish]
[lfd]og -> [og]
dog? -> [dog]
When you add a pattern to the data structure, break it up into meta words and store them in a Aho-Corasick string matching data structure. Maintain an internal data structure to map meta words back to pattern words.
Running Queries:
Given an input string, use the Aho-Corasick data structure you've built to get all the meta words contained in that string. Then, using the map you've created, test the patterns that correspond to those meta words.
This works well because while string matching is fairly slow, you can narrow down the number of patterns you actually have to match against very quickly. Our implementation can perform about 200,000 queries per second, on a regular laptop, against sets of 150,000+ patterns. See the bench-marking mode in the program to test that.
One approach that comes to mind is to create tree structures of patterns.
Example: http://* would contain all the patterns (listed above). http://*.site1.com/* would contain all the site1.com ones. This could significantly reduce the number of patterns that need to be checked.
Additionally you could determine which patters are mutually exclusive to further prune the list you search.
So first take all the patterns and create trees out of them. Search all roots to determine which branches and nodes need to be analyzed.
Improve the algorithm by determining which branches are mutually exclusive so once you find a hit on a given branch you would know which branches/nodes do not need to be visited.
To get started you could be lazy and your first pass could be to sort the patterns and do simple does next pattern contain this pattern type logic to determine if "this" is contained in next. EX: if( "http://*.site1.com/*".startsWith("http://*") == true )
You could get more sophisticated in your ability to determine if one pattern does actually contain another but this would get you started.
To get any better at determining the question:
"Does this pattern contain that pattern?"
I believe you would need to be able to parse the regex... This article looks like a good place to start with to understand how to accomplish that: Parsing regular expressions with recursive descent
If the set of URLs doesn't change very fast, you really should use a regex engine that compiles down its patterns. Java provides one of these, but it might not be satisfactory if you want to know which pattern matches.
A widely used mechanism for doing this and determining which match, are various lexer generators, e.g., FLEX and similar tools. They accept what amount to a regex for each "lexeme", and build an integrated FSA to recognize any of them which is extremely efficient to execute.
You could invoke Flex when your set changes. If that's too slow, get an open source version of Flex and integrate into your engine; it internally builds the FSA so you could use it directly. (Some engineering likely necessary). But if you really have a high performance matching problem, some work to do it well won't bother you.
If the set of URLs changes faster than FLEX can generate FSAs (odd), you have a real problem. In that case, you might build an online discrimination tree by scanning the "regex" from left to right and integrating the characters/predicates you see into your existing discrimination tree. Matching then consists of walking down the discrimination tree, performing the various tests; if you arrive at a leaf, you have a match, otherwise not. This might be just as fast as a FLEX-generated automation if done right, but likely a lot, lot bigger.
I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.
I have a remote "agent" that returns "yes" or "no" when handed a string. Communicating with this agent is expensive, so I'm hoping to find a library that will allow me to iteratively build a regular expression given positive and negative feedback, while being intelligent about its construction. This would allow me to cache answers on the sending side.
For example, suppose we query the agent with "good" and receive a "yes". The initial derived regular expression ought to be "good".
Suppose I query then with "goop" and receive a "yes". I would expect the derived regular expression to be "goo[dp]", not "good|goop".
And so forth.
I do not need backtracking or any other fancy non-linear time operations in my derived regex. Presumably the generated regex would be a DFA under the hood. Is anyone aware of any c/c++ regular expression libraries capable of doing this? Alternatively, reasons why this is a dumb idea and better solutions to my real problem would also be useful.
Rather than a regular expression, you could use a Trie.
Then for each new string you walk the trie one node for each character. I suspect that you would also want a marker character for end of string - once you reach this character, if the node exists, it holds the yes/no answer.
Well, unless I'm missing something in your situation, I think that memory is cheap enough to just straight up implement a dumb cache - say, an unordered_map of <std::string, bool>. Not only will this be much easier to build, it will probably be faster too, since you're building a hash map. The only downside to this is if you were going to query the remote service with a bazillion different keys, then this might not be the best approach.
I'm currently trying to solve a problem that's similar to Testing intersection of two regular languages with the exception that I know how to do the intersection, but have an additional requirement.
The intersection logic I intend to use is the Dragon Book's algorithm for converting an NFA to a DFA, but executed on two NFA's at the same time. Since all DFA's are NFA's (but with very little non-determinism), you can repeat this as needed for more intersections.
My problem is that one of my regexes has groups that can be used further on as a part of a new regex. Concretely:
bin/x86/a.out: obj/x86/.*\.o
obj/{[a-zA-Z0-9]+}/{.*}.o: src/\2.c
In the end of the first line I have a regex that matches all objects for x86 targets. In the second line I have a regex that specifies a possible build line, that should match the first group with the fixed "x86" and the second with any given string after it. In the example the first match isn't used yet, but it should be retrievable. To make sure that the matching ends (and to allow recursive rules), I want to use the information gained from the first regex in matching the second. The rule is selected by taking the second regex from the first and the first from the second line and to determine if the intersection of the two (the DFA resulting from the intersection) has an accepting state. If it does, there are sentences that both can parse and therefore some values that the group can take.
Is it possible, in general, to extract information from the first regex for use in matching the group of the second regex?
If not in general, what kinds of restrictions do I need to add?
I believe back-ticks make the language non-regular, so you won't be able to convert it to a finite-automoton.
Why do your examples look like Makefile rules, even though Makefiles don't support regular expressions?
Because that's the thing I'm trying to make (no pun intended).
Which regex library do you use?
None, as of yet. I'm considering to write my own based on the output of this question. If this isn't possible, I may make do with an existing one that supports this. If this is theoretically possible, I'll develop my own to do exactly this & make the application as I intend it.
Some support lookahead expressions, which are another way of expression intersections
The idea behind the intersection is to define rules that are generic and can contain multiple varying left-side parts (the use of % in usual makefiles, but without the requirement to do some sort of recursive make if you do have more than one variation point - such as the platform, build type or file name). If I can't take the second regex into account for the group I can't use such a rule recursively because the recursion wouldn't have any change between each step/level. That would reduce the genericity but would still be acceptable. Still, it's an interesting question to know the answer to (IE, the can it be done generically) and it'll be deciding in the requirements I'll have for a regex library.
(Not posted as original author because I lost my cookie & am waiting for the accounts to be merged).