Are there any good / interesting analogs to regular expressions in 2d? - regex

Are there any good (or at least interesting but flawed) analogs to regular expressions in two dimensions?
In one dimension I can write something like /aaac?(bc)*b?aaa/ to quickly pull out a region of alternating bs and cs with a border of at least three as. Perhaps as importantly, I can come back a month later and see at a glance what it's looking for.
I'm finding myself writing custom code for analogous problems in 2d (some much more complicated / constrained) and it would be nice to have a more concise and standardized notation, even if I have to write the engine behind it myself.
A second example might be called "find the +". The goal is to locate a column of 3 or more as, a b bracketed by 3 or more as with three or more as below. It should match:
..7...hkj.k f
7...a h o j
----a--------
j .a,g- 8 9
.aaabaaaaa7 j
k .a,,g- h j
hh a----? j
a hjg
and might be written as [b^(a{3})v(a{3})>(a{3})<(a{3})] or...
Suggestions?

Not being a regex expert, but finding the problem interesting, I looked around and found this interesting blog entry. Especially the syntax used there for defining the 2D regex looks appealing. The paper linked there might tell you more than me.
Update from comment: Here is the link to the primary author's page where you can download the linked paper "Two-dimensional languages": http://www.mat.uniroma2.it/~giammarr/Research/pubbl.html

Nice problem.
First, ask yourself if you can constrain the pattern to a "+" pattern, or if you would it need to test/match rectangles also. For instance, a rectangle of [bc] with a border of a would match the center rectangle below, but would also match a "+" shape of [c([bc]*a})v([bc]*a)>([bc]*a)<([bc]*a)] (in your syntax)
xxxaaaaaxxx
yzyabcba12d
defabcbass3
oreabcba3s3
s33aaaaas33
k388x.egiee
If you can constrain it to a "+" then your task is much easier. As ShuggyCoUk said, parsing a RE is usually equivalent to a DFSM -- but for a single, serial input which simplifies things greatly.
In your "RE+" engine, you'll have to debug not only the engine, but also each place that the expressions are entered. I'd like the compiler to catch any errors it could. Given that, you could also use something that was explicitly four RE's, like:
REPlus engine = new REPlus('b').North("a{3}")
.East("a{3}").South("a{3}").West("a{3}");
However, depending on your implementation this may be cumbersome.
And with regard to the traversal engine -- would the North/West patterns match RtL or LtR? It could matter if the patterns match differently with or w/o greedy sub-matches.
Incidentally, I think the '^' in your example is supposed to be one character to the left, outside the parenthesis.

Regular expressions are designed to model patterns in one dimension. As I understand it, you want to match patterns in a two dimensional array of characters.
Can you use regular expressions? That depends on whether the property that you are searching for is decomposable into components which can be matched in one dimension or not. If so, you can run your regular expressions on columns and rows, and look for intersections in the solution sets from those. Depending on the particular problem you have, this may either solve the problem, or it may find areas in your 2d array which are likely to be solutions.
Whether your problem is decomposable or not, I think writing some custom code will be inevitable. But at least it sounds like an interesting problem, so the work should be pleasant.

You're essentially talking about a spatial query language. There are plenty out there if you look up spatial query, geographic query and graphic querying. The spatial part generally comes down to points, lines and objects in a region that have other given attributes. Regions can be specified as polygons, distance from a point (e.g. circles), distance from a linear feature such as a road, all points on one side of a linear feature, etc... You can then get into more complex queries like set of all nearest neighbours, shortest path, travelling salesman, and tesselations like Delaunay TINs and Voronoi diagrams.

Related

Detect flexible patterns?

I need to detect a flexible pattern in a data set. For example a pattern like:
0{1},1{*},0{1}
(the number between { and } is how many times a number may occur)
This will match:
0,1,0
0,1,1,1,1,1,0
Above one is still easy here is a more difficult one:
0{1},( 1{*} | 1{*},2{1},1{*} ),0{1}
(the | pipe character is a "or")
This will match:
0,1,0
0,1,1,1,1,1,0
0,1,2,1,0
0,1,1,1,1,1,2,1,0
I'm finding it difficult to understand how i can detect these flexible patterns.
My first thought was to generate some kind of decision tree, but this becomes quite a difficult task when the patterns become even more complex, especially because you than have to go backwards in such a tree and mark paths as "not working" and try a alternative route.
Is there maybe a better solution for this kind of problem?
[edit] Oops, i might have oversimplified my question, the little numbers 0,1,2 are in my case not numbers but object with many properties. My pattern definition will match one or a combination of these object properties.
It sounds like you are talking of building a regular expression matcher. Every regular expression can be formulated as a DFA. Here is a link explaining how.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

Distance between regular expression

Can we compute a sort of distance between regular expressions ?
The idea is to mesure in which way two regular expression are similar.
You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.
There are a few of metrics you could use:
The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.
The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.
Use a large document and see how many matches each regex makes and how many of those are identical.
Are you looking for strict equivalence?
I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.
Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.
If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:
If they both match or both don't match, score 0.
If one matches and the other doesn't, score 1.
Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).
You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.
So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.
There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.
This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.
I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.
In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.
I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.
Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?