Iterating metacharacters for regex in Stata - regex

I'm currently using regular expressions to manipulate street names in Stata. I'm faced with a problem that requires me to select observations based on how long a certain word is in the string. I know that you can specify the iteration of expressions using curved brackets in other engines, but this doesn't seem to be working in Stata. Specifically, I want to select observations that have three or more alpha numeric characters in a certain point in the string, which should be coded by
[a-zA-Z0-9]{3,}
However, this doesn't work when I try it in Stata, nor do any other uses of {} work, even though online debuggers say it should be correct. Is this a deficiency in the Stata implementation of regex? I'm working on a solution that doesn't need that iteration, but I'd like to hear from the community on what is lacking in regex in Stata, and if there's a different way to iterate expressions in the program.

I think the new Unicode regexp parser in Stata 14 (based on the ICU standard) can use this notation to find patterns that repeat at least k times:
clear
input str50 address
"221B Baker Street"
"56B, Whitehaven Mansions"
"Danemead, High street, St. Mary Mead"
end
compress
list address if ustrregexm(address,"([0-9]){3,}")
This will only give you Sherlock's address since it has 3 or more numbers. It also looks like you can use character classes:
list address if ustrregexm(address,"([:digit:]){3,}")
The regular regexp parser has never supported this shortcut capacity.

There is no limiting quantifier in Stata according to the documentation.
Other popular regular-expression syntaxes include the POSIX standard and Perl’s standard. Both expand on these basic operators by including counting operators (use of curly braces), metacharacters (usually of the form :alpha:, etc.), and other syntax-specific additions.
When presented with the choice of which regular-expression syntax to adopt, Stata has several options. Different operating systems offer their own regular-expression parsers for applications to use, but there is no guarantee that these parsers are consistent. Stata avoids this ambiguity by using its own parser.
You just need to repeat the subpattern "manually" (as in some examples on the documentation Web page):
[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]+

Related

Best string-comparison algorithm for regex

Given a regex, I want to compare it with a list of other regex, and output a similarity score.
There are several edit distance algorithms out there (e.g. levenshtein distance), but they fail to compare regex's, e.g.:
R1: [a-z0-9]+
R2: [0-9]{1}[a-z0-9]+
Distance: 9
In the example above, both regex's are quite similar, however they have a quite high edit distance. I suppose an approach using character n-grams would be more suitable for such cases.
What algorithm/approach would you consider for this problem?
It seems you're unlikely to improve upon the regular expression parsing algorithm present in an engine itself, because you're ultimately going to be making inferences about combinations of rules.
There are a number of open source regular expression engines, many listed on wikipedia, possibly including the one you're using.
Without having looked at the internals myself (not an insignificant caveat,) my recommendation is to see if it's possible to modify a regex engine (or leverage some pre-existing debugging or testing code) to output pertinent rules-processing metadata, sub-scores, if you will, from which you can then calculate an aggregate. The engines ultimately do their work deterministically, so this is theoretically possible.
If it works, this will amongst other things, enable you classify constructs, which you define as similar, with similar weights, and to possibly ignore others entirely.

Reasonable assumptions about digit grouping

I've been working on a C++ class to extract arbitrarily-sized numbers from a stream and would like to leverage the number punctuation locale facets. Needless to say, std::num_get isn't going to extract my arbitrary-size number class; it only extracts builtin number types. But the extractor can get formatting information from the locale's numpunct and moneypunct facets.
The aspect I'm having the most trouble grappling with is digit grouping. I get that not all cultures group digits in threes, and some cultures have variably-sized number groups.
I've come across a blog (http://blogs.msdn.com/b/oldnewthing/archive/2006/04/17/577483.aspx) which shows some examples. Wikipedia (http://en.wikipedia.org/wiki/Decimal_mark#Examples_of_use) also has a table of examples.
The C and C++ standards have implemented a way to handle this in the locale mechanism. But the implementations leave semantic room for some very complicated situations. Recognizing a sequence of digits coming in with no end in sight, when we've told the recognizer to require correct digit grouping, is going to be extremely complicated.
So, can we cut down on the complexity by making some assumptions? These come from commonalities I've observed in the examples provided.
(Assumption 1) Only the least-significant group of digits can have a different size, and it can't be smaller than the other groups' size.
Failing assumption 1, we might fall back on:
(Assumption 2a) There are no more than a small number of different sizes. (Hopefully 2. I haven't seen any examples with more than two different sizes.)
(Assumption 2b) A less-significant digit group is always longer than all other groups for more-significant digits.
It bothered me that no-one ever addressed this, but recently I stumbled across the Unicode Consortium's Common Locale Data Repository (or CLDR)
Drilling down further, I found a summary chart (here) of the number formatting patterns in CLDR. This contains two basic grouping patterns:
#,##,##0.###: Indian languages, traditional use
# #### : Chinese and Japanese traditional (not in CLDR; I discovered this later)
#,##0.###: everyone else
so even my naïve assumption #1 from several years ago gave far more latitude than was needed.
However, the number formatting chart does appear to cover only base 10 modern languages. For example, it does not include Hittite, Mayan, or Babylonian.
Finally, I don't believe std::num_get is adapted to non-positional notations (like Roman numerals).

creating a regular expression for a list of strings

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example
I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:
[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}
A "best" answer for the table above would be:
[A-Z]{3}
[A-Za-z\s\.]+
\d{4}\sm
\d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
(speciosissima|intermediate|troglodytes)
(hf|sr)
\d{4}
Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.
I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.
NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values:
It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.
Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.
This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.
I only skimmed through one of their papers, but I'll try to lay out what I understand here.
There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.
The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.
They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)
So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.
I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.
Here is another related question. And here another one. And here another one.
My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).
flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)
The columns then become:
"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"
I shall then replace these by regular expressions such as
"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"
and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].
For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".
By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

Producing all possible matches of a regular expression

Given a regular expression, I want to produce the set of strings that that regular expression would match. It is important to note that this set would not be infinite because there would be maximum length for each string. Are there any well known algorithms in place to do this? Are there any research papers I could read to gain insight into this problem?
Thanks.
p.s. Would this sort of question be more appropriate in the theoretical cs stack exchange?
Are there any well known algorithms in place to do this?
In the Perl eco-system the Regexp::Genex CPAN module does this.
In Python the sre_yield generates the matching words. Regex inverter also does this.
A recursive algorithm is described here link1 link2 and several libraries that do this in Java are mentioned here.
Generation of random words/strings that match a given regex: xeger (Python)
Are there any research papers I could read to gain insight into this problem?
Yes, the following papers are available for counting the strings that would match a regex (or obtaining generating functions for them):
Counting occurrences for a finite set of words: an inclusion-exclusion approach by F. Bassino, J. Clement2, J. Fayolle, and P. Nicodeme (2007) paper slides
Regexpcount, a symbolic package for counting problems on regular expressions and words by Pierre Nicodeme (2003) paper link link code

Create a program that inputs a regular expression and outputs strings that satisfy that regular expression

I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.