I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.
Related
The situation:
We created a tool Google Analytics Referrer Spam Killer, which automatically adds filters to Google Analytics to filter out spam.
These filters exclude traffic which comes from certain spammy domains. Right now we have 400+ spammy domains in our list.
To remove the spam, we add a regex (like so domain1.com|domain2.com|..) as a filter to Analytics and tell Analytics to ignore all traffic which matches this filter.
The problem:
Google Analytics has a 255 character limit for each regex (one regex per filter). Because of that we must to create a lot of filters to add all 400+ domains (now 30+ filters). The problem is, there is another limit. Number of write operation per day. Each new filter is 3 more write operations.
The question:
What I want to find the shortest regex to exactly match another regex.
For example you need to match the following strings:
`abc`, `abbc` and `aac`
You could match them with the following regexes: /^abc|abbc|aac$/, /^a(b|bb|a)c$/, /^a(bb?|a)c$/, etc..
Basically I'm looking for an expression which exactly matches /^abc|abbc|aac$/, but is shorter in length.
I found multiregexp, but as far as I can tell, it doesn't create a new regex out of another expression which I can use in Analytics.
Is there a tool which can optimize regexes for length?
I found this C tool which compiles on Linux: http://bisqwit.iki.fi/source/regexopt.html
Super easy:
$ ./regex-opt '123.123.123.123'
(123.){3}123
$ ./regex-opt 'abc|abbc|aac'
(aa|ab{1,2})c
$ ./regex-opt 'aback|abacus|abacuses|abaft|abaka|abakas|abalone|abalones|abamp'
aba(ck|ft|ka|lone|mp|(cu|ka|(cus|lon)e)s)
I wasn't able to run the tool suggested by #sln. It looks like it makes an even shorter regex.
I'm not aware of an existing tool for combining / compressing / optimising regexes. There may be one. Maybe by building a finite-state machine out of a regex and then generating a regex back out of that?
You don't need to solve the problem for the general case of arbitrary regexes. I think it's a better bet to look at creating compact regexes to match any of a given set of fixed strings.
There may already be some existing code for making an optimised regex to match a given set of fixed strings, again, IDK.
To do it yourself, the simplest thing would be to sort your strings and look for common prefixes / suffixes. ((afoo|bbaz|c)bar.com). Looking for common strings in the middle is less easy. You might want to look at algorithms used for lossless data compression for finding redundancy.
You'd ideally want to spot cases where you could use a foo[a-d] range instead of a foo(a|b|c|d), and various other things.
Is there any proven way to overcome the difficulty of authoring and managing large regex patterns in your code? Preferably in a visual tool? Is there any way to build up a pattern from smaller reusable pieces? I could not find an web based regex visualizers that supported multine regex for instance.
We are currently using a technique to split patterns and store the pieces in variables, but this mixes languages - an architectural no-no for us - and also hinders the ability to paste the pattern into a visualizer.
I am using .NET/Powershell/JavaScript - but I am interested in the flavor agnostic perspective as well.
At my old job we used regex for everything. The best tools I found where the below:
Best regex editor in my opinion (it explains each segment and has a reference sheet): http://regex101.com
Best web multi-line regex editor:http://regexpal.com/
Best regex editor overall (a download for the price of $40): http://www.regexbuddy.com/
As far as managing regexs, we used to keep all regexs in a properties file separate from the code, where the code loads the property (regex) in real time. We also shared regexbuddy files for exchanging regex patterns. There was one file that was saved on source control that had lines and lines of simple patterns for matching certain things. It helped to create larger ones, using the smaller pieces. However, what I have learned is that basically all regexs need to be tweaked for your specific purposes. It is not as simple as piecing small ones together. The small ones just help get started in the right direction.
Over here in flavor agnostic land, I sometimes do something like this (actual working code I just happened to be revisiting):
street = "(#{names}[A-Za-z0-9']+)((?:\\s+(?:#{StreetType.regexp}))?)"
space = '[\s.,]+'
at_a_street =
'(?:and|&|&|at|#|by|just\s+\w+\s+of|just\s+past|looking(?:\s+\w+)?\s+(?:at|to|towards?)|near)' +
"#{space}#{street}"
between_streets =
"(?:between|(?:betw?|btwn)\\.?)#{space}#{street}#{space}(?:and|&|&)#{space}#{street}"
address = '(\b\d+)(?:\s*-\s*\d+|[a-z])?\s+' + street
#regexps = [
/#{street}#{space}#{at_a_street}/i,
/#{street}#{space}#{between_streets}/i,
/#{address}/i,
/#{address}#{space}#{at_a_street}/i,
/#{address}#{space}#{between_streets}/i
]
Namely break the regexp up into meaningful bits, give them comprehensible names, and concatenate them as necessary. (You need to think a little extra about whether each bit can be safely concatenated with others, e.g. watch out for greedy expressions at the end.)
Is there a good way to split English document into sentences? I mean English document frequently includes Mr. Mrs. U.S.A, etc. It is difficult to separate them out. Do we need a special natural language library to accomplish this? I suspect that we need it.
Thank you.
Technically, you need a complete understanding of English to do the job.
As a decent "almost" solution, you could use a dictionary of "things that end in period" and split on periods which do not immediately follow one of those tokens.
If every sentence begins with a capital and ends with a period, then I would define a sentence as the above but contains >1 word and does not end with (common abbreviation list or regex [a-zA-Z].+)
You can use sentence detector provided by numerous NLP tools such as OpenNLP or Stanford CoreNLP. They can handle cases like Mr. Mrs. U.S.A, etc.
Both OpenNLP and Stanford CoreNLP are written in Java.
SharpNLP is C# (ported) version of OpenNLP.
What are the common ways to verify the given regex pattern works well of the given scenario and check the results ?
I would like to know in general , not in the particular programming language and what is the best way to learn about writing regular expression ?
Books: Mastering Regular Expressions is the definitive guide to regular expressions. The Regular Expressions Cookbook is said to be lighter and more easily applicable.
Sites: Friedel's companion site is a good start. Regexlib is a source of idioms and patterns.
Software: RegexBuddy is a good, per pay, regex verifier.
I've used this resource when learning: http://www.regular-expressions.info/ and found myself going back there whenever there was something I needed to remember. It's very useful for learning and covers the basics very well. They also have various links to programs which can be used to verify regular expressions.
This is not a "real" verification, but RegexBuddy allows you to verify that your regex does what you expect it to do on any sample data you provide. It also translates the regex into an English description that can help to figure out mistakes. Plus, it knows all major regex flavors and can translate regexes between them.
For testing regular expression you can use RegEx Test tools like one below :
http://www.regextester.com/
To know more about how to learn regular expressions please check following SO threads :
Learning Regular Expressions
How to master Regular Expressions?
https://stackoverflow.com/questions/465119/how-do-i-learn-regular-expressions-closed
RAD Rexexp designer is a great tool
Set up an automated test using your tools of choice (because regex implementations vary from language to language and library to library) which applies the regex to a variety of both matching and non-matching inputs to verify that you get the correct results.
While RegexBuddy and the like may be helpful for initially creating the regex (or may not; I've never used them), you will still need to maintain it, just like any other code. When that time comes, it's vastly preferable to have a test script that will run through all your old test inputs (plus the new ones which created the need for the change) in a matter of seconds rather than having to sit on a website for tens of minutes, if not hours, trying to remember all your test inputs and manually re-run them to make sure you didn't break anything.
I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)